CN112534788B

CN112534788B - Lock-free pipelined network data packet bandwidth control

Info

Publication number: CN112534788B
Application number: CN201880096155.6A
Authority: CN
Inventors: 喻湘宁; 马可; 段建军; 刘昆
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-09-04
Filing date: 2018-09-04
Publication date: 2022-11-11
Anticipated expiration: 2038-09-04
Also published as: WO2020047740A1; CN112534788A

Abstract

Systems and methods are provided for improving multithreading performance in a computing system having multiple processors by reducing latency and controlling bandwidth through the following steps: grouping network data packets associated with a processor into bandwidth groups based on bandwidth requirements; aggregating network data packets associated with a processor into a lock-free dequeue queue; performing bandwidth control on the lock-free drain queue based on a respective length of each network data packet; and sending an inter-processor interrupt (IPI) to a corresponding processor of the currently queued network data packet in the lock-free drain queue to continue processing the currently queued network data packet.

Description

Lock-free pipelined network data packet bandwidth control

Background

In recent years, many businesses have increased the use of computing service providers to manage data storage, computing, and the like. These computing service providers support varying service needs for different businesses, which may be referred to as a mixed or multi-tenant scenario. In a multi-tenant scenario of a computing service provider, different services share the same physical Network Interface Card (NIC), however, different services have very different requirements for network quality of service. For example, online traffic such as e-commerce may require low latency and high Packet Per Second (PPS), whereas offline services such as big data may be insensitive to latency, but may have high bandwidth requirements. Therefore, there is a need to efficiently manage flows or traffic groups for isolating and sharing network resources to meet different needs and requirements through the same NIC.

Currently, some commercially available Traffic Control (TC) technology solutions require a very high percentage of up to 100% occupation of some Central Processing Unit (CPU) resources of the server to manage flow control without adequately meeting high PPS requirements, or are not suitable for large-scale deployment. For example, some flow control algorithms allocate some CPUs in the system to run the flow control algorithms in a busy-loop manner, which consumes valuable CPU resources without ensuring that network data packets will be processed by the original CPU. If the network data packet is not processed by the original CPU, the performance of the system may be degraded by cache locality issues of the system. Other control algorithms involve accessing current bandwidth usage simultaneously from different threads among a large number of CPUs and implementing locks for both bandwidth sharing and bandwidth limiting. The performance of systems utilizing such control algorithms is typically subject to global lock limitations, which results in degradation of multi-threaded processing performance.

Drawings

The detailed description is set forth with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.

Fig. 1 illustrates an example process for bandwidth-based network data packet control.

Fig. 2 illustrates an example flow diagram detailing one of the blocks of fig. 1.

Fig. 3 illustrates an example system for implementing the above-described processes and methods for bandwidth-based network data packet control.

Fig. 4 illustrates a diagrammatic representation 400 of bandwidth-based network data packet control.

Detailed Description

The methods and systems discussed herein relate to improving network resource isolation, and more particularly to improving multithreading processing performance in a computing system having multiple processors by reducing latency and controlling bandwidth.

In a multi-tenant scenario, the system shares the same physical Network Interface Card (NIC) for different services with different requirements for network quality of service. The system may include multiple processors, such as CPUs, for controlling network data packets associated with the multiple CPUs. The CPUs may include a physical CPU running on a physical machine and a virtual CPU running on a virtual machine. Rather than dedicating CPUs to flows or groups of traffic, each CPU may be scheduled to perform flow control tasks based on a predetermined schedule, such as a round-robin fashion that provides some performance fairness among the CPUs, i.e., a single CPU should not perform more flow control tasks than other CPUs. The currently scheduled CPU may also be referred to as an execution CPU. Each CPU may hold its network data packets in its own lock-free queue. The execution CPU may group network data packets associated with the plurality of CPUs into bandwidth groups based on the corresponding bandwidth requirements and aggregate the network data packets into lock-free drain queues. As an aggregation thread, the execution CPU may aggregate network data packets into a lock-free dequeue queue by: the queues of each of the plurality of CPUs are scanned at a high rate, for example every one to two microseconds, and the respective network data packets associated with the plurality of CPUs are queued in a predetermined order, e.g., round robin, of the plurality of CPUs. During this aggregation phase, the content of the network data packet is not accessed. The network data packet is therefore likely to remain in the cache corresponding to the original CPU, thereby ensuring cache locality of the network data packet and avoiding the problem of multiple threads accessing the data simultaneously.

The execution CPU may then perform bandwidth control on the lock-free drain queue, e.g., by performing a token bucket algorithm on the lock-free drain queue or using a bandwidth group-dedicated thread, based on the respective length of each network data packet, and as a drain thread, send an inter-processor interrupt (IPI) to the corresponding CPU of the network data packet currently queued in the lock-free drain queue to continue processing the queued network data packet.

By performing the above process, the system can effectively align the aggregation threads and the dequeue threads with the bandwidth groups, and align the enqueue and dequeue threads with the appropriate CPUs, so that the original sending CPU retains and processes its network data packets, which allows for better concurrency of the different processes or pipelines, stages.

To further improve the bandwidth performance of the system by decoupling bandwidth sharing and bandwidth limiting, the executing CPU may run a bandwidth allocation thread on one of the plurality of CPUs at a low frequency, such as once per second, and dynamically adjust the corresponding maximum bandwidth of each of the bandwidth groups.

Fig. 1 illustrates an example process 100 for bandwidth-based network data packet control.

In a system having multiple processors, such as CPUs, for controlling network data packets associated with the multiple CPUs, each CPU may be scheduled based on a predetermined schedule at block 102, and the currently scheduled CPU may be selected to execute or carry out a network data packet control task within a scheduling interval at block 104. The CPUs may include a physical CPU running on a physical machine and a virtual CPU running on a virtual machine. The predetermined schedule may be generated in a round-robin fashion, which may ensure some fairness between CPUs, i.e., a single CPU may not be scheduled to perform more network data packet control tasks than other CPUs. Rather than having a separate dedicated CPU to perform the network data packet control tasks, overhead costs in terms of material and complexity may be reduced by interspersing the network data packet control tasks among existing network data packet processing CPUs. Each CPU may hold its network data packets in its own lock-free queue.

At block 106, the execution CPU may group network data packets associated with the plurality of CPUs into a bandwidth group based on the bandwidth requirements of the corresponding network data packets, and aggregate the network data packets into a lock-free dequeue queue at block 108 as an aggregation thread. Because the contents of the network data packet are not accessible in this aggregation phase, the network data packet is likely to remain in the cache of the corresponding original CPU, which may ensure cache locality of the network data packet and avoid the problem of multiple threads accessing the data simultaneously.

At block 110, the execution CPU may perform bandwidth control on the lock-free dequeue queue based on the length of each network data packet. For example, the execution CPU may execute a token bucket algorithm on the lock-free drain queue or use a bandwidth group dedicated thread. As an eject thread, the executing CPU may then send an inter-processor interrupt (IPI) to the CPU that is currently queuing network data packets in the lock-free eject queue at block 112 to continue processing the currently queuing network data packets.

The process described above with reference to fig. 1 can effectively align the aggregation threads and the dequeue threads with the bandwidth groups and improve the concurrency of the different processes or pipelines, stages, by aligning the enqueue and dequeue threads with the appropriate CPU, so that the original sending CPU retains and processes its network data packets.

Fig. 2 illustrates an example flow diagram detailing block 108 of fig. 1.

At block 202, the executing CPU may scan the queue of each of the plurality of CPUs at a high frequency, for example, at a rate between once every microsecond and once every two microseconds. At block 204, the execution CPU may queue respective network data packets of the plurality of CPUs in a lock-free dequeue queue in a predetermined order of the plurality of CPUs.

To improve the bandwidth performance of the system by decoupling bandwidth sharing and bandwidth limiting, the execution CPU may run a bandwidth allocation thread simultaneously on one of the CPUs and dynamically adjust the corresponding maximum bandwidth of each of the bandwidth groups. The execution CPU may run a bandwidth allocation thread on each CPU at a low frequency, for example, at a rate of once per second.

Fig. 3 illustrates an example system 300 for implementing the above-described processes and methods for bandwidth-based network data packet control.

The techniques and mechanisms described herein may be implemented by multiple instances of system 300, as well as by any other computing device, system, and/or environment, including cloud computing. The system 300 shown in fig. 3 is only one example of a system and is not intended to suggest any limitation as to the scope of use or functionality of any computing device for performing the processes and/or programs described above. Other well known computing devices, systems, environments, and/or configurations that may be suitable for use with embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, implementations using field programmable gate arrays ("FPGAs") and application specific integrated circuits ("ASICs"), and/or the like.

The system 300 may include a plurality of processors (four

processors

302, 304, 306, and 308 are shown as CPUs in this example) and a memory 310 coupled to the processors. Each of the

processors

302, 304, 306, and 308 may in turn execute computer-executable instructions stored in the memory 310 as an execution processor to perform various functions as described below with reference to fig. 1 and 2.

Processors

302, 304, 306, and 308 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), both a CPU and a GPU, or other processing units or components known in the art. Additionally, each of the

processors

302, 304, 306, and 308 may have its own local memory, which may also store program modules, program data, and/or one or more operating systems.

Depending on the exact configuration and type of system 300, memory 310 may be volatile, such as RAM, non-volatile, such as ROM, flash memory, a miniature hard drive, a memory card, etc., or some combination thereof. Memory 310 may include one or more computer-executable modules (two modules 312 and 314 are shown in this example) that are executable by

processors

302, 304, 306, and 308.

The system 300 may additionally include an input/output (I/O) interface 316 for receiving data, such as network data packets, and for outputting the processed data. The system 300 may also include a communication module 318 and a network interface module 320 to allow the system 300 to communicate with other devices or systems 322 over a network 324. Network 324 may include the Internet, wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio Frequency (RF), infrared and other wireless media.

In system 300, the execution CPUs may be selected based on a predetermined schedule, which may be generated in a round-robin fashion, which may ensure some fairness among the CPUs, that is, a single CPU may not be scheduled to perform more network data packet control tasks than other CPUs. For example, the order used to execute the CPUs may be CPU 302, CPU 304, CPU 306, CPU 308, then back to CPU 302, and so on. Depending on the process being performed, the interval for each CPU to act as an execution CPU may not be the same each time, however over time it will average out. Rather than having a separate dedicated CPU to perform the network data packet control tasks, overhead costs in terms of material and complexity may be reduced by interspersing the network data packet control tasks among existing network data packet processing CPUs. Each CPU may hold its network data packets in its own lock-free queue.

The executing CPU, in this example CPU 302, may group the network data packets associated with

CPUs

302, 304, 306, and 308 into bandwidth groups based on the bandwidth requirements of the corresponding network data packets and aggregate the network data packets associated with

CPUs

302, 304, 306, and 308 into lock-free drain queues as an aggregation thread. The execution CPU 302 may scan the queues of each of the

CPUs

302, 304, 306, and 308 at a high frequency, for example, at a rate of between one to two microseconds, and queue the respective network data packets associated with the

CPUs

302, 304, 306, and 308 in the lock-free dequeue queues in a predetermined order of the

CPUs

302, 304, 306, and 308, which may be on a round-robin basis. In this aggregation phase, the content of the network data packets is not accessed. The network data packet is therefore likely to remain in the cache corresponding to the original CPU, thereby ensuring cache locality of the network data packet and avoiding problems caused by mutual exclusion, i.e. multiple threads accessing data simultaneously.

The execution CPU 302 may then perform bandwidth control on the lock-free dequeue queue based on the respective length of each network data packet and, as a dequeue thread, send an inter-processor interrupt (IPI) to the CPU of the currently queued network data packet in the lock-free dequeue queue to continue processing the currently queued network data packet. The execution CPU 302 may perform bandwidth control on the lock-free dequeue by performing a token bucket algorithm on the lock-free dequeue or using a bandwidth group dedicated thread.

The system described above with reference to fig. 3 can effectively align the aggregation threads and the dequeue threads with the bandwidth groups and improve the concurrency of the different processes or pipelines, stages by aligning the enqueue and dequeue threads with the appropriate CPU so that the original sending CPU retains and processes its network data packets.

To improve bandwidth performance of the system 300 by decoupling bandwidth sharing and bandwidth limiting, the execution CPU 302 may run a bandwidth allocation thread on one of the plurality of CPUs at a low frequency, e.g., once per second, and dynamically adjust the corresponding maximum bandwidth of each of the bandwidth groups to the maximum available bandwidth.

Network data packets

402, 404, 406, and 408 may be held in lock-free queues of

CPUs

302, 304, 306, and 308, respectively. The execution CPU 302 may group the

network data packets

402, 404, 406, and 408 into bandwidth groups based on each of the

network data packets

402, 404, 406, and 408 based on the bandwidth requirements of each network data packet and aggregate the network data packets into the lock-free eject queue 410 in the predetermined order of the CPU as described above with reference to fig. 1, 2, and 3. The execution CPU 302 may then perform bandwidth control on the lock-free dequeue queue 410 based on the length of each network data packet, for example, by performing a token bucket algorithm on the lock-free dequeue queue 410. The executing CPU 302 may then send an inter-processor interrupt (IPI) to the CPU, such as CPU 308, of the currently queued network data packet in the lock-free drain queue 430 (the network data packet of network data packet 408 from the illustrated CPU 308) to continue processing the currently queued network data packet.

Some or all of the operations of the above-described methods may be performed by executing computer readable instructions stored on a computer readable storage medium as defined below. The term "computer readable instructions" as used in the specification and claims includes routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.

The computer-readable storage medium may include volatile memory (such as Random Access Memory (RAM)) and/or nonvolatile memory (such as Read Only Memory (ROM), flash memory, etc.). Computer-readable storage media may also include additional removable and/or non-removable storage devices, including, but not limited to, flash memory, magnetic storage devices, optical storage devices, and/or tape storage devices, which may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.

Non-transitory computer-readable storage media are examples of computer-readable media. Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communication media. Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. Memory 310 is an example of a computer-readable storage medium. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transport mechanism. As defined herein, computer-readable storage media does not include communication media.

The computer-readable instructions stored on the one or more non-transitory computer-readable storage media, when executed by the one or more processors, may perform the operations described above with reference to fig. 1-4. Generally, computer readable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular abstract data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the process.

Example clauses

A. A method in a system comprising a plurality of Central Processing Units (CPUs) for controlling network data packets associated with the plurality of CPUs, the method comprising: grouping network data packets associated with the plurality of CPUs into bandwidth groups based on bandwidth requirements of corresponding network data packets; aggregating the network data packets associated with the plurality of CPUs into a lock-free dequeue queue; and performing bandwidth control on the lock-free drain queue based on a respective length of each network data packet.

B. The method of paragraph a, further comprising: sending an inter-processor interrupt (IPI) to a corresponding CPU of a currently queued network data packet in the lock-free drain queue to continue processing the currently queued network data packet.

C. The method of paragraph a, prior to grouping the network data associated with the plurality of CPUs into bandwidth groups based on the bandwidth requirements, further comprising: scheduling each of the plurality of CPUs based on a predetermined schedule; and selecting a currently scheduled CPU, wherein the method according to paragraph a is performed by the currently scheduled CPU.

D. The method of paragraph a, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: scanning the queue of each of the plurality of CPUs at a rate between once every microsecond and once every two microseconds.

E. The method of paragraph a, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: queuing the respective network data packets associated with the plurality of CPUs in the lock-free dequeue queue in a predetermined order for the plurality of CPUs.

F. The method of paragraph a, wherein performing the bandwidth control on the lock-free dequeue based on the respective length of each network data packet includes using a bandwidth group dedicated thread.

G. The method of paragraph a, further comprising: running a bandwidth allocation thread on one of the plurality of CPUs; and dynamically adjusting a corresponding maximum bandwidth for each of the bandwidth groups.

H. The method of paragraph F wherein running the bandwidth allocation thread on the one CPU comprises running the bandwidth allocation thread on the one CPU at a rate of once per second.

I. The method as paragraph a recites, wherein the network data packet is held in a lock-free queue of the corresponding CPU.

J. A system, the system comprising: a plurality of Central Processing Units (CPUs); a memory coupled to the plurality of CPUs, the memory storing computer-executable instructions executable by any of the plurality of CPUs, the computer-executable instructions, when executed, causing an executing CPU of the plurality of CPUs to perform operations comprising: grouping network data packets associated with the plurality of CPUs into bandwidth groups based on bandwidth requirements of corresponding network data packets; aggregating the network data packets associated with the plurality of CPUs into a lock-free dequeue queue; and performing bandwidth control on the lock-free drain queue based on a respective length of each network data packet.

K. The system of paragraph J, wherein the operations further comprise: sending an inter-processor interrupt (IPI) to a corresponding CPU of a currently queued network data packet in the lock-free drain queue to continue processing the currently queued network data packet.

L. the system of paragraph J, wherein the executing CPU is selected based on a predetermined schedule.

The system of paragraph J, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: scanning the queue of each of the plurality of CPUs at a rate between once every microsecond and once every two microseconds.

The system as paragraph J recites, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: queuing the respective network data packets associated with the plurality of CPUs in the lock-free dequeue queue in a predetermined order of the plurality of CPUs.

O. the system as paragraph J recites, wherein performing the bandwidth control on the lock-free dequeue based on the respective length of each network data packet comprises using a bandwidth group dedicated thread.

P. the system as paragraph J recites, further comprising: running a bandwidth allocation thread on one of the plurality of CPUs; and dynamically adjusting a corresponding maximum bandwidth for each of the bandwidth groups.

Q. the system of paragraph P, wherein running the bandwidth allocation thread on the one CPU comprises running the bandwidth allocation thread on the one CPU at a rate of once per second.

R. the system as paragraph J recites, wherein the network data packets are held in lock-free queues of the corresponding CPUs.

S. a computer readable medium storing computer readable instructions executable by any of a plurality of CPUs in a system, the computer readable instructions, when executed, cause an executing CPU of the plurality of CPUs to perform operations comprising: grouping network data packets associated with the plurality of CPUs into bandwidth groups based on bandwidth requirements of corresponding network data packets; aggregating the network data packets associated with the plurality of CPUs into a lock-free dequeue queue; and performing bandwidth control on the lock-free dequeue based on a respective length of each network data packet.

T. the computer-readable medium of paragraph S, wherein the operations further comprise: sending an inter-processor interrupt (IPI) to a corresponding CPU of a currently queued network data packet in the lock-free drain queue to continue processing the currently queued network data packet.

U. the computer-readable medium of paragraph S, wherein the executing CPU is selected based on a predetermined schedule.

V. the computer-readable medium of paragraph S, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: scanning the queue of each of the plurality of CPUs at a rate between once every microsecond and once every two microseconds.

The computer-readable medium of paragraph S, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises: queuing the respective network data packets associated with the plurality of CPUs in the lock-free dequeue queue in a predetermined order of the plurality of CPUs.

X. the computer-readable medium of paragraph S, wherein performing the bandwidth control on the lock-free drain queue based on the respective length of each network data packet includes using a bandwidth group dedicated thread.

Y. the computer-readable medium of paragraph S, wherein the operations further comprise: running a bandwidth allocation thread on one of the plurality of CPUs; and dynamically adjusting a corresponding maximum bandwidth for each of the bandwidth groups.

Z. the computer-readable medium of paragraph Y, wherein running the bandwidth allocation thread on the one CPU comprises running the bandwidth allocation thread on the one CPU at a rate of once per second.

The computer-readable medium as paragraph S recites, wherein the network data packets are held in lock-free queues of the corresponding CPUs.

Conclusion

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claims.

Claims

1. A method in a system comprising a plurality of Central Processing Units (CPUs) for controlling network data packets associated with the plurality of CPUs, the method comprising:

grouping network data packets associated with the plurality of CPUs into bandwidth groups based on bandwidth requirements of corresponding network data packets;

aggregating the network data packets associated with the plurality of CPUs into a lock-free dequeue queue; and

performing bandwidth control on the lock-free dequeue queue based on a respective length of each network data packet.

2. The method of claim 1, further comprising:

sending an inter-processor interrupt (IPI) to a corresponding CPU of a currently queued network data packet in the lock-free drain queue to continue processing the currently queued network data packet.

3. The method of claim 1, wherein prior to grouping the network data associated with the plurality of CPUs into bandwidth groups based on the bandwidth requirements, the method further comprises:

scheduling each of the plurality of CPUs based on a predetermined schedule; and

the CPU currently scheduled is selected and,

the method of claim 1, wherein the method is performed by the currently scheduled CPU.

4. The method of claim 1, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

scanning the queue of each of the plurality of CPUs at a rate between once every microsecond and once every two microseconds.

5. The method of claim 1, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

queuing the respective network data packets associated with the plurality of CPUs in the lock-free dequeue queue in a predetermined order for the plurality of CPUs.

6. The method of claim 1, wherein performing the bandwidth control on the lock-free dequeue based on the respective length of each network data packet comprises using a bandwidth group dedicated thread.

7. The method of claim 1, further comprising:

running a bandwidth allocation thread on one of the plurality of CPUs; and

dynamically adjusting a corresponding maximum bandwidth for each of the bandwidth groups.

8. The method of claim 7, wherein running the bandwidth allocation thread on the one of the plurality of CPUs comprises running the bandwidth allocation thread on the one of the plurality of CPUs at a rate of once per second.

9. The method of claim 1, wherein the network data packet is held in a lock-free queue of the corresponding CPU.

10. A system for controlling network data packets associated with a plurality of Central Processing Units (CPUs), the system comprising:

a plurality of CPUs;

a memory coupled to the plurality of CPUs, the memory storing computer-executable instructions executable by any of the plurality of CPUs, the computer-executable instructions, when executed, causing an executing CPU of the plurality of CPUs to perform operations comprising:

performing bandwidth control on the lock-free drain queue based on a respective length of each network data packet.

11. The system of claim 10, wherein the operations further comprise:

12. The system of claim 11, wherein the execution CPU is selected based on a predetermined schedule.

13. The system of claim 11, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

14. The system of claim 11, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

queuing the respective network data packets associated with the plurality of CPUs in the lock-free dequeue queue in a predetermined order of the plurality of CPUs.

15. The system of claim 11, wherein performing the bandwidth control on the lock-free dequeue based on the respective length of each network data packet comprises using a bandwidth group dedicated thread.

16. The system of claim 11, wherein the operations further comprise:

running a bandwidth allocation thread on one of the plurality of CPUs; and

17. The system of claim 16, wherein running the bandwidth allocation thread on the one of the plurality of CPUs comprises running the bandwidth allocation thread on the one of the plurality of CPUs at a rate of once per second.

18. The system of claim 11, wherein the network data packet is held in a lock-free queue of the corresponding CPU.

19. A computer readable medium storing computer readable instructions executable by any of a plurality of CPUs in a system, the computer readable instructions, when executed, causing an executing CPU of the plurality of CPUs to perform operations comprising:

20. The computer-readable medium of claim 19, wherein the operations further comprise:

21. The computer readable medium of claim 20, wherein the execution CPU is selected based on a predetermined schedule.

22. The computer-readable medium of claim 20, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

23. The computer-readable medium of claim 20, wherein aggregating the network data packets associated with the plurality of CPUs into the lock-free dequeue queue comprises:

24. The computer-readable medium of claim 20, wherein performing the bandwidth control on the lock-free dequeue queue based on the respective length of each network data packet includes using a bandwidth group dedicated thread.

25. The computer-readable medium of claim 20, wherein the operations further comprise:

running a bandwidth allocation thread on one of the plurality of CPUs; and

26. The computer-readable medium of claim 25, wherein running the bandwidth allocation thread on the one of the plurality of CPUs includes running the bandwidth allocation thread on the one of the plurality of CPUs at a rate of once per second.

27. The computer-readable medium of claim 20, wherein the network data packet is held in a lock-free queue of the corresponding CPU.