CN116113917A

CN116113917A - Lightweight thread (LWT) rebalancing in storage systems

Info

Publication number: CN116113917A
Application number: CN202080103408.5A
Authority: CN
Inventors: 陈丽莉; 万力; 汤志豪
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2023-05-12
Also published as: WO2022046150A1

Abstract

A computer-implemented method for performing rebalancing of lightweight threads (LWTs) within distributed storage nodes, comprising: a level 1 rebalancing is performed during the current rebalancing period. The 1 st order rebalancing includes: a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of the plurality of core partitions is determined. The first processing core in the subset is selected based on a maximum queue depth of the plurality of queue depths. A second processing core in the subset is selected based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. The load groups of the one or more load groups in the first processing core are moved to the one or more load groups in the second processing core.

Description

Lightweight thread (LWT) rebalancing in storage systems

Technical Field

The present disclosure relates to storage node computing. Some aspects relate to group-based lightweight threads (lightweight thread, LWT) rebalancing in storage systems, including LWT lock aware mapping, to avoid locking of processing resources.

Background

In a distributed data storage network architecture, input/output (I/O) processes are implemented by LWT in a central processing unit (central processing unit, CPU) system of a computing device (e.g., a storage node in a distributed data store). LWT is mapped to a processing core of the plurality of available cores for processing. When the computational load implemented with the LWT in the core is unbalanced, rebalancing of the LWT must be performed in the CPU system. However, rebalancing can lead to performance degradation associated with resource contention and inefficient cache and memory usage.

Disclosure of Invention

Various examples are now described to briefly introduce a selection of concepts, which are further described below in the detailed description. The summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to a first aspect of the present disclosure, a computer-implemented method for performing rebalancing of lightweight threads (LWT) within distributed storage nodes of a communication network is provided. The method comprises the following steps: a level 1 rebalancing is performed during the current rebalancing period. The 1 st order rebalancing includes: a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions is determined. Each queue depth for each processing core in the subset of the plurality of queue depths indicates a number of LWTs scheduled for execution by that processing core. The LWTs are grouped into one or more load groups within a processing core. A first processing core in the subset is selected based on a maximum queue depth of the plurality of queue depths. A second processing core in the subset is selected based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. A load group of the one or more load groups of the first processing core is moved to one or more load groups of the second processing core.

In a first implementation of the method according to the first aspect, the moving during the level 1 rebalancing is repeated until a difference between a queue depth of the first processing core and a queue depth of the second processing core is less than a threshold number.

In a second implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, the level 1 rebalancing is performed periodically based on a preconfigured rebalancing period.

In a third implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the plurality of queue depths is determined based on an exponential moving average using a historical queue depth of the subset of the plurality of processing cores, the historical queue depth being determined during the previous rebalancing period.

In a fourth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, the subset of LWTs is grouped into a first load group of the one or more load groups within the first processing core. The subset of the LWTs includes competing LWTs. The subset of the LWTs includes competing LWTs. The LWTs in the subset are performed sequentially.

In a fifth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, the subset of the plurality of core partitions is used for NUMA of a first of a plurality of non-uniform memory access (non-uniform memory access, NUMA) memory domains. The subset of the plurality of core partitions includes the first core partition.

In a sixth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, a 2-level rebalancing is performed on the first NUMA memory domain. The 2-stage rebalancing includes: a partition queue depth within the first NUMA memory domain for each core partition of the subset of the plurality of core partitions is determined to obtain a plurality of partition queue depths. Each partition queue depth of the plurality of partition queue depths is based on an average of queue depths of a plurality of processing cores of a corresponding core partition of the subset. From the subset of the plurality of core partitions, a core partition having a lowest partition queue depth of the plurality of partition queue depths is selected. From the subset of the plurality of core partitions, a core partition is selected having a highest partition queue depth of the plurality of partition queue depths.

In a seventh implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, a processing core is selected from the core partition having the lowest partition queue depth. The selected processing core is associated with a highest core sleep time among the remaining processing cores in the core partition having the lowest partition queue depth. The selected processing core is migrated from the core partition having the lowest partition queue depth to the core partition having the highest partition queue depth.

In an eighth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, a 3-level rebalancing is performed on the plurality of NUMA memory domains. The 3-stage rebalancing includes: a NUMA queue depth is determined for each of the plurality of NUMA memory domains to obtain a plurality of NUMA queue depths. A NUMA memory domain having a highest NUMA queue depth of the plurality of NUMA queue depths is selected from the plurality of NUMA memory domains. A NUMA memory domain having a lowest NUMA queue depth of the plurality of NUMA queue depths is selected from the plurality of NUMA memory domains.

In a ninth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, the core partition is selected for rebalancing. The selected core partition exists in the NUMA memory domain having the lowest NUMA queue depth and the NUMA memory domain having the highest NUMA queue depth. At least one load group associated with the core partition selected for rebalancing is migrated from the NUMA having the highest NUMA queue depth to the NUMA having the lowest NUMA queue depth.

In a tenth implementation form of the method according to the first aspect as such or any of the implementation forms of the first aspect, power saving rebalancing is performed on the plurality of NUMA memory domains. The power saving rebalancing includes: a NUMA memory domain of the plurality of NUMA memory domains having a lowest processing core count is determined. A released NUMA memory domain is generated by moving an available load group from the NUMA memory domain having the lowest processing core count to at least another one of the plurality of NUMA memory domains. And placing the released NUMA memory domain in a power saving mode.

According to a second aspect of the present disclosure, a system for performing rebalancing of lightweight threads (LWT) within distributed storage nodes of a communication network is provided. The system comprises: a memory storing instructions; and one or more processors in communication with the memory. To perform a level 1 rebalancing during a current rebalancing cycle, the one or more processors execute the instructions to: a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions is determined. Each queue depth for each processing core in the subset of the plurality of queue depths indicates a number of LWTs scheduled for execution by that processing core. The LWTs are grouped into one or more load groups within a processing core. A first processing core in the subset is selected based on a maximum queue depth of the plurality of queue depths. A second processing core in the subset is selected based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. A load group of the one or more load groups of the first processing core is moved to one or more load groups of the second processing core.

In a first implementation form of the distributed storage node according to the second aspect, the moving during the level 1 rebalancing is repeated until a difference between a queue depth of the first processing core and a queue depth of the second processing core is less than a threshold number. The plurality of queue depths are determined based on an exponential moving average using historical queue depths for the subset of the plurality of processing cores. The historical queue depth is determined during the previous rebalancing period.

In a second implementation form of the distributed storage node according to the second aspect as such or any of the implementation forms of the second aspect, the subset of LWTs comprises competing LWTs, and the subset of LWTs comprises a first load group of the one or more load groups within the processing core.

In a third implementation form of the distributed storage node according to the second aspect as such or any of the implementation forms of the second aspect, the subset of the plurality of core partitions is used for NUMA of a first of a plurality of non-uniform memory access (NUMA) memory domains. The subset of the plurality of core partitions includes the first core partition. To perform 2-level rebalancing of the first NUMA memory domain, the one or more processors execute the instructions to: a partition queue depth within the first NUMA memory domain for each core partition of the subset of the plurality of core partitions is determined to obtain a plurality of partition queue depths. Each partition queue depth of the plurality of partition queue depths is based on an average of queue depths of a plurality of processing cores of a corresponding core partition of the subset. From the subset of the plurality of core partitions, a core partition having a lowest partition queue depth of the plurality of partition queue depths is selected. From the subset of the plurality of core partitions, a core partition is selected having a highest partition queue depth of the plurality of partition queue depths.

In a fourth implementation form of the distributed storage node according to the second aspect as such or any of the implementation forms of the second aspect, a processing core is selected from the core partition having the lowest partition queue depth. The selected processing core is associated with a highest core sleep time among the remaining processing cores in the core partition having the lowest partition queue depth. The selected processing core is migrated from the core partition having the lowest partition queue depth to the core partition having the highest partition queue depth.

In a fifth implementation form of the distributed storage node according to the second aspect as such or any of the implementation forms of the second aspect, to perform 3-level rebalancing of the plurality of NUMA memory domains, the one or more processors execute the instructions to: a NUMA queue depth is determined for each of the plurality of NUMA memory domains to obtain a plurality of NUMA queue depths. A NUMA memory domain having a highest NUMA queue depth of the plurality of NUMA queue depths is selected from the plurality of NUMA memory domains. A NUMA memory domain having a lowest NUMA queue depth of the plurality of NUMA queue depths is selected from the plurality of NUMA memory domains.

In a sixth implementation form of the distributed storage node according to the second aspect as such or any of the implementation forms of the second aspect, the one or more processors execute the instructions to: core partitions are selected for rebalancing. The selected core partition exists in the NUMA memory domain having the lowest NUMA queue depth and the NUMA memory domain having the highest NUMA queue depth. At least one load group associated with the core partition selected for rebalancing is migrated from the NUMA having the highest NUMA queue depth to the NUMA having the lowest NUMA queue depth.

According to a third aspect of the present disclosure, there is provided a non-transitory computer readable medium storing instructions for performing rebalancing of lightweight threads (LWT) within distributed storage nodes of a communication network, wherein to perform a level 1 rebalancing during a current rebalancing period. The instructions, when executed by one or more processors of a computing device, cause the one or more processors to: a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions is determined. Each queue depth for each processing core in the subset of the plurality of queue depths indicates a number of LWTs scheduled for execution by that processing core. The LWTs are grouped into one or more load groups within a processing core. A first processing core in the subset is selected based on a maximum queue depth of the plurality of queue depths. A second processing core in the subset is selected based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. A load group of the one or more load groups of the first processing core is moved to one or more load groups of the second processing core.

In a first implementation form of the non-transitory computer readable medium according to the third aspect, the subset of the plurality of core partitions is, when executed, to perform non-uniform memory access (NUMA) on a first NUMA memory domain of a plurality of NUMA memory domains, the subset of the plurality of core partitions comprising the first core partition, wherein to perform a level 2 rebalancing on the first NUMA memory domain. The instructions, when executed by the one or more processors, further cause the one or more processors to: a partition queue depth within the first NUMA memory domain for each core partition of the subset of the plurality of core partitions is determined to obtain a plurality of partition queue depths. Each partition queue depth of the plurality of partition queue depths is based on an average of queue depths of a plurality of processing cores of a corresponding core partition of the subset. From the subset of the plurality of core partitions, a core partition having a lowest partition queue depth of the plurality of partition queue depths is selected. From the subset of the plurality of core partitions, a core partition is selected having a highest partition queue depth of the plurality of partition queue depths.

According to a fourth aspect of the present disclosure, a system for performing rebalancing of LWT within distributed storage nodes of a communication network is provided. The system includes means for determining a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions. Each queue depth for each processing core in the subset of the plurality of queue depths indicates a number of LWTs scheduled for execution by the processing core, the LWTs being grouped into one or more load groups within the processing core. The system further comprises selecting means for selecting a first processing core in the subset based on a maximum queue depth of the plurality of queue depths. The selecting means is further for: a second processing core in the subset is selected based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. The system further includes an LWT moving means for redistributing (moving) load groups of the one or more load groups of the first processing core to one or more load groups of the second processing core.

Any of the above examples may be combined with any one or more of the other examples described above to create new embodiments within the scope of the present disclosure.

Drawings

In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The drawings illustrate generally, by way of example and not by way of limitation, the various embodiments discussed herein.

FIG. 1 is a simplified system overview of a network architecture using distributed storage nodes with LWT rebalancing functionality, according to an example embodiment.

FIG. 2 is a block diagram of a CPU system having multiple cores configured with a stackable scheduler and load bank for performing LWT rebalancing, according to an example embodiment.

FIG. 3 is a block diagram of a non-uniform memory access (NUMA) memory domain having core partitions that can be used to perform LWT rebalancing according to an example embodiment.

FIGS. 4A and 4B are flowcharts of a method suitable for performing level 1 rebalancing within core sub-partitions in a NUMA memory domain according to one illustrative embodiment.

FIG. 5 is a flowchart of a method suitable for performing level 2 rebalancing within NUMA memory domains and between core sub-partitions according to an example embodiment.

FIG. 6 is a flowchart of a method suitable for performing 3-level rebalancing between NUMA memory domains according to an illustrative embodiment.

Fig. 7 is a flowchart of a method suitable for performing LWT power save rebalancing, according to an example embodiment.

FIG. 8 is a block diagram illustrating a representative software architecture that may be used in connection with the various device hardware and LWT rebalancing techniques described herein, according to one illustrative embodiment.

Fig. 9 is a block diagram illustrating a circuit of an apparatus implementing an algorithm and performing a method according to an exemplary embodiment.

Detailed Description

It should be understood at the outset that although illustrative implementations of one or more embodiments are provided below, the disclosed systems and/or methods described in connection with fig. 1-9 may be implemented using any number of techniques, whether currently known or not. The disclosure should in no way be limited to the illustrative embodiments, figures, and techniques shown below, including the exemplary designs and embodiments illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

The following detailed description, taken in conjunction with the accompanying drawings, is a part of the description and shows, by way of illustration, specific embodiments that may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the inventive subject matter, and it is to be understood that other embodiments may be utilized and that structural, logical, and electrical changes may be made without departing from the scope of the present disclosure. The following description of exemplary embodiments is, therefore, not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

The techniques disclosed herein may be used to perform computational load rebalancing in a distributed storage architecture or other type of network-based Service infrastructure (e.g., mobile-background-as-a-Service (MBaaS) infrastructure) architecture. More specifically, the disclosed rebalancing techniques may be used to implement: no lock rebalancing to reduce collisions; rebalancing of LWT in the same non-uniform memory access (NUMA) memory domain to improve memory resource utilization efficiency; and rebalancing between NUMA domains (e.g., based on user requests, during power saving modes of the CPU system, or for specific computationally intensive workloads).

In this context, the term "network-based service infrastructure" includes a plurality of network devices that provide on-demand computing capabilities (e.g., through one or more virtual machines or other virtual resources running on the network devices) and storage capabilities in the form of services to a community of end-receivers (e.g., clients of the service infrastructure), wherein the end-receivers are communicatively coupled to the network devices within the service infrastructure via a network. A customer of the service infrastructure may use one or more computing devices (also referred to as client devices or host devices) to access and manage services provided by the service infrastructure via a network (e.g., LWT rebalancing and execution). The above-described client devices, networks, and network-based service infrastructure may be collectively referred to as a "network architecture". Clients of the service infrastructure may also be referred to as "users" or "tenants.

In this document, the term "lightweight thread" (or LWT) is a computer program process that is an implementation of a user thread (e.g., a user I/O process) that shares address space and device resources (e.g., memory and processing power) with other threads, which reduces context switch time during thread execution. In contrast, heavyweight threads each have their own address space and/or device resources, increasing the context switch time between threads.

Herein, the term "load group" refers to a grouping of LWTs within a processing core. In an exemplary embodiment, the load group includes competing LWTs that may be executed sequentially to avoid resource locking and to achieve lock-free thread execution. Herein, the term "competing LWT" means an LWT that will contend for (or contend for) the same processing resources if executed simultaneously.

Herein, the term "core partition" refers to a partition in the same storage node that has a particular number (e.g., two or more) of CPU processing cores. A core partition may span multiple NUMA memory domains (e.g., multiple processing cores associated with the same core partition may be associated with different NUMA memory domains). In this context, the term "core sub-partition" refers to a processing core that belongs to the same core partition and the same NUMA memory domain for a core partition that spans at least two NUMA memory domains. In this regard, a core partition that spans multiple NUMA memory domains is made up of a corresponding plurality of core sub-partitions (e.g., one core sub-partition per NUMA memory domain, as shown in FIG. 3). The term "core sub-partition" is interchangeable with "core partition" when the core partition includes a processing core that encompasses only one NUMA memory domain. Since a core partition may include multiple processing cores, multiple LWTs may be executed on the same core partition (e.g., LWTs in the same or different load groups may be executed sequentially or simultaneously while hyper-threading is enabled for the CPU processing cores). Furthermore, core partitions may be expanded or contracted (e.g., by adding or removing processing cores), processing cores of the same partition do not overlap between different NUMA memory domains, and core partitions do not overlap each other (e.g., a processing core may be associated with only one core partition).

Herein, the "queue depth" of a processing core, core partition, or NUMA memory domain refers to the number of LWTs that are present (and waiting to execute) in such a processing core, core partition, or NUMA memory domain. In some aspects, an average queue depth of a core partition including a plurality of processing cores may be determined by averaging core partition queue depths over a predetermined period of time. Similarly, the average queue depth of a NUMA memory domain that includes multiple core partitions can be determined by averaging the queue depths of the core partitions in the NUMA domain over a predetermined period of time. In some embodiments, the average queue depth of a core partition or NUMA memory domain is determined based on an average (e.g., an exponentially moving average) of historical queue depths (e.g., determined in a previous rebalancing cycle) of the core partition or NUMA memory domain.

The disclosed rebalancing technique includes performing a multi-stage rebalancing, including a 1-stage rebalancing, a 2-stage rebalancing, and a 3-stage rebalancing. Level 1 rebalancing is performed in the core sub-partition of the NUMA memory domain. One or more load groups with LWT within a core partition may be selected (e.g., based on a queue depth of each core within the core partition), and the selected one or more load groups are migrated within the core partition to achieve level 1 LWT rebalancing.

Level 2 rebalancing is performed within the NUMA memory domain and across core sub-partitions. The core sub-partition may be selected (e.g., based on a queue depth associated with the core sub-partition), and one or more processing cores may be selected and migrated between the sub-partitions in the NUMA memory domain to achieve level 2 rebalancing.

The 3-level rebalancing is performed across NUMA memory domains and within the core partition. The NUMA memory domains can be selected (e.g., based on queue depth of available NUMA memory domains) and one or more load groups associated with the core partition can be moved between the NUMA memory domains (e.g., between sub-partitions of the core partition) to achieve a level 3 rebalancing.

In an exemplary embodiment, the disclosed techniques may also be used for: power saving rebalancing is performed by switching processing resources associated with one or more NUMA memory domains to a power saving mode (e.g., when the number of active threads associated with a workload is below a threshold number).

The prior art for load rebalancing does not utilize the different levels of rebalancing described above. More specifically, the prior art does not perform rebalancing by: load groups associated with core partitions having LWTs are moved between NUMA memory domains selected based on queue depth analysis of available NUMA memory domains. Additional new techniques include: mapping competing LWTs (e.g., LWTs that utilize the same processing resources during execution) to the same load group, and performing a three-level LWT rebalancing using historical and current queue depth data and core sleep time.

FIG. 1 is a simplified system overview of a network architecture using distributed storage nodes with LWT rebalancing functionality, according to an example embodiment. Referring to fig. 1, a network architecture 100 may include a plurality of devices (e.g., user devices) 102A, … …, 102N (collectively devices 102) communicatively coupled to a network-based service infrastructure (e.g., MBaaS) 112 through a network 110. The network-based service infrastructure 112 includes distributed storage nodes 114, … …, 116 and a rebalancing module 118. The distributed storage nodes 114, … …, 116 include: corresponding processing cores 120, … …, 122 and 160, … …, 162; corresponding memories 124 (with

NUMA memory domains

146, 148, … …, 150) and 164 (with

NUMA memory domains

186, 188, … …, 190); and corresponding Mapping Modules (MMs) 125, … …, 127, discussed in detail below. The devices 102A, … …, 102N are associated with the corresponding users 106A, … …, 106N and may be used to interact with the network-based service infrastructure 112 using a network access client (e.g., one of the clients 104A, … …, 104N). The network access clients 104A, … …, 104N may be implemented as Web clients or application program (app, application) clients.

The users 106A, … …, 106N may be generally referred to as "users 106" or collectively "users 106". Each user 106 may be a human user (e.g., a person), a machine user (e.g., a computer configured by a software program to interact with the device 102 and the network-based services infrastructure 112), or any suitable combination thereof (e.g., a machine-assisted human or a human-supervised machine). The users 106 are not part of the network architecture 100, but are each associated with one or more of the devices 102, and may be users of the devices 102 (e.g., the user 106A may be an owner of the device 102A, and the user 106N may be an owner of the device 102N). For example, device 102A may be a desktop computer, a vehicle-mounted computer, a tablet, a navigation device, a portable media device, or a smart phone belonging to user 106A. The users 106A, … …, 106N may use the corresponding devices 102A, … …, 102N to access services (e.g., distributed storage services, LWT rebalancing services, data storage services, data replication services, or other storage related services) provided by the network-based services infrastructure 112.

The network-based service infrastructure 112 may include a plurality of computing devices, such as distributed storage nodes (distributed storage node, DSNs) 114, … …, 116 (also referred to as DSNs 1, … …, DSNNs). The DSN 114 includes processing cores 120, … …, 122 (also referred to as cores 0, … …, core N, or C0, … …, CN), and the processing cores 120, … …, 122 may be part of the CPU system of the DSN 114. The processing core 120 includes a Base Scheduler (BSCH) 126 that includes suitable circuitry, logic, interfaces and/or code and that is operable to manage a pool of sub-schedulers (SSs) 128, … …, 130. Each of the SSs 128, … …, 130 includes suitable circuitry, logic, interfaces, and/or code and is operable to manage execution of LWTs 132, … …, 134 within a load bank associated with the SS. The processing core 122 includes a BSCH 136, which BSCH 136 includes suitable circuitry, logic, interfaces, and/or code and is operable to manage pools of SSs 138, … …, 140. Each of SSs 138, … …, 140 includes suitable circuitry, logic, interfaces, and/or code and is operable to manage execution of LWTs 142, … …, 144 within a load bank associated with the SS. An exemplary arrangement of load groups within a processing core for LWT rebalancing is shown in fig. 2. The DSN 114 also includes an MM 125, which MM 125 comprises suitable circuitry, logic, interfaces, and/or code and is configured to process incoming I/O processes (e.g., I/O process 108) by assigning LWTs to each I/O process and storing each LWT in a load bank (e.g., as shown in fig. 2) within one of the processing cores 120, … …, 122 according to a mapping algorithm. DSN 114 also includes memory 124, which memory 124 may include

NUMA memory domains

146, 148, … …, 150 (also referred to as NUMA0, NUMA1, … …, NUMA N). Each of the NUMA memory domains 146, … …, 150 can be used to perform LWT within the core partition associated with the NUMA memory domain. An exemplary arrangement of processing cores in multiple NUMA memory domains for LWT rebalancing is shown in FIG. 3.

Similarly, DSN 116 includes processing cores 160, … …, 162 (also referred to as cores 0, … …, core N, or C0, … …, CN), which processing cores 160, … …, 162 may be part of the CPU system of DSN 116. Processing core 160 includes a BSCH 166, which BSCH 166 includes suitable circuitry, logic, interfaces and/or code and is operable to manage pools of SSs 168, … …, 170. Each of the SSs 168, … …, 170 includes suitable circuitry, logic, interfaces, and/or code and is operable to manage execution of LWTs 172, … …, 174 within one or more load groups associated with the SS. The processing core 162 includes a BSCH 176, the BSCH 176 including suitable circuitry, logic, interfaces, and/or code and operable to manage the pools of SSs 178, … …, 180. Each of SSs 178, … …, 180 includes suitable circuitry, logic, interfaces, and/or code and is operable to manage execution of LWTs 182, … …, 184 within one or more load groups associated with the SS.

DSN 116 also includes an MM 127, which MM 127 comprises suitable circuitry, logic, interfaces, and/or code and is configured to process incoming I/O processes (e.g., I/O process 108) by assigning LWTs to each I/O process and storing each LWT in a load bank (e.g., as shown in fig. 2) within one of processing cores 160, … …, 162 according to a mapping algorithm. DSN 116 also includes memory 164, which memory 164 may include

NUMA memory domains

186, 188, … …, 190 (also referred to as NUMA0, NUMA1, … …, NUMA N). Each of the NUMA memory domains 186, … …, 190 can be used for LWT executing within the core partition associated with the NUMA memory domain.

The network-based service infrastructure 112 also includes a rebalancing module 118, which rebalancing module 118 comprises suitable circuitry, logic, interfaces, and/or code and is operable to perform the LWT rebalancing functions described herein, including performing level 1 rebalancing, level 2 rebalancing, level 3 rebalancing, and LWT power saving rebalancing.

In operation, device 102A uses network access client 104A to communicate I/O process 108 to DSN 114 over network 110. The MMs 125 in the DSN 114 allocate LWTs for the received I/O processes 108 and store the newly allocated LWTs to the load groups according to a mapping algorithm (e.g., as shown in fig. 2). The rebalancing module 118 is used to monitor queue depth within each processing core, queue depth of core partitions, and queue depth of NUMA memory domains associated with DSNs 114, … …, 116, and initiate LWT rebalancing (e.g., level 1 rebalancing, level 2 rebalancing, level 3 rebalancing, and power saving rebalancing) accordingly.

In an exemplary embodiment, rebalancing module 118 can monitor the queue depth within each processing core (e.g., to determine whether to perform a level 1 rebalancing), the queue depth within a core partition (e.g., to determine whether to perform a level 2 rebalancing), and the queue depth within a NUMA memory domain (e.g., to determine whether to perform a level 3 rebalancing or power saving rebalancing), respectively. In this case, the rebalancing module 118 may perform each rebalancing based on the determined queue depth (e.g., when the corresponding queue depth of the processing core, core partition, memory domain is equal to or above a corresponding threshold). In another embodiment, the rebalancing module 118 may perform different levels of rebalancing in sequence. For example, if at least one of the corresponding queue depths of the processing cores, core partitions, or memory domains is equal to or above the corresponding threshold, the rebalancing module 118 may first perform a 3-level rebalancing, followed by a 2-level rebalancing and a 1-level rebalancing (e.g., as described in connection with fig. 6). Different techniques for performing level 1 rebalancing, level 2 rebalancing, level 3 rebalancing, and power saving rebalancing are discussed in connection with fig. 4A through 7.

Although the rebalancing module 118 is shown as a separate module (e.g., executing on a computing device such as a configuration server within the network-based service infrastructure 112), the present disclosure is not so limited and the rebalancing module may be implemented within at least one of the DSNs 114, … …, 116. In an exemplary embodiment, rebalancing module 118 may perform the LWT rebalancing functions discussed herein periodically (e.g., based on a preconfigured rebalancing period) or automatically based on a configuration threshold (e.g., one or more of level 1 rebalancing, level 2 rebalancing, level 3 rebalancing, and power saving rebalancing may be triggered when the queue depth of individual processing cores, core partitions, or NUMA memory domains is above a threshold). In another embodiment, the rebalancing module 118 may perform the LWT rebalancing functions discussed herein based on rebalancing requests (or commands) from one or more of the devices 102 (e.g., based on an indication received at one or more of the devices 102 that the workload managed by at least one of the DSNs 114, … …, 116 includes an LWT associated with a queue depth above a threshold).

Any of the devices shown in fig. 1 may be implemented in a general-purpose computer modified (e.g., configured or programmed) by software into a special-purpose computer to perform the functions described herein for the computer, database, or device. A "database" as used herein is a data storage resource whose storage structure is a text file, a table, a spreadsheet, a relational database (e.g., an object relational database, a NoSQL database, a network or graphic database), a triplet database, a hierarchical data store, or any suitable combination thereof. In addition, data accessed (or stored) through an application programming interface (application programming interface, API) or remote procedure call (remote procedure call, RPC) may be considered to be accessed from (or stored to) a database. Furthermore, any two or more of the devices or databases shown in fig. 1 may be combined into a single machine, database, or device, and the functions described herein with respect to any single machine, database, or device may be subdivided among multiple machines, databases, or devices.

Network 110 may be any network that supports communication between machines, databases, and devices (e.g., devices 102A, … …, 102N, and DSNs 114, … …, 116 in network-based services infrastructure 112). Accordingly, the network 110 may be a wired network, a wireless network (e.g., a mobile network or a cellular network), or any suitable combination thereof. The network 110 may include one or more portions that form a private network, a public network (e.g., the internet), or any suitable combination thereof.

FIG. 2 is a block diagram of a CPU system 200 having multiple cores configured with stackable schedulers and load groups for performing LWT rebalancing, according to an example embodiment. Referring to fig. 2, a cpu system 200 includes multiple cores (e.g., processing cores 202, … …, 204) that may be implemented within a computing device such as one of DSNs 114, … …, 116 in network based services infrastructure 112. The processing core 202 includes a BSCH 206, which BSCH 206 is used to schedule a pool 208 of SSs 210, … …, 212. Each SS of pool 208 is responsible for scheduling execution of LWTs within a load group. For example, an MM within a DSN receiving an I/O process may assign

LWTs

218 and 220 to the I/O process and store LWTs 218 and 220 within load group 214. SS 210 (e.g., based on instructions from BSCH 206) may schedule execution of

LWTs

218 and 220. Similarly, SS 212 may schedule execution of LWT 222 within load group 216.

The processing core 204 includes a BSCH 224, which BSCH 224 is used to schedule a pool 226 of SSs 228, … …, 230. The MMs within the DSN receiving the I/O process may assign

LWTs

236 and 238 to the incoming I/O process and store LWTs 236 and 238 within load bank 232. Each SS in pool 226 is responsible for scheduling execution of LWTs within a load group. For example, SS 228 schedules execution of LWTs 236 and 238 (e.g., based on instructions from BSCH 224). Similarly, SS 230 schedules execution of LWT 240 within load group 234.

In an exemplary embodiment, each MM within the DSN may store competing LWTs to the same load group, and then the SS sequentially schedules execution of the LWTs within the load group. In this way, by mapping competing LWTs to the same load group, and then sequentially executing LWTs within that load group, contention for processing resources is avoided, and LWTs are executed without any lock (e.g., execution of one LWT is not locked in order for another LWT to complete execution).

FIG. 3 is a block diagram of a non-uniform memory access (NUMA) memory domain having core partitions that can be used to perform LWT rebalancing according to an example embodiment. Referring to fig. 3, memory 300 may be configured with a plurality of

NUMA memory domains

302, 304, … …, 306, which may be used in conjunction with: executing LWTs within each NUMA memory domain associated with processing cores grouped into core partitions; and rebalancing the individual processing cores, core partitions, and NUMA memory domains. As described above, a core partition is a collection of processing cores on which LWT may be executed. The core partition may be used to execute LWTs across multiple NUMA memory domains (e.g., using multiple sub-partitions), and multiple LWTs may be used to execute on the same partition. One or more load groups with LWT may move between partitions (e.g., from a child partition of a first core partition to a child partition of a second core partition), which causes the core partition to be expanded or contracted. Furthermore, the core partitions do not overlap between different NUMA memory domains. More specifically, the processing cores associated with the core sub-partitions within the first NUMA memory domain do not overlap with the processing cores associated with the core sub-partitions within the second NUMA memory domain.

As shown in fig. 3, each of the NUMA memory domains 302, … …, 306 may include multiple core partitions, with each core partition being limited to a single NUMA memory domain (e.g., all processing cores of one core partition are associated with a single NUMA memory domain) or spanning multiple NUMA memory domains (e.g., non-overlapping processing cores of the same partition are associated with different NUMA memory domains). For example, core partition P0 is distributed among

NUMA memory domains

302, 304, and 306 as corresponding core sub-partitions 308-0, 308-1, and 308-N.

As shown in FIG. 3, all processing cores within core partition P0 may be mapped to different sub-partitions in different NUMA memory domains in a non-overlapping manner. For example, core partition P0 includes processing cores C0 through C40. The first processing core sub-partition 308-0 of core partition P0 includes processing cores C0, … …, C10, and C21, … …, C30. The second processing core sub-partition 308-1 of core partition P0 includes processing cores C11, … …, C20. The third processing core sub-partition 308-N of core partition P0 includes processing cores C31, … …, C40. In this regard, all of the processing cores C0-C40 of core partition P0 map to different NUMA memory domains in a non-overlapping manner.

FIGS. 4A and 4B are flowcharts of a method suitable for performing level 1 rebalancing within core sub-partitions in a NUMA memory domain according to an illustrative embodiment. Referring to fig. 4A, a method 400A for performing a level 1 rebalancing during a current rebalancing cycle includes

operations

402A, 404A, 406A, and 408A. By way of example and not limitation, the method 400A is described as being performed by the rebalancing module 118. At operation 402A, a plurality of Queue Depths (QDs) corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions is determined. For example, the first core partition may be partition P0 shown in FIG. 3, which includes processing cores C0-C40, and the subset of the plurality of processing cores may include cores C0-C10 and C21-C30 of processing core sub-partition 308-0 within NUMA memory domain 302. QD's for a processing core in the subset of the plurality of Queue Depths (QDs) refer to a number of LWTs scheduled for execution by the processing core, wherein the LWTs are grouped into one or more load groups within the processing core. For example, processing core 202 (labeled C0) is shown in FIG. 2 and has a queue depth of 3, which is the number of LWTs within

available load groups

214 and 216. Accordingly, QDs of processing core C0 are denoted in fig. 3 as "qd=3". For each of the remaining processing cores of the first processing core sub-partition 308-0 described within NUMA memory domain 302, a similar QD is determined. For example, QDs of processing core C30 are denoted in fig. 3 by "qd=1".

At operation 404A, a first processing core in the subset is selected based on a largest QD of the plurality of QDs. At operation 406A, a second processing core in the subset is selected based on a smallest QD of the plurality of QDs and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period. For example, from among all the processing cores of core partition P0 within NUMA memory domain 302, processing core C0 may be selected as the core with the largest QD (qd=3) and processing core C30 may be selected as the core with the smallest QD (qd=1).

In an exemplary embodiment, rebalancing module 118 is used to store historical QD data associated with QDs that were previously measured (e.g., during a previous rebalancing cycle) for various processing cores, core partitions, and NUMA memory domains. In some aspects, the historical QD data may include average QD data determined, for example, using historical queue depth, based on an exponential moving average. In addition, the rebalancing module 118 is further configured to store respective processing core sleep times, such as core sleep times since a previous rebalancing. In this regard, processing core C30 may be selected based further on the core sleep time information for each processing core in the subset detected during the previous rebalancing period (e.g., processing core C30 is selected as the core having the lowest queue depth and the highest core sleep time during the previous rebalancing period). At operation 408A, a load group of the one or more load groups in the first processing core is moved to the one or more load groups in the second processing core. For example, load group 216 from core C0 is moved as a new load group in core C30. In an exemplary embodiment, the above-described rebalancing function for level 1 rebalancing (e.g., as described in connection with fig. 4B) may be repeated until the QD difference between QDs of processing cores C0 and C30 is equal to (or below) the threshold.

Referring to fig. 4B, a method 400B for performing level 1 rebalancing includes

operations

402B, 404B, 406B, and 408B. By way of example and not limitation, the method 400B is described as being performed by the rebalancing module 118. The level 1 rebalancing function discussed in connection with fig. 4B (and fig. 4A) may be performed once every N seconds period (where N is a positive number). The N second period is the duration for performing a level 1 rebalancing, also known as a rebalancing period. In operation 402B QDs are determined for each processing core in the core sub-partition within the NUMA memory domain. At operation 404B, the processing core with the highest QD (e.g., core C0 in fig. 3) is selected from the processing cores of the core sub-partition in the NUMA memory domain. At operation 406B, the processing core with the lowest queue depth is selected from the processing cores of the core sub-partition in the NUMA memory domain. At operation 408B, one or more load groups are moved between the processing core with the highest QD and the processing core with the lowest QD until the QD difference between the two processing cores is equal to or below a threshold. For example, a load group with a single LWT may be migrated from processing core C0 to processing core C30 such that the queue depth of both processing cores is qd=2.

In an exemplary embodiment, the level 1 rebalancing function described in connection with fig. 4A and 4B may be performed periodically, or upon detecting that at least one processing core within a core partition has QDs above a threshold (e.g., current queue depth or average QDs determined based on historical QD information).

FIG. 5 is a flowchart of a method suitable for performing level 2 rebalancing within NUMA memory domains and between core sub-partitions according to an example embodiment. Referring to fig. 5, a method 500 for performing 2-level rebalancing includes

operations

502, 504, 506, and 508. By way of example, and not limitation, the method 500 is described as being performed by the rebalancing module 118. In operation 502, for a NUMA memory domain of a plurality of NUMA memory domains, a partition queue depth (e.g., average QD based on historical QD information) for each corresponding core sub-partition in the NUMA memory domain is calculated and a determination is made as to whether to perform a level 2 rebalancing. For example, based on whether the QD difference between partition queue depths associated with the sub-partitions within the NUMA memory domain is greater than a threshold, based on an average core sleep time since a previous rebalancing, etc., it is determined whether to perform the remaining operations 504-508 of level 2 rebalancing.

At operation 504, the sub-partition with the lowest QD is determined. For example, QDs of core sub-partitions 308-0 and 310-0 within NUMA memory domain 302 may be determined as qd=15 and qd=5, respectively. Thus, core sub-partition 308-01 is selected as the core sub-partition having the lowest partition queue depth. In addition, processing cores within core sub-partition 310-0 are selected and released. For example, the processing core selected may be the processing core with the lowest QD within core sub-partition 310-0. The LWT within the selected processing core is moved to the other processing cores within core sub-partition 310-0. In an exemplary embodiment, the processing core with the lowest QD is also selected based on the core sleep time information (e.g., the processing core with the lowest queue depth and the longest sleep time since the previous rebalance period is selected).

At operation 506, the core sub-partition with the highest partition queue depth is determined and the released core from the previous operation is added to the core sub-partition. For example, core sub-partition 308-0 is determined to be the highest queue depth sub-partition, and the processing cores released from the lowest queue depth sub-partition are moved to core sub-partition 308-0 (of partition P0).

At operation 508, optionally, a level 1 rebalancing may be performed to redistribute (e.g., move) the load groups within each sub-partition.

FIG. 6 is a flowchart of a method suitable for performing 3-level rebalancing between NUMA memory domains according to an illustrative embodiment. Referring to fig. 6, a method 600 for performing 3-level rebalancing includes

operations

602, 604, 606, 608, and 610. By way of example, and not limitation, the method 600 is described as being performed by the rebalancing module 118. In operation 602, it is determined whether 3-level rebalancing is performed. For example, a NUMA queue depth for each of a plurality of NUMA memory domains is determined (e.g., an average queue depth for each NUMA memory domain is determined based on historical QD information for a core partition or sub-partition within the NUMA memory domain). Based on whether a queue depth difference between any two NUMA queue depths associated with any two NUMA memory domains is greater than a threshold, a determination is made as to whether to perform the remaining operations 604-608 of the 3-level rebalancing. In an exemplary embodiment, it may be determined that a level 3 rebalancing should be performed based on a user request received from one of the user devices 102. In another exemplary embodiment, a level 3 rebalancing may be determined to be performed based on the type of workload being handled by the CPU system of the distributed storage node within the network-based service infrastructure 112. For example, 3-level rebalancing may be triggered before or after performing a compute-intensive workload or memory write-intensive workload. In yet another exemplary embodiment, a 3-level rebalancing may be determined to be performed based on a duration of time that one or more processing cores of the CPU system are in a power saving mode.

After determining that a 3-level rebalancing is to be performed, at operation 604, the NUMA memory domain with the highest NUMA queue depth is determined. In some aspects, the NUMA memory domain with the highest queue depth is determined using an average queue depth (e.g., using an exponential moving average of historical queue depths for NUMA memory domains).

At operation 606, the NUMA memory domain having the lowest NUMA queue depth is determined. For example, referring to fig. 3, NUMA memory domain 302 is determined to be the NUMA memory domain with the lowest queue depth (e.g., qd=20), and NUMA memory domain 304 is determined to be the NUMA memory domain with the highest queue depth.

At operation 608, a core partition is selected having the following associated processing cores: the processing cores described above reside in sub-partitions of both the NUMA memory domain with the highest NUMA queue depth (determined at operation 604) and the NUMA memory domain with the lowest NUMA queue depth (determined at operation 606). For example, core partition P0 may be the selected core partition because it includes cores that exist in sub-partitions of both

NUMA memory domains

302 and 304. One or more load groups are then moved from the processing cores in the determined core partition of the NUMA memory domain having the highest queue depth to the processing cores in the determined core partition of the NUMA memory domain having the lowest queue depth. For example, a load group within processing core C11 of core partition P0 (and child partition 308-1) in NUMA memory domain 304 can be moved to processing core C30 of core partition P0 (and child partition 308-0) in NUMA memory domain 302. In operation 610, optionally, a level 2 rebalancing and a level 1 rebalancing may be performed after a level 3 rebalancing.

Fig. 7 is a flowchart of a method suitable for performing LWT power save rebalancing, according to an example embodiment. Method 700 includes

operations

702, 704, 706, and 708. By way of example and not limitation, the method 700 is described as being performed by the rebalancing module 118. At operation 702, a determination is made as to whether to perform LWT power saving rebalancing (e.g., whether to continue to perform rebalancing with

operations

704, 706, and 708). In an exemplary embodiment, it may be determined whether to perform LWT power saving rebalancing based on the total number of workloads or LWTs being processed by all cores of the CPU system. For example, when the total number of LWTs in all processing cores and all NUMA memory domains is below a threshold, one or more of the NUMA memory domains may be placed in a power save mode by performing

rebalancing operations

704, 706, and 708. At operation 704, a plurality of available NUMA memory domains is used to determine the NUMA memory domain having the lowest number of processing cores (also referred to as the lowest processing core count). At operation 706, the determined NUMA memory domain is converted to a released NUMA memory domain. More specifically, for each of the determined NUMA memory domains that spans across the plurality of available NUMA memory domains, all load groups in that partition (e.g., load groups in one sub-partition of that partition) are moved to the same partition (e.g., a different sub-partition of the same partition) in at least another of the plurality of NUMA memory domains. For example, NUMA memory domain 304 can be determined to be the NUMA memory domain having the lowest number of processing cores. All of the loads in the sub-partitions 308-1 and 310-1 of kernel partitions P0 and P1, respectively, in NUMA memory domain 304 can be migrated to the processing cores in the sub-partitions 308-0 and 310-0, respectively, in NUMA memory domain 302. In this regard, NUMA memory domain 304 is released from LWT. At operation 708, the NUMA memory domain that has released the LWT (e.g., NUMA memory domain 304) is placed in a power saving mode.

Fig. 8 is a block diagram illustrating a representative software architecture 800, according to an example embodiment, which representative software architecture 800 may be used in conjunction with the various device hardware and LWT rebalancing techniques described herein. Fig. 8 is merely a non-limiting example of a software architecture 802, it being understood that many other architectures may be implemented to facilitate the implementation of the functionality described herein. The software architecture 802 may execute on hardware such as the computer 900 illustrated in fig. 9, the computer 900 including a processor 905, memory 910,

storage

915 and 920, and I/

O components

925 and 930. A representative hardware layer 804 is shown, which may represent the computer 900, etc., shown in fig. 9. The representative hardware layer 804 includes one or more processing units 806 with associated executable instructions 808. Executable instructions 808 represent executable instructions of software architecture 802, including implementing the methods, modules, etc. of fig. 1-7. The hardware layer 804 also includes a memory and/or storage module 810 that also has executable instructions 808. The hardware layer 804 may also include other hardware 812, which represents any other hardware of the hardware layer 804, such as other hardware illustrated as part of the computer 900.

In the exemplary architecture shown in fig. 8, the software architecture 802 may be conceptualized as layers, with each layer providing specific functionality. For example, the software architecture 802 may include layers of an operating system 814, libraries 816, framework/middleware 818, applications 820, and a presentation layer 844. In operation, the application 820 or other component within the layers may call an application programming interface (application programming interface, API) call 824 through a software stack and receive a response, return value, etc. shown in message 826 in response to the API call 824. The layers shown in fig. 8 are representative in nature and not all software architectures 802 have all layers. For example, some mobile or dedicated operating systems may not provide framework/middleware 818, while other operating systems may provide such a layer. Other software architectures may include other layers or different layers.

Operating system 814 may manage hardware resources and provide common services. The operating system 814 may include, for example, a kernel 828, services 830, drivers 832, and a rebalancing module 860. The core 828 may act as an abstraction layer between hardware and other software layers. For example, the kernel 828 may be responsible for memory management, processor management (e.g., scheduling), component management, networking, security settings, and the like. The service 830 may provide other common services for other software layers. The driver 832 may be responsible for controlling or interfacing with the underlying hardware. For example, depending on the hardware configuration, the drivers 832 may include a display driver, a camera driver,

Driver, flash memory driver, serial communication driver (e.g., universal serial bus (universal serial bus, USB) driver)>

Drivers, audio drivers, power management drivers, etc.

In some aspects, rebalancing module 860 may include suitable circuitry, logic, interfaces, and/or code, and may be used to perform one or more of the rebalancing functions discussed in connection with fig. 1 (e.g., performed by rebalancing module 118) and fig. 4A-7. In some aspects, the functionality of rebalancing module 860 can be performed by other operating system modules (e.g., 828, 830, or 832).

Library 816 may provide a common infrastructure that may be used by applications 820 and/or other components and/or layers. Library 816 typically provides functionality that allows other software modules to perform tasks in a manner that is easier than directly interacting with the underlying operating system 814 functionality (e.g., kernel 828, services 830, drivers 832, and/or rebalancing module 860). Library 816 may include a system library 834 (e.g., a C-standard library), which system library 834 may provide functions such as memory allocation functions, string manipulation functions, mathematical functions, and the like. In addition, libraries 816 may include API libraries 836, such as media libraries (e.g., libraries that support presentation and operation of various media formats (e.g., MPEG4, H.264, MP3, AAC, AMR, JPG PNG), graphics libraries (e.g., openGL framework that may be used to present 2D and 3D graphics content on a display), database libraries (e.g., SQLite that may provide various relational database functions), web libraries (e.g., webKit that may provide web browsing functions), and the like. The library 816 may also include a variety of other libraries 838 to provide many other APIs to applications 820 and other software components/modules.

Framework/middleware 818 (also sometimes referred to as middleware) can provide a high-level public infrastructure that can be used by applications 820 and/or other software components/modules. For example, the framework/middleware 818 can provide various graphical user interface (graphical user interface, GUI) functions, advanced resource management, advanced location services, and the like. Framework/middleware 818 can provide various other APIs that can be used by applications 820 and/or other software components/modules, some of which can be specific to a particular operating system 814 or platform.

The applications 820 include built-in applications 840 and/or third party applications 842. Examples of representative built-in applications 840 may include, but are not limited to, a contact application, a browser application, a reader application, a positioning application, a media application, a communication application, and/or a game-like application. Third party applications 842 may include any built-in application 840 as well as a wide variety of other applications. In a particular example, third party application 842 (e.g., android is used by entities other than the vendor of a particular platform ^TM Or iOS ^TM The application developed by the software development kit (software development kit, SDK) may be, for example, in the iOS ^TM 、Android ^TM 、

Mobile software running on a mobile operating system such as a Phone or other mobile operating system. In this example, third party application 842 may call API call 824 provided by a mobile operating system (e.g., operating system 814) in order to facilitate implementation of the functionality described herein.

The application 820 may utilize built-in operating system functions (e.g., kernel 828, services 830, drivers 832, and/or rebalancing module 860), libraries (e.g., system libraries 834, API libraries 836, and other libraries 838), and framework/middleware 818 to create a user interface to interact with a user of the system. Alternatively or additionally, in some systems, the user may interact with the user through a presentation layer (e.g., presentation layer 844). In these systems, the application/module "logic" may be independent of aspects of the application/module that interact with the user.

Some software architectures utilize virtual machines. In the example of fig. 8, virtual machine 848 is shown. The virtual machine creates a software environment in which applications/modules may execute as if they were executing on a hardware device (e.g., the computer 900 shown in fig. 9, etc.). Virtual machine 848 is hosted by a host operating system (operating system 814 shown in fig. 8), and typically (but not always) has a virtual machine monitor 846, which virtual machine monitor 846 manages the operation of virtual machine 848 and interfaces with the host operating system (i.e., operating system 814). The software architecture 802 executes within a virtual machine 848, such as an operating system 850, libraries 852, framework/middleware 854, applications 856, or presentation layers 858. These software architecture layers executing within virtual machine 848 may be the same as or different from the corresponding layers described above.

Fig. 9 is a block diagram illustrating circuitry of an apparatus implementing an algorithm and performing a method according to some example embodiments. Not all components need be used in various embodiments. For example, the client, server, and cloud-based network devices may each use a different set of components, or for the server, a larger storage device.

An example computing device in the form of a computer 900 (also referred to as computing device 900, computer system 900, or computer 900) may include a processor 905, memory 910, removable storage 915, non-removable storage 920, an input interface 925, an output interface 930, and a communication interface 935, all connected by a bus 940. While an example computing device is shown and described as computer 900, the computing device may take different forms in different embodiments.

Memory 910 may include volatile memory 945 and nonvolatile memory 950, and may store programs 955. The computer 900 may include or have access to computing environments including: various computer readable media are available such as volatile memory 945, nonvolatile memory 950, removable storage 915, and non-removable storage 920. Computer storage devices include random-access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), and electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD ROM), digital versatile disks (digital versatile disk, DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.

Computer readable instructions stored in a computer readable medium (e.g., program 955 stored in memory 910) may be executed by processor 905 of computer 900. Hard disk drives, CD-ROMs, and RAMs are some examples of articles of manufacture comprising non-transitory computer readable media, such as storage devices. The terms "computer-readable medium" and "storage device" do not include carrier waves that are considered to be too transient. "computer-readable non-transitory medium" includes all types of computer-readable media, including magnetic storage media, optical storage media, flash memory media, and solid-state storage media. It should be appreciated that the software may be installed in a computer and sold with the computer. Alternatively, the software may be obtained and loaded into a computer, including by a physical medium or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. For example, the software may be stored on a server for distribution over a network. The terms "computer-readable medium" and "machine-readable medium" as used herein are interchangeable.

Program 955 may utilize a customer preference structure using modules described herein, such as rebalancing module 960. The rebalancing module 960 may be the same as the rebalancing module 860 discussed in connection with FIG. 8.

Any one or more of the modules described herein may be implemented using hardware (e.g., a processor of a machine, an application-specific integrated circuit (ASIC), a field-programmable gate array (field-programmable gate array, FPGA), or any suitable combination thereof). Moreover, any two or more of the modules may be combined into a single module, and the functionality described herein for a single module may be subdivided among multiple modules. Furthermore, in accordance with various exemplary embodiments, modules described herein as being implemented within a single machine, database, or device may be distributed across multiple machines, databases, or devices.

In some aspects, rebalancing module 960 and one or more other modules that are part of program 955 may be integrated into a single module, performing the respective functions of the integrated module.

Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, steps may be deleted from the described flows, and other components may be added or removed from the described systems. Other embodiments may be within the scope of the following claims.

It should also be appreciated that software comprising one or more computer-executable instructions that facilitate the processing and operation of any or all of the steps described above with respect to the present disclosure may be installed on and sold with one or more computing devices consistent with the present disclosure. Alternatively, the software may be obtained and loaded into one or more computing devices, including obtaining the software through a physical medium or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. For example, the software may be stored on a server for distribution over a network.

Furthermore, it will be appreciated by those skilled in the art that the present disclosure is not limited in its application to the details of construction and arrangement of components set forth in the following description or illustrated in the drawings. The embodiments herein are capable of other embodiments and of being practiced or of being carried out in various ways. It is also to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless limited otherwise, the terms "connected," "coupled," and "mounted" and variations thereof are used broadly herein and encompass both direct and indirect connections, couplings, and mountings. Furthermore, the terms "connected" and "coupled" and their variants are not limited to physical or mechanical connections or couplings. In addition, the terms "upper", "lower", "bottom" and "top" are relative and are used to aid in the description, but are not limiting.

The components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments may be implemented at least in part in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. For example, these components may be implemented as a computer program product (e.g., a computer program, program code, or computer instructions) tangibly embodied in an information carrier or in a machine-readable storage device for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers).

A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. Furthermore, functional programs, codes, and code segments for accomplishing the techniques described herein are easily construed by programmers skilled in the art to which the techniques described herein pertains to be within the scope of the claims. Method steps associated with the illustrative embodiments may be performed by one or more programmable processors executing a computer program, code, or instructions to perform functions (e.g., manipulate input data and/or generate output). For example, method steps may also be performed by, and means for performing the methods described above may be implemented as, special purpose logic circuitry, e.g., a Field Programmable Gate Array (FPGA) or an application-specific integrated circuit (ASIC).

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (digital signal processor, DSP), an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other similar configuration.

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from and/or transfer data to, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include various forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., electrically programmable read only memory or electrically programmable ROM (electrically programmable read-only memory), electrically erasable programmable ROM (electrically erasable programmable ROM, EEPROM), flash memory devices, data storage disks (e.g., magnetic disks, internal hard disks, or removable magnetic disks, magneto-optical disks, CD-ROMs, and DVD-ROM disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

As used herein, a "machine-readable medium (or computer-readable medium)" refers to a device capable of temporarily or permanently storing instructions and data, and may include, but is not limited to, random Access Memory (RAM), read-Only Memory (ROM), cache Memory, flash Memory, optical media, magnetic media, cache Memory, other types of Memory (e.g., erasable programmable read-Only Memory (EEPROM)), and/or any suitable combination thereof. The term "machine-readable medium" shall be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that are capable of storing the processor instructions. The term "machine-readable medium" shall also be taken to include any medium or combination of media that is capable of storing instructions for execution by one or more processors 905 such that, when the one or more processors 905 execute the instructions, the one or more processors 905 perform any one or more of the methodologies described herein. Accordingly, a "machine-readable medium" refers to a single storage device or apparatus, as well as a "cloud-based" storage system or storage network that includes multiple storage devices or apparatus. The term "machine-readable medium" as used herein does not include signals themselves.

Furthermore, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or described as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the scope disclosed herein.

Although the present disclosure has been described with reference to specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made thereto without departing from the scope of the disclosure. For example, other components may be added to or removed from the described systems. Accordingly, the specification and drawings are to be regarded only as illustrative of the present disclosure as defined in the appended claims, and are intended to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present disclosure. Other aspects may be within the scope of the following claims.

Claims

1. A computer-implemented method for performing rebalancing of lightweight threads (LWTs) within distributed storage nodes of a communication network, the method comprising:

performing a level 1 rebalancing during a current rebalancing period, the level 1 rebalancing comprising:

determining a plurality of queue depths corresponding to a subset of a plurality of processing cores associated with a first core partition of a plurality of core partitions, wherein each queue depth of the plurality of queue depths for each processing core of the subset indicates a number of LWTs scheduled for execution by the processing core, the LWTs being grouped into one or more load groups within the processing core;

selecting a first processing core in the subset based on a maximum queue depth of the plurality of queue depths;

selecting a second processing core in the subset based on a minimum queue depth of the plurality of queue depths and based on a core sleep time of each processing core in the subset detected during a previous rebalancing period; and

a load group of the one or more load groups of the first processing core is moved to one or more load groups of the second processing core.

2. The computer-implemented method of claim 1, further comprising:

the moving during the level 1 rebalancing is repeated until a difference between a queue depth of the first processing core and a queue depth of the second processing core is less than a threshold number.

3. The computer-implemented method of claim 1, further comprising:

the level 1 rebalancing is performed periodically based on a preconfigured rebalancing period.

4. A computer-implemented method according to any one of claims 1 to 3, further comprising:

the plurality of queue depths are determined based on an exponential moving average using historical queue depths for the subset of the plurality of processing cores, the historical queue depths determined during the previous rebalancing period.

5. A computer-implemented method according to any one of claims 1 to 3, further comprising:

grouping a subset of the LWTs into a first load group of the one or more load groups within the first processing core, the subset of LWTs comprising competing LWTs; and

the LWTs in the subset are performed sequentially.

6. The computer-implemented method of claim 1, wherein a subset of the plurality of core partitions is to NUMA a first non-uniform memory access (NUMA) memory domain of a plurality of NUMA memory domains, the subset of the plurality of core partitions including the first core partition.

7. The computer-implemented method of claim 6, further comprising:

performing a level 2 rebalancing of the first NUMA memory domain, wherein the level 2 rebalancing comprises:

determining a partition queue depth within the first NUMA memory for each core partition of the subset of the plurality of core partitions to obtain a plurality of partition queue depths, each partition queue depth of the plurality of partition queue depths based on an average of the queue depths of a plurality of processing cores of the corresponding core partition of the subset;

selecting, from the subset of the plurality of core partitions, a core partition having a lowest partition queue depth of the plurality of partition queue depths; and

from the subset of the plurality of core partitions, a core partition is selected having a highest partition queue depth of the plurality of partition queue depths.

8. The computer-implemented method of claim 7, further comprising:

selecting a processing core from the core partition having the lowest partition queue depth, the selected processing core being associated with a highest core sleep time among the remaining processing cores in the core partition having the lowest partition queue depth; and

The selected processing core is migrated from the core partition having the lowest partition queue depth to the core partition having the highest partition queue depth.

9. The computer-implemented method of claim 8, further comprising:

performing a 3-level rebalancing of the plurality of NUMA memory domains, the 3-level rebalancing comprising:

determining a NUMA queue depth for each of the plurality of NUMA memory domains to obtain a plurality of NUMA queue depths;

selecting a NUMA memory domain from the plurality of NUMA memory domains having a highest NUMA queue depth from the plurality of NUMA queue depths; and

a NUMA memory domain having a lowest NUMA queue depth of the plurality of NUMA queue depths is selected from the plurality of NUMA memory domains.

10. The computer-implemented method of claim 9, further comprising:

selecting a core partition for rebalancing, the selected core partition being present in the NUMA memory domain having the lowest NUMA queue depth and the NUMA memory domain having the highest NUMA queue depth; and

at least one load group associated with the core partition selected for rebalancing is migrated from the NUMA having the highest NUMA queue depth to the NUMA having the lowest NUMA queue depth.

11. The computer-implemented method of claim 6, further comprising:

performing a power saving rebalancing of the plurality of NUMA memory domains, the power saving balancing comprising:

determining a NUMA memory domain having a lowest processing core count of the plurality of NUMA memory domains;

generating a released NUMA memory domain by moving an available load group from the NUMA memory domain having the lowest processing core count to at least another NUMA memory domain of the plurality; and

and placing the released NUMA memory domain in a power saving mode.

12. A system for performing rebalancing of lightweight threads (LWTs) within distributed storage nodes of a communication network, the system comprising:

a memory storing instructions; and

one or more processors in communication with the memory, wherein to perform a level 1 rebalancing during a current rebalancing cycle, the one or more processors execute the instructions to:

13. The system of claim 12, wherein the one or more processors execute the instructions to:

repeating the moving during the level 1 rebalancing until a difference between a queue depth of the first processing core and a queue depth of the second processing core is less than a threshold number; and

14. The system of any of claims 12 and 13, wherein the one or more processors execute the instructions to:

Grouping a subset of the LWTs to a first load group of the one or more load groups within the processing core, the subset of LWTs comprising competing LWTs.

15. The system of any of claims 12 and 13, wherein a subset of the plurality of core partitions is to NUMA a first non-uniform memory access (NUMA) memory domain of a plurality of NUMA memory domains, the subset of the plurality of core partitions including the first core partition; and

wherein, to perform a 2-level rebalancing of the first NUMA memory domain, the one or more processors execute the instructions to:

16. The system of claim 15, wherein the one or more processors execute the instructions to:

17. The system of claim 16, wherein to perform 3-level rebalancing on the plurality of NUMA memory domains, the one or more processors execute the instructions to:

18. The system of claim 17, wherein the one or more processors execute the instructions to:

19. A computer-readable medium storing computer instructions for performing rebalancing of lightweight threads (LWT) within distributed storage nodes of a communication network, wherein, to perform a level 1 rebalancing during a current rebalancing cycle, the instructions when executed by one or more processors of a computing device cause the one or more processors to perform operations comprising:

20. The computer-readable medium of claim 19, wherein a subset of the plurality of core partitions are to NUMA a first non-uniform memory access (NUMA) memory domain of a plurality of NUMA memory domains, the subset of the plurality of core partitions including the first core partition, wherein to perform a 2-level rebalancing of the first NUMA memory domain, the instructions, when executed by the one or more processors, cause the one or more processors to perform operations comprising: