WO2017131668A1 - Scalability predictor - Google Patents

Scalability predictor Download PDF

Info

Publication number
WO2017131668A1
WO2017131668A1 PCT/US2016/015154 US2016015154W WO2017131668A1 WO 2017131668 A1 WO2017131668 A1 WO 2017131668A1 US 2016015154 W US2016015154 W US 2016015154W WO 2017131668 A1 WO2017131668 A1 WO 2017131668A1
Authority
WO
WIPO (PCT)
Prior art keywords
task nodes
distributed memory
computer cluster
memory program
task
Prior art date
Application number
PCT/US2016/015154
Other languages
French (fr)
Inventor
Sourav MEDYA
Ludmila Cherkasova
Guilherma de Campos MAGALHAES
Mehmet Kivanc Ozonat
Padmanabha CHAITRA
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2016/015154 priority Critical patent/WO2017131668A1/en
Publication of WO2017131668A1 publication Critical patent/WO2017131668A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Definitions

  • Parallel computing is a type of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved at the same time.
  • bit-level instruction-level
  • data data
  • task parallelism There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism.
  • Parallelism has been employed in high-performance computing, and interest in parallelism has grown lately due to the physical constraints preventing frequency scaling.
  • power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become a dominant paradigm in computer architecture.
  • parallel computing a set of computing devices concurrently execute a distributed memory program, such as a graph procedure.
  • graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects.
  • a graph in this context is made up of vertices, nodes or points and edges, arcs or lines that connect them.
  • a graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge or the edges of the graph may be directed from one vertex to another.
  • Graph procedures are employed to solve problems related to graph theory.
  • FIG. 1 illustrates a flowchart of an example method for analyzing the performance and scalability of a distributed memory program on a large-scale computer cluster.
  • FIG. 2 illustrates an example of a graph being searched.
  • FIG. 3 illustrates an example of a chart plotting a completion time of a distributed memory program as a function of a percentage of available bandwidth.
  • FIG. 4 illustrates another example of a chart plotting a completion time of a distributed memory program as a function of a percentage of available bandwidth.
  • FIG. 5 illustrates an example of a chart plotting a bandwidth demand as a function of a number of task nodes.
  • FIG. 6 illustrates another example of a chart plotting a bandwidth demand as a function of a number of task nodes.
  • FIG. 7 illustrates an example of a chart plotting a bandwidth demand as a function of a percentage of a Completion Time Increment (CTI).
  • CTI Completion Time Increment
  • FIG. 8 illustrates another example of a chart plotting a bandwidth demand as a function of a percentage of a CTI.
  • FIG. 9 illustrates an example of a medium-scale computer cluster for analyzing the performance and scalability of a distributed memory program on a large- scale computer cluster.
  • FIG. 10 illustrates a flowchart of an example method for calculating an average demand constant of a distributed memory program.
  • FIG. 1 1 illustrates an example of a control node for calculating an average demand constant of a distributed memory program.
  • FIG. 12 illustrates an example of a non-transitory machine readable medium having instructions for calculating an average demand constant of a distributed memory program.
  • Systems and methods for determining a scalability and performance of a distributed memory program for execution on a large-scale computer cluster are described herein.
  • a programmer of the distributed memory program can access a limited size cluster for program debugging and for running related performance (profiling) experiments.
  • Systems and methods described herein can assess and predict the scalability the distributed memory program and/or the performance of the distributed memory program when this distributed memory program is executed on a large-scale computer cluster.
  • the systems and methods described herein can assess the increased bandwidth demands of a communication volume as a function of the increased cluster size (e.g., a number of task nodes in the computer cluster).
  • the systems and methods described herein include iteratively executing the distributed memory program on a medium-scaled computer cluster with an "interconnect bandwidth throttling" tool, which enables the assessment of the communication demands with respect to available bandwidth.
  • the systems and methods described herein can assist in predicting the cluster size, where a communication cost becomes dominant component, at which point the performance benefits with respect to an increased cluster size leads to a diminishing return or a decrease in performance.
  • FIG. 1 illustrates a flowchart of an example method 10 that can be implemented for analyzing the performance and scalability of a distributed memory program on a large-scale computer cluster.
  • computer cluster denotes a set of loosely or tightly connected computers that work together such that, in many respects, the connected computers can collectively be conceptualized as a single system.
  • Computer clusters can operate as a parallel computing paradigm. Computer clusters can have each node set to perform a task that is controlled and scheduled by a software application operating (e.g., a scalability predictor) on a control node of the computer cluster. Each node performing a task can be referred to as a "task node".
  • a software application operating e.g., a scalability predictor
  • control node of the computer cluster can also be a task node.
  • control node may be considered to be separate from the computer cluster, and in other examples, the control node can be considered to be integrated with the computer cluster.
  • the method 10 of FIG. 1 is shown and described as executing serially, the method 10 is not limited by the illustrated order, as some actions could in other examples occur in different orders and/or concurrently from that shown and described herein. Moreover, in some examples, some actions of the illustrated actions of the method 10 may be omitted.
  • the method 10 can be implemented, for example, by a medium-scale computer cluster.
  • a medium-scale computer cluster can be a computer cluster that has less nodes a computer cluster for which the distributed memory program is being designed/anticipated to be execute on (e.g., up to about 32 nodes in one example).
  • a large-scale computer cluster can have more nodes than the medium-scale computer cluster (e.g., more than about 33 nodes in one example).
  • Each node on a computer cluster (a medium-scale or large-scale computer cluster) can be a computing device that can be referred to as a processing node (or referred to simply as a "processor").
  • Each node can store and execute machine readable instructions.
  • Each node in a computer cluster can include a partition of a distributed memory.
  • Each partition of the distributed memory can stored, for example, on a non-transitory machine readable medium that can store machine executable instructions.
  • the distributed memory can be shared among each of the nodes in the computer cluster, and in other examples, each partition (or some subset thereof) of the distributed memory can be stored on a local data store (e.g., hard drive and/or a solid state drive) that is accessible only by a specific node or set of nodes of the computer cluster.
  • each node can include at least one processor core to access the partition of the distributed memory and execute the stored machine readable instructions.
  • some or all of the nodes of the computer cluster can be representative of virtual resources (e.g., in a computing cloud) that operate across multiple instances of computing devices. In other examples, some or all of the nodes of the computer cluster can be representative of separate computing devices. Each node in the computer cluster can communicate with any of the other nodes (either directly or indirectly) via an interconnect of the computing cluster.
  • the medium-scale computer cluster can execute an interconnect emulator, such as InterSense, available from HEWLETT PACKARD ENTERPRISE® that can control (synthetically throttle) the interconnect bandwidth between nodes of the medium-scale computer cluster.
  • an interconnect emulator such as InterSense, available from HEWLETT PACKARD ENTERPRISE® that can control (synthetically throttle) the interconnect bandwidth between nodes of the medium-scale computer cluster.
  • the control node of the computer cluster via the interconnect emulator
  • available bandwidth to a range between 0% and 100%.
  • the method 10 can be employed, for example to execute a distributed memory program on the medium-scale computer cluster to predict the performance and scalability on the large-scale computer cluster.
  • Message passing interface is a programming paradigm for scale- out, distributed memory applications.
  • MPI Message passing interface
  • an accurate analysis and predictions of scaling properties of a communication layer impact on application performance for complex MPI-based programs that have interleaving computations and communications in inter-tangled patterns can be ascertained.
  • Due to asynchronous, concurrent execution of different task nodes many communication delays between the task nodes could be "hidden” such that these delays do not contribute or impact the overall completion time of the distributed memory program. Such delays can happen when some task nodes are still in a "computation-based” or "processing" portions of the code of the distributed memory program, while the other task nodes have already performed communication exchanges.
  • the method 10 described herein can be employed to analyze the utilized (required) interconnect bandwidth (or bandwidth demand) during the execution of the distributed memory program due to a variety of existing MPI collectives and calls that could involve different sets of nodes and communication styles. Moreover, performance of such distributed memory applications inherently depends on a performance of a communication layer of the cluster.
  • Each task node can be allocated with the same or similar resources.
  • each task node in the medium-scale computer cluster can have 14 processor cores that are each 2 GHz, INTEL XEON® E4-2683v3 processors, with a 35 Megabyte last level cache size and 256 Gigabytes (GB) of Dynamic Random Access Memory (DRAM).
  • the distributed memory program implements a Breath First Search (BFS) technique of a particular graph.
  • BFS is a graph traversal procedure performed on an undirected, unweighted graph.
  • FIG. 2 illustrates an example of a simplified graph 100 depicting eight (8) vertices.
  • the graph 100 is an example for demonstrating the BFS procedure.
  • the given source vertex is labeled as "V1 (s)” and the remaining seven (7) vertices are labeled in FIG. 2 as "V2"-"V8".
  • Each vertex is connected with an edge 102.
  • the minimum number of edges 102 traversed to travel between nodes defines the distance between the two nodes.
  • vertices V2, V3 and V4 are "one-hop” away, and vertices V5 and V6 are "two hops" away.
  • vertex V7 is "three hops" away from vertex V1 (s), as there are two non-equidistance routes.
  • first (shorter) route the edges 102 for a route of V1 (s)-V4-V5-V7 can be traversed and in a second (longer) route, the edges 102 for a route of V1 (s)-V3-V6-V8-V7 can also be traversed, such that the BFS procedure selects the first route.
  • vertex V8 is also "three hops" away from vertex V1 (s), as there are again, two non-equidistance routes.
  • the edges 102 for a route of V1 (s)-V4-V5-V7-V8 can be traversed and in a second (shorter) route, the edges 102 for a route of V1 (s)-V3-V6-V8 can also be traversed, such that the BFS procedure selects the second route.
  • the graph 100 is illustrated only as a simple example to facilitate understanding of the BFS procedure. As is easily demonstrated in FIG. 2, as the complexity of the graph increases (increasing the number of vertices and edges), the computational resources required to complete the BFS procedure in a reasonable time can increase exponentially. In fact, in many instances, a graph (e.g., representing a social network or computer network) can have tens of billions of vertices and there may be hundreds of billions of edges.
  • graph problems can be implemented in many different ways in terms of graph data partitioning for parallel processing.
  • 2-D partitioning is employed.
  • the control node can set parameters for execution of the distributed memory program.
  • the parameters can include an initial set number of task nodes and a maximum number of task nodes to execute the distributed memory program on the computer cluster.
  • the parameters can also include an initial set available bandwidth and a minimum available bandwidth for the interconnect emulator. Further, the parameters can include a set number of task nodes to increase and a set percentage of available bandwidth to decrement. Additionally, the parameters can include a predefined completion time increment (C77) corresponding to a maximum acceptable increase in completion time.
  • the initial available bandwidth is set to 100% and the initial number of task nodes is set to 4. Additionally, in the given example, the minimum available bandwidth is set to 20% and the maximum number of task nodes is set to twenty-five (25). In the given example, the CTI has multiple values of 15%, 16%, 17%, 18% and 19%.
  • the control node can cause the computer cluster to execute the distributed memory program.
  • the distributed memory program implements the BFS procedure traversing the particular graph for the set number of nodes at the set available bandwidth.
  • the control node can record a completion time of the distributed memory program for the set bandwidth and the set number of nodes.
  • the "completion time" can be a measured amount of time (e.g., in milliseconds) that the distributed memory program takes to complete operations.
  • the control node can calculate a CTI for the set bandwidth, CTI bw .
  • the CTI can be calculated with Equation 1 .
  • Completion Time at set bw is the completion time at the set bandwidth (bw); and Completion Time at initial bw is the completion time at the initially set available bandwidth (bw) (typically 100%).
  • the control node can cause the interconnect emulator to reduce (e.g., throttle) the set available bandwidth by a set percentage. In the given example, the control node causes the interconnect emulator to reduce the set available bandwidth by 2%.
  • a determination by the control node can be made as to whether the set bandwidth is below the minimum available bandwidth set in the parameters. If the determination at 40 is negative (e.g. "NO"), the method 10 can return to 20. If the determination at 40 is positive (e.g., "YES"), the method can proceed to 45.
  • the CTI bw is equal to ⁇ '.
  • the CTI bw is a number greater than or ⁇ ', since the completion time are increased (or at least not decreased) as the set amount of available bandwidth is reduced.
  • the CTl bw can be calculated for each decrement in the set amount of available bandwidth for the set number of task nodes (p).
  • the control node can calculate a Bandwidth Demand for the set number of task nodes (p), BW CTI p corresponding to the predefined CTI.
  • BW CTI p can be calculated for CTI values of 15%, 16%, 17%, 18% and 19%.
  • FIG. 3 illustrates a chart 150 (e.g., a linear graph) that plots completion time (in ms) for the given example as a function of a set availability of bandwidth for four (4) different number of task nodes (labeled in FIG. 3 as "nodes", 4, 9, 16 and 25). More particularly, the graph 150 corresponds to the given example, where the graph being searched by the BFS procedure has a scale (s) of 27. As used, the graph scale size, s denotes that the graph being searched has 2 s million vertices and 16 * 2 s million edges.
  • a scale size of 27 represents a graph with about 134 million vertices and about 2.1 billion edges.
  • FIG. 4 illustrates a chart 160 that also plots completion time (in ms) for the given example as a function of a set available of bandwidth for four (4) different number of task nodes (labeled in FIG. 4 as "nodes", 4, 9, 16 and 25).
  • the graph has a scale size of 28, thereby representing a graph with about 268 million vertices and about 4.2 billion edges.
  • FIG. 5 illustrates a chart 1 70 (e.g., a linear graph) that plots the bandwidth demand, BW CTI as a function of a number of task nodes (processors) for 5 different predetermined CTIs, 1 5%, 16%, 1 7%, 1 8% and 1 9% with a scale, s of 27.
  • FIG. 6 illustrates a chart 1 80 (e.g., a linear graph) that plots the bandwidth demand, BW CTI as a function of a number of task nodes (processors) for 5 different predetermined CTIs, 1 5%, 1 6%, 1 7%, 1 8% and 1 9% with a scale, s of 28.
  • the minimum bandwidth needed, the bandwidth demand, BW CTI to achieve a predetermined CTIs increases as a function of the number of task nodes implemented in the medium-scale computer cluster.
  • the control node can determine a demand constant (DC) for each different CTI.
  • DC demand constant
  • the demand constant for a particular CTI and two different number of task nodes can be calculated with Equations 2 and 3.
  • i is a first number of task nodes employed by the medium-scale computer cluster to complete the distributed memory program
  • p 2 is a second number of task nodes employed by the medium-scale computer cluster to complete the distributed memory program and p 1 ⁇ p 2 , '
  • BW CTI Pi is the bandwidth demand for the particular CTI with the first number of task nodes (p x ) ;
  • BW CTI p2 is the bandwidth demand for the particular CTI with the second number of task nodes (p 2 ) ;
  • DC CTI Pi P2 is the demand constant for the particular CTI for two particular number of task nodes, p x and p 2 .
  • a demand constant for each particular CTI and each consecutive set of number of task nodes e.g., 4, 9, 16 and 25 in the given example
  • the control node can determine an average of each demand constant for each predetermined CTI and consecutive set of number of task nodes can be calculated which can be referred to as the average demand constant, DC.
  • the average demand constant, DC can be, for example, an Arithmetic mean, a
  • the control node can calculate a predicted (e.g., estimated) bandwidth demand for a large-scale computer cluster.
  • a predicted bandwidth demand for a large-scale computer cluster the control node can employ Equations 4 and 5.
  • k is an integer greater than one
  • DC is the calculated average demand constant
  • p k is a next increased number of task nodes in the medium-scale computer cluster and p k > p k _ x (it is noted that eventually p k is limited by the number of task nodes in the medium-scale computer cluster)
  • B W cTi, Pk is the predicted bandwidth demand for number of task nodes, p k in the large-scale computer cluster;
  • BWcn is either a measured bandwidth demand or a predicted bandwidth demand based on previous calculations for a next smaller number of task nodes, p k _ 1 in the medium-scale or large scale computer cluster.
  • the bandwidth demand for a set number of task nodes, p k can be predicted. In this manner, if there is a maximum available bandwidth of the interconnect for the large-scale computer cluster, it can be determined whether the maximum available bandwidth of the interconnect meets the predicted bandwidth demand, BW CTIiPk for the set number of task nodes, p k .
  • control node can employ a plurality of predicted number of task nodes, p k to calculate a corresponding plurality of predicted bandwidth demands, BW CTIiPk .
  • a specific predicted number of task nodes, p k can be calculated for a point at which a further increase in the number of task nodes would lead to an interconnect bandwidth demand for the distributed memory program that is higher than the predetermined amount of available bandwidth on the interconnect of the large-scale computing cluster.
  • the control node can predict (e.g., via calculation) a number of task nodes, p k that reaches a point of decreasing marginal utility for a set amount of bandwidth of the interconnect on the large-scale computing cluster by employing Equations 4 and 5.
  • Equation 1 provides a mechanism for predicting the number of task nodes that reaches the point of decreasing marginal utility for a given amount of available bandwidth on the interconnect of the large-scale computer cluster.
  • point of decreasing marginal utility denotes the point at which communication cost becomes so highly dominant that the performance (scalability) benefit in further increase to the number of task nodes would lead to a diminishing return in completion time of the distributed memory program or an increase in completion time of the distributed memory program (a decrease in performance).
  • the performance and scalability of the distributed memory program can be predicted. That is, by employing Equations 4-5, and setting a number of task nodes, a bandwidth demand can be predicted by the control node, as explained in action 70. Additionally or alternatively, by employing Equations 4-5 and setting an amount of available bandwidth on an interconnect of the large-scale computer cluster, the control node can calculate the predicted number of task nodes that reaches the point of decreasing marginal utility for the distributed memory program.
  • the method 10 accounts for effects of decreasing available bandwidth during execution of the distributed memory program. As illustrated in FIGS. 3-4, it is often the case that decreasing the bandwidth on an interconnect of a computer cluster has little effect until the bandwidth is throttled (e.g., decreased) to a certain percentage. For instance, in the charts 150 and 160 of FIG. 3-4 at available bandwidths percentages between 100% and 30%, there is an increase of less than 50 ms in the completion time. However, between 30% and 20% there is about another 50 ms increase in completion time. In such a situation, the CTI below 30% available bandwidth increases at a nonlinear rate. Conventional performance analysis systems do not consider this possibility.
  • FIGS. 7 and 8 illustrate charts 200 and 210 that demonstrate experimental measured results as compared to calculated predicted results.
  • the charts 200 and 210 represent a bandwidth demand percentage plotted as a function of a number of task nodes.
  • a twenty-five (25) task node computing-cluster to execute the distributed memory program (a BFS of a graph) with a graph scale 27 in the chart 200 and a graph scale 28 in the chart 210, the measured results can be obtained and plotted.
  • the control node can obtain three (3) different combinations for each considered CTI (15 in total), such that the demand constant for each CTI can be calculated.
  • the Average DC can be calculated over these 15 DC P i, P j.
  • Bandwidth demands for 25 task nodes can be predicted using the Average DC, Equation 5, and p k as 288 (1 6 task nodes).
  • the average DC for scaling factors 27 and 28 are 0.8 and 0.86 respectively.
  • FIGS. 7 and 8 demonstrate that the method 1 0 provides excellent accuracy for predicting performance of a computer cluster (large or medium) executing the distributed memory program.
  • FIG. 9 illustrates an example of a medium-scale computer cluster 300 that could be employed to predict bandwidth demands and/or a number of processors needed to reach a point of marginal utility.
  • the computer cluster 300 can be employed, for example, to implement the method 10 illustrated in FIG. 1 .
  • the computer cluster 300 can be employed to execute a distributed memory program, wherein the execution of the distributed memory program is shared among N number of task nodes 302 of the computer cluster 300, where N is an integer greater than or equal to two (2).
  • the computer cluster 300 can also include a control node 304.
  • Each task node 302 can be a computing device that includes a
  • Each task node 302 can also include a memory 308 that stores machine readable instructions that can be accessed and executed by the processor 306.
  • the memory 308 of each task node 302 can be, for example, volatile memory (e.g., RAM), non-volatile memory (e.g., a hard disk drive, a solid state drive, flash memory, etc.) or a combination thereof.
  • Each task node 302 can be employed to implement a task node described with respect to FIG. 1 .
  • each memory 308 of the task nodes 302 can include a partition of a distributed (shared) memory that is employed to execute the distributed memory program.
  • the control node 304 can also include a processor 31 0 (that includes one or more processor cores) and a memory 31 2. The control node 304 can be employed to implement the control node described with respect to FIG. 1 .
  • the N number of task nodes 302 and the control node 304 can be any number of task nodes 302 and the control node 304 .
  • interconnect 314 can be, for example, InfinBand, a backplane, a network connection or a combination thereof.
  • some (or all) of the components of the computer cluster 300 can be virtual components executing on a computing cloud. In such a situation, each component could be representative of multiple instances of hardware (i.e., distributed). In other examples, each component of the computer cluster 300 (or some portion thereof) can be representative of a physical hardware device.
  • the control node 304 can initiate and control execution of the distributed memory program among the N number of task nodes 302. Additionally, the control node 304 can calculate performance metrics of the computer cluster 300 that can be employed to predict the scalability of the distributed memory program on a large-scale computer cluster (e.g., 25+ task nodes). For purposes of simplification of explanation, the control node 304 and the N number of task nodes 302 are illustrated and described in FIG. 2 as being separate devices. However, in other examples, the control node 304 can be integrated with a task node 302.
  • the distributed memory program can employ, for example, a graph searching procedure, such as a BFS search procedure.
  • a graph searching procedure such as a BFS search procedure.
  • the distributed memory program executed by the computer cluster 300 can be employed to search a graph for a shortest distance between vertices of the graph (e.g., as illustrated and described with respect to FIG. 2).
  • the examples described herein are merely only one example of a distributed memory program that could be executed by the computer cluster 300.
  • the control node 304 can execute a scalability predictor 316 to predict the performance and scalability of the distributed memory program on a large-scale computer cluster.
  • the scalability predictor 316 can set parameters for a scalability evaluation of a distributed memory program. Additionally, the scalability predictor 316 can cause the computer cluster 300 to execute the distributed memory program multiple times using a set number of task nodes (e.g., 4, 9, 16 and 25) and record the
  • the scalability predictor 316 can cause the computer cluster 300 to re-execute the distributed memory program using an interconnect emulator 318 (e.g., InterSense) to synthetically decrement (e.g., throttle) the available bandwidth of the interconnect 314.
  • the scalability predictor 31 6 can record the completion time for each incremental decrease of the interconnect bandwidth.
  • the charts 1 50 and 1 60 in FIGS. 3 and 4 illustrate an example of an increase in completion time plotted as a function over a range of available bandwidths
  • the scalability predictor 31 6 can employ Equation 1 to determine a CTI bw for each decrement of available bandwidth on the interconnect 314.
  • the scalability predictor 31 6 can compare the CTIbw for each decrement of available bandwidth to one or more predetermined CTIs (e.g., 1 5%-1 9%) to determine a bandwidth demand at each predetermined CTI, for each set number of task nodes 302 employed, BW CTI p .
  • the charts 1 70 and 1 80 illustrated in FIGS. 5 and 6 plot bandwidth demands as a function of a number of task nodes for five (5) different CTIs (1 5%-1 9%).
  • the scalability predictor 31 6 can employ Equation 3 to calculate a demand constant for each CTI and each a two sets of task nodes 202, DC CTI Pi P2 .
  • the scalability predictor 316 can calculate an Average Demand Constant, DC.
  • the scalability predictor 31 6 can employ Equations 4-5 to predict performance and/or scalability limits of the distributed memory program. Specifically, the scalability predictor 31 6 can employ Equation 5 to predict bandwidth demands for a specific number of task nodes (p fc ) in a large-scale computer cluster.
  • the scalability predictor 31 6 can employ Equation 5 to
  • the scalability predictor 31 6 can estimate a point at which a further increase in the number of task nodes 302 provides either diminishing returns or an increase in completion time (e.g., decreased overall
  • the scalability predictor 31 6 of the control node 304 can accurately predict operational requirements and limitations (e.g., the bandwidth demands and/or the greatest number of task nodes 302 before reaching marginal utility) of the distributed memory program. Moreover, since the computer cluster 300 actually executes instances of the distributed memory program with a throttled bandwidth on the interconnect 314, the predicted operational requirements and limitations are accurately predicted.
  • FIG. 10 illustrates an example of a flowchart of a method 400 for predicting an average demand constant for a distributed memory program.
  • the method 400 can be implemented, for example by a control node (e.g., the control node of FIG. 1 and/or the control node 304 of FIG. 9).
  • the control node can initiate an iterative execution of a distributed memory program over multiple numbers of task nodes in a computer cluster.
  • Each task node can include a computing device.
  • the control node can cause an interconnect emulator to change an available bandwidth value of an interconnect of the computer cluster over a range of available bandwidths values.
  • the control node can measure a completion time for execution of the distributed memory program for each change to the available bandwidth value and each change in the number of task nodes.
  • the control node can store an average demand constant in a non-transitory machine readable medium (e.g., memory) for the distributed memory program.
  • the average demand constant can corresponds to a CTI for each change of available bandwidth value and each of the multiple numbers of task nodes.
  • FIG. 1 1 illustrates an example of a control node 450.
  • the control node 450 can be employed to implement the control node of FIG. 1 and/or the control node 304 of FIG. 9.
  • the control node 450 can include a non-transitory memory 452 to store machine readable instructions and a processing unit 454 to access the memory and execute the machine readable instructions.
  • the machine readable instructions can include an interconnect emulator 456 to control an amount of available bandwidth of an interconnect in a computer cluster.
  • the non-transitory memory 452 can also include a scalability predictor 458 to cause a repeated execution of a distributed memory program with multiple numbers of the task nodes in the computer cluster each with multiple set available bandwidths throttled by the interconnect emulator 456.
  • the scalability predictor 458 can also calculate an average demand constant for the distributed memory program, wherein the average demand constant corresponds to a CTI for each change of bandwidth and each of the multiple numbers of task nodes.
  • FIG. 12 illustrates an example of a non-transitory machine readable medium 500 (e.g., a memory) having machine readable instructions.
  • the machine readable instructions can include a scalability predictor 502 to measure a completion time for an execution of a distributed memory program on a computer cluster
  • the scalability predictor 502 can also calculate a plurality of demand constants that each correspond to a respective CTI for a change of the available bandwidth value and each of the multiple number of task nodes. The scalability predictor 502 can further average the plurality of demand constants to determine an average demand constant.
  • portions of the systems and method disclosed herein may be embodied as a method, data processing system, or computer program product such as a non-transitory computer readable medium. Accordingly, these portions of the approach disclosed herein may take the form of an entirely hardware embodiment, an entirely software embodiment (e.g., in a non-transitory machine readable medium), or an embodiment combining software and hardware. Furthermore, portions of the systems and methods disclosed herein may be a computer program product on a computer-usable storage medium having computer readable program code on the medium. Any suitable computer-readable medium may be utilized including, but not limited to, static and dynamic storage devices, hard disks, solid-state storage devices, optical storage devices, and magnetic storage devices.
  • These computer-executable instructions may also be stored in computer- readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture including instructions which implement the function specified in the flowchart block or blocks.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart block or blocks.
  • Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a switch or a communication network. Examples of communication networks include a local area network ("LAN").
  • LAN local area network

Abstract

A method can include initiating an iterative execution of a distributed memory program over multiple numbers of task nodes in a computer cluster, wherein each task node comprises a computing device. The method can also include changing an available bandwidth value of an interconnect of the computer cluster over a range of available bandwidths values. The method can further include measuring a completion time for execution of the distributed memory program for each change to the available bandwidth value and each change in the number of task nodes. The method can still further include storing an average demand constant in a non-transitory machine readable medium for the distributed memory program. The average demand constant can correspond to a completion time increment (CTI) for each change of the available bandwidth value and each of the multiple numbers of task nodes.

Description

SCALABILITY PREDICTOR
BACKGROUND
[0001] Parallel computing is a type of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has been employed in high-performance computing, and interest in parallelism has grown lately due to the physical constraints preventing frequency scaling. As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become a dominant paradigm in computer architecture. In parallel computing, a set of computing devices concurrently execute a distributed memory program, such as a graph procedure.
[0002] In mathematics and computer science, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices, nodes or points and edges, arcs or lines that connect them. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge or the edges of the graph may be directed from one vertex to another. Graph procedures are employed to solve problems related to graph theory.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] FIG. 1 illustrates a flowchart of an example method for analyzing the performance and scalability of a distributed memory program on a large-scale computer cluster.
[0004] FIG. 2 illustrates an example of a graph being searched.
[0005] FIG. 3 illustrates an example of a chart plotting a completion time of a distributed memory program as a function of a percentage of available bandwidth.
[0006] FIG. 4 illustrates another example of a chart plotting a completion time of a distributed memory program as a function of a percentage of available bandwidth. [0007] FIG. 5 illustrates an example of a chart plotting a bandwidth demand as a function of a number of task nodes.
[0008] FIG. 6 illustrates another example of a chart plotting a bandwidth demand as a function of a number of task nodes.
[0009] FIG. 7 illustrates an example of a chart plotting a bandwidth demand as a function of a percentage of a Completion Time Increment (CTI).
[0010] FIG. 8 illustrates another example of a chart plotting a bandwidth demand as a function of a percentage of a CTI.
[0011] FIG. 9 illustrates an example of a medium-scale computer cluster for analyzing the performance and scalability of a distributed memory program on a large- scale computer cluster.
[0012] FIG. 10 illustrates a flowchart of an example method for calculating an average demand constant of a distributed memory program.
[0013] FIG. 1 1 illustrates an example of a control node for calculating an average demand constant of a distributed memory program.
[0014] FIG. 12 illustrates an example of a non-transitory machine readable medium having instructions for calculating an average demand constant of a distributed memory program.
DETAILED DESCRIPTION
[0015] Systems and methods for determining a scalability and performance of a distributed memory program for execution on a large-scale computer cluster are described herein. During the design and implementation of a distributed memory program, a programmer of the distributed memory program can access a limited size cluster for program debugging and for running related performance (profiling) experiments. Systems and methods described herein can assess and predict the scalability the distributed memory program and/or the performance of the distributed memory program when this distributed memory program is executed on a large-scale computer cluster. In particular, the systems and methods described herein can assess the increased bandwidth demands of a communication volume as a function of the increased cluster size (e.g., a number of task nodes in the computer cluster). The systems and methods described herein include iteratively executing the distributed memory program on a medium-scaled computer cluster with an "interconnect bandwidth throttling" tool, which enables the assessment of the communication demands with respect to available bandwidth. The systems and methods described herein can assist in predicting the cluster size, where a communication cost becomes dominant component, at which point the performance benefits with respect to an increased cluster size leads to a diminishing return or a decrease in performance.
[0016] FIG. 1 illustrates a flowchart of an example method 10 that can be implemented for analyzing the performance and scalability of a distributed memory program on a large-scale computer cluster. As used herein, the term "computer cluster" denotes a set of loosely or tightly connected computers that work together such that, in many respects, the connected computers can collectively be conceptualized as a single system. Computer clusters can operate as a parallel computing paradigm. Computer clusters can have each node set to perform a task that is controlled and scheduled by a software application operating (e.g., a scalability predictor) on a control node of the computer cluster. Each node performing a task can be referred to as a "task node". Additionally, in some examples, the control node of the computer cluster can also be a task node. Furthermore, in some examples, the control node may be considered to be separate from the computer cluster, and in other examples, the control node can be considered to be integrated with the computer cluster.
[0017] While, for purposes of simplicity of explanation, the method 10 of FIG. 1 is shown and described as executing serially, the method 10 is not limited by the illustrated order, as some actions could in other examples occur in different orders and/or concurrently from that shown and described herein. Moreover, in some examples, some actions of the illustrated actions of the method 10 may be omitted.
[0018] The method 10 can be implemented, for example, by a medium-scale computer cluster. A medium-scale computer cluster can be a computer cluster that has less nodes a computer cluster for which the distributed memory program is being designed/anticipated to be execute on (e.g., up to about 32 nodes in one example). A large-scale computer cluster can have more nodes than the medium-scale computer cluster (e.g., more than about 33 nodes in one example). Each node on a computer cluster (a medium-scale or large-scale computer cluster) can be a computing device that can be referred to as a processing node (or referred to simply as a "processor"). Each node can store and execute machine readable instructions.
[0019] Each node in a computer cluster (a medium-scale or large-scale computer cluster) can include a partition of a distributed memory. Each partition of the distributed memory can stored, for example, on a non-transitory machine readable medium that can store machine executable instructions. In some examples, the distributed memory can be shared among each of the nodes in the computer cluster, and in other examples, each partition (or some subset thereof) of the distributed memory can be stored on a local data store (e.g., hard drive and/or a solid state drive) that is accessible only by a specific node or set of nodes of the computer cluster. Additionally, each node can include at least one processor core to access the partition of the distributed memory and execute the stored machine readable instructions. In some examples, some or all of the nodes of the computer cluster can be representative of virtual resources (e.g., in a computing cloud) that operate across multiple instances of computing devices. In other examples, some or all of the nodes of the computer cluster can be representative of separate computing devices. Each node in the computer cluster can communicate with any of the other nodes (either directly or indirectly) via an interconnect of the computing cluster.
[0020] The medium-scale computer cluster can execute an interconnect emulator, such as InterSense, available from HEWLETT PACKARD ENTERPRISE® that can control (synthetically throttle) the interconnect bandwidth between nodes of the medium-scale computer cluster. In particular, the control node of the computer cluster (via the interconnect emulator) can synthetically set (e.g., throttle) a maximum available bandwidth for usage by the distributed memory program (e.g., referred to as an
"available bandwidth") to a range between 0% and 100%. Moreover, as noted, the method 10 can be employed, for example to execute a distributed memory program on the medium-scale computer cluster to predict the performance and scalability on the large-scale computer cluster.
[0021] Message passing interface (MPI) is a programming paradigm for scale- out, distributed memory applications. Using the method 10, an accurate analysis and predictions of scaling properties of a communication layer impact on application performance for complex MPI-based programs that have interleaving computations and communications in inter-tangled patterns can be ascertained. Due to asynchronous, concurrent execution of different task nodes, many communication delays between the task nodes could be "hidden" such that these delays do not contribute or impact the overall completion time of the distributed memory program. Such delays can happen when some task nodes are still in a "computation-based" or "processing" portions of the code of the distributed memory program, while the other task nodes have already performed communication exchanges. The method 10 described herein can be employed to analyze the utilized (required) interconnect bandwidth (or bandwidth demand) during the execution of the distributed memory program due to a variety of existing MPI collectives and calls that could involve different sets of nodes and communication styles. Moreover, performance of such distributed memory applications inherently depends on a performance of a communication layer of the cluster.
Therefore, in the method 10, actual executions of the distributed memory program (e.g., experiments) on the medium-scale computer cluster with different numbers of task nodes and different amounts of available bandwidth (set by the interconnect emulator) on the interconnect of the computer cluster are measured to predict performance on the large-scale computer cluster.
[0022] Each task node can be allocated with the same or similar resources. For instance, in one extended example of the method 10, (hereinafter, "the given example") each task node in the medium-scale computer cluster can have 14 processor cores that are each 2 GHz, INTEL XEON® E4-2683v3 processors, with a 35 Megabyte last level cache size and 256 Gigabytes (GB) of Dynamic Random Access Memory (DRAM). Additionally, in the given example, the distributed memory program implements a Breath First Search (BFS) technique of a particular graph. BFS is a graph traversal procedure performed on an undirected, unweighted graph. The BFS attempts to determine a distance from a given source vertex to each vertex in the graph to find all vertices which are "one hop" away from the given vertex, "two hops", etc. In the given example, each task node can be configured to execute eighteen (18) MPI processes, each with one (1 ) thread. [0023] FIG. 2 illustrates an example of a simplified graph 100 depicting eight (8) vertices. The graph 100 is an example for demonstrating the BFS procedure. In FIG. 2 the given source vertex is labeled as "V1 (s)" and the remaining seven (7) vertices are labeled in FIG. 2 as "V2"-"V8". Each vertex is connected with an edge 102. In the BFS procedure, the minimum number of edges 102 traversed to travel between nodes defines the distance between the two nodes. Thus, for the source vertex V1 (s), vertices V2, V3 and V4 are "one-hop" away, and vertices V5 and V6 are "two hops" away.
[0024] Additionally, vertex V7 is "three hops" away from vertex V1 (s), as there are two non-equidistance routes. In the first (shorter) route, the edges 102 for a route of V1 (s)-V4-V5-V7 can be traversed and in a second (longer) route, the edges 102 for a route of V1 (s)-V3-V6-V8-V7 can also be traversed, such that the BFS procedure selects the first route. Similarly, vertex V8 is also "three hops" away from vertex V1 (s), as there are again, two non-equidistance routes. In the first (longer) route, the edges 102 for a route of V1 (s)-V4-V5-V7-V8 can be traversed and in a second (shorter) route, the edges 102 for a route of V1 (s)-V3-V6-V8 can also be traversed, such that the BFS procedure selects the second route. It is noted that the graph 100 is illustrated only as a simple example to facilitate understanding of the BFS procedure. As is easily demonstrated in FIG. 2, as the complexity of the graph increases (increasing the number of vertices and edges), the computational resources required to complete the BFS procedure in a reasonable time can increase exponentially. In fact, in many instances, a graph (e.g., representing a social network or computer network) can have tens of billions of vertices and there may be hundreds of billions of edges.
[0025] Referring back to FIG. 1 , continuing with the given example, graph problems can be implemented in many different ways in terms of graph data partitioning for parallel processing. In the given example, it is presumed that two-dimensional (2-D) partitioning is employed.
[0026] In the method 10 shown in FIG. 1 , at 15, the control node can set parameters for execution of the distributed memory program. In particular, the parameters can include an initial set number of task nodes and a maximum number of task nodes to execute the distributed memory program on the computer cluster.
Additionally, the parameters can also include an initial set available bandwidth and a minimum available bandwidth for the interconnect emulator. Further, the parameters can include a set number of task nodes to increase and a set percentage of available bandwidth to decrement. Additionally, the parameters can include a predefined completion time increment (C77) corresponding to a maximum acceptable increase in completion time. In the given example, the initial available bandwidth is set to 100% and the initial number of task nodes is set to 4. Additionally, in the given example, the minimum available bandwidth is set to 20% and the maximum number of task nodes is set to twenty-five (25). In the given example, the CTI has multiple values of 15%, 16%, 17%, 18% and 19%.
[0027] At 20, the control node can cause the computer cluster to execute the distributed memory program. As noted, in the given example, the distributed memory program implements the BFS procedure traversing the particular graph for the set number of nodes at the set available bandwidth. At 25, the control node can record a completion time of the distributed memory program for the set bandwidth and the set number of nodes. The "completion time" can be a measured amount of time (e.g., in milliseconds) that the distributed memory program takes to complete operations. At 30, the control node can calculate a CTI for the set bandwidth, CTIbw. The CTI can be calculated with Equation 1 .
,- .. . Completion Time at set bw-Completion Time at initial bw
Equation 1 CTIbw =— - : :i——
Completion Time at initial bw
Wherein:
Completion Time at set bw is the completion time at the set bandwidth (bw); and Completion Time at initial bw is the completion time at the initially set available bandwidth (bw) (typically 100%).
[0028] At 35, the control node can cause the interconnect emulator to reduce (e.g., throttle) the set available bandwidth by a set percentage. In the given example, the control node causes the interconnect emulator to reduce the set available bandwidth by 2%. At 40, a determination by the control node can be made as to whether the set bandwidth is below the minimum available bandwidth set in the parameters. If the determination at 40 is negative (e.g. "NO"), the method 10 can return to 20. If the determination at 40 is positive (e.g., "YES"), the method can proceed to 45.
[0029] As is seen from Equation 1 , at the initially available bandwidth
(e.g., 100%), the CTIbw is equal to Ό'. For each other available bandwidth, the CTIbw is a number greater than or Ό', since the completion time are increased (or at least not decreased) as the set amount of available bandwidth is reduced. Moreover, the CTlbw can be calculated for each decrement in the set amount of available bandwidth for the set number of task nodes (p).
[0030] At 45, the control node can calculate a Bandwidth Demand for the set number of task nodes (p), BWCTI p corresponding to the predefined CTI. In some instances, at 45 multiple BWCTI p for different predetermined CTI can be calculated. In the given example, the BWCTI p can be calculated for CTI values of 15%, 16%, 17%, 18% and 19%.
[0031] At 50, a determination can be made by the control node as to whether the maximum number of task nodes has been reached. If the determination at 50 is negative (e.g. "NO"), the method 10 can proceed to 55. If the determination at 50 is positive, the method 10 can proceed to 60. At 55, the number of task nodes is increased by the set amount defined in the parameters and the method 10 returns to 20.
[0032] In this manner, the BWCTI p is calculated for each set number of task nodes from the initial set of task nodes to the maximum number of task nodes. FIG. 3 illustrates a chart 150 (e.g., a linear graph) that plots completion time (in ms) for the given example as a function of a set availability of bandwidth for four (4) different number of task nodes (labeled in FIG. 3 as "nodes", 4, 9, 16 and 25). More particularly, the graph 150 corresponds to the given example, where the graph being searched by the BFS procedure has a scale (s) of 27. As used, the graph scale size, s denotes that the graph being searched has 2s million vertices and 16 * 2s million edges. Thus, a scale size of 27 represents a graph with about 134 million vertices and about 2.1 billion edges. FIG. 4 illustrates a chart 160 that also plots completion time (in ms) for the given example as a function of a set available of bandwidth for four (4) different number of task nodes (labeled in FIG. 4 as "nodes", 4, 9, 16 and 25). In the chart 160, the graph has a scale size of 28, thereby representing a graph with about 268 million vertices and about 4.2 billion edges.
[0033] Continuing with the given example, FIG. 5 illustrates a chart 1 70 (e.g., a linear graph) that plots the bandwidth demand, BWCTI as a function of a number of task nodes (processors) for 5 different predetermined CTIs, 1 5%, 16%, 1 7%, 1 8% and 1 9% with a scale, s of 27. FIG. 6 illustrates a chart 1 80 (e.g., a linear graph) that plots the bandwidth demand, BWCTI as a function of a number of task nodes (processors) for 5 different predetermined CTIs, 1 5%, 1 6%, 1 7%, 1 8% and 1 9% with a scale, s of 28. As illustrated by the charts 1 70 and 180, the minimum bandwidth needed, the bandwidth demand, BWCTI to achieve a predetermined CTIs increases as a function of the number of task nodes implemented in the medium-scale computer cluster.
[0034] Referring back to FIG. 1 , at 60, the control node can determine a demand constant (DC) for each different CTI. The demand constant for a particular CTI and two different number of task nodes can be calculated with Equations 2 and 3.
Figure imgf000011_0001
Wherein:
i is a first number of task nodes employed by the medium-scale computer cluster to complete the distributed memory program;
p2 is a second number of task nodes employed by the medium-scale computer cluster to complete the distributed memory program and p1 < p2 ,'
BWCTI Pi is the bandwidth demand for the particular CTI with the first number of task nodes (px) ;
BWCTI p2 is the bandwidth demand for the particular CTI with the second number of task nodes (p2) ; and
DCCTI Pi P2 is the demand constant for the particular CTI for two particular number of task nodes, px and p2. [0035] Thus, by employing Equation 3, a demand constant for each particular CTI and each consecutive set of number of task nodes (e.g., 4, 9, 16 and 25 in the given example) can be calculated. At 65, the control node can determine an average of each demand constant for each predetermined CTI and consecutive set of number of task nodes can be calculated which can be referred to as the average demand constant, DC. The average demand constant, DC can be, for example, an Arithmetic mean, a
Pythagorean mean, a Geometric mean, a mode, a median, etc.
[0036] At 70, the control node can calculate a predicted (e.g., estimated) bandwidth demand for a large-scale computer cluster. To calculate the bandwidth demand for a large-scale computer cluster, the control node can employ Equations 4 and 5.
Equation 4
Equation 5
Figure imgf000012_0001
Wherein: k is an integer greater than one;
DC is the calculated average demand constant;
a number of task nodes in the large-scale computer cluster;
pk is a next increased number of task nodes in the medium-scale computer cluster and pk > pk_x (it is noted that eventually pk is limited by the number of task nodes in the medium-scale computer cluster)
B WcTi,Pk is the predicted bandwidth demand for number of task nodes, pk in the large-scale computer cluster; and
BWcn is either a measured bandwidth demand or a predicted bandwidth demand based on previous calculations for a next smaller number of task nodes, pk _1 in the medium-scale or large scale computer cluster. [0037] As noted, by employing Equation 5, the bandwidth demand for a set number of task nodes, pk can be predicted. In this manner, if there is a maximum available bandwidth of the interconnect for the large-scale computer cluster, it can be determined whether the maximum available bandwidth of the interconnect meets the predicted bandwidth demand, BWCTIiPk for the set number of task nodes, pk.
[0038] In many situations in large-scale computer clusters, less than 1 00% of the bandwidth (e.g., 75%) of an interconnect may be allocated to the execution of the distributed memory program, which limit can be referred to a predetermined amount of available bandwidth of the interconnect. It is noted that in some such situations, action 70 can be repeated multiple times. For instance, the control node can employ a plurality of predicted number of task nodes, pk to calculate a corresponding plurality of predicted bandwidth demands, BWCTIiPk. In this manner, a specific predicted number of task nodes, pk can be calculated for a point at which a further increase in the number of task nodes would lead to an interconnect bandwidth demand for the distributed memory program that is higher than the predetermined amount of available bandwidth on the interconnect of the large-scale computing cluster. At 75, the control node can predict (e.g., via calculation) a number of task nodes, pk that reaches a point of decreasing marginal utility for a set amount of bandwidth of the interconnect on the large-scale computing cluster by employing Equations 4 and 5.
[0039] As noted, in some situations, in large-scale computer clusters, less than 1 00% of the bandwidth (e.g., 75%) of an interconnect may be allocated to the execution of the distributed memory program. Thus, Equation 1 provides a mechanism for predicting the number of task nodes that reaches the point of decreasing marginal utility for a given amount of available bandwidth on the interconnect of the large-scale computer cluster. As used herein, the term "point of decreasing marginal utility" denotes the point at which communication cost becomes so highly dominant that the performance (scalability) benefit in further increase to the number of task nodes would lead to a diminishing return in completion time of the distributed memory program or an increase in completion time of the distributed memory program (a decrease in performance). [0040] By implementing the method 10, the performance and scalability of the distributed memory program can be predicted. That is, by employing Equations 4-5, and setting a number of task nodes, a bandwidth demand can be predicted by the control node, as explained in action 70. Additionally or alternatively, by employing Equations 4-5 and setting an amount of available bandwidth on an interconnect of the large-scale computer cluster, the control node can calculate the predicted number of task nodes that reaches the point of decreasing marginal utility for the distributed memory program.
[0041] The method 10 accounts for effects of decreasing available bandwidth during execution of the distributed memory program. As illustrated in FIGS. 3-4, it is often the case that decreasing the bandwidth on an interconnect of a computer cluster has little effect until the bandwidth is throttled (e.g., decreased) to a certain percentage. For instance, in the charts 150 and 160 of FIG. 3-4 at available bandwidths percentages between 100% and 30%, there is an increase of less than 50 ms in the completion time. However, between 30% and 20% there is about another 50 ms increase in completion time. In such a situation, the CTI below 30% available bandwidth increases at a nonlinear rate. Conventional performance analysis systems do not consider this possibility.
[0042] FIGS. 7 and 8 illustrate charts 200 and 210 that demonstrate experimental measured results as compared to calculated predicted results. In particular the charts 200 and 210 represent a bandwidth demand percentage plotted as a function of a number of task nodes. By employing a twenty-five (25) task node computing-cluster to execute the distributed memory program (a BFS of a graph) with a graph scale 27 in the chart 200 and a graph scale 28 in the chart 210, the measured results can be obtained and plotted. Moreover, by employing the collected measurements for computer clusters with 4, 9 and 16 task nodes (where each task node is configured with 18 MPI processes, i.e., with 72, 162, 288 MPI processes respectively), the control node can obtain three (3) different combinations for each considered CTI (15 in total), such that the demand constant for each CTI can be calculated. The Average DC can be calculated over these 15 DCPi,Pj. Bandwidth demands for 25 task nodes (with 450 processor cores respectively) can be predicted using the Average DC, Equation 5, and pkas 288 (1 6 task nodes). Thus, the average DC for scaling factors 27 and 28 are 0.8 and 0.86 respectively.
[0043] As is illustrated in the charts 200 and 21 0 of FIGS. 7 and 8 the prediction is relatively accurate. In fact, the error (the difference between the predicted bandwidth demand and the measured bandwidth demand) is less than 2% for a scaling factor of 27 (in FIG. 7) and less than 0.5% for a scaling factor of 28 (in FIG. 8). Thus, FIGS. 7 and 8 demonstrate that the method 1 0 provides excellent accuracy for predicting performance of a computer cluster (large or medium) executing the distributed memory program.
[0044] FIG. 9 illustrates an example of a medium-scale computer cluster 300 that could be employed to predict bandwidth demands and/or a number of processors needed to reach a point of marginal utility. The computer cluster 300 can be employed, for example, to implement the method 10 illustrated in FIG. 1 .
[0045] The computer cluster 300 can be employed to execute a distributed memory program, wherein the execution of the distributed memory program is shared among N number of task nodes 302 of the computer cluster 300, where N is an integer greater than or equal to two (2). The computer cluster 300 can also include a control node 304. Each task node 302 can be a computing device that includes a
processor 306 including one or more processor cores. Each task node 302 can also include a memory 308 that stores machine readable instructions that can be accessed and executed by the processor 306. The memory 308 of each task node 302 can be, for example, volatile memory (e.g., RAM), non-volatile memory (e.g., a hard disk drive, a solid state drive, flash memory, etc.) or a combination thereof. Each task node 302 can be employed to implement a task node described with respect to FIG. 1 . Thus, each memory 308 of the task nodes 302 can include a partition of a distributed (shared) memory that is employed to execute the distributed memory program. The control node 304 can also include a processor 31 0 (that includes one or more processor cores) and a memory 31 2. The control node 304 can be employed to implement the control node described with respect to FIG. 1 .
[0046] The N number of task nodes 302 and the control node 304 can
communicate over an interconnect 314 of the computer cluster 300. The
interconnect 314 can be, for example, InfinBand, a backplane, a network connection or a combination thereof. Moreover, in some examples, some (or all) of the components of the computer cluster 300 can be virtual components executing on a computing cloud. In such a situation, each component could be representative of multiple instances of hardware (i.e., distributed). In other examples, each component of the computer cluster 300 (or some portion thereof) can be representative of a physical hardware device.
[0047] The control node 304 can initiate and control execution of the distributed memory program among the N number of task nodes 302. Additionally, the control node 304 can calculate performance metrics of the computer cluster 300 that can be employed to predict the scalability of the distributed memory program on a large-scale computer cluster (e.g., 25+ task nodes). For purposes of simplification of explanation, the control node 304 and the N number of task nodes 302 are illustrated and described in FIG. 2 as being separate devices. However, in other examples, the control node 304 can be integrated with a task node 302.
[0048] The distributed memory program can employ, for example, a graph searching procedure, such as a BFS search procedure. In such a situation, the distributed memory program executed by the computer cluster 300 can be employed to search a graph for a shortest distance between vertices of the graph (e.g., as illustrated and described with respect to FIG. 2). However, it is to be understood that the examples described herein are merely only one example of a distributed memory program that could be executed by the computer cluster 300.
[0049] The control node 304 can execute a scalability predictor 316 to predict the performance and scalability of the distributed memory program on a large-scale computer cluster. The scalability predictor 316 can set parameters for a scalability evaluation of a distributed memory program. Additionally, the scalability predictor 316 can cause the computer cluster 300 to execute the distributed memory program multiple times using a set number of task nodes (e.g., 4, 9, 16 and 25) and record the
completion time (e.g., the time needed for the distributed memory program to achieve a desired result). Moreover, the scalability predictor 316 can cause the computer cluster 300 to re-execute the distributed memory program using an interconnect emulator 318 (e.g., InterSense) to synthetically decrement (e.g., throttle) the available bandwidth of the interconnect 314. The scalability predictor 31 6 can record the completion time for each incremental decrease of the interconnect bandwidth. As one example, the charts 1 50 and 1 60 in FIGS. 3 and 4 illustrate an example of an increase in completion time plotted as a function over a range of available bandwidths
(e.g., 1 00% to 20%) for different numbers of task nodes.
[0050] The scalability predictor 31 6 can employ Equation 1 to determine a CTIbw for each decrement of available bandwidth on the interconnect 314. The scalability predictor 31 6 can compare the CTIbw for each decrement of available bandwidth to one or more predetermined CTIs (e.g., 1 5%-1 9%) to determine a bandwidth demand at each predetermined CTI, for each set number of task nodes 302 employed, BWCTI p. The charts 1 70 and 1 80 illustrated in FIGS. 5 and 6 plot bandwidth demands as a function of a number of task nodes for five (5) different CTIs (1 5%-1 9%).
[0051] Upon determining the BWCTI p for each number of the N number of task nodes 202 employed, the scalability predictor 31 6 can employ Equation 3 to calculate a demand constant for each CTI and each a two sets of task nodes 202, DCCTI Pi P2. Upon determining each DCCTI Pi P2 , the scalability predictor 316 can calculate an Average Demand Constant, DC.
[0052] The scalability predictor 31 6 can employ Equations 4-5 to predict performance and/or scalability limits of the distributed memory program. Specifically, the scalability predictor 31 6 can employ Equation 5 to predict bandwidth demands for a specific number of task nodes (pfc) in a large-scale computer cluster.
[0053] Additionally or alternatively, given a specific bandwidth demand
percentage (e.g., 70%), the scalability predictor 31 6 can employ Equation 5 to
determine a number of task nodes 302 that is at (or near) a point of marginal utility for the distributed memory program. That is, the scalability predictor 31 6 can estimate a point at which a further increase in the number of task nodes 302 provides either diminishing returns or an increase in completion time (e.g., decreased overall
performance) due to increased messaging demands of the distributed memory program.
[0054] By employing the computer cluster 300, the scalability predictor 31 6 of the control node 304 can accurately predict operational requirements and limitations (e.g., the bandwidth demands and/or the greatest number of task nodes 302 before reaching marginal utility) of the distributed memory program. Moreover, since the computer cluster 300 actually executes instances of the distributed memory program with a throttled bandwidth on the interconnect 314, the predicted operational requirements and limitations are accurately predicted.
[0055] FIG. 10 illustrates an example of a flowchart of a method 400 for predicting an average demand constant for a distributed memory program. The method 400 can be implemented, for example by a control node (e.g., the control node of FIG. 1 and/or the control node 304 of FIG. 9). At 410, the control node can initiate an iterative execution of a distributed memory program over multiple numbers of task nodes in a computer cluster. Each task node can include a computing device. At 420, the control node can cause an interconnect emulator to change an available bandwidth value of an interconnect of the computer cluster over a range of available bandwidths values. At 430, the control node can measure a completion time for execution of the distributed memory program for each change to the available bandwidth value and each change in the number of task nodes. At 440, the control node can store an average demand constant in a non-transitory machine readable medium (e.g., memory) for the distributed memory program. The average demand constant can corresponds to a CTI for each change of available bandwidth value and each of the multiple numbers of task nodes.
[0056] FIG. 1 1 illustrates an example of a control node 450. The control node 450 can be employed to implement the control node of FIG. 1 and/or the control node 304 of FIG. 9. The control node 450 can include a non-transitory memory 452 to store machine readable instructions and a processing unit 454 to access the memory and execute the machine readable instructions. The machine readable instructions can include an interconnect emulator 456 to control an amount of available bandwidth of an interconnect in a computer cluster. The non-transitory memory 452 can also include a scalability predictor 458 to cause a repeated execution of a distributed memory program with multiple numbers of the task nodes in the computer cluster each with multiple set available bandwidths throttled by the interconnect emulator 456. The scalability predictor 458 can also calculate an average demand constant for the distributed memory program, wherein the average demand constant corresponds to a CTI for each change of bandwidth and each of the multiple numbers of task nodes.
[0057] FIG. 12 illustrates an example of a non-transitory machine readable medium 500 (e.g., a memory) having machine readable instructions. The machine readable instructions can include a scalability predictor 502 to measure a completion time for an execution of a distributed memory program on a computer cluster
comprising a plurality of task nodes for each change to an available bandwidth value of an interconnect of the computer cluster and each change in the number of task nodes, wherein each task node comprises a computing device. The scalability predictor 502 can also calculate a plurality of demand constants that each correspond to a respective CTI for a change of the available bandwidth value and each of the multiple number of task nodes. The scalability predictor 502 can further average the plurality of demand constants to determine an average demand constant.
[0058] In view of the foregoing structural and functional description, those skilled in the art will appreciate that portions of the systems and method disclosed herein may be embodied as a method, data processing system, or computer program product such as a non-transitory computer readable medium. Accordingly, these portions of the approach disclosed herein may take the form of an entirely hardware embodiment, an entirely software embodiment (e.g., in a non-transitory machine readable medium), or an embodiment combining software and hardware. Furthermore, portions of the systems and methods disclosed herein may be a computer program product on a computer-usable storage medium having computer readable program code on the medium. Any suitable computer-readable medium may be utilized including, but not limited to, static and dynamic storage devices, hard disks, solid-state storage devices, optical storage devices, and magnetic storage devices.
[0059] Certain examples have also been described herein with reference to block illustrations of methods, systems, and computer program products. It will be understood that blocks of the illustrations, and combinations of blocks in the illustrations, can be implemented by computer-executable instructions. These computer-executable instructions may be provided to one or more processors of a general purpose computer, special purpose computer, or other programmable data processing apparatus (or a combination of devices and circuits) to produce a machine, such that the instructions, which execute via the one or more processors, implement the functions specified in the block or blocks.
[0060] These computer-executable instructions may also be stored in computer- readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture including instructions which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart block or blocks.
[0061] Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a switch or a communication network. Examples of communication networks include a local area network ("LAN").
[0062] What have been described above are examples. It is, of course, not possible to describe every conceivable combination of structures, components, or methods, but one of ordinary skill in the art will recognize that many further
combinations and permutations are possible. Accordingly, the invention is intended to embrace all such alterations, modifications, and variations that fall within the scope of this application, including the appended claims. Where the disclosure or claims recite "a," "an," "a first," or "another" element, or the equivalent thereof, it should be interpreted to include one or more than one such element, neither requiring nor excluding two or more such elements. As used herein, the term "includes" means includes but not limited to, and the term "including" means including but not limited to. The term "based on" means based at least in part on.

Claims

CLAIMS What is claimed is:
1 . A method comprising:
initiating an iterative execution of a distributed memory program over multiple numbers of task nodes in a computer cluster, wherein each task node comprises a computing device;
changing an available bandwidth value of an interconnect of the computer cluster over a range of available bandwidths values;
measuring a completion time for execution of the distributed memory program for each change to the available bandwidth value and each change in the number of task nodes; and
storing an average demand constant in a non-transitory machine readable medium for the distributed memory program, wherein the average demand constant corresponds to a completion time increment (CTI) for each change of the available bandwidth value and each of the multiple numbers of task nodes.
2. The method of claim 1 , further comprising calculating a predicted bandwidth demand for the distributed memory program on a predetermined number of task nodes and a predetermined CTI based on the average demand constant.
3. The method of claim 2, wherein the predetermined number of task nodes is greater than the number of task nodes in the computer cluster.
4. The method of claim 2, wherein the predicted bandwidth demand is further based on a calculated or measured bandwidth demand for the predetermined CTI with a specific number of task nodes that is less the predetermined number of task nodes.
5. The method of claim 1 , further comprising calculating and storing in the non- transitory machine readable medium, a predicted number of task nodes for a point at which a further increase in the number of task nodes leads to an interconnect bandwidth demand for the distributed memory program that is higher than a predetermined amount of available bandwidth of the interconnect, wherein the predicted number of task nodes is based on the average demand constant.
6. The method of claim 1 , further comprising calculating and storing in the non- transitory machine readable medium, a predicted number of task nodes for a point at which a further increase in the number of task nodes provides diminishing returns or an increase in completion time for the distributed memory program, wherein the predicted number of task nodes is based on the average demand constant and a predetermined amount of available bandwidth of the interconnect.
7. The method of claim 6, wherein the predicted number of nodes is greater than the number of task nodes in the computer cluster.
8. A control node comprising:
a non-transitory memory to store machine readable instructions; and
a processing unit to access the memory and execute the machine readable instructions, the machine readable instructions comprising:
an interconnect emulator to control an amount of available bandwidth of an interconnect in a computer cluster;
a scalability predictor to:
cause a repeated execution of a distributed memory program with multiple numbers of the task nodes in the computer cluster each with multiple set available bandwidths throttled by the interconnect emulator; and
calculate an average demand constant for the distributed memory program, wherein the average demand constant corresponds to a completion time increment (CTI) for each change of bandwidth and each of the multiple numbers of task nodes.
9. The control node of claim 8, wherein the scalability predictor is further to calculate a predicted bandwidth demand for the distributed memory program on a predetermined number of task nodes of the computer cluster and a predetermined CTI based on the average demand constant.
10. The control node of claim 8, wherein the set number of task nodes is greater than the number of task nodes in the computer cluster.
1 1 . The control node of claim 8, wherein the salability predictor is further configured to calculate a predicted number of task nodes for a point at which a further increase in the number of task nodes leads to an interconnect bandwidth demand for the distributed memory program that is higher than a predetermined amount of available bandwidth of the interconnect, wherein the predicted number of task nodes is based on the average demand constant.
12. The control node of claim 8, wherein the scalability predictor is further to calculate a predicted number of task nodes for a point at which further increase in the number of task nodes provides diminishing returns or an increase in completion time for the distributed memory program, wherein the predicted number of task nodes is based on the demand constant and a set amount of available bandwidth of the interconnect.
13. The control node of claim 8, wherein the distributed memory program is executed on the computer cluster at least 15 times.
14. A non-transitory machine readable medium having machine readable
instructions, the machine readable instructions comprising:
a scalability predictor to:
measure a completion time for an execution of a distributed memory program on a computer cluster comprising a plurality of task nodes for each change to an available bandwidth value of an interconnect of the computer cluster and each change in the number of task nodes, wherein each task node comprises a computing device; calculate a plurality of demand constants that each correspond to a respective completion time increment (CTI) for change of the available bandwidth value and each of the multiple number of task nodes; and
average the plurality of demand constants to determine an average demand constant.
15. The medium of claim 14, wherein the scalability predictor is further to calculate a predicted bandwidth demand for the distributed memory program on a predetermined number of task nodes of the computer cluster and a predetermined CTI based on the average demand constant.
PCT/US2016/015154 2016-01-27 2016-01-27 Scalability predictor WO2017131668A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2016/015154 WO2017131668A1 (en) 2016-01-27 2016-01-27 Scalability predictor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2016/015154 WO2017131668A1 (en) 2016-01-27 2016-01-27 Scalability predictor

Publications (1)

Publication Number Publication Date
WO2017131668A1 true WO2017131668A1 (en) 2017-08-03

Family

ID=59398511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/015154 WO2017131668A1 (en) 2016-01-27 2016-01-27 Scalability predictor

Country Status (1)

Country Link
WO (1) WO2017131668A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911752A (en) * 2017-11-15 2018-04-13 晶晨半导体(上海)股份有限公司 A kind of bandwidth analysis method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106994A1 (en) * 2004-03-13 2007-05-10 Cluster Resources, Inc. Co-allocating a reservation spanning different compute resources types
US20110191781A1 (en) * 2010-01-30 2011-08-04 International Business Machines Corporation Resources management in distributed computing environment
US20120167101A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation System and method for proactive task scheduling
US20120233486A1 (en) * 2011-03-10 2012-09-13 Nec Laboratories America, Inc. Load balancing on heterogeneous processing clusters implementing parallel execution
US20150135185A1 (en) * 2010-06-28 2015-05-14 Amazon Technologies, Inc. Dynamic scaling of a cluster of computing nodes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106994A1 (en) * 2004-03-13 2007-05-10 Cluster Resources, Inc. Co-allocating a reservation spanning different compute resources types
US20110191781A1 (en) * 2010-01-30 2011-08-04 International Business Machines Corporation Resources management in distributed computing environment
US20150135185A1 (en) * 2010-06-28 2015-05-14 Amazon Technologies, Inc. Dynamic scaling of a cluster of computing nodes
US20120167101A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation System and method for proactive task scheduling
US20120233486A1 (en) * 2011-03-10 2012-09-13 Nec Laboratories America, Inc. Load balancing on heterogeneous processing clusters implementing parallel execution

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911752A (en) * 2017-11-15 2018-04-13 晶晨半导体(上海)股份有限公司 A kind of bandwidth analysis method

Similar Documents

Publication Publication Date Title
US10439890B2 (en) Optimal deployment of fog computations in IoT environments
US8799916B2 (en) Determining an allocation of resources for a job
US20140215471A1 (en) Creating a model relating to execution of a job on platforms
Loghin et al. Towards analyzing the performance of hybrid edge-cloud processing
CN113169990B (en) Segmentation of deep learning reasoning with dynamic offloading
US20130318538A1 (en) Estimating a performance characteristic of a job using a performance model
US20180276049A1 (en) Systems and methods for estimating computation times a-priori in fog computing robotics
JP2022511716A (en) Decentralized deep learning
US9213584B2 (en) Varying a characteristic of a job profile relating to map and reduce tasks according to a data size
Gao et al. Are cloudlets necessary?
Cai et al. mrMoulder: A recommendation-based adaptive parameter tuning approach for big data processing platform
Malakar et al. Optimal execution of co-analysis for large-scale molecular dynamics simulations
US20140201114A1 (en) Device of managing distributed processing and method of managing distributed processing
US11107187B2 (en) Graph upscaling method for preserving graph properties
Phillips et al. A CUDA implementation of the High Performance Conjugate Gradient benchmark
Chen et al. Multi-dimensional functional principal component analysis
US20150012629A1 (en) Producing a benchmark describing characteristics of map and reduce tasks
Llort et al. On the usefulness of object tracking techniques in performance analysis
Chen et al. Cost-effective resource provisioning for spark workloads
US10579748B2 (en) Capacity planning for systems with multiprocessor boards
US20180373567A1 (en) Database resource scaling
WO2020008392A2 (en) Predicting execution time of memory bandwidth intensive batch jobs
WO2017131668A1 (en) Scalability predictor
KR102158051B1 (en) Computer-enabled cloud-based ai computing service method
US10089151B2 (en) Apparatus, method, and program medium for parallel-processing parameter determination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16888401

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16888401

Country of ref document: EP

Kind code of ref document: A1