WO2005111843A2

WO2005111843A2 - Methods for parallel processing communication

Info

Publication number: WO2005111843A2
Application number: PCT/US2005/016407
Authority: WO
Inventors: Kevin D. Howard; James A. Lupo
Original assignee: Massively Parallel Technologies, Inc.
Priority date: 2004-05-11
Filing date: 2005-05-11
Publication date: 2005-11-24
Also published as: WO2005111843A3

Abstract

A system, method and software product facilitate communication to, from and within a parallel processing environment such as a cascade. A partial data-set all-to-all exchange between a plurality of compute nodes of the parallel processing environment is described. A manifold includes a plurality of home nodes organized as a cascade, wherein each home node forms a cascade with one or more compute nodes. A hyper manifold is formed by generating a cascade of home nodes from each communication channel of a first home node. All-to-all communication and next-neighbor data exchange between compute nodes of a cascade is described. An algorithm is organized so that compute nodes concurrently cross-communicate while executing the algorithm. The algorithm may be decomposed into uncoupled functional components that are processed on one or more compute nodes. Checkpointing facilitates failure recovery. Emotional analogues applied to dynamic resource allocation within a parallel processing environment are described.

Description

METHODS FOR PARALLEL PROCESSING COMMUNICATION

RELATED APPLICATIONS This application claims the benefit of priority to U.S. Provisional Application No. 60/569,845, filed May 11, 2004, and to U.S. Provisional Application No. 60/586,852, filed My 9, 2004; each of these applications incorporated herein by reference.

BACKGROUND Over the last thirty years, opinion surrounding high performance and massively parallel computers has assumed that to improve processing speed and to scale parallel performance, faster processors and faster communication channels with low latency are required. However, a human brain is capable of performing sophisticated and intensive levels of multi-processing; and yet, in comparison to modern computer systems, processing within the human brain appears to be weak, point-to-point channel speed is extremely slow, and latency of these point-to-point channels is high. If the current assertion is correct, it should be impossible for the brain to have computational scaling that is completely different from, and profoundly more scaleable than, modern technologies. A compute task may contain serial and/or parallel processing elements. Parallel compute tasks are able to take advantage of parallel processing systems, whereas serial elements within the compute task, by definition, cannot be performed in parallel. Amdahl's Law argues that even where the fraction of serial work (say s) in a given problem is small, the maximum speed increase obtainable from'an infinite number of parallel processors is limited to 1/s.

SUMMARY OF THE INVENTION Commonly-owned U.S. Patent Number 6,857,004 filed June 26, 2000, titled'Collective Network Processing System and Methods!' is incorporated herein by reference. Commonly-owned U.S. Patent Application Number 10/340,524 filed January 10, 2003, titled'Parallel Processing Systems and Method', is incorporated herein by reference. In one embodiment, a method inputs a problem-set to a parallel processing environment based upon Howard cascades. The problem-set is received at a first home node of the parallel processing environment and distributed, from the first home node, to one or more other home nodes of the parallel processing environment. The problem-set is then distributed from the first home node and the other home nodes to a plurality of compute nodes of the parallel processing environment. In another embodiment, a method performs a partial data-set all-to-all exchange between a plurality of compute nodes of a parallel processing environment based upon Howard cascades. First and second identical lists of unique identifiers for the compute nodes are created, wherein the identifiers are organized in ascending order within the first and second lists. If the number of compute nodes is odd, an identifier for a home node of the parallel processing environment is appended to the first and second lists. A first pattern is applied to the first list to identify one or more first node pairs and data is exchanged between each first node pair. If the number of compute nodes is odd, a second pattern is applied to the second list to identify one or more second node pairs and data is exchanged oetween eacn seconα noαe pair, π me numoer of compute nodes is even, a second pattern is applied to the second list to identify one or more third node pairs and data is exchanged between all but the last node pair of the third node pairs. If the number of compute nodes is odd, the second pattern is applied to the second list to identify one or more fourth node pairs and data is exchanged in a reverse direction between each fourth node pair. If the number of compute nodes is even, the second pattern is applied to the second list to identify one or more fifth node pairs and data is exchanged in a reverse direction between all but the last node pair of the fifth node pairs. The first pattern is applied to the first list to identify one or more sixth node pairs and data is exchanged in a reverse direction between each node pair of the sixth node pairs. All but the last entry in the first list is shifted up by one and the identifier moved from the first entry is inserted in the last but one entry. All but the last entry in the second list is shifted down by one and the identifier moved from the last but one entry is inserted into the first entry. The steps of applying and shifting are repeated until all data is exchanged. In another embodiment, a manifold includes a plurality of home nodes organized as a cascade, wherein each home node forms a cascade with one or more compute nodes. In another embodiment, a method generates a hyper-manifold. A cascade of home nodes is generated from each communication channel of a first home node. For each additional home node level, a cascade of home nodes is generated from each communication channel of each home node of the last home node level generated. A cascade group is then generated from each communication channel of each home node. In another embodiment, a method forms a cascade. A network generator is used to generate a pattern of nodes and the pattern is then converted to a tree structure of interconnected nodes, to form the cascade. In another embodiment, a method provides for all-to-all communication within a manifold having two or more cascade groups. Data is exchanged between compute nodes within each cascade group. Data is then exchanged between each cascade group. Then, data is exchanged between top-level nodes of the manifold. In another embodiment, a method provides for next-neighbor data exchange between compute nodes of a cascade. A neighbor stencil is utilized for each element in a dataset allocated to a first compute node to identify nearest neighbor elements in a dataset allocated to other compute nodes. Data is exchanged with the other compute nodes to receive the nearest neighbor elements. In another embodiment, a processor increases memory capacity of a memory by compressing data and includes a plurality of registers, a level 1 cache, and a compression engine located between the registers and the level 1 cache. The registers contain uncompressed data and the level 1 cache contains compressed data such that compressed data is written to the memory. In another embodiment, a method forms a parallel processing environment. First compute nodes of the environment are organized as a cascade. Home nodes of the environment are organized as a manifold. An algorithm for processing by the first compute nodes is organized so that the first compute nodes concurrently cross-communicate while executing the algorithm, such that additional compute nodes added to the environment improve performance over performance attained by the first compute nodes. In another embodiment, a method parallelizes an algorithm for execution on a cascade. The algorithm is decomposed into uncoupled functional components. The uncoupled functional components are processed on one or more compute nodes of the cascade. In another embodiment, a method balances work between multiple controllers within a hyper- manifold. Home nodes are organized as a hyper-manifold. An alternating all-to-all exchange is performed on level-2 nodes of the hyper-manifold. An alternating all-to-all exchange is performed on level- 1 nodes of the hyper-manifold, and any level- 1 home node is used as a controller. In another embodiment, a method reduces communication latency within a compute node of a cascade. A first processing thread is used to handle asynchronous input to the compute node; a second processing thread is used to process a job of the cascade; and a third processing thread is used to handle asynchronous output from the compute node. In another embodiment, a method checkpoints a hyper-manifold. An all-to-all exchange of checkpoint data is performed and checkpoint data of all other compute nodes is stored at each compute node of the hyper-manifold. In another embodiment, a method processes a problem on a cascade. The problem is received from a user. An input mesh, based upon the problem, is created to apportion the problem to compute nodes of the cascade. An input dataset, based upon the input mesh, is acquired on each compute node. The input dataset is processed on each compute node to produce output data. The output data is agglomerated from all compute nodes to form an output mesh and the results, based upon the output mesh, are returned to a user. In another embodiment, a method benchmarks problem-set distribution of parallel processing architecture. If the same code and data is distributed to all nodes, a number of time units required to broadcast the code and data to all nodes is determined. If dissimilar code and data is distributed to each node, a number of time units required to send the code and data to each node from a single controller is determined. In another embodiment, a method applies emotional analogues to dynamic resource allocation within a parallel processing environment. Completion time for each job running on the parallel processing environment is determined. The required time frames for each job are accessed. A number of compute nodes allocated to a job is increased if the priority of the job increases. The number of compute nodes allocated to a job is decreased if the priority of the job decreases. In another embodiment, a method applies emotional analogues to dynamic resource allocation within a parallel processing environment. A first processing state of one or more associated jobs is defined within the parallel processing environment as a first emlog star. A second processing state of one or more associated jobs within the parallel processing environment is defined as a second emlog star. The first emlog star is transitioned to the second emlog star to transition the first processing state to the second processing state. In another embodiment, a method applies emotional analogues to dynamic resource allocation within a parallel processing environment. A processing state of one or more associated algorithms within the parallel processing environment is defined as an emlog star. Completion time for a job running in the parallel processing environment is predicted. Certain time frames of the job are accessed when running in the parallel processing environment. The number of compute nodes allocated to the job is increased if increased processing power is required. The number of compute nodes allocated to the job is decreased if less processing power is required. In another embodiment, a method provides for communication between compute nodes of a parallel processing system. The compute nodes are organized as a cascade. A problem-set is distributed to the compute nodes using type I input. The problem-set is processed on the compute nodes and processing results from the compute nodes are agglomerated using type I agglomeration. In another embodiment, a parallel processing environment includes a remote host, a plurality of home nodes, a gateway for interfacing the remote host to one or more of the plurality of home nodes, and a plurality of compute nodes. Each compute node has one communication channel and are configured as a plurality of cascades. Each cascade is connected to a communication channel of a home node of the manifold. Each home node has at least one communication channel and the home nodes are configured as a manifold. The remote host sends a problem-set to the manifold via the gateway, the problem-set comprising identification of one or more algorithms, the home nodes distributing the problem-set to the cascades, the compute nodes of the cascades process the problem-set and agglomerate data back to the home nodes, and the home nodes agglomerate the data back to a controlling home node of the manifold, and the controlling home node transfers the agglomerated result to the remote host via the gateway.

BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows a graph illustrating one exemplary exchange sub-impulse as a 3-dimensional volume. -- - ^•- - - - _. FIG. 2 shows a graph illustrating one exemplary data exchange impulse consisting of three sub-impulses. FIG. 3 shows a graph illustrating one exemplary data exchange impulse consisting of three sub-impulses. FIG. 4 shows one exemplary parallel processing environment architecture illustrating code flow for various decomposition strategies. FIG. 5 shows one exemplary depth-3 cascade with one home node and seven compute nodes. FIG. 6 shows a data exchange impulse for the cascade of FIG. 5. FIG. 7 shows one exemplary embodiment of a level four cascade performing a Type I agglomeration. FIG. 8 shows a group of seven compute nodes. FIG. 9 shows one exemplary data exchange impulse. FIG. 10 shows one exemplary cascade with one home node and thirty compute nodes. FIG. 11 shows one exemplary data exchange impulse. FIG. 12 shows one exemplary depth-3 cascade illustrating a single channel home node and thirteen compute nodes, each with two independent communication channels. FIG. 13 shows one exemplary data exchange impulse with three sub-impulses associated with the cascade of FIG. 12. π . 1 snow one exemplary cascade illustrating a home node with two independent communication channels and twenty-six compute nodes, each with two independent communication channels. FIG. 15 shows one exemplary data exchange impulse with three sub-impulses associated with the cascade of FIG. 14. FIG. 16 shows a data exchange for distribution of incoming data illustrating three exchange steps on a depth-3 (7 compute node) cascade with a single communication channel in the home node. FIG. 17 shows one exemplary data exchange impulse with three sub-impulses illustrating a trailing edge exchange time. FIG. 18 shows one exemplary cascade that has one home node and seven compute nodes. FIG. 19 shows one exemplary cascade with one home node and seven compute nodes divided into three cascade strips, illustrating that the number of independent communication channels within the home node increases to decrease the number of exchange steps used to load input data to the cascade. FIG. 20 shows a data exchange impulse resulting from the cascade of FIG. 19. FIG. 21 shows one exemplary parallel processing environment illustrating one depth-2 manifold with four home nodes and twenty-eight compute nodes. FIG. 22 shows one exemplary parallel processing environment with one depth-1 manifold of three home nodes and three cascades, each with fourteen compute nodes. FIG. 23 shows one exemplary data exchange impulse with four sub-impulses associated with the parallel processing environment of FIG. 22. FIG. 24 shows one exemplary data exchange impulse for the parallel processing environment^{" "} of FIG. 21. FIG. 25 shows one exemplary data exchange impulse with four sub-impulses for the parallel processing environment of FIG. 22. FIG. 26 shows one exemplary level- 1 depth-2 manifold. FIG. 27 shows one exemplary level-2 hyper-manifold with four home nodes representing level-2 of the hyper-manifold and twelve home nodes representing level- 1 of the hyper-manifold. FIG. 28 shows one exemplary two level hyper-manifold with level- 1 organized as a depth-1 cascade of 2-channel home nodes and level-2 organized as a depth-1 cascade of 4-channel home nodes. FIG. 29 shows one exemplary data exchange impulse for the hyper-manifold of FIG. 28. FIG. 30 shows one exemplary data exchange impulse for the hyper-manifold of FIG. 27. FIG. 31 shows a hyper-manifold that is created by generating a depth-2 cascade using a 2- channel home node and 2-channel compute nodes, then adding a second level of single channel compute nodes, also of depth-2. FIG. 32 shows a data exchange impulse for the hyper-manifold of FIG. 31. FIG. 33 shows one exemplary Fibonacci generation pattern. FIG. 34 shows a Fibonacci tree generated from the Fibonacci generation pattern of FIG. 33. FIG. 35 shows a Tribonacci generation pattern. FIG. 36 shows a Tribonacci tree generated from the Tribonacci generation pattern of FIG. 35. FIG. 37 shows a Bibonacci generation pattern. FIG. 38 shows a Bibonacci tree generated from the Bibonacci generation pattern of FIG. 37. FIG. 39 shows one exemplary cascade generator pattern. FIG. 40 shows a cascade tree with associated network communication paths for the cascade generator pattern of FIG. 39. FIG. 41 shows one exemplary cascade generated pattern resulting from a generator node with two communication channels. FIG. 42 shows one exemplary cascade tree showing generated data paths of the cascade generated pattern of FIG. 41. FIG. 43 shows one exemplary 2-channel cascade generation pattern. FIG. 44 shows a 2-channel cascade generated tree extracted from the cascade generation pattern of FIG. 43 and illustrating generated paths. FIG. 45 shows one exemplary manifold generated pattern. FIG. 46 shows a manifold tree extracted from the manifold generated pattern of FIG. 45 and illustrating generated paths. FIG. 47 shows one exemplary cascade with one home node and seven compute nodes. FIG. 48 shows one exemplary data exchange impulse of the cascade of FIG. 47. FIG. 49 shows one exemplary depth-3 cascade with one home node that has three independent communication channels and seven compute nodes. FIG. 50 shows one exemplary data exchange impulse of the cascade of FIG. 49. FIG. 51 shows one exemplary parallel processing environment illustrating four home nodes, twenty-eight compute nodes and a mass storage device. FIG. 52 shows one exemplary depth-3 cascade with three home nodes, shown as a home node bundle, and seven compute nodes. FIG. 53 shows a data exchange impulse with two sub-impulses, each with 3 exchange steps, and one sub-impulse with one exchange step illustrating a trailing edge exchange time. FIG. 54 shows one exemplary manifold, based upon depth-3 cascade, with one additional home node to form a channel with a compute node. FIG. 55 shows one exemplary data exchange impulse of the manifold of FIG. 54. FIG. 56 shows one exemplary type IHb manifold with twelve home nodes and twenty-eight compute nodes. FIG. 57 shows one exemplary data exchange impulse for the type TJIb manifold of FIG. 56. FIG. 58 shows one exemplary data exchange impulse for the type Illb Manifold of FIG. 56, illustrating addition of four home nodes to form additional home node channels. FIG. 59 shows one exemplary cascade for performing type πic I/O. FIG. 60 shows a data flow diagram for a first exchange step of a Mersenne Prime partial all- to-all exchange between seven data sets and a temporary data location of the cascade of the first example. FIG. 61 shows a data flow diagram for a second exchange step of the Mersenne Prime partial all-to-all exchange and follows the exchange step of FIG. 60. π . ΌΔ iiiusrxaies a nrsi excnange phase of an all-to-all exchange performed on a 15 compute node cascade. FIG. 63, FIG. 64, FIG. 65, FIG. 66, FIG. 67 and FIG. 68 show the remaining six exchange phases of the 15-node cascade all-to-all exchange. FIG. 69 and FIG. 70 show two exemplary data flow diagrams illustrating a first two steps of a data exchange utilizing two communication channels per node. FIG. 71 shows a binary tree broadcast such as used by LAM MPI. FIG. 72 shows one exemplary depth-2 manifold that has depth-2 cascade groups illustrating manifold level cross-communication. FIG. 73 shows a graph that plots communication time units as a function of number of nodes. FIG. 74 shows a data flow diagram illustrating a single channel true broadcast all-to-all exchange with four exchange steps between four nodes. FIG. 75 shows a data flow diagram illustrating nine compute nodes and next-neighbor cross- communication. FIG. 76 shows one exemplary computational domain illustratively described on a 27x31 grid. FIG. 77 shows the computational domain of FIG. 76 illustrating one exemplary stencil with sixteen 2^nd nearest-neighbor elements and ghost cells highlighted for a group allocated to one compute node. FIG. 78 shows one exemplary cascade with seven compute nodes illustrating a single pair- wise exchange between logically nearest neighbors of the group of FIG. 77. FIG. 79 shows a perspective view illustrating nearest neighbors of one cell in three- dimensions. ^{" ~~} FIG. 80 shows exemplary periodic boundary conditions for a two dimensional computational domain where the outer surface cells wrap around to communicate with opposite sides. FIG. 81 shows a checker-board like model illustrating two exemplary red-black exchanges. FIG. 82 shows a linear nearest-neighbor exchange model illustrating communication between one red node and two linear black nodes. FIG. 83 shows one exemplary data exchange impulse with four sub-impulses, each having three exchanges and a latency gap. FIG. 84 shows one exemplary data exchange impulse for a pair-wise exchange. FIG. 85 shows one exemplary binary tree with eight nodes. FIG. 86 shows one exemplary data exchange impulse. FIG. 87 shows a data exchange impulse for a partial exchange. FIG. 88 shows an exemplary data exchange impulse for a full exchange. FIG. 89 shows an exemplary data exchange impulse for a partial exchange. FIG. 90 shows one exemplary 2-D grid divided into nine subgrids for distribution between nine compute nodes. FIG. 91 shows the 2D grid of FIG. 90 with internal points highlighted. flu. y snows a scnematic αiagram illustrating standard data address path connectivity between (a) a microprocessor with registers and a Ll cache, (b) a L2 cache, (c) a RAM and (d) other circuitry through a bus interface. FIG. 93 shows compression/decompression hardware of FIG. 92 in further detail. FIG. 94 shows a schematic diagram illustrating a processor with registers, Ll cache, L2 cache, a compressor/decompressor, located between the registers and the Ll cache, and a bus interface. FIG. 95 shows a graph illustrating exemplary quantization, where m represents a quantization size. FIG. 96 shows two exemplary waveforms that represent either the value 1 or the value 0 depending upon interpretation. FIG. 97 shows five primary waveform transition types that may be applied to a waveform to increase information content. FIG. 98 shows three schematic diagrams illustrating three exemplary circuits for encoding data into signals. FIG. 99 shows four exemplary waveforms illustrating skew. FIG. 100 shows an alpha phase and a beta phase for overlapped computation and I/O illustrating communication in terms of the total communication time (t_c), the priming time (t¹), the overlapped time (t_c-t "), and the processing time (t_p). FIG. 101 shows a two level hierarchical cross-communication model with four cascade groups each having a home node and three compute nodes. FIG. 102 shows a time graph illustrating details of overlapped communication and calculation. -FIG. 103 shows a time graph illustrating details of overlapped communication and calculation with periods of cross communication. FIG. 104 shows a graph illustrating curves for 0.36 seconds of exposed time, 3.6 seconds of exposed time, 36 seconds of exposed time and 6 minutes of exposed time. FIG. 105 shows a graph illustrating the effect if the exposure time is reduced to that of typical network latencies. FIG. 106 shows a graph illustrating the number of comparison nodes used to match the performance of the specified number of reference nodes, given different values of Ω. FIG. 107 shows a graph illustrating the number of comparison nodes used to match the performance of the specified number of reference nodes, given different values of Ω. FIG. 108 shows a graph with three curves illustrating the number of comparison nodes used to match the performance of the specified number of reference nodes, given different three different values of Ω. FIG. 109 shows a graph illustrating AmdahTs law for a cascade with type IJIb input output. FIG. 110 shows a graph illustrating superlinear start properties for 5 interpretations of the limit value of QQ. FIG. Ill shows a graph with curves for 5 interpretations of the limit value of ΩO, illustrating that from the first node count to the second node count the system (that is from φ = 1, to φ = 2) has linear performance. FIG. 112 shows a graph 2340, with curves for 5 interpretations of the limit value of Ω?0,illustrating that when sublinear scaling occurs between the first node count to the second node count, the system (that is from φ = 1, to φ = 2) has sublinear performance. FIG. 113 shows a graph with curves for 5 interpretations of the limit value of Ω? 0, illustrating that starting with standard scaling between the first node count and the second node count the system (that is from φ = 1, to φ = 2) has standard Amdahl starting performance. FIG. 114 shows a graph with curves for 5 interpretations of the limit value of Ω?0, illustrating that, starting with negative scaling between the first node count to the second node count, the system (that is from φ = 1, to φ = 2) has negative scaling starting performance. FIG. 115 shows a block diagram illustrating functional components of an algorithm. FIG. 116 shows a parallel processing environment with four compute nodes where uncoupled functional components FI, F2 and F3 are assigned to different processing nodes of parallel processing environment. FIG. 117 shows a pipeline with four functional components FI, F2, F3 and F4 and four phases. FIG. 118 shows a more correct depiction of a pipeline with two phases and four functions, illustrating latency and data movement for each function. FIG. 119 shows one exemplary two phase pipeline where each functional component doubles the time used by the preceding functional component. FIG. 120 shows one exemplary pipeline with two phases illustrating one scenario where each functional component utilizes half the processing time used by the preceding .functional component. FIG. 121 shows one exemplary pipeline with two phases illustrating mixed duration functional components. FIG. 122 shows a block diagram of three exemplary home nodes and communication channels. FIG. 123 shows one exemplary hyper-manifold with five level 1 home nodes, each representing a group of five level 2 home nodes. FIG. 124 shows one exemplary hierarchy with a thread model one-to-one, thread model one- to-many, thread model many-to-one and thread model many-to-many. FIG. 125 shows one exemplary job with one thread on one node. FIG. 126 shows one exemplary job that utilizes two threads running on two nodes. FIG. 127 shows two jobs being processed by a single thread on a single node. FIG. 128 shows two jobs being processed by two threads on two nodes. FIG. 129 shows one job running on two nodes, each with an input thread, a processing thread and an output thread. FIG. 130 shows two jobs being processed on two nodes where each node has three processes allocated to each job. FIG. 131 shows a parallel processing environment illustrating transfer of checkpoint information to a master node from all other nodes. FIG. 132 shows a parallel processing environment with a three node cascade and allot spare' node, illustrating recovery when a node fails. FIG. 133 shows one exemplary parallel processing environment that has one cascade of seven nodes, illustrating recovery when one node fails. FIG. 134 shows one exemplary processing environment that has one home node and seven compute nodes illustrating cascade expansion. FIG. 135 shows a block diagram illustrating three data movement phases of an algorithm processing life cycle and times when movement of data is a factor. FIG. 136 shows a schematic diagram illustrating transaction data movement between a remote host and a home node (operating as a controller), and between the home node and three compute nodes. FIG. 137 shows a schematic diagram illustrating transaction data movement between a remote host, a home node (operating as a controller) and three compute nodes. FIG. 138 shows a hierarchy diagram illustrating hierarchy of models: Embarrassingly Parallel Algorithm (EPA), Data Parallel (DP), Parallel Random Access Model (PRAM), Shared Memory (SM), Block Data Memory (BDM) and Massively Parallel Technologies Block Data Memory (MPT-BDM). FIG. 139 shows a function communication life-cycle illustratively shown as three planes that represent the kind of processing accomplished by the function. FIG. 140 shows the I/O plane of FIG. 139 depicted as an input sub-plane and an output sub- plane. ^" FIG. 141 shows the translation plane of FIG. 139 separated into a system translation sub-plane and an algorithm translation sub-plane. FIG. 142 shows one exemplary output mesh illustrating portioning oi^" the output mesh (and hence computation) onto a plurality of compute nodes. FIG. 143 shows a first exemplary screen illustrating definition of an algorithm within the system. FIG. 144 shows a second exemplary screen illustrating input of an algorithm's input dataset and its location. FIG. 145 shows one exemplary screen for specifying algorithm input conversion. FIG. 146 shows a third exemplary screen for specification of algorithm cross-communication. FIG. 147 shows one exemplary screen for selecting an agglomeration type for the algorithm using one of buttons. FIG. 148 shows a fifth exemplary screen for specifying the algorithm's output dataset and its location to the system. FIG. 149 shows one exemplary screen for specifying the programming model. FIG. 150 is a functional diagram illustrating six functions grouped according to scalability. FIG. 150 shows a single job with six functions, each with different scalability. FIG. 151 shows a node map for a parallel compute system logically separated into zones. FIG. 152 shows a programming model screen illustrating a job with algorithm/function displays linked by arrow-headed lines to indicate processing order. FIG. 153 shows a process flow display indicating that a decision point has been reached, as indicated by indicator. FIG. 154 shows a process flow display illustrating process flow display after selection of an arrow to indicate that processing continues with a specific function, which is then highlighted. FIG. 155 shows a process flow display illustrating the process flow display of FIG. 154 after an automated decision has been made by a function. FIG. 156 shows a programming model screen illustrating one programming model where a function encounters anomalous data that cannot be categorized, and therefore selects an unknown function to handle the anomalous data. FIG. 157 shows one example of problem set distribution illustrating multiple data transfers. FIG. 158 shows one exemplary distribution for a dissimilar problem set illustrating multiple data transfers. FIG. 159 shows a graphical representation of single processor timing and multi-processor timing. FIG. 160 shows a simple linear resource priority called an EmLog. FIG. 161 shows one exemplary Emlog star with five linear resource priorities. FIG. 162 shows a data exchange impulse illustrating sub-impulses that represent priorities of the Emlog start of FIG. 161. FIG. 163 shows a program allocation impulse function with five concurrent exchange steps, one for each program. FIG. 164 shows an exemplary grid with an output imbalance region. FIG. 165 show one exemplary output grid that represents an image of a tree (not identified) illustrating identification of a white outlined area that has the characteristics of a stick. FIG. 166 shows one exemplary emlog star transition map illustrating a Linear translation from one emlog star to the next, using the output data generated by each current emlog star to transition to the next emlog star. FIG. 167 shows a Culog star transition map illustrating how the analysis of one culog star allows for the transition to the next portion of the analysis. FIG. 168 shows an Emlog star budding process illustrating how, with the assistance of the culog stars, a new emlog star can be generated. FIGs. 169 is a flowchart illustrating one exemplary process for communicating between nodes of a parallel processing environment. FIG. 170 is a flowchart illustrating one exemplary sub-process for initializing home and compute nodes within a parallel processing environment. FIG. 171 is a flowchart illustrating one sub-process for checking the validity of the problem- set and data description. FIG. 172 is a flowchart illustrating one sub-process for determining cascade size based upon the input/output type of the problem-set and data description. FIG. 173 is a flowchart illustrating one sub-process for distributing the problem-set and data description to top level compute nodes from the home node. FIG. 174 is a flowchart illustrating one exemplary sub-process for distributing problem-set and data description to lower compute nodes. FIG. 175 is a flowchart illustrating one exemplary sub-process for processing the problem-set and data description on compute nodes and exchanging data if necessary. FIG. 176 is a flowchart illustrating one exemplary sub-process for performing an all-to-all exchange. FIG. 177 is a flowchart illustrating one exemplary sub-process for performing a next neighbor exchange. FIG. 178 is a flowchart illustrating one sub-process for agglomerating results. FIG. 179 is a flowchart illustrating one process for increasing information content within an electrical signal. FIGs.180, 181 and 182 show a flowchart illustrating one exemplary process for increasing effective memory capacity within a node. FIG. 183 is a flowchart illustrating one exemplary process for improving parallel processing performance by optimizing communication time. FIGs. 184 and 185 show a flowchart illustrating one process for comparing two parallel exchange models using exchange entropy metrics. FIG. 186 is a flowchart illustrating one exemplary process for determining information carrying capacity based upon Shannon's equation. FIG. 187 illustrates one exemplary Howard Cascade Architecture System (HCAS) that provides an algorithm-centric parallel processing environment.

DETAILED DESCRIPTION OF THE FIGURES FIG. 187 illustrates one exemplary Howard Cascade Architecture System (HCAS) 5702 that provides an algorithm-centric parallel processing environment 5700. HCAS 5702 has a gateway 5704, a home node 5706 and, illustratively, three compute nodes 5710, 5712 and 5714. Three nodes 5710, 5712, 5714 are shown for purposes of illustration, though more nodes may be included within HCAS 5702. Compute nodes 5710, 5712 and 5714 (and any other nodes of HCAS 5702) are formulated into one or more Howard cascades, described in more detail below. Gateway 5704 communicates with a remote host 5722 and with home node 5706; home node 5706 facilitates communication to and among processing nodes 5710, 5712 and 5714. Each processing node 5710, 5712 and 5714 has an algorithm library 5718 that contains computationally intensive algorithms; algorithm library 5718 does not necessarily contain graphic user interfaces, application software, and/or computationally non-intensive functions. Remote host 5722 is shown with a remote application 5724 that has been constructed using computationally intensive algorithm library API 5726. Computationally intensive algorithm library API 5726 defines an interface for computationally intense functions in the algorithm library 5718 of processing nodes 5710, 5712 and 5714. In operation, remote host 5722 sends an algorithm processing request 5728, generated by computationally intensive algorithm library API 5726, to gateway 5704. Gateway 5704 communicates request 5728 to controller 5708 of home node 5706, via data path 5732. Since the computationally intensive algorithms of libraries 5718 are unchanged, and remain identical across processing nodes, 'barallelizatiori' within HCAS 5702 occurs as a function of how an algorithm traverses a dataset. Each of the algorithms, when placed on processing nodes 5710, 5712 and 5714, is integrated with a data template 5720. Controller 5708 adds additional information to algorithm processing request 5728 and distributes the request and the additional information to processing nodes 5710, 5712 and 5714via data paths 5734, 5736 and 5738, respectively; the additional information details (a) the number of processing nodes (e.g., N=3 in this example) and (b) data distribution information. Each processing node 5710, 5712 and 5714has identical control software 5716 that routes algorithm processing request 5728 to data template software 5720. Data template software 5720 computes data indexes and input parameters to communicate with a particular algorithm identified by algorithm processing request 5728 in algorithm library 5718. Data template software 5720 determines whether or not a particular computationally intensive algorithm requires data. If the algorithm requires data, data template 5720 requests such data from home node 5706. The algorithm in library 5718 is then invoked with the appropriate parameters, including where to find the data, how much data there is, and where to place results. Host 5722 need not have information concerning HCAS 5702 since only the data set is being manipulated. Specifically, remote host 5722 does not directly send information, data, or programs to any processing node 5710, 5712 and 5714. HCAS 5702 appears as a single machine to remote host 5722, via gateway 5704. Once HCAS 5702 completes its processing, results from each node 5710, 5712 and 5714 are agglomerated (described in more detail below) and communicated to remote host 5722 as results 5730. An HCAS may maximize the number of nodes that communicate in a given number of time units. The HCAS may thus avoid inefficiencies (e.g., collisions in shared memory environments, the bottle-neck of a central data source, and the requirement of N messages for an N node cluster) in the prior art by, for example, broadcasting a full data set to all processing nodes at once. Even though the same amount of data is transferred over the communication channel, the broadcasting reduces overhead of using N separate messages. A parallel processing environment (e.g., parallel processing environment 5700, also referred to as a "system" herein) may, for example, include two or more compute nodes (e.g., nodes within the parallel processing environment 5700 that are used for processing purposes), where each compute node performs one parallel activity of an algorithm. The algorithm may be specified, or included, within a problem-set that defines a problem to be solved by the parallel processing environment. A dataset includes data associated with, or included within, the problem-set and is processed by one or more compute nodes of the parallel processing environment to provide a result for the problem-set. FIGs. 1 through 186 illustrate embodiments and examples of communication between nodes (e.g., home nodes and compute nodes) and a remote host (e.g., remote host 5722), such as now described below. A generally-accepted mathematical relationship for parallel processing speed-up is'ΑmdahTs Law', named after Dr. Gene Amdahl, and shown in Equation 1 below. In particular, AmdahTs Law relates serial and parallel activity of an algorithm to a number of compute nodes working in parallel, to provide a speed-up factor compared to the same algorithm performed on a single processor. This relationship shows that even for an algorithm with 90% parallel activity and 10% serial activity, at the algorithmic level, a speed-up factor of only 10 is generated for an infinite number of compute nodes. In a first interpretation of this relationship, only algorithms that are primarily decoupled from any serial activity in their parallel instantiation may achieve a high speed-up factor. In a second interpretation of this relationship, for highly coupled algorithms, extremely fast communication between compute nodes may also result in a high speed-up factor. Although both these interpretations may seem reasonable, a third interpretation of this relationship is that low discrete entropy communication may also result in a high speed-up factor. To understand this third interpretation, Amdahl's Law must be understood in the context of communication theory. Since, for example, the second interpretation of Amdahl^'s Law shows that, for highly cross-coupled algorithms, high communication channel speed is required, it is also necessary to ensure that as much as possible of each communication channel is used. By increasing the likelihood of channel use, the discrete communication entropy of the system effectively decreases. In certain of the following examples, it is assumed that communication channels are non- blocking, point-to-point communication routes between compute nodes. Multiple commumcation channels on the same compute node normally have unique end points; but, if they share the same end point, they are not considered as bound channels. This allows for communication without explicit synchronization and prevents potential problems with skew that may occur on bound channels. Amdahl's Law can be expressed as: Equation 1. Amdahl's Law

where: S ≡ System speed-up p ≡ fraction of parallel activity at the algorithm level q = 1-p ≡ fraction of serial activity at the algorithm level P ≡ # of compute nodes

Since discrete entropy in a communication system may be expressed as a measure of uncertainty that the system is in a particular state, given a fixed number of discrete channels, the above fractions may represent the probability that a particular compute node is in one state or another.

Further, since the communication channels discussed by users of Amdahl's Law in the production of high-speed-up systems are in fact channels which meet Shannon's prerequisites for using his equations, the discrete entropy of the system may be estimated as follows. Let: P - probability of being in a parallel state q = l-p ≡ probability of being in a serial state The discrete entropy H_s for a system of P compute nodes, is then given by: Equation 2. Discrete Entropy

^Hs = -∑ (Pi ^log Pi + ii log q_t ) i=l

Equation 2 indicates that H_s may be driven to 0 (i.e., to remove all entropy) if all possible parallelism is exploited or if all activity is perfectly serialized. In a parallel processing environment, communication issues as well as node level algorithm issues should be considered. Processor level uncertainties involve inability to access resources when needed; latency effects may occur at all levels of memory access and communication. Speed-up is achieved by reducing latency effects. In one example, data and instruction blocks may be fitted within a cache. In another example, data block size may be maximized for communication. Speed-up may also be achieved by masking latency effects. For example, when a processor reads data from main memory, an access penalty is incurred. However, if ten processors read data from the main memory in parallel, ten times the amount of data is accessed for the same access penalty (i.e., the parallel memory access latency is one tenth that of the single access latency) thereby masking the effect of latency. If communication channels are available when needed, the uncertainty that a channel will be available when needed is eliminated. Thus, within a parallel processing environment, entropy may be minimized by eliminating uncertainty of communication channel availability and masking the effects of latency. To compare the effects of different data exchange methods, assume that the fastest, lowest latency communication channels are used such that the data exchange method comparison is based upon the effects of the exchanges themselves.

These data exchanges, whether input output exchanges or cross-communication exchanges, may be broken into discrete exchange steps involving one or more point-to-point exchanges. Multiple point- to-point exchanges may, for example, indicate that parallel communication is occurring. In certain of the following examples, a data exchange has a sequence of steps called sub- impulses that move data. Each sub-impulse is illustratively shown as a 3-dimensional volume defined by three orthogonal axes: time, number of exchanges, and channel bandwidth. Data exchange methods may thus be compared in an absolute fashion since each data exchange is defined as an impulse function, where the best impulse function has a minimum pulse width. FIG. 1 is a graph showing one exemplary data exchange impulse 100 illustrating an exchange sub-impulse 106 as a 3-dimensional volume that starts at time 108 and ends at time 110. Sub-impulse 106 includes a data transfer 102, shown as an invariant solid volume (i.e. bandwidth * time), and a protocol latency 104 (illustratively shown as an empty volume) representing lost data movement opportunity due to latency. Data exchange impulse 100 may be reduced to 2-dimensions if bandwidth is assumed constant, for example. If the number of exchanges increases (e.g., by spreading the data load across multiple channels) or if the bandwidth of a channel increases, the same amount of data may be exchanged in less time. Since protocol latency 104 represents lost data transmission time, it takes up space on the time line but does not contribute to data movement. Latency and bandwidth are characteristics (e.g., hardware specific characteristics) of an underlying compute systems of the parallel processing environment; the number of exchanges and the volume of data moved are characteristics (e.g., software characteristics) of an exchange algorithm. Although it appears that these two sets of characteristics are independent and have little effect on each other, FIG. 2 illustrates effects that occurs when several sub-impulses are combined to form a data exchange impulse. FIG. 2 is a graph showing one exemplary data exchange impulse 120 with three sub-impulses 126(1), 126(2) and 126(3). Data exchange impulse 120 starts at time 132 and ends at time 134, thereby having a duration 130. Each sub-impulse 126 includes one data transfer 122 and one protocol latency 124; data exchange impulse 120 thereby includes three protocol latencies 124(1), 124(2) and 124(3) in this example. If the exchange algorithm is modified to reduce the number of sub-impulses used to complete the data exchange impulse, the number of protocol latency periods is also reduced by the same amount. Thus, software contributes to latency reduction, as does hardware utilization techniques that may increase the number of exchanges possible in each sub-impulse, for example. Bandwidth and protocol latency thereby form engineering and economic considerations for a given parallel architecture, but say nothing about the best way to accomplish a data exchange. Bandwidth is not considered in future descriptions of data exchange impulses, thereby reducing them to 2-dimensional figures, since increasing bandwidth improves the overall system, but does impact the choice or design of the communication method-the best exchange algorithm performs better with faster hardware support.

Impulse Latency Discussion

Two types of latency are associated with a data exchange impulse. The first, protocol latency 124 (discussed above), occurs with each sub-impulse and may be masked. Protocol latency may also be reduced by decreasing the number of sub-impulses. A second type of latency occurs within the sub-impulse exchange itself as a function of parallel processing environment topology. Communication between two nodes is called a communication 'hop'. Thus, at least one hop is used for any communication. Additional hops occur when there is no direct connection between two nodes. This means that a connection is made either a) through multiple nodes or b) across multiple networks / sub-networks. Each hop increases the sub- impulse width without increasing the number exchanges or the amount of data moved. This increase is knows as liop latency. The more communication hops used in moving the data within a sub-pulse, the greater the "hop latency. The presence of different hop latencies in a system has the effect of delaying the completion of an impulse. This is represented by a tail region of each sub-impulse as the number of active exchanges dies out. The hop latency is primarily a network topological feature. FIG. 3 is a graph 140 illustrating one exemplary data exchange impulse 144 with three sub- impulses 142(1), 142(2) and 142(3). Sub-impulse 142(1) is shown with one data transfer 122(1), one protocol latency 124(1) and four hop latencies 146(1). Sub-impulses 142(2), 142(3) are similarly shown with data transfers 122(2), 122(3), protocol latencies 124(2), 124(3) and hop latencies 146(2) and 146(3), respectively. The presence of hop latencies 146 in a parallel processing environment delays completion of data exchange impulse 144. Examining data exchanges in this way leads to several insights. These insights can be quantified to improve the use of the impulse approach.

Howard's 1^st Observation If the dataset size, the number of nodes, the channel speed, and the number of channels used in a data exchange impulse are fixed, then the volume defined by the integral of the number of exchanges over the duration of the exchange multiplied by the channel bandwidth is also fixed. The following impulse discussion utilizes the following mathematical notation. The following quantities are used: φ ≡ The exchange step of a data exchange impulse. D_a ≡ The amount of data the algorithm requires to be moved at step φ. D_φ ≡ The amount of data moved by the implementation at step φ. T_φ ^c ≡ The sub-pulse width at step φ. T_ψ ^λ ≡ The latency time at step φ. This latency includes any latency in the protocol stacks as well as delays incurred in the communication subsystems. T_φ ≡ The total time to complete step φ. T ≡ The total impulse width. b_φ ≡ The bandwidth of the communication channels at step φ. b_φ ≡ The effect bandwidth of the communication channels at step φ. υ ≡ Some hardware characteristic of the system on which the number of exchanges is dependent. Later on in this paper, it represents the number of channels available per node. e_φ(υ) ≡ The number of point-to-point exchanges that occur at step φ. λ_p(φ) ≡ The protocol latency time at step φ. λ φ) ≡ The hop latency time a step φ. The sub-impulse width, the impulse width, impulse exchanges and impulse data may be computed as: Equation 3. Sub-Pulse Latency Width

Equation 4. Sub-Pulse Width

Equation 5. Impulse Width

Equation 6. Impulse Exchanges E = ∑e_φ(v) Equation 7. Impulse Data

Neither increasing nor decreasing the number of exchanges, or the bandwidth, has effect on the amount of data moved during an impulse. Thus, changing the number of exchanges can be treated as providing an effective change in bandwidth. Effective bandwidth can be define as: Equation 8. Impulse Effective Bandwidth

Clearly, the software design and hardware utilization considerations can have at least as big an impact on the impulse width as does changing the bandwidth. Arguably, pursuing raw bandwidth may not be as productive as proper software design and utilization. Use of multiple channels increases the number of exchanges that occur in a single sub- impulse. Since the number of channels is a hardware feature and may be fixed in a system, the ratio of (a) the number of channels that are unused during an exchange to (b) the total number available gives a measure of the channel use efficiency of the exchange method. If all of the exchange resources are fixed, the data exchange impulse width is equivalent to the minimum time used by the algorithm to move the data. This leads to a second observation:

Howard's 2^nd Observation The invariant volume of the data exchange impulse is meaningful only for the least number of exchanges used, as defined by the algorithm. Increasing the number of exchanges above the minimum has the effect of expanding the volume and necessarily increasing the data exchange impulse width. The implication of this observation is that unnecessary (i.e., non-algorithm required) exchanges should be removed. More directly, moving more data than is required by the algorithm should also be avoided unless some other factor has a bigger effect.

Howard's 3^rd Observation Total time folding effects are proportionate to the sum of the leading/trailing edge exchange times to the exchange pulse width. This observation means that if the individual elements of the exchange are not synchronized in time, some or all of the time gains may be lost. The sum of the leading and trailing edge exchange times is: Equation 9. Data Exchange Leading/Trailing Edge Exchange Time ζ = Leading edge exchange time + Trailing edge exchange time where: ζ ≡ Jitter skew leading/trailing edge exchange times. There are two sources of timing effects which contribute to the leading/trailing edge times: algorithmic and jitter. Algorithmic contributions occur as a natural feature of the exchange method. For example, a certain amount of channel blocking may be due to the exchange method itself. Blocking has the effect of reducing the data exchange rates, thus extending the impulse width. Jitter is due to unexpected or improper synchronization processes or load balancing. Jitter can be mitigated through better system control. Both leading and trailing edge exchange times are computed differently for different exchange types. This is because any algorithmic data leading/trailing edge exchange time is a function of the exchange itself. An analytical measure that serves to compare both the raw performance and the efficiency of various data exchange methods is now discussed. The following two ratios compare the channel availability and masking during a particular data exchange operation; these are called the Lupo Entropy Metrics: Equation 10. Lupo Entropy Metrics _fγ — used Ω — unused γ₌ TL #, N_s D_a where: Ns ≡ Number of node-to-node exchange steps required to complete a data transfer operation. Each step may involve one or more simultaneous node-to-node data movements. Nused = Sum of the number of channels used during each step of the data transfer operation. Nunused = Sum of the number of channels which went unused during each step of the data transfer operation, α s Used channel entropy metric, β ≡ Unused channel entropy metric. γ = Redundant data measure. α is expected to be large on high availability systems, since this ratio reflects the fact that more data is being moved during each exchange step. It is also a measure of the amount of latency hiding in the system, since α increases only if multiple simultaneous channels are in use. On the other hand, β is expected to be small, since unused channels represent a lost opportunity cost during data movement. A low entropy exchange method ideally has a β of 0, indicating no channels go unused at any time, γ is expected to be 1 since the amount data exchanged should equal the amount of data the algorithm exchanges. N_s is a direct measure of latency hiding; it gives exactly the number of communication latencies exposed during the course of the data exchange operation. When comparing identical exchange operations on different systems, the five numbers - N_s, a, β, γ, D_φ, D_a - indicate which one is optimally configured. For two optimally configured systems, the only difference lies in their raw hardware performance. Since the algorithmic skew is a natural response to a particular exchange, each exchange with an algorithmic skew has a different equation. The mean exchange value is calculated (i.e., the average number of exchanges per partial exchange sub-impulse) which exposes the leading and trailing edges. The jitter skew as a function of synchronization, etc., may be calculated for each machine topology as well as for the communication type. The variable for this type of skew is designated as ζ. Jitter skew is not computed in this document because it is too topology dependent. Howard's 4^th Observation If the dataset size, number of channels, and channel speed is fixed then, the total volume of the data exchange impulse width remains fixed as the number of n-space dimensions describing the data exchange impulse width increases. This observation means that if two exchange methods are compared and the communication resources and the dataset sizes used in the exchanges are equal, then the exchange method with the largest number of n-space dimensions takes the least amount of time. This volumetric interpretation can be continued through hyper- volumes et cetera. Equation 10 is applied in discussions of cross-communication where various cross- communication models are compared. The general theme of maximally using resources reappears throughout the rest of this specification as the optimal way of achieving parallel speed-up. In order to create an impulse diagram for a particular exchange all of the equations are first summarized into a single table. This table, called an Impulse Analysis Form (IAF), provides information used to draw the exchange impulse diagram.

Table 1. Impulse Analysis Form Valid and Invalid N-space Transformations

Instead of comparing either different parallel processing systems or different parallel data exchange methods, the effects of different data movement dimensions are determined. The following list introduces various data movement dimensions: 1) Channel Speed The data movement frequency of a point-to-point communication channel. 2) Number of Channels The number of point-to-point communication channels per node. 3) Virtual Channels The number of parallel data movements per sub-impulse width, without a physical channel to support such movements. 4) Over laying Processing and Data Exchanges The number of parallel data movements that occur while processing data, without decreasing the processor bandwidth. 5) Compressing Data Prior to Exchanging Decreasing the total amount of information exchanged. 6) Exchanging Cross-communication for input/output (IO Exchanges Decreasing the total number of exchanges by moving overlapped data during I/O rather than performing a cross-communication exchange. An I/O exchange is a data exchange that moves the data physically off of the compute nodes and only onto the I/O controller node (e.g., Home Nodes). 7) Pipe-line Exchanges The handling of multiple independent I/O.

Channel Speed Analysis Changing the channel speed does not affect the total number of exchanges required and thus maintains the invariance required. Changing the physical channel speed is therefore a valid dimensional parameter; that is, channel speed represents an n-space dimension. Other n-space dimensions may have stronger effects than the physical channel speed dimension, which gives only linear time folding effects.

Number of Channels Changing the number of physical channels does not affect the total number of exchanges required. As for Channel Speed Analysis, above, this means that changing the number of physical channels is a valid dimensional parameter. Changing the number of physical channels, if performed correctly, may give non-linear time folding results.

Virtual Channels Virtual channels occur when physical channels are reused during an exchange, such that the virtual channel affect is indistinguishable from a physical channel effect. Like the physical channels described above, virtual channels do not affect the total number of exchanges required, which means that they are a valid dimensional parameter. Virtual channels have the added benefit of providing an effect with no hardware and no synchronization issues. Creation of virtual channels for system I/O is described below. Overlaying Processing and Data Exchanges It may be possible (e.g., using multiple threads) to commingle algorithm processing with data exchanges. Performing this commingling does not change the number of exchanges required, which again means that this is a valid dimensional parameter. Examples of these techniques are discussed below and from these techniques parameters required for scaling may be determined.

Compressing Data Prior to Exchanging Compressing data has the affect of decreasing the number of exchanges required. Thus it is not a valid dimensional parameter. In certain models presented herein, data compression is treated as jirocessing' and data to be exchanged is the communicated data. This means that, even though it may be valid to compress data to increase performance, that compression is utilized. Throughout this document though, the effects of data compression on cross-communication are described. Technically, compressing data changes the number of exchanges required and may not normally be utilized, however, in a Shannon sense compression is getting to the mimmum algorithmically required dataset size and is therefore included.

Exchanging Cross-communication for I O Exchanges Many types of cross-communication require more exchanges than required by an I/O data exchange. It is therefore sometimes advantageous to oversubscribe the data when performing the initial data transmission, rather than performing a cross-communication exchange. However valid this technique is, it does change the total number of exchanges and so is again not utilized.

Pipe-line Exchanges ^' Pipe-line exchanges take advantage of the fact that a dataset may be separable into pieces and these pieces may use independent channels. This type of exchange does not change the total number of exchanges required and is therefore a valid dimensional transformation.

Type I I/O and Howard Cascades

A Howard cascade (hereinafter 'cascade?) fully utilizes available communication channels for Type I I/O, which includes problem-set distribution and cross-sum-like agglomeration. A cascade is defined as a home node and all of the compute nodes which communicate directly with it. See FIG. 5, for example. The cascade utilizes one communication channel per compute node and one channel on a single home node. These communication channels may be implemented by switching technologies without limiting the physical interconnections between machines. Moving programs and/or data into or out of a node represents one type communication exchange. Type I problem-set decomposition (Input) is characterized by a fixed size information movement from the top of the tree structure to the bottom of the tree structure. There are two problem- set decomposition methods outlined: code movement problem-set decomposition (CMPD) and code tag movement problem-set decomposition (CTMPD).

Input Code Movement Problem-set Decomposition CMPD is the standard form of problem-set decomposition. CMPD assumes that code written for a parallel processing environment requires a run-time download of that code. FIG. 4 shows one exemplary parallel processing environment 160 illustrating code flow for various decomposition strategies. Parallel processing environment 160 includes a remote host 162 and a parallel processor 164. Remote host 162 is illustratively shown with a compiler 166 and a distribution model 168. Parallel processor 164 is illustratively shown with gateway nodes 170, controller nodes 172 and compute nodes 174. Arrowed lines represent data paths for decomposing a problem by generating parallel processing code (not shown) utilizing compiler 166 and distribute the parallel processing code onto parallel processor 164. These data paths are useful for distributing non-production code (i.e., code which is run only a few times and then discarded). The time overhead associated with this code movement is represented in Equation 15 below. The parallel processing code represents data being moved from remote host 162 to parallel processor 164. In particular, the parallel processing code is moved onto compute nodes 174 for execution. Therefore, as the size of the parallel processing code increases, so does system overhead. Further, since the parallel processing code may be considerably different for each node of compute nodes 174, it is usually not possible to use Type I input to distribute the parallel processing code. This is not the case for a transactional decomposition model. In a transactional decomposition model, parallel processing code remains the same size from node-to-node and, thus, may take advantage of Type I input.

Input Code Tag Movement Problem-set Decomposition Unlike CMPD, which moves the entire program from the remote host to the compute nodes, CTMPD is more amenable to production codes as it only moves a tag that specifies which function/algorithm is to be invoked. More complex algorithms can be constructed from aggregates of these tagged functions, described in further detail below. The use of tags decreases the amount of information that is transferred from the remote host to the compute nodes. However, other considerations are: 1) The ability to automatically profile complex algorithms as a function of the composition of less complex, profiled functions/algorithms. 2) The ability to predict, or at least bound, the scaling and execution timing of most complex algorithms. 3) The ability to simplify the parallel programming process. 4) The ability to automatically change the number of processors used on different sections of a complex problem. 5) The ability to interactively steer processing. A complex algorithm may contain a large number of tags (each with an associated parameter lists), and the time required to upload this information can itself cause significant overhead. However, as long as an MPT Block Data Memory model is used (described below), Type I input may also be used to move the tags onto a parallel processor (e.g., parallel processor 164, FIG. 4), since the MPT Block Data Memory model, when used in problem-set decomposition, ensures that the tag size remains the same as the tags move through the compute nodes (e.g., compute nodes 174). Use of Type I input hides most of the tag transport time, and is directly proportional to the number of nodes engaged in the cascade and the number of exchange steps required to move data to those nodes. The number of compute nodes in a cascade is given by Equation 11 (see also Equation 15): Equation 11. Number of Compute Nodes in a Cascade p₉ = ₊ιγ -ι] Where: v = # of channels per compute node φ = # of expansion steps, or depth of the cascade Pφ = Number of compute nodes in resulting cascade. FIG. 5 shows one exemplary depth-3 cascade 180 with one home node 182 and seven compute nodes 184(1-7). Arrows 186(1-7) indicate data paths through which data movement occurs during an associated exchange step. For example, in exchange step 1, data moves from home node 182 to compute node 184(1) via data path 186(1). In exchange step 2, data moves from home node 182 to compute node 184(2) via data path 186(2) and data moves from compute node 184(1) to compute node 184(5) via data path 185(5). In exchange step 3, data moves from home node 182 to compute node 184(3) via data path 186(3), data moves from compute node 184(1) to compute node 184(4) via data path 186(4), data moves from compute node 184(2) to compute node 184(6) via data path 186(6) and data moves from compute node 184(5) to compute node 184(7) via data path 186(7). The CTMPD number of exchanges per exchange step is given by: Equation 12. CTMPD Sub-pulse # of Exchanges

Table 2 shows each exchange step for indicated data movement within cascade 180 based upon movement of 3 MB of data and each point-to-point exchange moving data at 100 Mb/s.

Table 2. 7-Node CTMPD Impulse Analysis Form

FIG. 6 shows a data exchange impulse 200 for cascade 180, FIG. 5, as described in Table 2. In particular, data exchange impulse 200 has three sub-impulses 202, 204 and 206 representing the three exchange steps φ of Table 2, respectively. Sub-impulse 202 is shown with a latency period 208 and a data transfer 214; sub-impulse 204 is shown with a latency period 210 and a data transfer 216; and sub- impulse 206 is shown with a latency period 212 and a data transfer 218. Type I Agglomeration A cascade clears data off compute nodes (e.g., compute nodes 184, FIG. 5) to a home node (e.g., home node 182) in a determined number of exchange steps. This number of exchange steps defines a 'depth' of the cascade. FIG. 7 shows a depth-four cascade 220 performing a Type I agglomeration. Cascade 220 has one home node 222 and fifteen compute nodes 224(1-15). Type I agglomeration is a cross-summed result with uniform-sized data set movements between all nodes. Note that at any given step, all data paths of a node are either participating in the data movement or the node has completed its communication and is free for other work. In a first exchange step, compute nodes 224(4), 224(5), 224(8), 224(10), 224(11), 224(12), 224(14) and 224(15) transfer data to nodes 222, 224(1), 224(2), 224(3), 224(6), 224(7), 224(9) and 224(13), respectively. In a second exchange step, compute nodes 224(3), 224(6), 224(9) and 224(13) transfer data to nodes 222, 224(1), 224(2) and 224(7), respectively. In a third exchange step, compute nodes 224(2), and 224(7) transfer data to nodes 222 and 224(1), respectively. In a fourth exchange step, compute node 224(1) transfers data to home node 222. Equation 13. Howard Cascade Sub-pulse # of Exchanges

Where: v = # of channels per compute node ψ_max - Maximum exchange depth φ = # of exchange steps

If cascade 220 moves 4 MB of data and each communication channel moves data at 100 Mb/s, then its analysis impulse form is shown in Table 3, and FIG. 9 shows its associated data exchange impulse 240 with a width 242.

Table 3. 15-Node Single Channel Howard Cascade Impulse Analysis Form FIG. 8 shows a group 230 of seven compute nodes 232(1-7). Group 230 does not include a home node and, therefore, all algorithm specific information resides on compute nodes 232. Since most communication interlaces do not saturate the data handling capability of a machine, a second independent channel may be added to a home node. Adding communication channels at the home node level has the effect of increasing the amplitude (i.e., maximum number of concurrent data exchanges) of the data exchange impulse, while holding the pulse width constant. Equation 13 becomes: Equation 14. Howard Cascade Sub-pulse # of Exchanges, Multiple Home Node Channels

where ψ is the number of communication channels at the home node level. FIG. 10 shows one exemplary cascade 260 with one home node 262 and thirty compute nodes 264. Home node 262 has two independent communication channels 266(1) and 266(2) that allow cascade 260 to clear (i.e., agglomerate) its data in 4 exchange steps. Thus, twice the number of compute nodes is cleared in 4 exchange steps as compared with the example of FIG. 7 (in which fifteen compute nodes cleared in 4 exchange steps). If the system moves 4 MB of data and each communication channel moves data at 100 Mb/s, then its analysis impulse form is shown in Table 4, and FIG. 11 shows its associated data exchange impulse 280. In particular, data exchange impulse 280 has four sub-impulses 282, 284, 286 and 288 corresponding to exchange steps 1-4 of Table 4.

Table 4. 30-Node (ψ=2, ψ=4) Howard Cascade Outflow Impulse Analysis Form

FIG. 12 shows one exemplary depth-3 cascade 300 illustrating a single channel home node 302 and thirteen compute nodes 304, each with two independent communication channels 306(1) and 306(2). Cascade 300 clears (i.e., agglomerates) all thirteen compute nodes 304 in 3 exchange steps. If cascade 300 moves 3 MB of data and each communication channel moves data at 100 Mb/s, then its analysis impulse form is shown in Table 5, and FIG. 13 shows its associated data exchange impulse 320 with three sub-impulses 322, 324 and 326 that represent exchange steps 1-3 of Table 5.

Table 5. 13-Node Howard Cascade; 2-Channel Compute Node Case Impulse Analysis Form

FIG. 14 show one exemplary cascade 340 illustrating a home node 342 with two independent communication channels 346(1) and 346(2) and twenty-six compute nodes 344, each with two independent communication channels 348(1) and 348(2). Cascade 340 clears all twenty-six compute nodes 344 in 3 exchange steps. If cascade 340 moves 3 MB of data and each commumcation channel moves data at 100 Mb/s, then its analysis impulse form is shown in Table 6, and FIG. 15 shows its associated data exchange impulse 360 with three sub-impulses 362, 364 and 366.

Table 6. 26-Node Howard Cascade; AU 2-Channel Case Impulse Analysis Form

The size of a cascade is thus affected independently by the numbers of channels on the home and the numbers of channels on each compute node. An equation that describes the number of compute nodes in a cascade may be written as: Equation 15. Howard Cascade Formula

where P. ≡ Number of compute nodes connected to a channel at a given cascade depth. Cascade depth. υ Number of communication channels on a compute node Ψ Number of communication channels on a home node In Equation 15, the term (tx-1) represents the number of n-space dimensions used in the communication pattern. The (+1) term results from communication channels that are reused during communication exchanges and are thus called Υirtual channels'. The use of virtual channels increases the number of available channels as a function of cascade depth. Communication channel reuse adds anotner n-space αimension to me excnange and thereby decreases the time required to perform the data exchange. The entropy of the cascade may be computed as follows. If ζ = 0 then N_s ≡ φ P i-l N * ≡ ψ∑(v + l) i=l ^unused " Note that N _ed is equivalent to P_φ, and after substituting these values into Equation 10, cascade I/O entropy becomes: Equation 16. Cascade I/O Entropy P « = -*-, β = 0, γ = l φ The χterm can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since γ parameter of the Cascade I/O Entropy = 1, data compression effects on cascade I/O entropy is: Equation 17. Data Compression Effects on Cascade I/O Entropy γ- Compression Ratio -The cascade has a natural leading and trailing edge data exchange time. Consider an example of a depth-37-node cascade Q7IG. 5). Table 7 shows the number of point- wise exchanges that take place at each exchange step.

Table 7. Exchange Step to Number of Exchanges

The mean number of exchanges, or cascade mean value, can be calculated from: Equation 18. Cascade Mean Value

M„ = Cascade

where: M_c ≡ The cascade mean value. Cascade ≡ A function that finds the next highest valid cascade value, z" ≡ A loop index. E] ≡ The number of exchanges during exchange step i. φ ≡ The total number of exchange steps. This generates a cascade mean or 4 nodes transmitting at one exchange step. This value defines the leading and trailing exchange and times. Note that the order of the exchanges is 1, 2, and then 4. FIG. 16 shows that the data exchange required during distribution of incoming data involves three exchange steps 382, 384 and 386 on a depth-3 (7 compute node) cascade with a single communication channel in the home node. Since the mean number of exchanges occurs only for 1 exchange step 386, 2 exchange steps 382, 384 represent the leading edge exchange time 388 as shown in FIG. 16. Bandwidth is constant within this figure. FIG. 16, constructed from the dataflow of FIG. 5 where data flows from home node 182 to compute nodes 184. An edge exchange time may appear as a leading or trailing edge depending upon the direction of the dataflow within the cascade. By reversing the direction of dataflow (i.e., the arrowheads) of FIG. 5, the position of the edge exchange time is also reversed from leading to trailing as shown in a data exchange 400 of FIG. 17. In particular, FIG. 17 shows three sub-impulses 402, 404 and 406, and trailing edge exchange time 408. To remove the effects of this exchange time means manipulating the n-space dimensions. It is first noted that a cascade consists of data passing zones that are called cascade strips as shown in FIG. 18. FIG. 18 shows one exemplary cascade 420 that has one home node 422 and seven compute nodes 424(1-7). Compute nodes 424 are divided into three cascade strips 426(1-3); each cascade strip represents a group of compute nodes that directly interact with other compute nodes in a cascade. These cascade strips (e.g., cascade strips 426) are separable by data movement interactions. Since cascade strips 426 are data movement interaction independent, the total data exchange impulse amplitude may be increased by considering communication for each strip independently as shown in FIG. 19. FIG. 19 shows one exemplary cascade 440 illustrating one home node 442 and seven compute nodes 444 divided into three cascade strips 446(1-3). In particular, FIG. 19 shows that the number of independent communication channels within home node 442 is increased (i.e., increased to four in this example) in order to decrease the number of exchange steps required to load input data to the cascade. In the example of FIG. 19, if the number of independent channels of home node 442 is increased to four, one exchange step is eliminated, thereby changing the data exchange impulse to that shown in FIG. 20. FIG. 20 shows a data exchange impulse 460 resulting from data input to cascade 440 of FIG.

19. In particular, data exchange impulse 460 has four exchanges in exchange step 1 followed by three trailing edge exchanges in exchange step 2. Bandwidth is constant in FIG. 20. Note that for a cascade, the data exchange impulse amplitude may also be obtained (i.e., other than finding the mean) as shown in Equation 19. Particularly, since the first cascade strip contains half the total number of compute nodes in the cascade plus one, this defines the largest number of simultaneous data exchanges. The cascade amplitude is thus given by: Equation 19. Cascade Amplitude

where: A_c = The cascade amplitude value. It can be further noted that the total number of exchange steps required to complete an exchange is given by the cascade depth. Thus the leading/trailing edge time is given by: Equation 20. Cascade Leading/Trailing Edge Exchange steps π_c = φ - 1 where: π_c ≡ The leading/trailing edge exchange time of the cascade.

Howard-Lupo Manifolds A Howard-Lupo Manifold, hereafter referred to as a 'manifold', may be defined as a cascade of home nodes. The organization of a manifold is thereby analogous to compute node cascades. For example, 4 7-node cascades could be grouped into a manifold that clears 28 compute nodes in 5 exchange steps as shown in FIG. 21. In particular, FIG. 21 shows one exemplary parallel processing environment 480 illustrating one depth-2 manifold 482 with four home nodes and twenty-eight compute nodes. Parallel processing environment 480 is configured as four depth-3 cascades 484, 486, 488 and 490 that clears (i.e., agglomerates) data from all twenty-eight compute nodes to one home node in 5 exchanges steps. In one example of agglomeration for parallel processing environment 480, cascades 484, 486,

488 and 490 clears data to their respective controlling home node as previously described. Then, manifold 482, using a similar technique as described for a cascade, clears data to one particular home node (i.e., home node 492 in the example of FIG. 21). The example of FIG. 21 does not yield time advantage in clearing data from the compute nodes. As stated above, twenty-eight compute nodes may be cleared in 5 exchange steps, whereas a regular cascade of thirty-one compute nodes clears in the same amount of time. However, since multiple communication channels may be used to form the manifold, an advantage may be gained. FIG. 22 shows one exemplary parallel processing environment 500 illustrating one depth-1 manifold 502 with three home nodes 506(1-3) and three cascades 504(1-3), each with fourteen compute nodes. Each home node 506 has two independent communication channels and clears data from its associated cascade of fourteen compute nodes in three exchange steps. Manifold 502 clears data from all 42 compute nodes in 4 exchange steps, thereby clearing 2.8 times the number of compute nodes cleared by a single cascade in 4 exchange steps. These additional advantages of manifold 502 may result from an increase in the number of virtual channels. The size of a manifold is thus affected independently by the numbers of channels on the home and compute nodes and the number of exchanges required between home nodes to complete the total data exchange impulse. Equation 21 describes the number of compute nodes in a manifold as follows: Let: ψ ≡ the number of channels per home node v ≡ the number of channels per compute node φ ≡ the number of exchange steps at the compute node level m = the number of exchange steps at the home node level Then: Equation 21. Howard-Lupo Manifold Equation P_φ = [ψ(ψ + l)^m /v] [(v + l)^φ - l] If parallel processing environment 500 moves 4 MB of data and each communication channel moves data at 100 Mb/s, then an analysis of the data exchange impulse is shown in Table 8 and FIG. 23. In particular, FIG. 23 shows four sub-impulses 522, 524, 526 and 528.

Table 8. 42-Node Howard-Lupo Manifold; 2-Channel Manifold Level Impulse Analysis Form

A 'cascade group' is defined as a home node channel and all compute nodes that communicate directly with it. For example, with reference to FIG. 22, six cascade groups are shown (two within each cascade 504). The number of nodes in each cascade group is P_φ where φ refers to the depth of the cascade group or number of exchange steps required to clear the cascade group (i.e., the depth is 3 for each cascade group 504 of FIG. 22). In analogous fashion, additional clearing time for home nodes 506 is defined as the depth of the manifold (i.e., manifold 502 has a depth of 1). Equation 22. Manifold Mean Value Calculation

where: M_m The manifold mean value. Manifold A function that finds the next highest valid manifold value. i A loop index. The number of exchanges in an exchange step i. φ The cascade depth (number of clearing exchange steps). Using Equation 22 on manifold 502 of FIG. 22 generates a manifold mean value of 8 nodes transmitting at one exchange step. With this value the leading and trailing data exchange angles and time may be defined. Note that the order of the exchanges is 1, 2, and then 4, as shown in FIG. 24. FIG. 24 shows one exemplary data exchange impulse 540 for parallel processing environment 480 of FIG. 21. Data exchange impulse 540 represents data exchanges for a one home node channel manifold. Bandwidth is constant in FIG. 24. Using Equation 22 to calculate a manifold leading/trailing edge exchange time for manifold 502 of FIG. 22 gives a manifold-mean value of 11. As above, the position of the edge exchange time may be reversed by reversing the direction of arrowheads (i.e., the data flow direction) in FIG. 21. FIG. 25 shows one exemplary data exchange impulse 560, with four sub-impulses 562, 564, 566 and 568, of parallel processing environment 500 of FIG. 22. Bandwidth is constant in FIG. 25. By adding 1 additional independent communication channel at the Home Node level, the exchange amplitude changes from 16 to 24, a 50% increase, while the leading edge exchange time 570 stays the same. Notice further that the number of nodes exchanging information in the allotted time is also 50%, that is, 28 nodes in 4 exchange steps versus 42 nodes in 4 exchange steps (the slight discrepancy is due to rounding errors). As in the cascade, the manifold exchange amplitude can be computed without generating a mean. In fact, Equation 19 can be used to calculate the amplitude and Equation 20 can be used to calculate the leading edge exchange time. Entropy for the manifold may also be calculated. lf ζ = 0 then N_s ≡ φ+ m Mused ~ * φ unused "

After substituting these values into Equation 10, this gives: Equation 23. Manifold I/O Entropy P φ + m The χterm can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since ^parameter of the Manifold I O Entropy = 1, this gives: Equation 24. Data Compression Effects on Manifold I/O Entropy γ= Compression Ratio FIG. 22 shows a pattern for additional levels of expansion. Note that each channel of the top level home node clears a cascade group, plus one full cascade. If higher level home nodes are added and a cascade group plus a lower level manifold are connected, a hyper-manifold is formed. Hyper-Manifolds A hyper-manifold serves to carry the organization of compute and home node cascades to higher "dimensions" while keeping communication channels consumed. One manifold, shown in FIG. 26, is illustratively used as a start when developing a hyper-manifold. FIG. 26 shows one exemplary level-1 depth-2 manifold 582. Each channel of a home node 584 is the starting point of a home node cascade, such that a cascade group, each with three compute nodes 586, is attached to each resulting home node channel. A level-2 manifold consists of a cascade of home nodes, to each commumcation channel of which is attached a cascade of level- 1 home nodes and a cascade group, as shown in FIG. 27. FIG. 27 shows one exemplary level-2 hyper-manifold 600 with four home nodes 602(1-4) representing level-2 of the hyper-manifold and twelve home nodes 604(1-12) representing level- 1 of the hyper-manifold. Each level is organized as a depth-2 home node cascade. This case is equivalent to a depth 4 home node cascade using single channel home nodes. Notice that the cascade group is used to keep the channel on the level-2 home nodes consumed during data clearing. By the time the cascade groups on the level- 1 home nodes have cleared, the level- 2 home nodes are ready to start clearing the level- 1 nodes. Going to level-3 implies generating a home node cascade. Each level-3 home node channel then generates a level-2 home node cascade. The resulting channels then each generate a level- 1 home node cascade. Finally, all channels generate a cascade group. This allows the level-3 home nodes to clear a cascade group, followed by the level- 1 and -2 home node cascades, continuously keeping their channels consumed. FIG. 27 is identical to a level- 1 manifold of depfh-4. The situation becomes more interesting, and more complicated, when the number of channels used in the home node controls is changed with the manifold. However, it does help illustrate general rules for generating a hyper-manifold: 1) Starting with a single home node, each available home node channel generates a home node cascade. 2) For each additional level desired, every channel at the current level is used to generate a cascade of home.nodes at the next level. 3) After all levels have been generated, each home node channel is used to generate a cascade group. The hyper-manifold of FIG. 27 clears 48 nodes in 6 exchange steps, compared to 63 nodes in a single depth-6 cascade. However, a beneficial effect is obtained by increasing the channel counts at the various manifold levels. For example, consider using a 4-channel depth-1 manifold at level-2 and a 2- channel depth-1 manifold at level- 1, with single channel level-0 home nodes and compute nodes as shown in FIG. 28. Given its complexity, only the lower right channel of the top level home node is fully expanded in FIG. 28, for clarity of illustration. Note that this hyper-manifold clears 360 nodes in just 4 exchange steps. FIG. 28 shows one exemplary two level hyper-manifold 620 with level-1 organized as a depth-1 cascade of 2-channel home nodes 624 and level-2 as a depth-1 cascade of 4-channel home nodes 622. It contains a total of 360 compute nodes 628 in 90 depth-2 single channel cascades (i.e., using 90 single channel home nodes 626). The hyper-manifold clears all 360 compute nodes in 4 time units. With the knowledge of how the hyper-manifold is generated, the total number of compute nodes may be calculated. This is a matter of calculating the total number of home node level channels and multiplying by the number of computational nodes in the cascade hung off each. At the top, or iV*, level of the hyper-manifold, the number of channels is given by: Equation 25. Number of Channels in Top Level of a Hyper-Manifold C = ψ_N(ψ_N + l)^m" This is simply the number of home nodes created at the top level times the number of channels per home node. Moving down to the next level, the total number of channels in the top two levels is given by: Equation 26. Number of Channels in Top 2 Levels of a Hyper-Manifold C = _ΨN [(_ΨN + if* + (_ΨN + ι)^m» _, + ι)^m- - ι}J Continuing this process down to the 0^th, or cascade level, some cancellation of terms yields a final expression for the total number of channels in a hyper-manifold. From this, the expression for the total number of compute nodes becomes: Equation 27. Total Number of Compute Nodes in a Hyper-Manifold ^p _N = ^p _φψNYl{ψi + i=0 Note that this equation implies that a Howard Cascade may be a level-0, depth-0 hyper- manifold. Each term (ψi-1 + 1) in Equation 27 represents the n-space dimensions of the exchange. Since each group of n-space dimensions are multiplied together, the total number of n-space dimensions grows rapidly. The n-space dimensions that are growing are the dimensions attributable to Virtual Channels. Thus, a hyper-manifold folds time faster than a cascade or manifold, but without a lot of additional hardware, (except for hardware used to support the basic series expansion growth rate). The entropy calculation for the hyper-manifold is given below. If ζ = 0 then N Ns ≡ φ + m_{ «=ι ^ sed ~ " φ ^unused ⁼ After substituting these values into Equation 10, this gives: Equation 28. Hyper-manifold I/O Entropy

The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Hyper-manifold I/O Entropy = 1, this gives: Equation 29. Data Compression Effects on Hyper-manifold I O Entropy γ- Compression Ratio If the system is moving 4 MB of data and each point-to-point exchange moves data at 100 Mb/s, then a 360-Node Multi-channel Two Level Hyper-Manifold Impulse Analysis Form is generated as shown in Table 9.

Table 9. 360-Node Multi-channel Two Level Hyper-Manifold Impulse Analysis Form

FIG. 29 shows one exemplary data exchange impulse 640 for hyper-manifold 620 of FIG. 28. In particular, FIG. 29 shows four sub-impulses 642, 644, 646 and 648. FIG. 30 shows one exemplary data exchange impulse 660 for hyper-manifold 600 of FIG. 27. Bandwidth is constant in FIG. 30. In particular, data exchange impulse 660 has six sub-impulses 662, 664, 666, 668, 670 and 672, and an edge exchange time 674. As above, the position of the edge exchange time is reversed by reversing the direction of the arrowheads, and hence the data flow direction, found in FIG. 26.

Howard-Lupo Sub-Cascades The rules for building the hyper-manifold can be modified to change the form of the cascade itself. For example, consider a cascade built using compute nodes with different numbers of channels, as long as the clearing time requirements are maintained. The rules for the creation of sub-levels within a cascade are: 1) Each home node generates a cascade of compute nodes. 2) For each additional sub-level, all channels generate another cascade. For instance, FIG. 31 shows a hyper-manifold 680 that is created by generating a depth-2 cascade using a 2-channel home node 682, and 2-channel compute nodes 684, then adding a second level of single channel compute nodes 686 also of depth-2. Hyper-manifold 680 results in 62 nodes that clear in 4 exchange steps. This can be compared to a single channel cascade, which clears 15 nodes in four exchange steps, and a dual channel cascade, which clears 80 nodes in four exchange steps. An advantage of the hyper-manifold may for example be the added flexibility for achieving maximum performance with available resources. Sub-cascades are created by changing the number of channels in the compute nodes at some depth into the cascade. Hyper-manifold 680 starts as a 2-channel cascade off of a 2-channel home node with depth-2, then continues as a single channel cascade to an additional depth of 2. If hyper-manifold 680 moves 4 MB of data and each communication interface moves data at 100 Mb/s, then its impulse analysis form is shown in Table 10.

Table 10: Impulse Analysis Form for hyper-manifold 680, FIG. 31.

FIG. 32 shows a data exchange impulse 700 for hyper-manifold 680 of FIG. 31. In particular, FIG. 32 shows four sub-impulses 702, 704, 706 and 708 illustrating that 72 exchanges are required to clear hyper-manifold 680. Data exchange impulse 700 has a trailing edge exchange 710. Bandwidth is constant in FIG. 32. Effective Clearing Bandwidth The impact of clearing the compute nodes in fewer and fewer exchange steps is equivalent to using faster and faster channels to clear the data in serial fashion. Hyper-manifold 620 of FIG. 28, for example, clears 360 compute nodes in 4 exchange steps; this is faster than the single channel speeds suggests. An effective bandwidth for the clearing operation may be computed as follows: Equation 30. Hyper-manifoU Effective Clearing Time Bandwidth M ^beff = M + φ where: b_eιι ≡ The effective bandwidth of the clearing operation. b ≡ Bandwidth of a compute node channel. M ≡ Total depth or number of steps to clear the hyper-manifold. PM = Total number of compute nodes in the hyper-manifold.

Assuming 100 Mb/s channels, Equation 30 suggests that hyper-manifold 620 of FIG. 28 achieves an effective bandwidth of 9 Gb/s. Manifolds as Growth Series Well-defined numerical series which define growth patterns may be used to describe the development of whole families of tree-based networks. Cascades and manifolds may be described as extensions to the trees generated by Fibonacci-like sequences. In addition to the clearing time metric, the development factor Df of a series provides a measure of how fast the network grows in size. For a given series R[i], Df is defined as: Equation 31. Development Factor

In the case of a binary sequence, D_f- 2. The Df of the well-known Fibonacci Sequence is approximately 1.61803, otherwise known as the Golden Section. The number of compute nodes in commumcation at a given depth φ may be considered a sequence. Since the numbers of compute nodes at each time unit is given by Equation 15, the computation of the development factor is straightforward. Note that the leading factors of the equation cancels under division, giving: Equation 32. Hyper-manifold Development Factor Calculation n - lim- L . n ϋ±lϊ i - (v + 1) ' <~ ,P_f(i-l) <→- (r + l)^rf -l Hence, the development factor of a hyper-manifold is related to the number of available channels per compute node. All other parameters serve as multiplicative factors, and the hyper- manifold can be said to grow as 0(υ ^φ). This growth factor implies that the number of channels is the primary driver of network growth. Network Generators A'network generator' may be used to describe production of growth series. Networks can be generated from numerical series. Generation of a network may be described by specifying the roles played by various nodes in the process. A generator may first be described as a node which adds network nodes at specified time intervals or units. A newly added node is said to be'ltistantiated'and, depending on the sequence generation rules, may or may not replicate additional nodes at later'time steps^''. In the case of the Fibonacci sequence, a generator produces just one initial node. From then on, an instantiated node begins to replicate additional nodes following the rule of adding one node per time step, beginning with the second time step after its instantiation. FIG. 33 shows one exemplary Fibonacci generation pattern 720. A generator node 722 starts the growth process and produces one node 724 in time step 1. This is an instantiated node, as indicated by a white circle in Fibonacci generation pattern 720, and does not reproduce until the second time step after its instantiation; thus, at time step 3 (i.e., point 726), node 724 reproduces node 728, and in time step 4 node 724 reproduces node 730, and so on. Node 728 starts reproducing in time step 5 and node 730 starts reproducing in time step 6. Reproducing nodes are shown as solid circles. A reproducing node generates one new node on each subsequent time step. FIG. 34 shows a Fibonacci tree 740 generated from Fibonacci generation pattern 720, FIG. 33. Fibonacci tree 740 is extracted by collapsing Fibonacci generation pattern 720 along the time traces of each node, reducing them to single points. These are the paths used to define the relationship of nodes during data movement through the network, but they need not represent the only paths available to the nodes. This process may be used to generate a whole family of networks. For example, replication could be delayed until the 3^rd time step after instantiation (euphemistically called a "Tribonacci Tree") to produce a Tribonacci generation pattern 760 shown in FIG. 35 and a Tribonacci tree 780 shown in FIG. 36. In another example, replication may begin immediately with the next time step after instantiation (a so-called "Bibonancci Tree") to produce a Bibonacci generation pattern 800 shown in FIG. 37 and a Bibonacci tree 820 shown in FIG. 38. Trees 740, 780 and 820 are similar, though Tribonacci tree 780 grows more slowly, having a development factor of about 1.465, while Bibonacci tree 820 has a development factor of 2. Thus, the primary difference in these trees is their rate of growth. Given enough time, all can reach approximately the same number of nodes. This description may also be applied to cascade generation. The primary difference is the ability of the generator to continue inserting nodes at each exchange step. This has the effect of growing the network faster than Bibonacci tree 820 even though both have a development factor of 2. FIG. 39 shows one exemplary cascade generator pattern 840. In particular, a generator 842 generates a node 844 in time step 1, a node 846 in time step 2, and so on. Node 844 generates a node 848 in time step 2, and so on. FIG. 40 shows a cascade tree 860 with associated network communication paths for cascade generator pattern 840 of FIG. 39. A cascade and a Bibonacci Tree both guarantee that data can flow from the lowest level nodes to the top level or generator node while fully utilizing all upper level communication paths. For example, at a given starting exchange step, all lower nodes can immediately move data to the node above. The same occurs at every later exchange step until all data arrives at the top-most node. This means that both the cascade and the Bibonacci Tree have zero entropy at all exchange steps. Other tree patterns cannot guarantee this. If one examines FIG. 34 and FIG. 36, it is fairly easy to pick out several nodes that attempt to communicate to the same node at the same time (for instance, the triplet on the right side of FIG. 36). The cascade's faster growth rate is important because it implies greater channel utilization. While the cascade can clear 31 nodes in 5 exchange steps, the Bibonacci Tree requires 5 exchange steps to clear just 16 nodes. This effect can be carried to higher dimensions in several ways. A first method involves adding additional independent communication channels to a generator node for the cascade so that multiple cascades are created. Adding a second independent channel to the generator node, for example, allows twice as many compute nodes to be cleared in a given number of time units. FIG. 41 shows one exemplary cascade generated pattern 880 resulting from two communication channel in a generator node 882. With two communication channels, generator node 882 simultaneously generates nodes 884 and 886. FIG. 42 shows one exemplary cascade tree 900 showing generated data paths of cascade generated pattern 880. A second method involves adding additional growth channels to compute nodes. By adding multiple channels to the generator and compute nodes, each node may branch multiple times at each allowed time step. For example, adding a second channel to the compute nodes of 41 allows an even faster rate of network growth, while maintaining the ability to clear data from the nodes in a time which utilizes all available channels. FIG. 43 shows one exemplary 2-channel cascade generation pattern 920. In FIG. 43, a two channel generator simultaneously generates nodes 924 and 926. In a next time step, nodes 924 and 926 simultaneously each generates nodes 928, 930, 932 and 934, respectively. FIG. 44 shows a 2-channel cascade generated tree 940 extracted from cascade generation pattern 920 of FIG. 43 and illustrating generated paths. The first and second methods may also be combined. This can be further expanded to cover manifolds and hyper-manifolds. A manifold is produced by a rule which has the generator first produce the equivalent of a cascade of generators. Each generator in this cascade then produces a cascade of compute nodes. FIG. 45 shows one exemplary manifold generated pattern 960. In particular, pattern 960 shows a generator 962 that generates two generators 964 and 968 in two time steps; generator 964 generates generator 966 in the second of these time steps. Each generator 962, 964, 966 and 968 then generates a cascade of nodes. For example, generator 962 generates nodes 970, 972 and 974 in time steps 3, 4 and 5. These nodes may also replicate as shown. FIG. 46 shows a manifold tree 980 extracted from manifold generated pattern 960 of FIG. 45 and illustrating generated paths. The hyper-manifold carries this rule to higher dimensions. For instance, a second manifold dimension uses the generator to produce a cascade of generators; then each 2^nd level generator produces a cascade of 1^st level generators below it; and finally, every generator produces a cascade. This can be expanded further to produce interconnection networks of arbitrary size and complexity, all of which ensures zero-entropy Type I data I/O. As shown above, the introduction of arbitrary connections or termination of sequence growth leads to violation of the channel availability rule for zero-entropy systems. By construction and elimination by counter-example, it appears that the Howard-Lupo Hyper-Manifold covers all cases of such networks.

Type H I O

Type Ha Type Ila I/O is primarily used for agglomeration and involves the movement of datasets that increase in size as one moves up the cascade and manifold levels. The ideal case requires each node to contribute the same size partial result to the total so that the final result is proportional to the number of nodes; that is, the data is evenly distributed across the nodes. Since the size changes uniformly as data travels up the levels, the time to complete the data movement also increases uniformly. For Type II agglomeration, the data size grows in proportion to the number of nodes traversed, since each upper level node passes on its data plus that from all nodes below it. In the following example, each node is assumed to start with 1 unit of data. FIG. 47 shows one exemplary cascade 1000 with one home node 1002 and seven compute nodes 1004(1-7). During a first exchange step, one unit of data is moved from compute nodes 1004(1), 1004(4), 1004(6) and 1004(7) to nodes 1002, 1004(2), 1004(3) and 1004(5), respectively. During a second exchange step, two units of data are moved from compute nodes 1004(2) and 1004(5) to nodes 1002 and 1004(3), respectively; and during a third exchange step four units of data are moved from compute node 1004(3) to home node 1002. Thus, 7 exchange steps, instead of 3, are required to clear this depth-3 cascade. Increasing the number of channels between nodes and/or increasing the speed of the channels may reduce the time. An advantage over sequentially moving data from the nodes may be to decrease total latency. The time to move data through a manifold can be expressed as: Equation 33. Data Movement Time Through a Type Ha Manifold P ' _mmD_φ φ t_e = bψ Equation 34. Data Movement Latency Time Through a Type Ha Manifold t_λ = λ(φ + m) where: φ ≡ Depth of the cascade. D_ψ ≡ Data set size on a compute node. ψ ≡ Number of home node channels. b ≡ Channel bandwidth. λ ≡ Channel effective latencies. m ≡ Depth of the manifold. P_m ≡ Total number of compute nodes in the manifold. Calculating the entropy values for Type Ila I O yields: Equation 35. Type Ha I O Entropy R, a Ila = ( , + 1>², β_JIa = 0, γ_IIa = φ The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Type Ua I/O Entropy = ¥_φ I φ, this gives: Equation 36. Data Compression Effects on Type Ha I/O Entropy γ= (P_φ I φ) Compression-Ratio If cascade 1000 moves 3 MB of data and each point-to-point exchange moves data at 100 Mb/s, then a data exchange impulse analysis produces the agglomeration impulse analysis form shown in Table 11.

Table 11. 7-Node Type H Agglomeration Impulse Analysis Form FIG. 48 shows one exemplary data exchange impulse 1020 of cascade 1000, FIG. 48. The effects of increasing data size are illustrated in data exchange impulse 1020 by the decreasing number of exchanges coupled with the increasing time to complete an exchange. For example, in a first sub- impulse 1022, four simultaneous exchanges occur in one time period. In a second sub-impulse 1024, two exchanges of two data units occur simultaneously, in two time units. And in a third sub-impulse 1026, one exchange of four data units takes four time units. Thus, trailing edge exchange 1028 shows an increased time requirement. An exchange step (i.e., a time unit) is defined as the time required to move a single node's data, D_ψ, to another node. Bandwidth is constant in FIG. 48. Type Hb The introduction of multiple channels at the home node level may reduce the amount of time required to clear a cascade. While any number of additional channels may help, there is one case of particular interest; it occurs when the number of channels at the home node level satisfies: Equation 37. Home Node Level Channels Required For TypeH Clearing in φ Steps ψ = log₂(P_φ +l)= φ The channels can be provided either as multiple channels on a single home node, or as some combination of multiple channels and multiple home nodes. For the first φ-l exchange steps, all of the home node channels may be fully utilized. The final step is used to clear the remaining nodes. For example, a depth-3 cascade requires 3 channels at the home node level. FIG. 49 shows one exemplary depth-3 cascade 1040 with one home node 1042 that has three independent communication channels and seven compute nodes 1044(1-7). In a first exchange step, compute nodes 1044(1), 1044(2) and 1044(4) simultaneously transfer data to home node 1042; in a second exchange step, compute nodes 1044(3), 1044(5) and 1044(6) simultaneously transfer data to home node 1042; and in a third exchange step compute node 1044(7) transfers data to home node 1044. Similarly, a depth-4 cascade uses 4 independent communication channels on a home node to allow the first 3 steps to clear 4 nodes each, followed by 3 nodes on the last step. Computing the entropy for this exchange is straight forward: Equation 38. Type Hb Entropy Calculation

The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Type lib Entropy = 1, this gives: Equation 39. Data Compression Effects on Type Hb Entropy /= Compression-Ratio The resulting analysis table and impulse shape for cascade 1040 of FIG. 49 is shown in Table

12, which assumes each node outputs 1MB of data, and FIG. 50 shows one exemplary data exchange impulse 1060 of cascade 1040. In particular, data exchange impulse 1060 shows sub-impulses 1062, 1064 and 1066. Sub-impulses 1062 and 1064 each have three simultaneous exchanges, and sub-

Table 12. Type Hb Exchange Impulse Analysis Form, Depth-3 Cascade

By distinguishing between ψ, the number of channels at the home node level which participate in cascade generation, and ψ_a, some number of available auxiliary channels at the home (unused channels on one home node, or additional nodes at the home node level), the number of steps required to clear the cascade in a Type lib data movement may be expressed as: Equation 40. General Case Type Hb Clearing Steps

the number of home node channel groups may be defined as: Equation 41. Number of Home Node Channel Groups

then the number of clearing steps reduces to: Equation 42. Home Node Channel Group Restricted Type Hb Clearing Steps

G_hv For the general case, the number of home node channel groups required for a Type Ub clearing of a cascade in a number of steps equal to its depth is given by: Equation 43. General Type Hb Clearing in φ Steps

Type HI Input/Output

If the data is to be streamed off the system for additional processing or assembly elsewhere, additional home nodes, separate from the cascade nodes, may be used to create multiple data streams off of the parallel processing environment. Individual cascades could be moved off the parallel processing environment in one time unit if at least as many channels exist at the home node level as on the total of all compute nodes. The total number of time units may then depend on the number of manifold groups: Equation 44. Data Movement Time Through a Type IHa Manifold M " Q bv„ Hψ Equation 45. Data Movement Latency Time Through a Type Hla Manifold

FIG. 51 shows one exemplary parallel processing environment 1080 illustrating four home nodes 1082, twenty-eight compute nodes 1084 and a mass storage device 1088. Parallel processing environment 1080 also has four additional home nodes 1086, each with eight independent communication channels, that allow data to be cleared from all twenty-eight compute nodes 1084 to mass storage device 1088 in one exchange step and concurrently with type I agglomeration. FIG. 52 shows one exemplary depth-3 cascade 1100 with three home nodes 1102(1-3), shown as a home node bundle 1106, and seven compute nodes 1104(1-7). Home node bundle 1106 facilitates movement of data from a mass storage device 1108 to compute nodes 1104. In a first exchange step, data is moved from home nodes 1102(1), 1102(2) and 1102(3) to compute nodes 1104(1), 1104(2) and 1104(3), respectively. In a second exchange step, data is moved from home nodes 1102(1), 1102(2) and 1102(3) to compute nodes 1104(4), 1104(5) and 1104(6), respectively. In a third exchange step, data is moved from home node 1102(1) to compute node 1104(7). Thus, data is transferred to all compute nodes in three exchange steps, in this example. This model may match the Type I I/O exchange steps for a non-collapsing dataset. A home node bundle may be defined as a group of home nodes that act in concert, analogously to a single home node in a cascade, using the least amount of channel capacity to generate a linear time growth with an expanding network. The following equation relates I O channel capacity to the number of nodes used for a given cascade. Let: VΕ_ase ^≡ The Home Node channel capacity required to clear the cascade in φ exchange steps. H s Number of Home Nodes (home node bundle count). b ≡ Channel speed. ψ = Number of channels per Home Node. φ ≡ Number of cascade expansion exchange steps. P_φ ≡ Number of compute nodes. Ceil ≡ If the real number within this function has a non-zero decimal value, the function selects the next highest integer value; otherwise, it selects the current integer value.

Then: Equation 46. Home Node Channel Capacity Required to Clear a Type Hlb Cascade in φ Exciiange steps

Ψ_Base = Hψ

And: Equation 47. Data Movement Time Through a Type Hlb Cascade

Table 13. Relationships Between υ_base and Cascade Parameters The type Hlb cascade may be treated in a manner that is analogous to the type I cascade; that is, it can be formed into manifold and hyper-manifold structures. The number of n-space dimensions used in this exchange is given by ψ _ase- Table 13 show relationships between ψ_base and other cascade parameters. Extra n-space dimensions occur because of channel reuse; this is another example of virtual channels increasing the effect of the communication channels in a non-linear manner. Adding more channels than the base formula shows increases the throughput of the I/O transfer, while subtracting channels from the base formula decreases the throughput of the I O transfer. Equation 48. Type Hlb Cascade Mean Value

where: M_m The Type Iϋb cascade mean value. Cascade A function that finds the next highest valid cascade value. i A loop index. E, The number of exchanges a exchange step i. φ The total number of exchange steps.

AA Using Equation 48 on the type HTb cascade generates a type IHb cascade mean of 3 nodes transmitting at one exchange step. This value defines the leading and trailing edge exchange times. As shown in FIG. 53, data exchange impulse 1120 has two sub-impulses 1122, 1124 of 3 exchange steps, and one sub-impulse 1126 of one exchange step; sub-impulse 1126 represents the trailing edge exchange time. If cascade 1100 moves 4 MB of data and each commumcation channel moves data at 100 Mb/s, then its impulse analysis form is shown by Table 14.

Table 14. 7-Node Type-HIb I/O Impulse Analysis Form

Once again, the edge exchange time position (leading or trailing) is a function of the dataflow direction, in this case from the mass storage device to the compute nodes. Reversing the arrowhead direction in FIG. 52 reverses the position of the edge exchange time from trailing to leading. As demonstrated in FIG. 20, the natural leading edge exchange time of the cascade is decreased at a cost of communication channel count. The communication channel cost is minimal to mitigate the natural trailing edge exchange time. Continuing the example of FIG. 52, a single home node channel is added to depth-3 cascade 1100 to time shift the trailing edge exchange time to the first φ time. FIG. 54 shows one exemplary manifold 1140, based upon depth-3 cascade 1100, with one additional home node 1142 to form a channel with compute node 1104(7). In particular, the number of additional home node channels used is a function of the number of nodes left out of the exchange, in this case 1. FIG. 55 shows one exemplary data exchange impulse 1160 of manifold 1140 of FIG. 54. In particular, data exchange impulse 1160 has a first sub-impulse 1162 with four exchanges, and a second sub-impulse 1164 with three exchanges. Thus, there are no leading edge or trailing edge exchange times. Entropy of type Hlb I O is calculated as follows: Equation 49. Type Hlb Cascade Entropy <Xπi_b-c = _φ(lP_φ

O, γ_m__c = ι The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since / parameter of the Type HTb Cascade I O Entropy = 1: Equation 50. Data Compression Effects on Type nib Cascadel/O Entropy /= Compression-Ratio Howard-Lupo Type HTb Manifold A Howard-Lupo type IITb manifold is hereafter referred to simply as a type Hlb manifold. A type HTb manifold may be defined as a cascade of home node bundles. The organization of the home node bundle cascades is analogous to the home node cascades found in manifolds. FIG. 56 shows one exemplary type Hlb manifold 1180 with twelve home nodes 1182 and twenty-eight compute nodes 1184. Home nodes 1182 of type Hlb manifold 1180 are also shown connecting to a mass storage device 1186. Type Hlb manifold 1180 clears twenty-eight compute nodes 1184 in three exchange steps. Unlike the manifold depicted in FIG. 21, which takes 5 exchange steps to clear, type Hlb manifold 1180 only takes 3 exchange steps since the data is not destined for a single home node. This implies that there is no hyper-manifold equivalent with a type Hlb cascade. Equation 51. Type Hlb Manifold Mean Value

M_m =

where: M_m The Type Hlb manifold mean value. Manifold A function that finds the next highest type Hlb manifold value. i A loop index. E The number of exchanges a exchange step i. φ The total number of exchange steps. Using Equation 51 on the Type Hlb manifold generates a Type Hlb manifold mean value of 12 nodes transmitting at one exchange step. This value defines the leading and trailing edge exchange times. Note that the order of the exchange is 12, 12, and then 4. If type nib manifold 1180 moves 3 MB of data and each communication channel moves data at 100 Mb/s, then its impulse analysis form is shown in Table 15.

Table 15. 28-Node Type-IHb Manifold Impulse Analysis Form

FIG. 57 shows one exemplary data exchange impulse 1200 for type Hlb manifold 1180 of FIG. 56. In particular, data exchange impulse 1200 has two sub-impulses 1202, 1204 of 12 exchanges and one sub-impulse 1206 of 4 exchanges; sub-impulse 1206 represents a trailing edge exchange time. Of course, reversing the arrowhead direction within type Hlb manifold 1180 repositions the edge exchange time from trailing to leading. It should be noted that a φ-ύ e shift that corresponds to FIG. 54 and FIG. 55 may be performed on type Hlb manifold 1180. Entropy for Type Hlb manifold I/O may be calculated as follows: Equation 52. Type Hlb Manifold I/O Entropy

* ^ai -m ~

The /term may be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Manifold I/O Entropy = 1 : Equation 53. Data Compression Effects on Type THb Manifold I/O Entropy /= Compression-Ratio0 FIG. 58 shows one exemplary data exchange impulse 1220 for type HTb Manifold 1180, FIG. 56, with four additional home nodes to form additional home node channels. In particular, data exchange impulse 1220 has a first sub-impulse 1222 with 16 exchanges and a second impulse 1224 with 12 exchanges. Thus there are no leading or trailing edge exchanges. Bandwidth is constant in FIG. 58. 5 Pipe-Lining

A pipe line process handles multiple independent input and output datasets. Such processing is characterized by a very large total data set size that may be subdivided into smaller parcels representing unique and independent data units. These data units may be processed to produce results independently of each other and these results also may be handled independently. Consequently, these data units may be distributed across individual cascades to minimize processing time. Gathering the results from each cascade is a Type I I/O operation. Each compute node of the cascades stops processing for the time it takes to agglomerate results to the home node and acquire a new data unit. As described above, the depth of the cascade determines this agglomeration time. A modified version of Type III I/O, also called Type IIIc I/O hereinafter, provides one way to decrease the amount of time required to off load results and begin processing the next data unit. Auxiliary home nodes are introduced to serve as I/O processors. Each home node is assigned a collection of compute nodes in the form of a small sub-cascade. For example, a 15-node depfh-4 cascade may be subdivided into 5 depfh-2 sub-cascades, each containing 3 compute nodes. This allows results to be agglomerated to the auxiliary home nodes in 2 exchange steps, rather than 4. However, it does require 3 additional exchange steps to agglomerate data off the auxiliary home nodes. As long as processing time is greater than 3 exchange steps, this represents an advantage since I/O may be overlapped with processing time. FIG. 59 shows one exemplary cascade 1240 for performing type HIc I/O. Cascade 1240 has five home nodes 1242(1-5) and fifteen compute nodes 1244. Home node 1242(1) allows cascade 1240 to operate as a normal cascade with normal cascade generation data flow. Cascade 1240 is also shown divided into sub-cascades 1246(1-5), where each sub-cascade 1246(1-5) has one home node 1242(1-5), respectively. Agglomeration of temporary results is shown by arrowed lines indicating an upward direction. Since each sub-cascade is of depth 1, data may be agglomerated to sub-cascade home nodes in two exchange steps. The number of auxiliary home nodes can be used to tune agglomeration times. For example, a depth-8 cascade with 255 compute nodes may be subdivided into 36 depth 3 and 1 depth 2 sub-cascades, requiring 37 auxiliary home nodes, and clears in 3 time units, rather than 8. Distribution of subsequent data units can also make use of this structure. Data units may be distributed to all nodes in the same exchange steps. Thus, for the depth-8 cascade example, distribution and agglomeration of results for a single data unit occurs in 6 rather than 16 exchange steps. Equation 54. Sub-Cascade I/O Entropy

The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Sub-Cascade I/O Entropy = 1 : Equation 55. Date Compression Effects on Sub-Cascade I/O Entropy / = Compression-Ratio

Cross-communication

Many types of parallel algorithms eventually reach points in their calculation where data is exchanged with other nodes. While the Howard Cascade and the Howard-Lupo Manifolds are organized logically as cascades, point-to-point communication between any compute nodes and home nodes is fully supported. Therefore, while cascade organization allows for efficient movement of data onto and off of the cascade, other logical connection architectures may be assumed as dictated by the algorithm. An all-to-all exchange involves moving some or all local data to every other node working on the algorithm. The movement of a complete copy of a data set from one node to another is referred to hereinafter as a full all-to-all exchange and the movement of a portion of a data set from one node to another node is referred to hereinafter as a partial all-to-all exchange. The exchange time is proportional to the amount of data moved. Further, in a partial all-to-all exchange, a small amount of calculation selects the correct pieces of the data set. As shown below, the pattern of data movement is the same for both full and partial all-to-all exchanges.

Mersenne Prime Cascade AlI-to-AU Cross-Communication A prime number of the form P = 2^φ -1 where φ is also prime is called a Mersenne Prime. This happens to be the equation for a Howard Cascade when ψ= v= 1 in Equation 15. If P (the number of compute nodes in a cascade) is prime it is also a Mersenne Prime. The following all-to-all exchange completes for all Mersenne Primes. In a first example, a cascade of depth-3 (i.e., ψ- v= l and φ= 3) has seven compute nodes (i.e., P = 7 which is a Mersenne Prime) and 1 home node. In a partial dataset all-to-all exchange, specific packets of data are moved from one compute node to every other compute node. The home node is used as a temporary storage area, allowing all communication channels in the cascade to be fully utilized at every exchange step in the all-to-all exchange. FIG. 60 shows a data flow diagram 1260 for a first exchange step of a Mersenne Prime partial all-to-all exchange between seven datasets 1262(1-7) and a temporary data location 1268 of the cascade of the first example. Temporary data location 1268 may be located on the home node and datasets 1262(1-7) located on compute nodes of the cascade. For example, data set 1262(1) is located on compute node 1 of the cascade, dataset 1262(2) is located on compute node 2 of the cascade, and so on. In this exchange step, one data packet moves from a compute node to the home node for temporary storage. In FIG. 60, vertical columns 1264 of implicitly numbered (sequentially from top to bottom) storage locations within each data set 1262 represent data locations that are filled from a correspondingly numbered compute node of the cascade. Diagonal locations 1266 of implicitly numbered (sequentially from top left to bottom right) storage locations within each data set 1262 represent data packets to be sent to a correspondingly numbered compute node of the cascade. For example, a data packet of node 1 that is destined for node 2 is transferred from a second location of diagonal 1266 within data set 1262(1) to the first location of column 1264 within data set 1262(2). Once a column is filled, all necessary data packets have been received from all other nodes. The temporary data location 1268 at the home node level represents a temporary holding area, allowing all the channels to remain occupied during the exchange while maintaining progress. In particular, within FIG. 60, a diagonal element 1266 of each data set 1262 represents a portion of data that is moved to a corresponding data set 1262 of a compute node. A vertical column 1264 within each data set 1262 represents data locations for data packets received from other compute nodes, and includes a portion of the data that is local to the compute node and not moved. For example, in data set 1262(1), column 1264 intersects with diagonal 1266 at a first data location; thus the first data packet is local and is not moved. In data set 1262(2), column 1264 intersects with diagonal 1266 at a second data location; thus the second data packet is local to data set 1262(2) and is not moved. Remaining datasets 1262(3-7) are similarly shown. In FIG. 60, a seventh data packet of diagonal 1266 of data set 1262(1) is moved to temporary data location 1268; a seventh data packet of diagonal 1266 of data set 1262(2) is moved to a second data location of column 1264 of data set 1262(7); a sixth data packet of diagonal 1266 of data set 1262(3) is moved to a third data location of column 1264 of data set 1262(6); and a fifth data packet of diagonal 1264 of data set 1262(4) is moved to a fifth data location of column 1264 of data set 1262(5). Thus all exchanges of the first exchange step are completed simultaneously and all communication channels are utilized. FIG. 61 shows a data flow diagram 1280 for a second exchange step of the Mersenne Prime partial all-to-all exchange and follows the exchange step of FIG. 60. In FIG. 61, the data packet stored in temporary data location 1268 is moved to a first data location of column 1264 of data set 1262(7); a third data packet of diagonal 1266 of data set 1262(4) is moved to a fourth data location of column 1264 of data set 1262(3); a second data packet of diagonal 1266 of data set 1262(5) is moved to a fifth data location of column 1264 of data set 1262(2); and a first data packet of diagonal 1266 of data set 1262(6) is moved to a sixth data location of column 1264 of data set 1262(1). This sequence of steps, moving one data packet up to temporary data location 1268 within the home node and then down to its destination while other nodes interchange amongst themselves, continues until all packets have been moved. The total number of exchange steps required to complete the movement of all data is then given by: Equation 56. Number of All-To-All Exchange Steps Required for P_φ Nodes tf_f = 2(p, -l) Table 16 shows exchange steps for completing a full all-to-all exchange on a depth-3 cascade (i.e., with seven compute nodes).

Table 16. All-To-All Exchange Sequence, Single Channel

Table 16 displays the exchange steps for a 7-node (φ=3) cascade with a single channel on each node. Each step is numbered in the order in which it is performed. All nodes are either transmitting or receiving simultaneously, with the node on the left side of the arrow transmitting to the node on then right side. The full all-to-all exchange completes in 12 exchange steps. The arrowheads shown in FIG. 60 and FIG. 61 show a single transfer direction per exchange step. If the communication channel that provides the connection for the data transfer is full-duplex (i.e., a node can transmit and receive simultaneously), the number of exchange steps is reduced by half as shown in Table 17. Offset

Table 17. All-To-All Exchange Sequence, Single Channel Full-Duplex

Table 17 shows the exchange steps for a depth-3 (φ=3) cascade with a single full-duplex communication channel per node. The exchange steps are performed in sequence, and all nodes either transmit or receive simultaneously during each step. Not all channels are utilized at all times since the cascade has an odd number of nodes. In the example of Table 17, the use of full-duplex communication channels gives two effective channels, and the number of exchange steps is determined by: Equation 57. All-To-All Exchange Steps for Full-Duplex Channel N_s = P_φ X General Cascade Level All-to-All Cross-Communication An expression can be derived to give the amount of time required to complete an all-to-all data exchange, using the same notations introduced earlier for FIGs. 60 and 61. In a first example, a deρth-3 cascade (i.e., seven compute nodes and one home node) has one communication channel per node. A partial all-to-all exchange moves specific packets of data from 1 node to every other node. The home node is used as a temporary storage area, allowing all channels in the cascade to be fully utilized. FIG. 60 and FIG. 61 show the first two exchanges. As above, diagonal elements represent the portion of the data that is moved to a corresponding node, and column elements represent the data packets received from other nodes, as well as the portion of the data that is local to the node and not moved. This sequence of steps, moving one packet up to the home node then down to its destination while other nodes interchange data amongst themselves, continues until all packets have been moved. Since a cascade group always involves at least one home node and an odd number of compute nodes, this pattern may be extended to all cascade sizes. The total number of exchange steps required to complete the movement of all data is then given by Equation 56 above. As discussed above, it may be possible to reduce the number of exchange steps required to perform an exchange by half, using a full duplex-like exchange. The general process of completing an all-to-all exchange for any number of nodes can be specified using two communication patterns and a specific order for applying them to the list of nodes. An exchange phase is defined as a set of 4 exchange steps involving two node lists and two communication patterns. Every all-to-all exchange involves a fixed number of exchanges related to the number of exchange steps by: Equation 58. Total Number of Exchanges Calculation for General Cascade All-to-All Exchange

where: θ ≡ Number of all-to-all exchange exchanges. For the single channel case, θ represents the number of possible next Λ^-neighbors. Each exchange involves two communication patterns, taken in a forward and reverse direction as shown in Table 18. Pattern 1 is the first pattern, pattern 2 is the second pattern, pattern 3 is the second pattern reversed, and pattern 4 is the first pattern reversed.

All-To-All Exchange Patterns Basic Reversed Pattern 1 Pattern 2 Pattern 3 Pattern 4 e(0) → e(P-l) e(P-l) → e(P-2) e(P-2) → e(P-ϊ) e(P-l) ^■ e(0) e(l) → e(P-2) e(P-3) → e(0) e(Q) → e(P-3) e(P-2) ^• e(l) e(2) → e P-3) e(P-4) → e(l) e(l) → e(P-4) e(P-3) - e(2) e(P/2-l) → e(P/2) e(P/2-l) → e(P/2-2) e(P/2-2) → e(P/2-l) e(PI2) → e(P/2-l) Table 18. All-To-All Exchange Pattern

Node indices, using 0-based numbering, are used here. If the number of nodes is odd, then P is set to that number plus 1, and P-l represents a home node. If the number of nodes is even, P is unmodified, and the steps of pattern 2 and pattern 3 are not performed during the last exchange. Table 18 shows the endpoints of the exchange operation with e(i) representing the node number of an endpoint of the exchange. If the number of nodes is odd, as it is in a single channel cascade, then the number of nodes is incremented by 1, and P-l represents a home node. If the number of nodes is even, then no adjustment is necessary to P; however, the two middle steps of the last exchange are not performed. An example of the endpoint lists required for an all-to-all exchange on a 15-node (depth 4) cascade is shown in Table 19. It is worth noting that the same list of endpoints is usable for a 16 node all-to-all exchange, modifying only the steps applied in the last exchange of the operation. Listl List 2 End Exchange Exchanj Point ϊe 1 2 3 4 5 6 7 1 2 3 4 5 6 7 e(0) 0 1 2 3 4 5 6 0 14 13 12 11 10 9 e(l) 1 2 3 4 5 6 7 1 0 14 13 12 11 10 e(2) 2 3 4 5 6 7 8 2 1 0 14 13 12 11 e(3) 3 4 5 6 7 8 9 3 2 1 0 14 13 12 e(4) 4 5 6 7 8 9 10 4 3 2 1 0 14 13 e(5) 5 6 7 8 9 10 11 5 4 3 2 1 0 14 e(6) 6 7 8 9 10 11 12 6 5 4 3 2 1 0 e(7) 7 8 9 10 11 12 13 7 6 5 4 3 2 1 e(8) 8 9 10 11 12 13 14 8 7 6 5 4 3 2 e(9) 9 10 11 12 13 14 0 9 8 7 6 5 4 3 e(10) 10 11 12 13 14 0 1 10 9 8 7 6 5 4 e(ll) 11 12 13 14 0 1 2 11 10 9 8 7 6 5 e(12) 12 13 14 0 1 2 3 12 11 10 9 8 7 6 e(13) 13 14 0 1 2 3 4 13 12 11 10 9 8 7 e(14) 14 0 1 2 3 4 5 14 13 12 11 10 9 8 e(15) H H H H H H H H H H H H H H Table 19. Node Lists Required for All-To-All Exchange on 15 Nodes

Table 19 shows the sequence of node lists for each exchange of a 15 node all-to-all exchange. The same pattern works for 16 nodes if the node identification number 15 replaces the H node, and steps 2 and 3 of the last exchange are omitted. FIG. 62 illustrates a first exchange phase 1340 of an all-to-all exchange performed on a 15 compute node cascade. In FIG. 62, H represents a home node, and numbers 0-14 indicate a compute node identifier. Arrows indicate the direction in which a pair-wise exchange is performed. The patterns indicated do not change in the process of completing an all-to-all exchange; however, the list of nodes is shifted by one position P/2 times. The following rules define the overall exchange process: 1) Start with 2 lists of the node identification numbers entered in ascending order. If the number of nodes is odd, append a home node to the end of the list. Call these lists List 1 and List 2. 2) Repeat the following ^/2 times: a. Apply Pattern 1 to List 1, then Pattern 2 to List 2. If in the last exchange and the number of nodes is even, apply only Pattern 1 to List 1. b. Reversing the transmission directions, apply Pattern 2 to List 2 then Pattern 1 to List 1. If in the last exchange and the number of nodes is even, apply only Pattern 1 to List 1. c. Shift the order of List 1 up by 1 and the order of List 2 down by 1, holding the last node in its original location. FIG. 63 through FIG. 68 show the remaining six exchange phases 1360, 1380, 1400, 1420, 1440 and 1460 of a 15-node cascade all-to-all exchange, which requires 28 exchange steps. Adding communication channels to the compute and home nodes may speed up the exchange process. Additional communication channels at the home node level may be multiple channels on a single home node or a single channel on multiple home nodes. The sequence of required exchanges remains unchanged and different portions of the process are assigned to different channels, as long as the first two and last two exchange steps are paired on a given channel. In one example, two channels may be used to complete each half of an exchange simultaneously, as shown in Table 20. FIG. 69 and FIG. 70 show two exemplary data flow diagrams 1480 and 1500, respectively, illustrating the first two steps of such an exchange with two communication channels per node. The representation within FIGs. 69 and 70 is the same as used in FIG. 60 and FIG. 61. The primary difference between the exchange of FIGs. 60, 61 and FIGs. 69, 70 is that in FIGs. 69, 70 each node sends or receives two packets at a time, and two data storage locations are used within temporary data storage 1488. In one example of operation, in data flow diagram 1480, a seventh packet of data from diagonal 1486 of compute node 1482(1) is transferred to temporary data storage 1488 and a sixth packet of data from diagonal 1486 of compute node 1482(1) is transferred to a first data location of column 1484 of compute node 1482(6). Similarly, in data flow diagram 1500, compute node 1482(1) receives and stores a packet of data from temporary storage location 1488 into a seventh location of column 1484 and compute node 1482(1) receives and stores a packet of data from a first location of diagonal 1486 of compute node 1482(6) into a sixth location of column 1484. Note in particular that temporary storage location receives two data packets in data flow diagram 1480, and sends these two data packets in data flow diagram 1500.

Table 20. 2-Channel All-To-All Exchange Pattern Table 20 lists the steps of an all-to-all exchange on a 7-Node (φ=3) Cascade where each node has two communication channels. The required 12 steps are split between the channels by assigning Channel 0 to the first two exchange steps of each neighbor offset, and Channel 1 to the second two steps. The sequence of exchange steps for each channel is shown by the number in parentheses. Using the notation introduced above, a 'channel group' may be defined as the ratio ψ/υ, giving the balance between home node channels and compute node channels. In the 2-channel example above, there is 1-channel group since ψ = 2 and υ = 2. Adding home nodes provides more temporary storage locations, and adding more communication channels increases the amount of data exchanged on each step. Thus, a more general equation to describe the number of steps involved in an exchange is: Equation 59. General Pair-wise All-to-All Number of Exchanges 2.{p -l) N, = — — -, iff Hψ = v Hψ ^r This indicates that the number of available channels at the home node level is equal to the number of channels per compute node. Equation 59 may be used to estimate the total amount of time consumed if uniform packet sizes are moved. Let: t_e ≡ the time to transmit a data packet during a single exchange. λ ≡ the channel latency per exchange. D_φ ≡ the data set size. Then t_c, the total time to complete the communication exchange within a cascade, is given by: Equation 60. General Cascade All-to-All Exchange Total Time

But Equation 59 relates Hψ to υ, so Equation 60 may be rewritten as: Equation 61. General Cascade All-to-All Exchange Total Time

Equation 61 may be separated into transmission and latency parts. Equation 62. General Cascade All-to-All Exchange Transmission Time 2£> , -ι) t_C(e) bv Equation 63. General Cascade All-to-All Exchange Latency Time _ 2λ(P₉ -l) ^{cw ~} v As noted above, it may be possible to double the exchange rate for the cascade exchange by using the communication channels in a full duplex-like mode. Incorporating full-duplex channels into Equation 61 yields: Equation 64. Full-Duplex Cascade All-to-All Total Time

This change, which reduces the time by a factor of 2, yields: Equation 65. Full-Duplex General Cascade All-to-All Exchange Transmission Time

Equation 66. Full-Duplex General Cascade All-to-All Exchange Latency Time

It is evident from Equation 65 and Equation 66 that bandwidth and number of channels do not have an interchangeable effect. Increasing the number of channels has the effect of masking channel latency, while increasing bandwidth has no impact at all upon latency. The effectiveness of the general cascade cross-communication model and the Butterfly model may be compared as follows. The Butterfly model is commonly used in algorithms such as a Fast Fourier transform (FFT). The number of exchanges required in a Butterfly model is given by: Equation 67. Butterfly Exchange Total Number of Required Exchanges N_s = P(P -l) Equation 68 and Equation 69, for the Butterfly exchange, are analogous to Equation 65 and Equation 66: Equation 68. Butterfly Exchange Transmission Time Component

Equation 69. Butterfly Exchange Latency Time Component t ₎ = λP(P-l) Both Equation 68 and Equation 69 demonstrate that bandwidth and number of channels represent equivalent tradeoffs for the Butterfly exchange. For example, doubling the speed is equivalent to doubling the bandwidth. However, the Butterfly exchange shows no ability to adjust latency, except by hardware.

Comparison of All-To-All Exchange Methods By understanding relative performance of various all-to-all exchange methods, more efficient methods may be used.

Cascade Versus Butterfly Exchange One advantage of the general cascade exchange model over the Butterfly exchange model is for example shown in Table 21. In particular, Table 21 provides a comparison between the Butterfly exchange and a general cascade exchange for certain numbers of one channel nodes. Cascade Cascade Butterfly Depth Nodes Time Advantage Time Units Units Ratio

Table 21. 1-Channel Butterfly and 1-Channel Cascade Comparison

As can be seen from Table 21, for any number of nodes, the cascade exchange has a distinct advantage over the Butterfly exchange. Equation 10 emphasizes the reasons for this advantage. For the cascade exchange: lf ζ = 0 then N_s = 2(P_φ -\)lυ unused ⁼⁼ " Nused = 2(P_φ-l)/υ * υ (P_ψ+l)/2 = (P_Ψ -l)v Inserting into Equation 10, and performing a little algebra, yields: Equation 70. General Pair-wise Cascade Exchange Entropy a a. la-c βa2s-c = Q' Y_t a2a-c 1

The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Cascade Exchange Entropy = 1, this gives: Equation 71 Data Compression Effects on Cascade Exchange Entropy /= Compression-Ratio As expected, the cascade exchange allows no channels to go unused and, therefore, represents a zero entropy exchange. Channel availability ratio increases with the number of compute nodes and with the square of the number of available channels for the cascade exchange. Repeating this analysis for the Butterfly exchange: N, = P(P-l) Nunused = P(P-l)(P 2-l) Nused = PGP-1)

After substituting these values into Equation 10 and simplifying: Equation 72. Butterfly Pair-wise Exchange Entropy βf b, utterfly ^•*-> "butterfly r. ^' / butterfly *^■ The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Butterfly Exchange Entropy = 1 : Equation 73. Data Compression Effects on Butterfly Pair-wise Exchange Entropy /= Compression-Ratio 5 For the Butterfly exchange, the availability ratio remains unchanged. It essentially operates at a fixed rate of data movement, showing no improvement as more nodes, and thus more channels, are made available on a parallel processing environment. Conversely, the measure of unused channels is optimal only on 2 nodes and gets continually worse as the number of nodes increases.

A True Broadcast Exchange

10 An alternate way of implementing an all-to-all data exchange is to allow each node, in turn, to broadcast to all others, assuming a) that a reliable broadcast protocol is available and b) that the above channel definition holds, either literally or in practice. Applying our analysis tool to the broadcast method: N_s = P

J-J iV unused ⁼ ^ N_used = P²/2 After insertion into Equation 10 and simplifying: Equation 74. Single Channel Broadcast Exchange Entropy p broadcast . ' broadcast ' I broadcast

20 The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Broadcast Exchange Entropy = 1: Equation 75. Data Compression Effects on Broadcast Exchange Entropy /= Compression-Ratio Thus, both the broadcast and cascade methods can be considered zero entropy methods, since 25 both have a β of 0. They also have very similar availability ratios; however, the cascade ratio is consistently larger for a given number of nodes, and it has a quadratic relationship to the number of channels per node. This analysis may be repeated for the hypothetical case of a broadcast method using multiple channels per node:

■2V.J unused ⁼ ^ Nused = P²/2 After insertion into Equation 10 and simplifying, gives: Equation 76. Multiple Channel Broadcast Exchange Entropy

^multi-b ~ r ' Pmulti-b ~~ "' Tmulti-b ~ *^■ The allowance for additional channels helps with the availability ratio but only in a linear fashion. This result indicates that moving data between multiple pairs of nodes simultaneously has a larger impact on channel usage than a broadcast.

A Binary Tree Broadcast A reliable broadcast method is often difficult to implement. The portable versions of the MPI

(Message Passing Interface) library, such as MPICH and LAM/MPI, use a binary tree broadcast method. The technique below uses a Bibonacci tree (as described above) to distribute data from each node in turn to all others. FIG. 71 shows a binary tree broadcast 1520, such as used by LAM/MPI. In binary tree broadcast 1520, data from a broadcasting node 1522 is re-sent or relayed by each node as it is received until all nodes have received the data. For example, broadcasting node 1522 sends data to nodes 1524, 1526 and 1528; node 1524 sends data to nodes 1530 and 1532; node 1526 sends data to node 1534; and node 1532 sends data to node 1536. As each sender completes, another node begins the broadcast process. Computing the necessary elements for the entropy metrics: N_s = log₂P nused = 2 lθg₂P-l Nused = 2 lo&P-l So the metrics are given by: Equation 77. Binary Tree Broadcast Entropy = — 21og 22₂ P - l ₁ β _a - — 21og £2₂ P - l _{γ= l} log₂ P log₂ P

Note that as P gets large, both metrics approach the value 2. The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Tree Broadcast Exchange Entropy = 1: Equation 78. Data Compression Effects on Binary Tree Broadcast Exchange Entropy /= Compression-Ratio

Comparison Summary The results for the various methods are summarized in Table 22, where P = number of nodes; υ = number of channels. Exchange Method α β Cascade I 2 J v² 0 P Butterfly 1 β = --l 2 P True Broadcast 0 2 f(21og₂ P)- 1 ) ((21og₂ P)-l ) MPI Broadcast (Tree) V log₂ P ) I ι° ₂ P J Table 22. Summary of AU-To-All Entropy Metric Results

Manifold and Hyper-Manifold Level All-to-All Cross-Communication In considering the data present on the nodes in each cascade group, each node has all of the data from every node in the group. Thus, data may be exchanged between each cascade group similarly to data exchanges between nodes: each node exchanges only with its corresponding node in the other groups. However, temporary storage locations, such as temporary storage locations within home nodes for a cascade exchange, are not accessible. Therefore, the number of exchanges for a manifold or hyper-manifold exchange differ, depending on whether the number of cascades in the group is even or odd. An even number of cascades allows for the optimum number of exchanges, while an odd number of cascades require more exchanges.

Type I Manifold AU-to-AH Cross-Communication If G is the number of groups, then the number of exchanges required is given by: Equation 79. Manifold and Hyper-manifold Even and Odd Node Number of Exchanges Required 2(G -l), G even N = 2G, G odd While development may continue by branching at each level for both the even and the odd case, the manifold or hyper-manifold design may be such that the number of cascades in a group is always even (as often assumed herein). The exchange process can be organized into 3 distinct steps: 1) All compute nodes in a cascade group exchange among their member nodes. 2) All cascade groups connected to single top level hyper-manifold channel exchange. 3) The nodes on each top level channels exchange. After step 1, each node in a cascade group has a copy of all the data on the group. Thus, step 2 proceeds by having each node exchange only with its counter part in the other groups. When step 2 is completed, each node has a copy of all data from all nodes attached to that top level channel. Finally, in step 3, corresponding nodes attached to different top level channels may exchange data, completing the process. At each step, the data set increases in proportion to the number of nodes, but the number of exchanges needed is lower than if the exchange proceeded node by node (see FIG. 72). FIG. 72 shows one exemplary depth-2 manifold 1540 that has depth-2 cascade groups illustrating manifold level cross-communication. Manifold 1540 cross-communicates by first exchanging data within each group (e.g., home node 1542, compute nodes CO, Cl and C2 exchange data), and then corresponding nodes in each group participate in a group level exchange (e.g., compute nodes CO and C6 exchange data, compute nodes C2 and C8 exchange data, compute nodes Cl and C7 exchange data, and so on). Equation 61 gives the time for step 1. For step 2, the number of cascade groups is given by the total number of channels connected to a top level channel. From Equation 27: Equation 80. Cascade Groups Connected to a Single Top Level Hyper-manifold Channel

In treating the cascade group exchanges, the first thing to note is that the amount of data being moved has increased in size. Since the nodes have accumulated the data from all others in their cascade, data set size has increased to P_φD_φ. The communication time between nodes within the cascade groups on a home node is then: Equation 81. Communication Time Between Nodes Within the Cascade Group

with the explicit understanding that there is an even number of cascade groups. At this point, all nodes attached to a single top level channel have completed a mutual all-to- all exchange. The data set size on each node has increased to P_ψD_ψC. The final exchange occurs between the corresponding nodes on each top level channel. Equation 27 gives an expression for the total number of channels at the top level: Equation 82. Cascade Groups On Single Top Level Hyper-manifold Channel

The time for the top level exchange is then given by: Equation 83. Top Level Exchange Time Calculation P_φD_φC_{l + χ} ) 2(C₂ -l) T ι₃ = — b v The total communication exchange time is then: Equation 84. Total Communication Exchange Time ^•* C ⁼ ^C ~^ * 2 ^"r" -* 3 Equation 84 may be divided into exchange and latency times. Inserting Equation 61, Equation 81, and Equation 83, performing some algebra, and substituting Equation^'27 produces: Equation 85. Hyper-manifold All-to-All Exchange Time 2Z> r ι 2 „ T_e ^ [^p _ClC₂ ~l] = ^-(P_N -l) bv bv and Equation 86. Hyper-manifold All-to-All Exchange Latency Time

Equation 85 and Equation 86 show that exchange time and latency time decrease by a factor of two if full-duplex communication channels are used. Equation 85 and Equation 86 also reinforces that data movement is not accelerated, except by changing the technique, bandwidth, and, more importantly, the number of channels. However, intelligent grouping of data movements leads to latency hiding; the reduction factor being proportional to the ratio of the product of parts over the sum of the parts. An example of such an exchange operation is shown in FIG. 74 for a small manifold with 4 home nodes and 12 compute nodes. Table 23 through Table 26 shows the effects of increasing communication channels and dimensionality of a hyper-manifold, based upon Equation 85. FIG. 73 shows a graph 1560 that plots communication time units as a function of number of nodes, thereby summarizing Table 23 through Table 26 (note that Table 23 values are extended to a depth of 13). The ending value of Table 23 is recomputed for depth- 13 so its line appears on the chart. The slope continuously decreases with the dimensionality of the hyper-manifold, indicating an ability to tune the hyper-manifold based upon the number of nodes and desired cross-communication time constraints. In FIG. 73, all-to-all cross- communication time units are plotted as a function of the number of nodes using the data from Table 23 (line 1562), Table 24 (line 1564), Table 25 (line 1566), and Table 26 (line 1568). The line for Table 25 (line 1566) overlies that of Table 24 (line 1564). Note that Equation 70 through Equation 74 may also apply. As communication channels are added to a hyper-manifold (e.g., hyper-manifold 1450), all channels are used all the time. However, as communication channels are added to the Butterfly model, and there is no change in the Butterfly exchange behavior to utilize all channels all the time. Table 26 illustrates the value of hiding the number of exposed exchanges. For example, if the 6-channel Butterfly exchange used one channel with 6 times the communication speed, the exchange time remains the same. In order to match the manifold performance, the speed of each single channel used for the butterfly exchange is increased by, at least, the ratio shown in Table 27.

Table 23. 1-Channel Butterfly and 2-Channel Manifold Comparison

Table 23 shows a comparison for an all-to-all exchange on a single channel butterfly with an all-to-all exchange on a manifold specified by mo=2, ψ₀=2 and υ=l.

Table 24. 4-Channel Butterfly and 4-Channel Manifold Comparison

Table 24 shows a comparison between a butterfly all-to-all exchange using 4 channels per node and a manifold all-to-all exchange specified by mo=2, ψo=2 and υ=4.

Table 25. 4-Channel Butterfly and 4-Channel Hyper-manifold Comparison

Table 25 shows a comparison between a 4-channel butterfly all-to-all exchange and a hyper- manifold all-to-all exchange specified by mι=l, ψι=3, m₀=l, ψo=2 and υ=4.

Table 26. 6-Channel Butterfly and 6-Channel Hyper-manifold Comparison

Table 26 shows a comparison between a 6-channel butterfly all-to-all exchange and a hyper- manifold all-to-all exchange specified by m₂=l, ψ₂=8, mι=l, ψι=4, mo=l, ψo=2 and υ=6.

Table 27. Speed-up Required for 1-Channel Butterfly to Match 6-Channel Hyper-manifold (m₂=l, ψ₂=8, mι=l, ψ₁₌4, m,,=l, ψ₀=2, υ=6)

Cross-Communication Controlling Factors Equation 27 and Equation 85 are complex and are not readily manipulated to give insight into cross-communication performance as a function of individual variables. Based upon construction of the cross-communication process, it is clear that the channels involved at the manifold level and higher do not play a direct role. The hyper-manifold structure serves to organize groups of compute nodes for the higher levels of cross-communication. The equations show that terms involving the ψβ and m,s collectively form a multiplier which gives the total number of compute nodes for a given cascade depth. Of the other variables in the equations, υ is directly related to compute nodes. Reducing the number of cross-communication exchange steps on any number of compute nodes may improve performance of the parallel processing environment; a cross-communication performance metric may be defined as the ratio of the required number of cross-communication exchange steps to the total number of compute nodes. The smaller this ratio, the faster the exchange. Given the complexity of the equations, Equation 27 is referred to as Total_Compute_Nodes and Equation 85 is referred to as XCom_Time_Units. For large numbers of nodes, the ratio of the two approaches: Equation 87. Total Number of Compute Nodes to Cross-communication Time

Thus, for a given data set size and channel bandwidth, a primary factor that determines performance is the number of communication channels on each compute node. The number of compute node communication channels has a quadratic effect on performance improvement and is independent of the total number of compute nodes, reinforcing the proposition that more channels are better than faster channels. It also indicates that the system architecture utilize a hyper-manifold to control data I/O into and out of the parallel processing environment, while the compute node channels play a dominant role in cross-communication tasks. The limit of Equation 87 is approached in a monotonically increasing fashion. The higher the node count, the closer its value is to the limit. To see this behavior, the performance ratios for several different hyper-manifolds are shown in Table 28, which starts with a depth-3 cascade and extends to 1 and 2 dimensions. The compute nodes may each have 1 or 4 communication channels. The behavior predicted by Equation 87 and demonstrated in Table 28 leads to conclusions. Once the number of compute node channels has been fixed, the hyper-manifold's performance for cross-communication is bounded by a linear scaling law. Compare this to the Butterfly, where the same ratio scales quadratically with the number of compute nodes. Equally important, adding communication channels allows the magnitude of the scaling factor to be controlled, while the Butterfly sees only an initial time reduction from added channels. Scaling a hyper-manifold to higher dimensions adds compute nodes and impacts data I/O and latency hiding, whereas cross- communication costs remain under control.

Table 28. Cross-Communication Performance

The exchange equations illustrate the situation under the assumption that all datasets being moved are of uniform size. If an all-to-all exchange is performed with non-uniform data set sizes, then the exchange times are controlled by the movement of the largest data set at each step. This introduces a timing skew having an additional worst-case latency effect equal to: Equation 88. Timing Skew Calculation p-max φ-ave t = N. bv_n where: Uφ-max Size of the largest data set exchanged. I φ-ave Average size of the datasets exchanged. N. Number of exchange steps performed. Broadcast and Hyper-Manifold All-to-All Exchange Comparison Modern cluster computers provide what are called broadcast methods to implement all-to-all exchanges. A true broadcast allows one node to send its data simultaneously to all other nodes. FIG. 74 shows a data flow diagram 1580 illustrating a single channel true' broadcast all-to-all exchange with four exchange steps 1582(1-4) between four nodes 1584. The modifier true is used to emphasize that data movement to other nodes is in fact simultaneous and direct. FIG. 74 illustrates that the number of broadcasts required to execute an exchange between four nodes is four. Multiple channels may also be utilized in a broadcast exchange, and therefore the communication time for such an exchange between P nodes is given by: Equation 89. True Broadcast All-to-All Exchange Communication Time

As with Equation 61, Equation 89 may be split into separate exchange and latency times: Equation 90. True Multi-Channel Broadcast All-to-All Exchange Time DP ^fB(e) ~ bv Equation 91. True Multi-Channel Broadcast ExchangeAlI-to-AU Latency Time Ratio _ λP

Dividing Equation 62 by Equation 90 gives an expression for the ratio, r_e, of the hyper- manifold exchange method to the broadcast method: Equation 92. Hyper-manifold to True Multi-Channel Broadcast Exchange Time Ratio

Thus it appears that a true broadcast supporting multiple channels has a factor of two advantage over the hyper-manifold. In practice, true broadcast methods are not used since broadcast protocols (e.g., UDP) are unreliable. As described above, broadcast methods may be implemented using point-to-point methods. For now, assume the standard practice of using a single channel and that a reliable true broadcast method is available. Under these conditions, the hyper-manifold does show an advantage: Equation 93. Hyper-manifold to True Single-Channel Broadcast Exchange Time Ratio

This ratio approaches 2/υ as P increase, indicating that the all-to-all exchange with two communication channels is similar to a true broadcast method on a single channel. Any number of channels over two results in a linear improvement over a single channel true broadcast. Hence it appears that any networking protocol that provides reliable point-to-point communication can implement a reliable exchange with true broadcast speed, or better, using two or more communication channels.

Next-Neighbor Cross-Communication A next-neighbor cross-communication exchange is often involved when the problem domain is spatially decomposed and data needed by the algorithm on one node is assigned to some other node. A next-neighbor exchange involves moving some or all local data to every logically adjacent node which needs it. FIG. 75 shows a data flow diagram 1600 illustrating nine compute nodes 1602 and next- neighbor cross-communication. As seen in FIG. 75, next-neighbor exchanges may be considered as a partial dataset all-to-all exchange with some of the cross-communication deleted. This is actually a worst-case situation as it assumes the logical and physical problem discretizations are identical. In general, this is not the case. Logical and physical problem discretizations are controlled by the type of data decomposition and a neighborhood stencil that describes the required exchanges.

Nearest-Neighbor in 2-D FIG. 76 shows one exemplary computational domain 1620 illustratively described on a 27x31 grid 1622. Grid 1622 thus consists of 837 computational elements 1624 that are divided into seven groups 1626(0-6) and assigned to seven compute nodes (e.g., compute nodes 184, FIG. 5). FIG. 76 also shows three exemplary neighbor stencils 1628(1-3) illustrating 1^st, 2^nd, and 3^rd nearest-neighbor cases. In stencil 1628(1), a center element interacts with eight adjacent elements within stencil 1628(1), and thus represents the 1^st nearest-neighbor case. In stencil 1628(2), a center element interacts with sixteen 2^nd nearest-neighbor elements; it does not interact with any 1^st neighbor elements which are masked out in this example. In stencil 1628(3), a center element interacts with twenty-four 3^rd nearest-neighbor elements; it does not interact with 1^st or 2^nd nearest neighbor elements which are masked out in this example. Data may be over-subscribed on each node to provide the additional data for the algorithm.

The additional data is located in ghost cells (also known as boundary cells) around the assigned group (e.g., group 1626(3)) of work cells for a node; these ghost cells are not considered work cells, since they are not directly processed by the node. If the algorithm requires updated values for ghost cells, nodes may exchange ghost cells with other nodes. This may involve moving data to 2, 4, or possibly all other nodes, depending on the needs of the maximum neighbor stencil used by the algorithm. A worst case is a full all-to-all exchange, while a best case is each node exchanging with only two other nodes, a subset of a full or partial all-to- all exchange. FIG. 77 shows computational domain 1620 of FIG. 76 illustrating one exemplary stencil 1644 with sixteen 2^nd nearest-neighbor elements and ghost cells 1642 highlighted for group 1626(3). FIG. 78 shows one exemplary cascade 1660 with seven compute nodes 1662 illustrating a single pair- wise exchange between the logically nearest neighbors of FIG. 77. The exchange uses 4 exchange steps, regardless of the number of nodes in the cascade. Additional exchanges may be utilized to move data for longer range neighbor stencils, but the worst case becomes a full all-to-all exchange. The latency time for this exchange is just 4λ, and the exchange time is given by: Equation 94. Pair-wise Nearest Neighbor Exchange Time AD ^~ bv This method of determining ghost cells for a group is also applicable to irregular meshes, although determination of ghost cells is more complicated. Further, it is also readily extensible to multiple dimensions as the stencil works equally well in volume and hyper- volume application. The entropy effects of the nearest neighbor exchange may be compared using Equation 10. If ζ= 0 then N_s = 4 N_used = 4 IV unused ~ After substituting these values into Equation 3, this gives: Equation 95. Pair-wise Nearest-Neighbor Exchange Entropy neighbor = 1 ^■"^■> β neighbor = 0 ' r / neighbor = 1 The /term may be changed by the use of data compression. The compression ratio has the effect of shrinking the amount of data used. Since /parameter of the Nearest-Neighbor Exchange Entropy = 1: Equation 96. Data Compression Effects of Nearest-Neighbor Exchange Entropy /= Compression-Ratio Nearest-Neighbor in 3-D FIG. 79 shows a perspective view of one exemplary stencil 1680 illustrating nearest neighbors of one cell in three-dimensions. In this example, the nearest neighbor problem is complicated by having many more neighboring cells. Stencil 1680 has twenty-six 1^st neighbors, whereas there are only eight in a two dimensional stencil. A direct exchange of information uses each cell to exchange data with its 26 neighbors. The time required to complete such an exchange is: Equation 97. Half-duplex, single channel 3-D nearest neighbor exchange time.

where d is introduced to represent the duplex characteristic of the channels. For half-duplex d=l, and for full duplex d=2. An approach is to exchange data in planar order; for instance front-back, followed by top-down, followed by left-right. The data collected on cell from one exchange is added to the data that is moved in the next exchange. Thus, while the number of exchanges can be reduced to 6 using a full duplex channel, the total time remains the same: jcquauon vs. i-i) learest-Neighbor Exchange Time for Planar Exchange Method t ^bC = t ^l front-back + ^~ t ^l top-bottom 4 ^~- t ^l left-right

Because 26 has only 2 prime factors (2 and 13), multiple channels can gain advantage only if they split the data set sent to one end point. However, this does not prevent use of arbitrary numbers of channels. Achieving a zero entropy exchange does face one minor difficulty. The cells aligned on the outside of the computational volume have fewer neighbors than cells located inside the volume. The outside cells find themselves with no neighbors to communicate with during portions of the exchange. The total time is still that of the internal cells, but some channels go unused during the exchange rather than be fully utilized. If periodic boundary conditions are used in the calculation, this issue goes away, since the outer surface cells wrap around to communicate with the opposite sides. FIG. 80 shows exemplary periodic boundary conditions for a two dimensional computational domain 1700 where the outer surface cells wrap around to communicate with opposite sides (i.e. the top and bottom, left and right, and front and back cells communicate directly with each other). In FIG. 80, the computation domain has nine cells A, B, C, D, E, F, G, H and I (shown in a solid grid). These cells wrap such that cell G also appears above cell A as shown within the dashed grid. Thus, using the dashed grid, 1^st, 2^nd and 3^rd nearest neighbors are determined for all cells of computational domain 1700. The calculation of the exchange entropy may also consider the cases of an isolated system and one with periodic boundary conditions separately. As an example, consider systems with cubic topology. There are 4 cases to compute: P^1/3 even or odd, and an isolated or periodic boundary system (P being the total number of processors). For each case, the effects of single and dual channels are shown in Table 29.

Table 29. Entropy Metrics for Nearest-Neighbor Exchange.

The /term may be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the 3D Nearest-Neighbor Exchange Entropy = 1: Equation 99. Data Compression Effects on 3D Nearest-Neighbor Exchange Entropy /= Compression-Ratio Next-Neighbor Channel Over-subscription Method Adding channels may improve the exchange time in two ways. First, additional channels allow for simultaneous exchange with multiple neighbors. In the above example, 4 channels allow the entire exchange to occur in 1 exchange step. Adding additional channels allows a reduction in the amount of data that had to be moved over each.

Red-Black and Left-Right as Degenerate Nearest-Neighbor Cases

Red-Black Exchange There are two degenerate cases of the next neighbor exchange: the Red-Black Exchange and the Left-Right Exchange. The Red-Black exchange assumes a checker-board like connection model. FIG. 81 shows a checker-board like model 1720 illustrating two exemplary red-black exchanges. In a first exchange, a node 1722, assigned the color red (shown as a hollow circle in FIG. 81) in model 1720, communicates only with adjacent nodes 1724, 1725, 1726 and 1727 assigned the color black (shown as hatched circles in FIG. 81). Conversely, a node 1732, assigned the color black in model 1720, communicates only with adjacent red nodes 1734, 1735, 1736 and 1737. Thus, in a red-black exchange model (e.g., model 1720), each node exchanges data with a neighbor of a different color; hence, a black node exchanges data with its 4 red neighbor nodes and each red node exchanges data with its 4 black neighbor nodes. The red-black exchange model is a degenerate form of the nearest-neighbor cross- communication method since exchanges with corner neighbors are eliminated. The exchange time amongst the immediate neighbors is given by: Equation 100. Red-black Exchange Time Formula ,_, = * , _{iff v}<₈ ^e bv In order to keep the exchanges within a red-black exchange zero entropy, the number of channels is a factor of 8; that is 1, 2, 4, or 8 channels. Since the next-neighbor exchange takes only , = 4D/bυ communications, it is faster to perform the next-neighbor exchange than the red-black exchange. Entropy effects of the red-black exchange may be compared using Equation 10. If ς = 0 then Ns = 8 N_used = 1 ^un sed ⁼ ' After substituting these values into Equation 10, gives: Equation 101. Red-black Exchange Entropy ^red-black = 1/8) βred-black = 7/8, Jred-black = 1 The /term can be changed by the use of data compression. The compression ratio has the effect of shrinking the amount data used. Since /parameter of the Read-Black Exchange Entropy = 1 , this gives: Equation 102. Data Compression Effects on Red-Black Exchange Entropy 5 /= Compression-Ratio Comparing the next-neighbor entropy values gives: O ighbor Orod-black ^an Pneighbor ^ Pred-black Therefore, the nearest-neighbor exchange is both faster and more efficient than the red-black 10 exchange. Left-Right Exchange FIG. 82 shows a linear nearest-neighbor exchange model 1740 illustrating communication between one red node 1742 (shown as a hollow circle in FIG. 82) and two linear black nodes 1744 and 1746 (shown with hatching in FIG. 82). Linear nearest-neighbor exchange model 1740 assumes that all 15 communication takes place along a line, and is, in fact, the degenerate case of the Red-Black communication model 1720 of FIG. 81. The number of communication exchanges required to perform a linear nearest-neighbor exchange is given by: Equation 103. Left-right Exchange Time Formula

20 t = iff υ < 4 ^e bv In order to keep the exchanges within a linear nearest-neighbor zero entropy, the number of channels is a factor of 4; hat is 1, 2, or 4 channels. As in the red-black exchange, it is interesting to note that since the next-neighbor exchange takes only L, = 4/υ communications, it is faster to perform the next-neighbor exchange than the Linear 25 Next Neighbor exchange for channel numbers greater than 1. The entropy effects of the Left-Right exchange may be compared using Equation 10. If ζ = 0 then N_s = 4 ^■ used ⁼ 1

After substituting these values into Equation 10, this gives: Equation 104. Left-right Exchange Entropy *^■

The /term can be changed by the use of data compression. The compression ratio has the 35 effect of shrinking the amount data used. Since /parameter of the Left-Right Exchange Entropy = 1, this gives: Λ--fuucττ.iΛJiJi ut n Compression Effects on Left-Right Exchange Entropy /= Compression-Ratio Comparing the next neighbor entropy values gives: Equation 106. Left-right to Next Neighbor Entropy Comparison Calculation &¹ neighbor ^> ^left-right r neighbor rleft-right Therefore, the nearest-neighbor exchange is both faster and more efficient than the Left-Right exchange.

Cross Communication Impulses

True Broadcast Exchange Impulse The relative performance of the various communication techniques can be discerned directly from their exchange impulse forms. Given the poor performance of the Butterfly exchange, only the true broadcast is illustrated, the binary-tree broadcast, and the cascade exchange for the two cases of full and partial all-to-all exchanges. The following examples are based on 4 nodes, which is sufficient to highlight the differences in the exchanges. In the full true broadcast exchange, each node broadcasts its data set to the other three nodes simultaneously, as shown in Table 30, which assumes each node transfers 1MB of data to the other three nodes, resulting in a data exchange impulse 1760 shown in FIG. 83. In particular, FIG. 83 shows four sub-impulses 1762, 1764, 1766 and 1768, each having three exchanges and a latency gap 1770, 1772, 1774 and 1776, respectively.

Table 30. True Broadcast on 4 Nodes Analysis Form

A partial all-to-all exchange sends a different subset of a node's data set to each node. A true broadcast uses either an alternative method, or considers broadcasting all data, thus requiring each receiving node to select its appropriate data sub-set. The latter works only if the receiving node is able to calculate the data it is receiving. The effect of choosing the pair- wise exchange is shown, recognizing that the binary-tree method described below could also be used. The information pertaining to the pair wise "broadcast" is shown in Table 31, which assumes each node sends 1MB of data to each of the other three nodes. FIG. 84 shows one exemplary data exchange impulse 1780 for the pair-wise exchange information of Table 31. Data exchange impulse 1780 is shown with twelve sub-impulses 1782(1-12), each with an associated latency gap 1784(1-12).

Table 31. True Broadcast Partial Exchange on 4 Nodes Analysis

If the data subset is 1/4 the amount moved in the full exchange, the impulse width may still be greater, given the increased number of latency times. If the data subset is 1/3 the amount of the full exchange, or greater, then it is faster to broadcast all data, and have each node select its required data from the broadcast.

The Binary-Tree Exchange Impulse The binary-tree full all-to-all exchange is straightforward. Since one node sends all its data to all others, the data set size remains the same during the data movement. Only the number of exchanges during each step increases as shown in Table 32. In a partial all-to-all exchange, the data size does not remain constant. FIG. 85 shows one exemplary binary tree 1800 with eight nodes 1802(1-8). During a first exchange step, node 1802(1) sends four sets of data to node 1802(2). In a second exchange step, node 1802(1) sends two sets of data to node 1802(3) and node 1802(2) sends two sets of data to node 1802(6), having extracted data for itself and for node 1802(5). In a third exchange step, node 1802(1) sends one set of data to node 1802(4), node 1802(2) sends one set of data to node 1802(5), node 1802(3) sends one set of data to node 1802(7) and node 1802(6) sends one set of data to node 1802(8). Continuing with the four node example of Table 32, in the full all-to-all exchange, it is assumed that each node sends 1MB of data to each of the other three nodes.

Table 32. Binary Tree Full All-To-All Exchange Analysis, 4 Nodes

The analysis for a full all-to-all exchange on 4 nodes is shown in Table 32, and the resulting data exchange impulse 1820 is shown in FIG. 86, while the partial exchange is illustrated by Table 33 and the resulting data exchange impulse 1840 is shown in FIG. 87. In particular, data exchange impulse 1820 has eight sub-impulses 1822(1-8), each transferring the same data size and representing 1 exchange time unit (i.e., 0.08 sec. in this example), as shown in Table 32.

Table 33. Binary-Tree Partial All-to-All Exchange Analysis Form

Table 33 assumes each node sends 340KB of data to each of the other 3 nodes. In FIG. 87, data exchange impulse 1840 has eight sub-impulses 1842(1-8); a first sub-impulse 1842(1) utilizes two exchange time units (i.e., each time unit of 0.0136 sec. in this example) to move data for two nodes in a first exchange, and a second sub-impulse 1842(2) utilizes one exchange time unit to move data for the following two exchanges (i.e., the first node to receive data from the first sub-impulse keeps its data and in the second sub-impulse passes data to the next node). ^astauc jϋj cuaiigc ±iuμuise

The behavior of a cascade exchange is different. The same set of 6 point-to-point exchange steps are used, but only the data required by each node is transmitted. Thus the longest impulse width occurs for the full exchange, and the partial exchanges run faster in proportion to the decrease in data set size. The full exchange is illustrated by Table 34 and FIG. 88 and assumes that each node sends 1MB of data to every other node, while the partial exchange is presented in Table 35 and FIG. 89 and assumes each node exchanges 340KB of data with the other three nodes (just over 1/3 the data requirement of the full all-to-all exchange. In particular, FIG. 88 shows one exemplary data exchange impulse 1860 with six sub-impulses 1862, 1864, 1866, 1868, 1870 and 1872, each having two exchanges. FIG. 89 shows one exemplary data exchange impulse 1880 with six sub-impulses 1882, 1884, 1886, 1888, 1890 and 1892, each having two exchanges.

Table 34. Cascade Full Exchange Impulse Analysis Form

Table 35. Cascade Partial Exchange Impulse Analysis Form

Comparing the Entropy of All Exchanges This completes the analysis of the various exchange methods. Table 36 summarizes the entropy found in those exchange methods. Note that these exchange entropy calculations assume a distributed network cluster using a cross-bar type switching fabric.

Table 36. Exchange Entropy Summery Overlapping Cross-Communication with Computation Achieving high efficiency in distributed parallel processing systems makes sure the parallel processors are performing computation. Actions performed by the parallel processors that may not be performed when the same algorithm is executed on a single processor contribute to inefficiency. This principle is exemplified in the parallel speed-up factor defined by the equation for Amdahl's Law, which is presented in Equation 1 and is repeated below for convenience. Equation 107. Amdahl's Law Restatement

where: S ≡ System speed-up p ≡ fraction of parallel activity at the algorithm level q ≡ l-p s fraction of serial activity at the algorithm level P ≡ number of processors The speed-up factor computed in AmdahFs equation is defined relative to an equivalent single processor execution. This equation separates processing activity in a parallel system into two components: parallel and serial. For the purpose of this discussion, the parallel activity is defined as any action that may be performed in the equivalent single processor execution. Conversely, the serial activity is defined as any action performed in the parallel execution that is not present in the single processor execution. One serial action that certain types of parallel algorithms perform is to exchange the results of intermediate computations among processors. This is termed cross-communication and can be a source of serial activity. The typical execution flow for these types of parallel algorithms is to compute, synchronize, cross-communicate, and repeat. The sequential nature of the cross-communication makes it explicitly clear it is a serial activity. For example, many numerical algorithms represent the problem domain as a grid of data. Each data point in the grid represents an approximate solution. An approximate solution is computed for each data point from its current value and those of selected neighboring points. The computations are iterated until the approximate solution converges to a desired degree of precision. When these types of algorithms are implemented in parallel on P processors, the grid is partitioned into P non-overlapping sub-grids. Each processor is responsible for computing approximate solutions for each point in its sub-grid. Since the computation requires values from neighboring points, the points on the perimeter of the sub-grid require values of the corresponding approximate solution from one or more adjacent processor. The points that are located on adjacent processors are herein referred to as overlap points. In order to facilitate the exchange of intermediate results with neighboring processors, the sub-grids are expanded to include the overlap points. FIG. 90 shows one exemplary 2-D grid 1900 divided into nine sub-grids 1902(1-9) for distribution between nine compute nodes (e.g., compute nodes 224, FIG. 7). For sub-grid 1902(5), points 1904 (shown within ovals) represent data that is exchanged with other compute nodes that are processing adjacent sub-grids (i.e., a next neighbor exchange). In this example of 2-D square grid 1900 partitioned into square sub-grids 1902, the number of interior points scale as the square of the sub-grid's dimension, while the number of overlap points scale linearly. As such, the dimension is computed of a sub-grid for which the cross-communication time can be totally hidden by the interior computation time. In this example, an equation is derived to compute the sub-grid dimensions needed to hide the cross-communication in a next-neighbor exchange. The cross-communication used to exchange data (indicated by overlap points 1904) between compute nodes is an obstacle to parallel efficiency; the main reason is the serial nature of the computation and cross-communication activities. Consequently, an advantage may be gained by temporally overlapping these two activities. By balancing these two activities, both can complete in the time it takes a single one to execute. A general approach is to restructure a parallel algorithm to compute the overlap points first, pass these points off to an I/O device for delivery to the neighboring processors, and then compute the interior points. While the interior points are computed, the I/O device exchanges the overlap points with the neighboring sub-grids. By the time computation of the interior points is complete, the overlap points from the adjacent processors are available for the next iteration, and the sequence repeats. FIG. 91 shows the 2D grid 1900 of FIG. 90 with internal points 1922 highlighted. The maximum time allowable for the exchange of data associated with overlap points 1904 is given by the time required to process all data associated with internal points 1922: Equation 108. Maximum Time for the Exchange of Overlap Data Formula

where: N ≡ dimension of the sub-grid t_c ≡ time to compute one interior point. The time required to exchange data associated with overlap points 1904 is given by: Equation 109. Overlap Data Exchange Time Formula _ S(N + l)D_p ^{e ~} b where: D_p ≡ data set size associated with the exchange of one point. b ≡ channel bandwidth. By setting Equation 108 equal to Equation 109 and solving for N, the size required to assure the computation time is longer than the overlap exchange time can be estimated: Equation 110. Dataset Size to Computation Comparison

no -,₄ αu_Uu i ±\J siivjwo, as ιuug » the dataset is large enough, or the interior compute time is sufficiently long, then the entire time for cross-communication can be hidden. One consequence of this relationship is that the choice of N is limited by the desired amount of cross-communication exposure. This process also generalizes to higher dimensions and provides a means to hide cross- communication time for problems with N-dimensional neighborhoods. Using Howard-Lupo manifolds, the exchange can be optimized for maximum efficiency. One possible criticism is that the separation of the sub-grid points into interior and overlap regions complicates the way in which an algorithm sequences through the data points. This might make it more difficult to take advantage of the most efficient caching strategies. Indeed, this gives up some computational efficiency, but given the dominant nature of cross-communication, the gain from overlapping the operations may be larger than any loss due to non-optimal caching strategy. After all, if parallel efficiency were simply a function of caching strategy, then the problem of parallel efficiency may not be an issue. To gain maximum benefit, this technique uses dedicated hardware. The challenge of parallel processing is to overlap as many operations in time as possible, and given the advantage of this technique, the cost of some dedicated hardware can be justified. The hardware has intelligent I/O channels with ability to directly associate application memory space with individual logical connections on the physical I/O channels; it also provides a path between application memory on one node and application memory on another node. The application memory is for example dual-ported so that the I/O channel and the CPU can access it independently. The I/O channel accesses application memory to exchange the overlap points at the same time that the CPU is accessing memory to compute the interior points. In addition, the I/O channel may efficiently access non-contiguous memory. Since N- dimensional gridded data, which is logically accessed as an N-dimensional array, is actually stored in a linear 1-D array of physical memory, some of the overlap points will be non-contiguous. In the 2-D case of FIG. 91, the overlap points 1904 for the left and right neighbors are columns in the 2-D array of data. Given a row-major storage format for the array, the columns have a stride value equal to the length of a row. By having the channel hardware apply a stride value to its memory accesses, it can access non-consecutive data as efficiently as consecutive data. Although this type of operation is not uncommon for contemporary DMA controllers, the key is to have the channel's DMA intimately coupled with the CPU and its application memory. One additional issue that is addressed is that of operating systems that map memory. In most contemporary systems, application memory is mapped. The operating system allocates physical resources, and in the case of application memory, it can be swapped out without an application's knowledge. Therefore, the memory locations that contain the overlap points are locked in place when the I/O channels are setup. Since distributed parallel compute nodes are usually dedicated to a single parallel application, this is a case where memory-mapped operating systems might not be ideal. From a computational efficiency point of view, virtual memory swapping is anathema. As long as there's enough physical memory to contain it, the data for a parallel application should be memory resident throughout the execution time. Even a small amount of swapping is enough to destroy any parallel efficiency, so it should be avoided. In order for the intelligent I/O channels to work with a mapped memory operating system, care is taken to make sure the application memory containing the data associated with the overlap points (e.g., overlap points 1904) remains memory resident. In order to achieve maximum benefit from the Howard-Lupo manifolds, each physical channel may have its own dedicated intelligent I/O channel hardware. Therefore, for the example of an 8-neighbor exchange shown in FIG. 90, having 4 physical channels on the compute nodes allows 4 simultaneous exchanges at a time. There may be a small amount of contention due to arbitration for the memory bus by the I/O channels, but this is negligible since the bandwidth of the memory bus is likely greater than the communication channels themselves. A beneficial side effect of the intelligent I/O channels is a reduction in commumcation latency. Since latency is determined by the total time it takes to get data from source memory to the channel, send it over the channel, and have it available in the destination memory, these direct memory links may reduce it. It takes some time to set up the I/O channels so they may not be suitable for algorithms that repeatedly make and break connections. However, for algorithms that require persistent N-neighbor connections and compute an iterative solution, the I/O channels provide a substantial advantage. With the advances in Field Programmable Logic Devices (FPGA), the hardware for the intelligent I/O channels may be implemented in a cost effective manner.

RAM Size as a Determiner of Computational Job Size Implicit to the various node-to-node communication techniques discussed above is knowledge that each technique moves either the partial dataset or the full dataset. Moving a partial dataset takes place within functions such as a matrix transpose which is used in such algorithms as the multidimensional FFT, Wavelet, and Cosine transforms. The full dataset exchanges are used to compute long-range forces like gravitational, electro-static, or photonic forces used in various simulations. Full dataset exchanges require each processor to have access to all of the data that is exchanged. In the case of an all-to-all exchange, all nodes need access to the full exchange datasets of all other nodes. This increases the random access memory requirements of algorithms performing such exchanges. Since the primary cost item within a shared memory machine is the cost of the memory, decreasing the amount of memory required to perform a given job introduces a negative cost driver for such systems. A method of decreasing the memory requirements of such systems without changing the underlying algorithms is now introduced.

Lower Tier Cache Compression/Decompression Since the performance difference between having Ll and L2 Cache versus Ll Cache alone may be ten percent or less, performing compression between RAM and the L2 Cache may give mimmum system performance differences. It should be noted that if there are higher levels of cache than L2, then the highest level cache in the system should be the one that is in direct communication with the compression/decompression portion of the system, if system processing performance is an issue. Put another way, the cache closest to the RAM is the cache in communication with the compression/decompression portion of the system. This is because the effect of adding time to the highest level cache also has the least affect on the total system performance. Placing the compression at the lowest cache level allows the RAM to decrease its frequency (and associated heat) while maintaining the same effective bandwidth, as long the clock rate decrease is less than or equal to the compression ratio. FIG. 92 shows a schematic diagram 1940 illustrating standard data/address path connectivity between (a) a microprocessor 1942 with registers 1944 and a Ll cache 1946, (b) a L2 cache 1948, (c) a RAM 1952 and (d) other circuitry through bus interface 1954. The compression data/address path goes from the processor 1942 to the L2 cache 1948 and if required from the L2 cache 1948 to compression/decompression hardware 1950 and then to the RAM 1952. FIG. 93 shows compression/decompression hardware 1950 in further detail. Compression/decompression hardware 1950 has an address conversion table 1962 and a compressor/de-compressor 1964 that contains appropriate hardware and firmware to operate a compression/decompression algorithm for example. If instead the compression/decompression system is placed between the Ll cache and the registers then the total cache size as well as the RAM size can be decreased as a function of the compression ratio. FIG. 94 shows a schematic diagram 1980 illustrating a processor 1982 with registers 1984, Ll cache 1986, L2 cache 1990, a compressor/decompressor 1988 located between registers 1984 and Ll cache 1986, and a bus interface 1994. With this location of compressor/decompressor 1988, both a RAM 1992 and processor 1982 may decrease there respective data transfer clock rates without decreasing overall performance, providing the decrease in clock rate is less than or equal to the compression ratio of compressor/decompressor 1988. Decreasing the clock rates also decreases the amount of heat generated by both RAM 1988 and microprocessor 1982 proportional to the compression ratio. Cache Line Compression Issues When recompressing a cache line so that it can return to the RAM there are three possible cache line compressed size outcomes: the cache line returns to the original size, the cache line is larger than the original size, or the cache line is smaller than the original size. Both the address conversion and the compression/decompression are affected by the compression method. Below is a discussion of various compression methods and there effects upon the address conversion and the compression/decompression capabilities.

Recompressed Cache Line the Same Size as Original If the recompressed cache line is the same size as the original cache line then the recompressed cache line can return back to the original RAM location. Recompressed Cache Line Greater Than the Original If the recompressed cache line is greater than the size of the original cache line then the recompressed cache line is placed in a larger data space and a bitmap is updated to indicate the new data location, ine Ditmap may, tor example, represent an address mapping table for cache lines within the cache.

Recompressed Cache Line Less Than the Original If the recompressed cache line is less than the size of the original cache line then the recompressed cache line can return back to the original RAM location, with the RAM hole noted for use by additional data.

Various Compression/Decompression Methods There are two general types of compression codes lossless and lossy. The following codes are discussed: o Lossless Codes - Huffman Codes, Arithmetic Codes, Dictionary Compression, o Lossy Codes-Quantization Codes.

Lossless Codes There are several advantages to lossless compression codes, the most important being no need to worry about the bit accuracy of the results, by definition the bit accuracy remains the same. However, lossless compression ratios may be less than that found with lossy codes.

Huffman Codes - The simple Huffman code generates a binary tree whose hierarchy determines the encoding length. An example of this table is given below in Table 37.

Character Huffman Bit Code ASCII Bit Code A 111 01000001 B 110 01000010 C 10 01000011 D 0 01000100 Table 37. Example Huffman Code Table

If the input string is:

DDDCCBADDDCCBADDDCCBADDDCCBA

Then the Huffman Output is: 0001010110111000101011011100 01010110 1110001010110111

And ASCH Output is: 01000100010001000100010001000011010000110100001001000001 01000100010001000100010001000011010000110100001001000001 01000100010001000100010001000011010000110100001001000001 01000100010001000100010001000011010000110100001001000001 As can be seen the Huffman output is about 24% of the ASCII output. If the cache is 4 characters wide then the following index table may be generated at compression time which maps the

Huffman output to the character location. Character Number ASCII Bit Huffman Bit Location Locations 1-4 1-32 1-5 5-8 33-64 6-14 9-12 65-96 15-20 13-17 97-128 21-28 18-21 129-160 29-36 22-25 161-192 37-42 26-29 193-224 43-52 Table 38. Example Huffman Address Conversion Table

It is clear that the Huffman encoding is a function of the cache size. This means that the entire cache is read/written at a time. A variation of this is to evenly divide the cache into smaller sections. This allows the cache segments to be read/written allowing for more efficient cache utilization. Utilizing Table 38, a cache may be filled/emptied from/to a compressed RAM region. An adaptive Huffman code generates the Huffman output on-fhe-fly and has no need to ship this table.

Arithmetic Codes The arithmetic coding method of compression is a variable length compression method very similar to the Huffman coding method. Rather than creating a binary tree from the probability distribution this method calculates an upper and lower bounded binary fraction between the numbers 0 and 1, giving a higher compression ratio. Below is three letter code with the following probabilities of occurrence: X = .7, Y = .12, Z = .1. Now the following pattern is compressed as shown in Table 39: XXXY

Letter Lower Bound Upper Bound Base 10 Base 10 X .30000 1 X .51000 1 X .65700 1 Y .71874 .7599 Table 39. Arithmetic Compression Table

Since the lower bound is .71874ι₀ = 10110111111111111₂ and the upper bound is .75990₁₀ = 11000010100001₂ then the bits are compared (from left to right) until a difference is found. Once a difference is found checking stops and a 1 is added. This looks like the following: Compare Direction 10110111111111111 11000010100001000

Difference This gives 11 as the compressed value.

Character Number ASCπ Bit Arithmetic Bit Location Locations 1-4 1-32 1-2 Table 40. Example Arithmetic Address Conversion Table Since this compression method is defined for a group of characters, as long as that group size corresponds to the cache line size and as long as this includes a proper escape character it is possible to treat this compression analogously to the Huffman coding method. Table 40 shows an exemplary arithmetic conversion table for the above example.

Dictionary Compression Methods Dictionary compression techniques include Lempel-Ziv, Lempel-Ziv-Welch (LZW), and

Lempel-Ziv-Storer-Symanski (LZSS). Because of the similarities in the techniques, only LZW is described because of its simplicity. LZW builds a dictionary of' vords"'by starting at the beginning of the data and comparing other characters to the saved dictionary items. Once the words are detected then they are tracked. An index to the dictionary words then replaces the words in the document. For example: ABAXABBAXABB is transformed into Table 41.

Dictionary Index Dictionary Words 1 A 2 AB 3 BA 4 AX 5 XA 6 BB 7 BAX 8 AXA 9 ABB A AXABB B BAXAB C XABB Table 41. LZW Dictionary Table

So using the largest possible dictionary words first gives: 2AA Which reduces the dictionary to:

Dictionary Index Dictionary Words 1 AB 2 AXABB Table 42. Reduced LZW Dictionary Table And translates the text to: 122 If the dictionary word size can be no larger than the cache line size then, at most, a cache lines worth of information is available. Dictionary address conversion table can thus be built analogous to that found in the Huffman codes.

Character Number ASCπ Bit Dictionary Bit Location Locations 1-12 1-96 1-3 Table 43 Dictionary Address Conversion Table

Lossy Codes With lossy codes the bit accuracy may also be considered. This is because all lossy codes may be considered a type of data rounding. The bit loss cannot exceed the bit accuracy requirements of the target algorithms. The algorithm required bit accuracy is know a priori, thus the bit loss tolerance can be computed a priori to compressing. With these caveats in mind, it is possible to maintain compression ratios that exceed what is possible with lossless compression techniques.

Quantization Codes In quantization coding, each real number x is replaced with an integer i such that mi + b is as close as required. FIG. 95 shows a graph 2000 illustrating exemplary quantization where m represents a quantization size. The larger the quanta the more compression and lower the computational accuracy. A function which leads to a distribution can be used to generate quanta. Some function examples include: the Gaussian distribution function, wavelet transform, Poisson distribution function, discrete cosine transforms, and Fourier transforms. If the quanta are evenly distributed between the range x_min> x_max or the logarithm of the quanta are evenly distributed then the median error is m 4 and half of the values are closer than m/4 while the other half are less than m/4.

Multibit Techniques to Decrease Chip Level I/O, Address Lines

As discussed above, there may be advantages to using compression between the registers (e.g., registers 1984), RAM (e.g., RAM 1992), and both the highest and lowest level of cache, that is, between the Ll cache (e.g., cache 1986) and the registers or L2/L3 cache (e.g., cache 1990) and RAM. Two areas not addressed by conventional compression techniques are the number of I/O channels required to support higher precision memory (whether in RAM or cache) and the number of address channels required to handle denser memory. For example, a 16 Mbit memory chip configured as 4Mx4 (which means 4,194,304 addresses with 4 bits per address) requires 22-bit (to form 2²² addresses) address lines to access the memory. This used to be arranged as 2 11-bit addresses, one for row, and the other one for column. So 11- address lines plus 1 -control (row, column selector) line are required. Because of the performance hit (it takes twice as many access cycles) this is now generally avoided and the full compliment of address bits are used. For a modern computer system 32 address lines are usually required, this forms 2³² or 4 Gb of addressable memory. In addition to the address lines, I/O channels are required. For a 64-bit system this means 64- input data lines and, for dual ported memory, 64-output data lines. So to address and use modern dual ported memory requires a minimum 160 data lines. Each of these data lines is maintained across the entire motherboard including the various system buses, increasing the motherboard complexity and cost.

Waveform Analysis According to Shannon, as the information content increases the data looks more and more like randomness. This holds true even at the single bit level. Since a single binary digit is encoded as a single wave pulse and further that wave pulse only has 2 states that correspond to a 0 or 1 condition, then a single wave pulse is highly deterministic. The Shannon's channel capacity equation below shows why:

Let: p_x ≡ the probability of outcome x Log₂ ≡ the logarithm with a base equal to the number of possible bit states So: Equation 111. Shannon's Channel Capacity H = pi log₂ (1/pι) + p₂ log₂ (I/P2) +• -+p_n log2 (1/pn) bits per unit time Since a single binary digit means: x = 2

Then: H = V2 log₂ (1/ Vi) + Vi log₂ (1/ Vi) = Vz \og₂2^l + Vι \og₂2^l = 1 bit, the smallest quanta of information Since the 0 and 1 condition represents one transition and that a single transition is the smallest possible quantum of information. Thus a binary representation has the lowest possible information carrying capacity. As can be seen from the above example, the sum of all of the probabilities of outcome (ρ_x ) is unity. Further, if the probability of outcome is an even multiple of some number of transitions, that is 2^X, then that multiple can be algebraically extracted from the log₂ portion of the equation leaving only the x term. This is used later to simplify Shannon's channel capacity equation. The reason for its wide use is its ease of use and ready translation into electronic hardware. FIG. 96 shows two exemplary waveforms 2020, 2022 that represent either the value 1 or the value 0 depending upon interpretation. These 2 values may be readily translated into a simple hardware switch. Representation of data may be 'ori = 1, 'off = 0, or 'ori - 0 and 'off = 1. By modulating the waveform in one or more of the ways, as shown in FIG. 97, the possible carrying capacity of the bit area may be increased. As shown in FIG. 97, there are five types of primary waveform transition types 2040, 2042, 2044, 2046, 2048 that can be used. By using both the positive 2040, 2042, 2044, 2046, 2048 and the negative 2050, 2052, 2054, 2056, 2058 versions of the transformations, this can increase to ten waveform transition types. FIG. 98 shows three schematic diagrams 2060, 2062, 2064 illustrating three exemplary circuits for encoding data into a signal. Diagram 2060 shows one circuit for encoding amplitude by varying voltage levels using three resistors and a switch. Diagram 2062 shows one circuit for a high pass filter that may encode phase changes into a signal. Diagram 2064 shows one circuit for a low pass filter that may encode phase changes into a signal. Diagrams 2060, 2062 and 2064 do not represent full circuitry for encoding data. In addition to each waveform carrying multiple bit information it may also be possible to have multiple parallel waveforms. These multiple waveforms can and often do, have a skew factor associated with them. FIG. 99 shows four exemplary waveforms 2080, 2082, 2084 and 2086 illustrating skew. Waveforms 2080 and 2084 show waveforms without skew; waveform 2082 shows a waveform with a half phase forward skew; and waveform 2086 shows a waveform with a half phase backward skew. If a bit relative skew factor that is less than the waveforms minimum frequency is induced, then each parallel waveform may be further encoded using the relative skew time between them. Each relative skew time is independent of all other waveform encoding and so can be considered independent dimensions. Since only the relative skew matters, it is unnecessary to have a continuously increasing skew between the parallel waveforms. Below shows the relative skew waveform encoding dimension count equation. Let: S = The number of parallel bits with per relative skew Then: Equation 112. Relative Skew Dimension Count Equation D, = S-1 Positional skew occurs because there is a known time when the waveform should arrive. If it arrives a little before or a little after the expected time, this is called positional skew. If instead of relative skew, positional skew is used, it gives: Equation 113. Positional Skew Dimension Count Equation D_S = S Finally, if the data is compressed prior to encoding, the compression ratio can be considered an independent dimension. Combining Shannon's bit channel carrying capacity equation with multiple dimensions of encoding plus compression gives: Let: Cr ≡ The dataset compression ratio t = The number of transitions D ≡ The number of independent dimension Ds ≡ The number of independent relative skew dimensions Then: Equation 114. Augmented Channel Capacity Equation H= (l/Cr)(_Pl log₂ (1/pι) + P₂ log₂ (l/p₂) +..+p„ log₂ (l/p„)) If the probability of each outcome is the same then the sum total of all possible outcomes equals 1. Further, if those probabilities are evenly divisible by a binary transition count, it gives: H = (l/Cr) log₂ (l/_Pl) = (l/Cr)log₂ 2^t Since each waveform type represents different dimensions of encoding, the above definition with an added number of dimensions, assuming that each dimension holds the same number of transitions, then: H = (l/Cr₁) log₂ 2^tD So, adding relative skew dimensions means:

Therefore: Equation 115. Simplified Channel Capacity Equation H = (1/Cr) log₂ 2^(Ds)tD = D_s(tD)/Cr By encoding two transitions on each of the twelve possible waveform types, with no compression, the following bit carrying capacity is available: H = log₂ (2^2*12) = 24 bits Adding a 2:1 compression ratio gives: H = (1/ Vi ) log₂ C2²⁴) = 2(24) = 48 bits

Analysis of Communication Phases Effects of manifold and hyper-manifold cross-communication and overlapping communication may be combined with computation to enhance cross-communication performance for transpose type partial dataset and other types of all-to-all exchanges. Equation 110 shows how much computation per data size is present in order to show advantage. This equation implies that an initial amount of work is ready to compute prior to performing the overlapped computation and I/O activity. In general, overlapped cross-communication and computation may assume two distinct relationships, or phases. FIG. 100 shows an alpha phases 2100 and a beta phase 2110 for overlapped computation and I/O illustrating communication in terms of the total commumcation time (t_c), the priming time (f), the overlapped time (t_c-f), and the processing time (t_p). Note that t_p may be obtained by profiling the algorithm on a representative node as described below. Conventional programs are in a sequential phase since most processing and commumcation occurs in turn. Scaling of these programs is controlled by the sum of all the communication time and latencies over the course of the process. Therefore, if the communication time is a significant portion of the compute time, the process is communication dominated and scaling is severely limited. The Alpha phase 2100 is similar to the sequential phase. It also occurs when the communication time is greater than the processing time, but allows for the possibility some significant portion of the communication can be overlapped by computation. The scaling of such a system is controlled by the initial latency and priming time, plus the sum of the exposed communication time. Thus while a problem may be described as communication dominated in terms of total processing and communication times, the overlapped operations have a strong beneficial impact on scaling. The Beta phase 2110 occurs when the overlapped communication and latency has been reduced to only the initial priming time plus initial latency. Scaling actually improves as the process runs longer, since the exposed communication time becomes a smaller and smaller fraction of the total time. The exchange process can be organized into only 2 distinct steps: 1) Priming Step — All compute nodes in a cascade group exchange among their member nodes. With the member node count and the number of cascade groups selected to meet the Equation 110 criteria. 2) Overlap Step — All cascade groups connected to single top level hyper-manifold channel exchange data while the additional computation is performed upon the prior exchanged data. Communication configuration required to construct an alpha phase commumcation is shown below. FIG. 101 shows a two level hierarchical cross-communication model 2120 with four cascade groups 2122(1-4) each having a home node 2124 and three compute nodes 2126. At the lowest level, each cascade group 2122 performs cross-communication in parallel. If the algorithm is designed such that the data set exchanged at the cascade group level equals the priming data set, then all of the higher level cross-communication is hidden. This gives: Equation 116. All-to-All Exchange to Overlap I/O Exchange Conversion lim T = t = f D'→Dφ ^c where D' is the priming data set. Equation 116 means that at the limit overlapping all-to-all communication and computation converts into a standard all-to-all exchange with the maximum exposed communication time.

Alpha-phase to Beta-phase Conversion Barrier As described above, transitioning from Alpha-phase to Beta-phase generates benefits in terms of maximizing the scaling of a parallel system. Though it is beneficial to be in Beta-phase and not Alpha-phase, there is difficulty in transitioning into Beta-phase. As shown in FIG. 100, for a given algorithm, alpha phase 2100 occurs when the time for processing data is less than the time to transfer the data to be processed and beta phase 2110 occurs when the time for processing the data is greater than or equal to the time it takes to transfer the data. Here, performance of either processor speed or communication channel speed does not matter; their performance ratio is more important. Entropy metrics and Equation 10 are used to develop an expression for the relationship between channel speed and processing speed. Given that exchange steps are constant in time at a given level within a hyper-manifold (e.g. cascade group, manifold group), the effective bandwidth of a systems may be expressed as: Equation 117. Effective Bandwidth From Entropy a Metric N b

Since the γ metric relates the data moved to the data required by the algorithm, communication time may be expressed as: Equation 118. Total Commumcation Time in Terms of Entropy Metrics (vD \ t_c = { —ab ) The conditions for being in Alpha-phase or Beta-phase can be written as: Equation 119. Alpha- and Beta-phase Conditions Alpha - phase : t_c - > t Beta - phase : t_c — ≤ t Substituting Equation 118 into Equation 119, and, making note of the fact that the priming time (?) and the processing time (t_p) can be determined by profiling, gives: Equation 120. Lupo-Howard Phase Barrier Equations

Alpha - phase : — - > — (t_p - )

Beta -phase : ^- ≤ —{t -f)

To account for the changing data set sizes transferred between levels of a hyper-manifold, the notation is changed to take the levels into account. Letting n indicate the level, gives: D_a —> D„ The algorithmically required data set size moved during the level n exchange steps. tp ^→ t_p(n) The processing time at level n. t' —> t'(n) The priming time at level n. The general statement for the phase balance equations becomes: Equation 121. The General Lupo-Howard Phase Barrier Equations Alpha-phase : -r > - (t_p{n)-t(n)) n " i n Beta-phase : ∑^ ≤ ∑ (, („)-,^•(„)) _n b γ „

Alpha and Beta Phases for Embarrassingly Parallel Problems So called embarrassingly parallel problems contain a large number of commercially useful algorithms, for such fields as oil and gas exploration, database searches, and image processing.

Although the processing scales perfectly, the end-to-end speedup of such jobs may not scale well since such systems often become I/O bound. The behavior of such algorithms exhibits characteristics of an Alpha-phase non-embarrassingly parallel problem. Viewing I/O as a special case of communication, the I/O problem may be treated as a cross-communication problem; the only differences are the types of "exchanges'' used. I/O communication may be one of the following: 1) Type I Input/Output and Howard Cascades 2) Type I Agglomeration 3) Howard-Lupo Manifolds 4) Hyper-Manifolds 5) Type II Input/Output 6) Type Ila 7) Type lib 8) Type III Input/Output 9) Howard-Lupo Type Illb Manifold Since these are all zero entropy communication models, described above, and further, since these models cover collapsing and non-collapsing data structures I/O may be considered analogous to the cross-communication model. All discussions of manifold or hyper-manifold communication can be taken to mean either cross-communication or I/O usage. Converting to Beta-phase, where there is perfect overlap between I/O and computation, allows data bound embarrassingly parallel algorithms to scale.

Laws Governing Type II Manifold Hyper-Manifold All-to-All Exchange or Type I/Type H/Type UI I/O Commumcation Time Masking

The laws governing time masking by overlapping communication with computation allow computational systems to scale to extremely large numbers of nodes. Type H Manifold/Hyper-Manifold All-to-All Exchange or Type Type ϋ/Type m I O Communication Time Masking and Amdahl's Law Given the ability to speed up communication, something interesting occurs within the details of overlapping communication with calculation. Start with a single process, and consider what might occur at the beginning and end. JL VJ. lui auυws a u e grapn i U illustrating details of overlapped communication and calculation. Data input to a process involves some total input time 2141 broken down into 3 time intervals: some inherent system latency 2142, a data priming time 2144 which represents moving enough data to allow the processing to start, and input data transfer 2146 fully overlapping the processing 2148. The analogous situation occurs at the end of processing 2148: a total data output time 2149 has some inherent system latency 2150, a period of overlapped output data transfer 2152 with processing 2148, and a drain time 2154 representing the transfer of the remaining data after processing 2148 has stopped. This procedure has essentially hidden some input data transfer 2146 and output data transfer 2152 time plus one output latency period 2150; exposing only one input latency period 2142 and priming time 2144. Total processing time may be given as: Equation 122. Total Time for Process with Overlapped Cross-Communication and Processing T = t_λ + t'+t_p +t_d where: T ≡ Total procedure time. t ≡ Exposed latency time. t' = D'/bo ≡ Priming time t_p ≡ Processing time. t_d = D/bv ≡ Drain time. FIG. 103 shows a time graph 2160 illustrating details of overlapped communication and calculation with periods of cross communication. As described above with respect to FIG. 102, data input to a process involves latency 2142, data priming time 2144, and data input transfer 2146 that overlaps processing 2168. At the end of processing 2169 a latency 2150, a period of output data transfer 2152 overlapped with processing 2169, and a drain time 2154 represents the movement of the remaining data after processing 2169 has stopped. In addition, one exchange of data between nodes is shown. The output from one node represents input to another node, and the priming time then means the amount of data required to keep prevent the receiving node processing from halting. For the exchange time, lead time 2165 (r) represents the time from the start of the exchange until the end of the processing step. In FIG. 107, lead time 1265 is long enough to allow for the latency period 2162 and the priming period 2164 to complete just as the next processing step 2169 is ready to begin. A longer lead time has no impact, but a short lead time implies processing step 2169 waits until at least priming data 2164 has been received. This induces dead time in the processing stream and leads to an increased total time. If there is a total of N processing steps (e.g., processing 2168, 2169) each joined by a communication exchange (e.g., latency 2162 + priming period 2164 + transfer period 2166), then the total processing time is represented by: Equation 123. Total Time for Process with Overlapped Cross-Communication and Processing

T = t_κ +

+ max(t_u ^! ₊₁ - f_m ,θ))+ t_pN + t_d Implementation requires that function profiling (described below) provide information on processing time, while algorithm design provides information on the t's and t_ds. Note that the exposed times may be separated from the processing times to rewrite Equation 123 as: Equation 124. Total Time as Sum of Exposed Communication Elements and Processing Time T = Ω + T_p where Ω = t_κ + ,θ))

If all of the exchange steps are fully overlapped, then the sum in Ω goes to 0 leaving only the leading latency (e.g., latency 2142), priming time (e.g., priming time 2144), and drain time 2154. This holds true for any number of nodes, so leading to an equation for speedup as: Equation 125. Speedup Prediction Overlapped I/O Timing, Amdahl's Law

where P = Number of processors S ≡ Speedup factor as a function of number of processors. Since omega represents all of the exposed latency and communication time, as that time approaches zero the system speedup approaches the number of processors involved in the calculation. That is: Equation 126. Amdahl's Infinite Linear Speedup Prerequisite lim S(P) = P Ω→O This is similar to Amdahl's equation, except that the serial and parallel components are explicitly identified. Properly designed algorithms and supporting hardware only expose the initial latency and priming time, allowing for tremendous amounts of scaling. FIG. 104 shows a graph illustrating the effect on scaling for various total exposure times (latency plus priming time). In particular, FIG. 104 shows a curve for 0.36 seconds of exposed time 2182, 3.6 seconds of exposed time 2184, 36 seconds of exposed time 2186 and 6 minutes of exposed time 2188. FIG. 105 shows a graph illustrating the effect if the exposure time is reduced to that of typical network latencies. In FIG. 105, lines for an example low latency commercial network (Myrinet at 3 micro-seconds), a 1000 BaseT network and a 100 BaseT network appear close to a linear line 2202; a 100 BaseT network is shown as curve 2204; a 10 BaseT network is shown as curve 2206; and a human neuron (about 0.1 sec) is shown as curve 2208. As can be seen in FIG. 105, even extremely poor latency (as with a neuron, curve 2208) within the beta-phase allows scaling of up to 5,000 processors.

Linear Regime The linear regime means the minimum and maximum number of nodes that scale perfectly with the number of nodes added. By definition the minimum node count within the linear regime is one node. The maximum number of nodes within a system that is considered linear consists of the maximum N such that N nodes are N times faster than 1 node: Equation 127. Linear Regime Test M_; £ {l,max(N^rrae(sw=r'^/;v))} where: Ti ≡ The total procedure time for one node Ν ^s Node count M; ≡ The set of all nodes within the linear regime True ≡ A function that produces a 1 if the internal condition is true and produces a 0 if the internal condition is false. Equation 126 shows that the linear regime increases as the amount of exposed latency and communication time decreases.

Howard's Law

The scaling behavior of an arbitrary collection of nodes can be quantified if the nodes can be grouped in the linear portion of either the alpha or beta phases. Let: Mi be the set of all nodes in the linear alpha or beta regime. M_a be an arbitrary set of nodes. M_b be an arbitrary set of nodes. M_x be an arbitrary set of nodes.

Further, let: Equation 128. Sets of Machines in Linear Regime

M_b c M, M_{x z} M_b ≥ M_a

And: Speed( _α) > Speed( _fc) Where: Speed () = Is a function that calculates the processing performance of set of nodes within the function. Then: Equation 129. Howard's Law - Speed Matching Heterogeneous Machines Speed( ₆ uM_x) ≥ Speed( _α)

W Equation 130. Heterogeneous Machine-to-Machine Speed Matching Condition ^x Speed( ₆) Howard's Law means that within the linear regime, it is possible to meet or exceed the processing performance of any another machine regardless of the speed of the respective processors, the speed of the respective channels, the channel latency, or the processing phase, as long as the prerequisites are met.

1^st Corollary: The processing performance of any machine in the linear regime can be made to approach infinity if the communication latency of that machine is zero, regardless of the speed of the respective processors, the speed of the respective channels, or the processing phase. Let: M_aιι equal the set of all possible nodes. If: Equation 131. Nodes in the Linear Regime are a Subset of All Nodes. M, C M att According to Equation 126, when Ω approaches 0, the system scaling approaches P (the number of all possible nodes). Therefore by Equation 126: Equation 132. Unbounded Speedup Corollary lim Speed( _au) = M^ => M^ = M_t Ω→0 Since Ω approaching 0 can only occur when there is no latency and the priming period is zero, and further since this can only occur in the beta-phase this means that the Unbounded Speedup Corollary is only valid in the beta-phase. 2^nd Corollary Any machine in the regime of the Unbounded Speedup Corollary can be made equal to any other machine in the regime of the Unbounded Speedup Corollary. Let:

Then the Unbounded Speedup Corollary gives: Equation 133. Equal Performance Corollary lim i = _aιι and lim M₂ = M_au Ω->0 Ω→O

Therefore, 3 M_x such that Speed(Mι u M_x) > Speed(M₂) For any M_\ and M₂ The second corollary means that any compute systems can be made meet or exceed the processing speed of any other compute system regardless of the speed of the respective processors or the speed of the respective channels. 3^rd Corollary - The Sufficient Channel Speed Corollary For any system in the linear regime, increasing the effective bandwidth has no impact on that system's performance. This is because maximal masking of communication with processing has already occurred and speeding up the communication (decreasing the communication time) does nothing to improve performance. This is another way of stating the system processing power and effective bandwidth should be balanced.

4^th Corollary - The Infinite Processing Speed Corollary No processing system with infinite processing speed can be scaled.

Let: Ω = 2, the minimum conventional channel number of time units T_p = 0, the time it takes to process data with infinite processing speed

Then, from Equation 125: S(Infmite Processing Speed) = (2 + 0)/(2 + 0/P)

So: Lim S(Infϊnite Processing Speed) = 1 P-»∞ .'. An infinite processing speed computer system with conventional channels does not scale beyond a single infinite processing speed computer system.

5th Corollary — The Zero Speedup Corollary

Let: Ω = 0, the minimum unconventional channel number of time units T_p' = 0, the time it takes to process data with infinite processing speed

Then, from Equation 125: S(Infinite Processing and Communication Speed) = (0 + 0)/(0 + 0/P) So: Lim S(Infinite Processing Speed) = 0 P→∞

Λ An infinite processing speed computer system with infinite channel speed and no latency does no work. Scaling Behavior Impact on Machine Comparisons

Comparing parallel processing systems has always been difficult when the goal is to decide which type of machine is best for a given job. Peak performance figures are seldom accurate predictors of what a machine may do with any given real problem. The following discussion shows the relative importance between machine speed and the ability to mask (overlap) communication. In effect, a machine made up of slow processors can better a machine using faster processors.

Performance Equality To examine the relative performance of two parallel machines, Amdahl's Law OEquation 125) is used to find the conditions under which one machine can better another. For two machines with equal performance on some problem, their single processor speeds divided by their speedups should be equal: Equation 134. Condition for Performance Equality between Two Parallel Machines

where the r subscript is associated with a machine called the reference machine, and the c subscript indicates the comparison machine. Let R=T T_r d ε=Ω/Ω_c. Substituting into Equation 134 and solving for N_c, gives: Equation 135. Comparison Nodes Needed to Match Reference Nodes R²τ_r(εΩ_c +τ_r) N = €Ω_c ²(l- R)+RT_rΩ_c(ε-l)+^(Ω_c +RT_r)

This equation demonstrates rather surprising behavior as illustrated by an example. Let R=4, so the processors on the reference machine are 4 times faster than those on the comparison machine; T_c=7200 seconds (2 hours); and β^=60 seconds (60 seconds of unmasked I/O time). FIG. 106 shows a graph 2220 illustrating the number of comparison nodes required to match the performance of the specified number of reference nodes, given different values of Ω. The reference machine is 4 times faster than the comparison machine (T_p of 1,800 versus 7,200 seconds), and exposes 60 seconds of non- masked I/O. In FIG. 106, a curve 2222 shows the number of comparison nodes required to match the performance of the specified number of reference nodes, given Ω has a value of 1.93548387, a curve 2224 shows the number of comparison nodes required to match the performance of the specified number of reference nodes, given Ω has a value of 1.93548393, and curve 2226 shows the number of comparison nodes required to match the performance of the specified number of reference nodes, given Ω has a value of 3.0. The implications of FIG. 106 are surprising. It indicates that a comparison machine with better Ω can not only equal the performance of the reference machine, but can also reach a performance level that the reference machine can never match. The behavior is better understood by allowing the s ccu uuicicii cs cuiu υvcuaμ cu i/u umerences to vary separately. First Ω is set constant on the reference and comparison machines, such that Equation 135 becomes: Equation 136. Comparison Nodes Required to Match Reference Nodes at Constant Ω _N R%N_r(Ω + T_r) ^c Ω²N_r(l-R)+T_r(Ω + RT_r) This is only true if the result is greater than 0, implying the denominator is greater than 0.

That means a solution for N_c exists only if: Equation 137. Condition for N_s Existence at Constant Ω _Λr T_r(Ω + RT_r) N_r < - X X- Ω(R - l) If this condition is satisfied, then a machine with some number of slower nodes (albeit possibly infinitely slower) can always meet or beat the performance of the reference machine. The limit of this equation (either infinitely slow processors compared to fast processors, or fast processors compared to infinitely faster ones) yields: Equation 138. Condition for N_s Existence at Limit of Infinite Speed Ratio ( TT. V limN. < κ→~ _j In the example above, if .2=60 seconds, N_r is greater than 129,000 nodes to ensure that the slowest possible machine can't meet or beat its performance. Continuing with our example, if N_r<Y12,92Q then some number of processors 4 times slower can meet or exceed its performance. If N_r =100,000, for instance, the slow machine needs /V_c=950,522 to match it, but adding more nodes exceeds the fast machine's performance. A second way to examine Equation 135 is to hold the processor speeds fixed, but vary the Ω's.

This has the effect of measuring the impact of the implementation quality, since a better implementation uses fewer machines to reach a given performance level. For this purpose, consider Ω_c > Ω_r (the comparison machine has more exposed communication time than the reference machine). The number of reference nodes needed to match a given number of comparison nodes is then: Equation 139. Number of Reference Nodes Needed to Match Comparison Nodes at Constant T_p

For a limit as N_c gets arbitrarily large, there is a finite value for N_r which meets or exceeds the performance of an infinite number of comparison nodes: Equation 140. Reference Nodes Needed to Match Arbitrary Comparison Nodes at Constant T_p Ω + T lim N =— — ^p- w_c→- ^r Ω_c-Ω_r The behavior of both Equation 139 and Equation 140 is shown in FIG. 107. In particular, FIG. 107 shows a graph 2240 illustrating the number of comparison nodes required to match the performance of the specified number of reference nodes, given different values of Ω. In FIG. 107, processing time is fixed at 2 hours (7200 seconds) and the reference machine has 60 seconds of exposed communication. A Une 2242 shows a linear relationship when Ω for the comparison machine is the same as the reference machine (i.e., 60). A curve 2244 shows the number of comparison nodes with Ω = 59 required to match the reference nodes. A curve 2246 shows the number of comparison nodes with Ω = 58 required to match the reference nodes. A curve 2248 shows the number of comparison nodes with Ω = 50 required to match the reference nodes. A curve 2250 shows the number of comparison nodes with Ω = 30 required to match the reference nodes. A curve 2252 shows the number of comparison nodes with Ω = 1 required to match the reference nodes. The comparison . machine is matched at a number of different exposed communication times. The limits on N_r for each case plotted is show in Table 44.

Ω_c (seconds) Limiting N_r 60 No 59 7,200 58 3,600 50 720 30 240 1 123 Table 44. Limiting Values on N_r Assuming T_p=7,200 seconds

Table 44 demonstrates the tremendous impact algorithm design can have on system resources required for even modest problems.

Behavior at the Alpha-Beta Boundary As described above with reference to FIG. 100, the change from alpha phase 2100 to beta phase 2110 occurs when the amount of exposed communication and latency drops below the amount of processing time. This occurs when Ω<T_p. By setting the reference machine to lie just above the beta phase, performance may be examined by comparing the reference machine to machines deeper in the alpha phase and by comparing the reference machine to machines in the beta phase. Assume the reference machine has: T_r = 1800 seconds Ω = 1801 seconds. This reference machine may be compared to comparison machines with Ω set to values of 1810, 1801, and 1789 seconds, as shown in graph 2260 of FIG. 108. In particular, graph 2260 shows three curves 2262, 2264 and 2266 illustrating the number of comparison nodes required to match the performance of the specified number of reference nodes, given different three different values of Ω: 1811, 1801 and 1789, respectively. Note in particular that the comparison machine deeper in the alpha- phase (i.e., Ω=1811 curve 2262) fails to keep up with the reference machine, eventually requiring an infinite number of nodes to match the reference machine's performance at only 401 nodes. Conversely, the comparison machine in the beta phase (i.e., Ω=1789 curve 2266) requires only 300 nodes to match any number ol nodes on toe reterence macmne. This change from inferior to vastly superior performance is achieved by reducing Ω by only 1.1%. Working on limiting Ω may be hard, but the benefits are crucial to good performance.

Variable Ω Extension to Amdahl's Law, Gamma Phase Amdahl's law is defined only for Ω that is independent of the number of nodes. This means that it does not cover the condition where a bottleneck within a single node is relieved because multiple nodes are utilized. Nor does it anticipate negative scaling, see FIG. 104, FIG. 105, and FIG. 106 for examples of Amdahl^'s law scaling. As can be seen there is a performance roll off, followed by essentially a flat performance region. The effects of Ω, which is dependent upon the number of nodes, is explored below. Most parallel processing practitioners have seen both negative scaling and super-linear scaling. Super-linear performance is generally attributed to caching effects. For example: better cache fit because of increased node count or allowing the entire data structure to stay within RAM and thus eliminating a disk access because of increased node count, etc. A new cause of super-linearity, geometrically induced super-linearity, as well as the causes of negative scaling is discussed below. As discuss above, system input/output and cross-communication using cascades, manifolds, and hyper-manifolds are all zero-max entropy structures. The effects of some of these structures used during priming time are shown below. In a first example, a hypothetical embarrassingly parallel algorithm is I/O bound. In particular, Ω = 50 seconds and T_p = 400 seconds. Equation 125 derives the speedups shown in Table 44 below.

Table 45. Speedup Calculations for Ω = 50, T_p = 400, Type Hlb I/O

FIG. 109 shows a graph 2280 illustrating Amdahl's law. In particular, graph 2280 shows a speedup chart for omega = 50, T_p = 400 for a cascade with type Hlb input/output as shown in Table 44. Such graphs may not show negative scaling, showing only scaling that 'tails off. To obtain negative scaling requires Ω to vary with the number of nodes, or more generally with the node count of a given φ. When Ω varies with the number of nodes it is designated gamma-phase. To see the effects of gamma phase — Let: Ω'0 = 3(Ω, φ) T = T_P /P_φ This means that for each P_φ step the Ω term may be used to calculate some Ω'() function. Function 3(Ω, φ) is different for each exchange type. This changes the basic Amdahl equation to: Equation 141. Extension to Amdahl's Law S'(P_φ) = (Ω + T_p) / [Ω'() + (Tp/P_φ)] Since Ω'() is a function, taking the limit of that function yields its maximum behavior. As can be seen below, the limit values of Ω'() can have one of 5 basic interpretations: 1^st Interpretation: Let: lim Ω = Ω => a constant for all P_φ

lim Ω'O = Ω'() => Ω'() is a constant for all P_φ P_φ→∞

If: Ω'() = Ω Then: S'(P_φ) = S(P_φ) for all P_φ So: Equation 142. Standard Amdahl's Law Prerequisite lim Ω'() = Ω'() =-> Amdahl's initial condition: Pφ→∞ This is because under the Equation 142 conditions Ω'() is a constant with respect to the number of processors. That is, a constant as the node count increases. Therefore under these conditions: Equation 143. Howard's Second Law, Amdahl's Law Extension Operational Regime

Let: lim Ω'O ≠ Ω'O

Then: lim Ω'() - lim T = Ω'() => sublinear scaling; 2ⁿ Interpretation: Pφ→∞ Pφ→∞ lim Ω'O - lim T = C => linear scaling; 3^rd Interpretation

lim Ω'() - lim T = -T => superlinear scaling; 4^th Interpretation

lim Ω'() - lim T = ∞ = negative scaling; 5^th Interpretation

Where: 3(,)≡ A function which relates Ω to the expansion of the cluster as given by φ. This function depends upon the algorithm in question. Ω ≡ The single node version of Ω. The second operational interpretation occurs because Ω'() is not a constant and the T approaches zero faster than Ω'() approaches zero effectively making Ω'() a constant. This causes the behavior to be analogous to the standard Amdahl's Law behavior. The third operational interpretation occurs because Ω'() is approaching zero at the same rate as T. This remains within the linear regime. The forth operational interpretation occurs because Ω'()aρproaches zero at a faster rate than T.

This means that the overhead is shrinking faster than linear, since T always shrinks linearly this results in an overall super linear effect. The fifth operational interpretation occurs because is growing as a function of the number of processors. This eventually forces the system to zero performance. To get to zero performance requires negative speedup. There are four scaling starting positions: Super linear, Linear, Sub linear, and Negative. A scaling starting position occurs when φ = 0 transitions to φ = 1. Below is a graph that charts extensions to Amdahl's law properties (1^st, 2^nd, 3^rd, 4^th, and 5^th) from φ = 0 through φ = ∞. Please note that only the endpoints and graph shape matter in these figures. To show the shape of the curves, the inflexion points for various Ω'() functions are first found. The inflexion point calculation is then shown for the embarrassingly parallel model where the only concern is with I/O. If the Type Hlb I/O model is used, then: Let: 3(Ω, φ) = (Ω/φ); because the number of I/O channels equals φ so the I/O overhead is divided by number of channels

By Equation 11: P_φ = [(v + l)^φ - l] / v Therefore Equation 141 becomes: s'(p_φ) = s'([(v + ιy - i]) = (Ω + T_p)/[(Ω/φ) + (vT_p / ((v + If - 1))]

So : d/d_φ [S'(P_φ)] = -[(Ω + T_p)/( Ω/φ + (vT_p / ((v + If - 1))²] d d_φ[Ω/φ + (vT_p / ((v + lf - l))] = [(Ω + T_p)( ΩP_φ ² + φ² T_p(P_φ + 1) In (v+1)] / [ΩP_φ + T_pφ]² Therefore, the inflection point occurs when K changes sign: K = [ ²y/ d_φ ²] / [l + (dy/ d_φ)²]^{3 2} This shows the inflection points for one of our exchange methods. All other exchange method inflection points can be computed analogously. The end-points combined with the inflection points can be used to generate the curves 2302, 2304, 2306, 2308 and 2310 as shown in graph 2300, FIG. 110. In particular, curves 2302, 2304, 2306, 2308 and 2310 of graph 2300 illustrate superlinear start properties for the basic 5 interpretations of the limit value of Ω0, respectively. This means that from the first node count to the second node count the system (that is from φ = 1, to φ = 2) has superlinear performance. As can be seen, those systems which have decreasing 0 first have superlinear scaling and then either linear scaling, sublinear scaling or superlinear scaling depending upon the rate that Ω0 decreases. Those systems which have increasing Ω0 first go linear, then sublinear, then negatively scale, and finally, at infinity, there is zero performance. FIG. Ill shows a graph 2320 illustrating curves 2322, 2324, 2326, 2328 and 2330 for the basic 5 interpretations of the limit value of Ω0, respectively. This shows that from the first node count to the second node count the system (that is from φ = 1, to φ = 2) has linear performance. As can be seen, those systems which have decreasing 0 first have linear scaling and then either linear scaling, sublinear scaling or superlinear scaling depending upon the rate that 0 decreases. Those systems which have increasing ΩQ first go sublinear, then negatively scale, and finally, at infinity, there is zero performance. FIG. 112 shows a graph 2340 illustrating curves 2342, 2344, 2346, 2348 and 2350 for the basic 5 interpretations of the limit value of Ω0, respectively. In particular, graph 2340 illustrates that when sublinear scaling occurs between the first node count to the second node count, the system (that is from φ = 1, to φ = 2) has sublinear performance. As shown by graph 2340, where Ω0 first first descreases, sublinear scaling and then either linear scaling, sublinear scaling or superlinear scaling occurs, depending upon the rate that 0 decreases. Where Ω() first increases, negative scaling occurs, and at infinity, there is zero performance. FIG. 113 shows a graph 2360 illustrating curves 2362, 2364, 2366, 2368 and 2370 for the basic 5 interpretations of the limit value of Ω0, respectively. In particular, graph 2360 illustrates that starting with standard scaling between the first node count and the second node count the system (that is from φ = 1, to φ = 2) has standard Amdahl starting performance. As shown in graph 2360, those systems which have decreasing 0 first have standard scaling and then either linear scaling, sublinear scaling or superlinear scaling depending upon the rate that 0 decreases. Those systems which have increasing 0 stay negatively scaling, and finally, at infinity, there is zero performance. FIG. 114 shows a graph 2380 illustrating curves 2382, 2384, 2386, 2388 and 2390 for the basic 5 interpretations of the limit value of ΩQ, respectively. In particular, graph 2380 shows that starting with negative scaling between the first node count to the second node count, the system (that is from φ = 1, to φ = 2) has negative scaling starting performance. As shown in graph 2380, systems with decreasing Ω0 first have negative scaling and then either linear scaling, sublinear scaling or superlinear scaling depending upon the rate that 0 decreases. Those systems which have increasing 0 remain with negative scaling until, at infinity, there is zero performance. Communication Degenerate Case Expanded Amdahl's Law Examining the derivation leading to Amdahls Law, Equation 125, it can be seen that it is applicable only when commumcation and computation are combined. This means that when the system is only computing and not communicating you get the following degenerative case: Let: Ω'() = 0 Then: Equation 144. Compute Bound Degenerative Amdahl's Law S'(P_φ) = P_φ, for all P_φ Note: This degenerative case always has linear speedup. To show non-linear responses see:

Computation Degenerate Case Further Expanded Amdahl's Law When the system is only communicating and not computing you get the following degenerative case: Let: T_p = 0

Then: Equation 145. Communication Bound Degenerative Amdahl's Law S(P) = Ω / Ω = 1; iff lim Ω'() = Ω'()

S'(P) = Ω / Ω'0; ϊ lim Ω'() = 0 or

lim Ω'() = ∞

The first part of Equation 145 allows for sublinear, linear, superlinear, and negative speedups as function of Ω'()- These conditions are shown below: Let: Y = Ω /P_φ lim Ω'() = Ω'() => Speedup = lx, 1^st Operational Regime P_φ→∞ This is because Ω'() is a constant with respect to the number of processors. That is, a constant as the node count increases. Further, since there is no calculation component, the speedup equals the overhead. Thus the first portion of Equation 145 is true. If: lim Ω'() = 0 or lim Ω'O = ∞

Then: Equation 146. Amdahl's Law Extension Communication Degeneritive Case Operational Regime lim ΩO-lhn Y = Ω0; Sublinear speedup, 2^nd Operational Regime Pφ→∞ Pφ→∞ lim Ω0-lim Y = Constant, linear speedup, 3^rd Operational Regime Pφ→∞ Pφ→∞ lim ΩO-lim Y = -Y; Superlinear speedup, 4^th Operational Regime Pφ→∞ Pφ→∞ lim ΩO-lim Y = ∞; Negative speedup, 5^th Operational Regime Pφ→∞ Pφ→∞ The second operational interpretation occurs because 0 is not a constant and the Y approaches zero faster than ΩO approaches zero effectively making 0 a constant. This causes the behavior to be analogous to the standard Amdahl's Law behavior. The third operational interpretation occurs because O is approaching zero at the same rate as Y. This remains within the linear regime. The forth operational interpretation occurs because ΩO approaches zero at a faster rate than Y. This means that the overhead is shrinking faster than linear, since T always shrinks linearly, this results in an overall superlinear effect. The fifth operational interpretation occurs because O is growing as a function of the number of processors. This eventually forces the system to zero performance. To get to zero performance requires negative speedup.

Optimizing Ω'() Parallel Effects There are six areas that may be balanced when optimizing parallel effects for a given algorithm. They are: problem-set decomposition (D), data input (I), computation (C), cross- communication (X), agglomeration (A), and data output (O). Most of these areas can be performed by combining calculation with communication; however, the cases where this is not true are also accounted for. In addition, it is necessary to be able to handle one or more of these areas not occurring for a particular algorithm-the degenerate cases.

Algorithm Required Multiple Omega Primes The total end-to-end time for an algorithm is given by: Let: T_D ≡ Total inline" problem-set decomposition time T_ϊ ≡ Total exposed"^' data input time T_c ≡ Total computation time T_x ≡ Total exposed cross-communication time T_A ≡ Total exposed agglomeration time To = Total exposed data output time

So: Equation 147. Total End-to-End Algorithm Run Time Tend-to-end = T_D + Tj + Tς + T_x + T_A + To Note: oc - inline time is the time it takes to decompose a problem during the course of performing the algorithm. It does not include compilation, algorithm analysis, or programming time Note: ∞ - exposed time is that time which is additive to the end-to-end algorithm performance time. It does not include time that is masked because of parallelism. As stated above, each of these areas can have either communication only, computation only or both. This means that the scaling of Tend-to-end is the weighted average of the scaling of each area as it exits in the full and degenerative states. Let: S'(P_φ, x) ≡ The variable Ω speedup of area x, I/O and computation S'(P_φ, x) ≡ The variable Ω speedup of area x, communication only S°(P_φ, x) ≡ The variable Ω speedup of area x, computation only S^x(P_φ, x)≡ The variable Ω speedup of area x, any type Ω_TD ≡ The exposed non-algorithm time for problem-set decomposition ΩTI ≡ The exposed non-algorithm time for data input ^"Ω_Tc ≡ The exposed non-algorithm time for computation Ωx ≡ The exposed non-algorithm time for cross-communication Ω_TA ≡ The exposed non-algorithm time for agglomeration Ω_τo ≡ The exposed non-algorithm time for data output T_Di ≡ Total inline problemset decomposition time for all areas!' T_B ≡ Total exposed" data input time for all areas'!' T i ≡ Total computation time for all areas'!' Tχι ≡ Total exposed cross-communication time for all areas'!' TAJ ≡ Total exposed agglomeration time for all areas'!' T₀i ≡ Total exposed data output time for all areas'!' Note: an area is one of the six exposed time consuming portions of the algorithm; i.e., problem-set decomposition, data input, computation, cross-communication, agglomeration, and data output So: Equation 148. Ω'() Speedup Equations for all Component Areas S'(P_φ, T_D) = (ΩTD + T_D) / [Ω(T_D) + (T_D/P_φ)], overlapped I/O and computation S'(P_φ, T_D) = Ω_TO / Ω(T_D), using I/O only S°(P_φ, T_D) = P_φ, using computation only

S'(P_φ, Ti) = (Ωn + T ) / [Ω (T£) + (Tι/P_φ) overlapped I/O and computation S"(P_φ, T = Ω_Ή / Ω'(T , using I/O only S" '(P_φ, Ti) = P_φ, using computation only

S'(P_φ, T_c) = (Ω_τc + T_c) / [Ω'(T_C) + (T_c P_φ)], overlapped I/O and computation S"(P_φ, T_c) = Ω_τc / Ω'(T_C), using I/O only S " ' (P_φ, T_c) = P_φ, using computation only

S'(P_φ, T_x) = (Ω_τx + T_x) / [Ω'(Tχ) + (Tχ/P_φ)],overlapped I/O and computation S"(P_φ, T_x) = Ω_τx / Ω'(T_X), using I/O only S " ' (P_φ, T ) = P_φ, using computation only

S'(P_φ, T_A) = (Ω_TA + T_A) / [Ω'(T_A) + (T_A/P_φ)],overlapped I/O and computation S"(P_φ, T_A) = Ω_TA / Ω'(T_A), using I/O only S " ' (P_φ, T_A) = P_φ, using computation only

S'(P_φ, T₀) = (Ω_τo + T₀) / [Ω'(T₀) + (T₀ P_φ)], overlapped I/O and computation S"(P_φ, To) = Ω_τo / Ω'(T₀), using I/O only S' ' '(P_φ, T₀) = P_φ, using computation only Because there are six identified areas that may be addressed for each algorithm, both Ω and

Ω'() can be expressed such that: Let: 6 6 Ω = ∑ Ω; = ∑ (T_Dι+ T_H+ T_Ci+ T_xi+ T_M + T_0i) i = l i = l

6 6 Ω' = ∑ Ω'_i = ∑ (T'_Di + T'_Ii+ T'c_i + T'χ_i + T'_Ai+ TOi) i = 1 i=l Then: Equation 149. Speedup Equation In Terms of Algorithm Activity Areas for Ω'O

Tm+ Tn+ Tci + Txi+ TAi+ Toi 6 S'(P_φ) = ∑ S^x'^TDi(P_φ) T_Di) + S^x-^τ,i(P_φ, TD) + S^x'^τci(P_φ, T_ci) + S^x'^τxi(P_φ, T_xi) + i = l S^x-^TAi(Pφ, T_Ai) + S^x'^τoi(P_φ, Toi) In order to optimize the speedup, a commumcation model for each area of the algorithm may be selected on an algorithm-by-algorithm basis. This leads to the following: Let: A_m ≡ the current mathematical algorithm needing optimum parallelization fpⁱ ≡ the j* valid communication model for the i^Λ area ™^ ≡ the problem-set decomposition area, j^Λ valid communication model ^Tij __ _{me m}p_Ut j_{ata areaj} j^Λ _wa\id _coιrιmunication model ^TCj = the calculation area, j* valid communication model ^τ ^,j __ he cross-communication area, j^Λ valid communication model ?^TAj ≡ the agglomeration area, j"¹ valid communication model ^T0J ≡ the output data area, j"¹ valid communication model I() ≡ the inspection function Min ≡ the minimum value function The inspection function translates a given valid communication model into an O function. A valid commumcation model is one that is valid for a particular area for a given algorithm. This requires some level of human interaction or some type of machine intelligence. Once found both the valid communication model and its O function is saved. The translation is accomplished via table lookup (an example table is shown below). So: Equation 150. Optimum Communication Ω'() Model Selection Per Algorithm 6 n Ω0 = Σ ∑ Min ( I(A_m, g> ) ^{i=1 j=1} n n n = ∑ Min ( I(A_m, ^j) + ∑ Min ( I(A_m, ^Uj) + ∑ Min ( I(A_m, ^TCJ) + j=l j=l j=l

∑ Min ( I(A_m, p™J) + ∑ Min ( I(A_m, ^TAJ) + ∑ Min ( I(A_m, p¹⁰ⁱ) j=l j=l j=l Equation 150 shows that finding the minimum 0 function also finds the list of best valid communication models for each area.

Calculating Ω'() Functions for Various Exchange Methods As discussed above, a separate omega prime function is calculated for each exchange method. Type la I/O Cascade Exchange Ω'() Function The Type la I/O Cascade exchange method allows the number of nodes defined in Equation

15 to communicate in φ time steps. In other words P_φ nodes complete this exchange in φ time steps. Mathematically this means that the omega prime function for this exchange method is: Let: φ s the number of time steps required to move data in the cascade Then: Equation 151. Type la I/O Cascade Exchange Ω'() Function Ω0 = Ω / (P_φ - φ) Since the non-degenerate extension to Amdahfs law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time K = O - φ = 0

So: Ω0 = Ω / P_φ Which according to the third interpretation of Equation 143 is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives:

Let: O ≡ overlap time K = 0 - φ So: Ω0 = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time K = φ - 0 So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition.

Type lb I/O Manifold Exchange Ω'O Function The Type lb I/O Manifold exchange method allows the number of nodes defined in Equation

21 to communicate at the cascade level in φ time steps and at the manifold level in m time steps. In other words P_φ nodes complete this exchange in φ + m time steps. Mathematically this means that the omega prime function for this exchange method is: Let: m s the number of time steps for the manifold Then: Equation 152. Type lb I/O Manifold Exchange Ω'() Function ΩQ = Ω / (P_φ - (φ + m)) Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: O s overlap time K = O - (cp + m) = 0 So: Ω'() = Ω / P_φ Which according to the third interpretation of Equation 143 is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time K = O - (φ + m) So: Ω'() = Ω / (P_φ + K) Which according to the fourth interpretation of Equation 143 is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time K = (φ + m) - 0

So: Ω'() = Ω / (P_φ - K) Which according to the second interpretation of Equation 143 is the sublinear speedup condition. The advantage of the manifold over the cascade may be that (if set up properly) the manifold allows communication to a lot more nodes per time step then the cascade. Thus it is easier to mask the overhead time with overlapped calculations.

Type Ic I/O Hyper-Manifold Exchange Ω'() Function The Type Ic I/O Hyper-Manifold exchange method allows the number of nodes defined in Equation 27 to communicate at the cascade level in φ time steps and at the manifold level in m time steps. In other words P_φ nodes complete this exchange in φ plus the sum of all m time steps. Mathematically this means that the omega prime function for this exchange method is: Let: m_; ≡ the number of time steps required for the ^ manifold N ≡ the maximum hyper-manifold level Let: Equation 153. Type Ic I/O Hyper-Manifold Exchange Ω'Q Function N Ω'() = Ω / (P_φ - (φ + ∑ m_i )) i=l Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time N K = O - (φ + Σ m;) = 0 i=l

So: Ω'() = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: O -= overlap time

K = O - (φ + Σ m i-l

So: Ω'() = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time N

So: Ω'() = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition. The advantage of the manifold over the cascade is that (if set up properly) the manifold allows communication to a lot more nodes per time step then the cascade. Thus it is easier to mask the overhead time with overlapped calculations.

Type Ha I/O Exchange Ω'O Function The Type II I/O exchange method allows the number of nodes defined in Equation 15 to communicate P_φ time steps. In other words P_φ nodes complete this exchange in P_φ time steps. Mathematically this means that the omega prime function for this exchange method is: Equation 154. Type Ha I/O Exchange Ω'() Function Ω0 = Ω + P_φ Since the non-degenerate extension to Amdahls law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: K = O-P_φ = 0 So: Ω() = Ω This, according to the second interpretation of Equation 143, is the sublinear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: K = 0-P_φ

So: Ω0 = Ω - K This, according to the second interpretation of Equation 143, is the sublinear speedup condition, unless K = Ω, in which case, by the third interpretation it is the linear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: K= P_φ - O

So: ΩO = Ω + K This, according to the fifth interpretation of Equation 143, is the negative speedup condition.

Type Hb I/O Exchange Ω'() Function The Type lib I/O Cascade exchange method allows the number of nodes defined in Equation 15 to communicate in φ time steps. In other words P_φ nodes complete this exchange in φ time steps. Mathematically this means that the omega prime function for this exchange method is: Equation 155. Type Hb I O Exchange Ω'() Function Ω0 = Ω / (P_φ - φ) Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: K = O - φ = 0 So: Ω0 = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: K = 0 - φ

So: Ω0 = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: K = φ - 0 So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition.

Type Hla I/O Manifold Exchange Ω'() Function The Type Ilia I/O Manifold exchange method allows the number of nodes defined in Equation 44 to communicate in 1 time steps. In other words P_φ nodes complete this exchange in 1 time steps. Mathematically this means that the omega prime function for this exchange method is: Equation 156. Type Hla I/O Manifold Exchange Ω'() Function Ω0 = Ω / (P_φ - l) Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: K = O - l = 0 So: Ω0 = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: K = 0 - 1 So: Ω0 = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: κ= ι - o

So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition.

Type Hlb Cascade I/O Exchange Ω'() Function The Type Hlb I/O Cascade exchange method allows the number of nodes defined in Equation 46 to communicate in φ time steps. In other words P_φ nodes complete this exchange in φ time steps. Mathematically this means that the omega prime function for this exchange method is: Equation 157. Type DJb Cascade I O Exchange Ω'() Function Ω0 = Ω / (P_φ - φ) Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: O ≡ overlap time K = O - φ = 0 So: Ω0 = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives:

Let: O ≡ overlap time K = 0 - φ

So: ΩO = Ω / (Pφ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives:

Let: O ≡ overlap time K = φ - 0 So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition. Type Hlb Manifold I/O Exchange Ω'() Function The Type IUb I/O Manifold exchange method allows the number of nodes to communicate at the cascade level in φ time steps and at the manifold level in m time steps. In other words P_φ nodes complete this exchange in φ time steps. Mathematically this means that the omega prime function for this exchange method is: Equation 158. Type Hlb Manifold I/O Exchange Ω'() Function Ω0 = Ω / (P_φ - φ) This is the same as the Type Hlb Cascade exchange O function. Since the non-degenerate extension to Amdahl's law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: K = O-φ= 0 So: ΩQ = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: K = 0-φ So: Ω0 = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: K = φ - 0 So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition. The advantage of the manifold over the cascade is that (if set up properly) the manifold allows communication to a lot more nodes per time step then the cascade. Thus it is easier to mask the overhead time with overlapped calculations. Cascade, Butterfly, Broadcast, Tree Broadcast, Nearest Neighbor, 3D-Nearest Neighbor, Red- Black, and Left Right Exchanges These communication exchanges cause data to be traded in φ time steps. However, the φ and P_φ values are different for each exchange case. In other words P_φ nodes complete this exchange in φ time steps. These exchanges are grouped because they are all usually used in cross-communication. Mathematically this means that the omega prime function for this exchange method is: Ω0 = Ω / (P_φ - φ) Since the non-degenerate extension to Amdahls law requires that communication and computation occur together, then the question becomes, what is the effect of various overlap amounts. If the overlap amount is equals the exposed time it takes to move data in this exchange method then this gives: Let: K = O-φ= 0 So: Ω0 = Ω / P_φ This, according to the third interpretation of Equation 143, is the linear speedup condition. If the overlap amount is greater than the exposed time it takes to move data in this exchange method then this gives: Let: K = 0-φ

So: Ω() = Ω / (P_φ + K) This, according to the fourth interpretation of Equation 143, is the superlinear speedup condition. If the overlap amount is less than the exposed time it takes to move data in this exchange method then this gives: Let: K = φ - 0 So: Ω0 = Ω / (P_φ - K) This, according to the second interpretation of Equation 143, is the sublinear speedup condition. An advantage of the manifold over the cascade may be that (if set up properly) the manifold allows communication to a lot more nodes per time step then the cascade; thus it is easier to mask the overhead time with overlapped calculations. Exchange Method Ω'() Type la I/O Ω/(P_φ-φ) Cascade Type lb I/O Ω/(P_φ-(φ + m)) Manifold Ω N Type Ic I/O 0 - (φ + Σ mi) i=l Hyper-Manifold

Type Ila Ω + P_φ

Type lib IO Ω/(P„-φ) Type Ilia I/O Manifold Ω/(P_φ-l) Type Hlb Cascade I/O Ω/(P_φ-φ) Type Hlb Manifold I/O Ω/(P_φ-φ) Cascade All-to-All Exchange Ω/(P_φ-φ) Manifold Hyper- Manifold All-to-All Ω/(P_φ-φ) Exchange Butterfly All-to-All Exchange Ω/(P_φ-φ) True Broadcast All- to-All Exchange, Ω/(P_φ-φ) Partial Dataset True Broadcast All- to-All Exchange, Ω/(P_φ-φ) Full Dataset MPI Broadcast (Tree) All-to-All Ω/(P_φ-φ) Exchange Nearest Neighbor Ω/(P_φ-φ) 3-D Nearest Neighbor, 1 Ω/(P_φ-φ) Channel, Odd Red-Black Ω/(P_φ-φ) Left-Right Ω/(P_φ-φ)

Table 46. Communication Model to Ω'Q Lookup Table variaoie i,, an r_φ extensions to Amdahl's Law, Gamma Phase AmdahPs law is defined for P_φ only when the processor performance is a constant with respect to the individual node resources for a particular algorithm and is defined for T_p only when the algorithmic workload is constant with respect to the dataset size for a particular algorithm. This means that AmdahTs law is silent when either the there is a bottleneck in the node or when the workload represented by N bytes of data grows at something other than 0(N).

Selected Parallelization Techniques Analyzed

Various common parallelization techniques are discussed below using our current understanding of the communication effects generated above. The functional decomposition method, the loop unrolling method and the pipelining method of parallelization are examined.

Functional Decomposition Method Analysis Functional decomposition is a method whereby an algorithm's functional components are analyzed for functional areas that have no dependence on other functional components. These uncoupled functional components are then run on separate nodes. Once run the output of those separate nodes are recombined such that their output is available for the coupled functional components. . FIG. 115 shows a block diagram illustrating functional components FI, F2, F3 and FN of an algorithm 2400. In particular, algorithm 2400 has three uncoupled functional components FI, F2 and F3, and one functional component FN that is coupled to function components FI, F2 and F3. FIG. 116 shows a parallel processing environment 2420 with four compute nodes 2422, 2424, 2426 and 2428. Uncoupled functional components FI, F2 and F3 are assigned to different compute nodes of parallel processing environment 2420, as shown in FIG. 116. In particular, functional component FI is assigned to compute node 2422, functional component F2 is assigned to compute node 2424, functional component F3 is assigned to compute node 2426, and functional component FN is assigned to compute node 2428 and requires input from (i.e., is coupled to) functional components FI, F2 and F3. Equation 159. Functional Decomposition Logical Argument

Let: C ≡ # of channels per node A = fi, f₂, f₃₎. ,f_n where f_x = functional components of a single algorithm'Α' If: P c A V {P} are uncoupled and are on separate nodes And: {P} output returns to the same node v for completion

Then: 1) Ω increases in relationship to'Α'on a single node if {P} are not workload balanced and or C<n

2) Ω remains the same as'Α'on a single node if {P} are workload balanced and C ≥ n Current human constructed parallel computer systems are not designed to ensure that the number of channels equals or is greater than the number of uncoupled functional components in an arbitrary algorithm. Therefore, only the first condition of Equation 159 holds; that is Ω increases. From the Alpha-Beta Phase Postulate it can now be determined that all current human constructed parallel computer systems are in the alpha phase. From Equation 119 this gives: Alpha-phase => T_c-f > t_p Beta-phase = T_c-ι < t_p From FIG. 100 this gives: Ω = f Equation 160. Relative Ω Size Between Alpha and Beta Phases Equation So: Alpha-phase => T_c-Ω > t_p = Ω > t_p + T_c Beta-phase => T_c-Ω < t_p :=> Ω < t_p + T_c .'. Ω is larger in alpha-phase than in beta-phase Since Ω is larger in alpha-phase than in beta-phase, a smaller Ω scales better than a larger Ω and the functional decomposition method is currently only in the alpha phase for all current parallel processing systems, any beta-phase parallelization method scales better than the functional decomposition method. Loop Unrolling Analysis Loop unrolling is a parallelization technique whereby the looping structures within an algorithm are spread among multiple machines. This is only possible if the loops are uncoupled (that is non-recursive) to the other instances of the looping structure.

Let: L ≡ a looping structure in "A' L = lι, l₂, Is,.., l_a where: l_x are uncoupled loop instances

Then: L = P the results of the functional decomposition analysis follows directly Loop unrolling is generally considered at a lower parallelization level than functional decomposition. If the parallelization level continues to decrease, a processor per operation code is achieved. An operation code can be considered a function in the sense of f_x while the complete program may be analogous to"A

Pipeline Analysis A pipeline is a parallelization technique that overlaps the processing of multiple functional components. There are four basic ways that the functional components can line up in time. The first method is to have all of the functional components balanced in time, the second method is to have the time trailing functional components take longer then the time preceding functional components, the third method is to have the time trailing functional components take less time then the time preceding functional components, and the forth method is a mixture of the preceding methods. Workload Balanced Pipelined Functional Components Workload balanced pipelined functional components ensure that each functional component takes the same amount of processing time. For a heterogeneous system this means that the workload is matched to the node performance, described in further detail below. For a homogenous system this usually means that the dataset size is the same for each node.

FIG. 117 shows a pipeline 2440 with four functional components FI, F2, F3 and F4 and four phases 2442, 2444, 2446 and 2448. Thus, pipeline 2440 has sixteen functional components processing data in 7 time units. However, this depiction is not complete because it does not show either the data movement or the latency involved with a pipeline. The more correct depiction is given in FIG. 118 which shows a pipeline 2460 with two phases 2462 and 2464 and four functions FI, F2, F3 and F4, illustrating latency L and data movement D for each function. As can be seen in this depiction the data communication and the latencies add to the processing time. If Ω is calculated for FIG. 117 using the corrections found in FIG. 118 analogously to the calculation shown above, this gives: Ω = 7(0.0008 seconds) + 7(0.250 seconds) = 1.7556 seconds Using the overlapped communication described above with the Type-lib input output and assuming that 100Mb of data is input, and 100Mb of data is output, this gives: Ω = 4(0.0008 seconds) + 4(0.250 seconds) = 1.0032 seconds Giving an Ω difference of 43%.

Workload Expanding Pipelined Functional Components Workload expanding pipelined functional components have an ever-expanding functional time component progressing down the time line. FIG. 119 shows an example where each functional component doubles the time required by the preceding functional component. As shown in FIG. 119, eleven time units were taken to complete the functional processing as compared to the expected seven time units. This is because of the processing time gaps. Phase 2484 could not start processing functional component F2 until phase 1282 has completed its processing of functional unit F2, for example.

Workload Contracting Pipelined Functional Components Workload contracting pipelined functional components has an ever-decreasing functional time component when progressing down the time line. FIG. 120 shows one exemplary pipeline 2500 with two phases 2502 and 2504 illustrating one scenario where each functional component (i.e., functional components FI, F2 and F3) utilizes half the processing time required by the preceding functional component. Surprisingly, contracting workloads generate the same 11 time units as the expanding time units. This is because a functional component that takes the most time is pushing out the start of the other functional components. Mixed Workload Pipelined Functional Components Mixing workload types produces a time unit number that is between that found in balanced workload and the expanding/contracting workloads. FIG. 121 shows one exemplary pipeline 2520 with two phases 1522 and 1524 illustrating mixed duration functional components. Work Balancing Between Multiple Controllers

Multiple controllers may work on a single job. In addition, multiple jobs may also benefit from having multiple controllers. There are two broad methods that can be used to allow multiple controllers to perform on the same machine. The first method is to assign some compute nodes to a controller and other compute nodes to other controllers. In this method the node to controller assignments change only rarely. In effect the machine is split into multiple machines. However, if the compute node information is shared by the various home nodes then a single cohesive machine remains. FIG. 122 shows a block diagram 2540 illustrating three exemplary home nodes 2542, 2544 and 2546, illustrating communication channels. As can be seen one or more channels can connect together the Home Nodes. As can be seen in Equation 65, with a single channel the time it takes to perform the exchange is roughly equal to P_ψ - 1. This suggests that the optimum way to load-balance the Home Nodes is to use a two phase method. First, organize the Home Nodes in a Hyper-manifold sense. Then, perform an alternating all-to-all exchange first on level-2 and then on level- 1 nodes. FIG. 123 shows one exemplary hyper-manifold 2560 with five level 1 home nodes 2562, each representing a group of five level 2 home nodes 2564. Hyper-manifold may, for example, represent hyper-manifold 620, FIG. 28. If home node data is shared at level 2 nodes 2564 first, followed by level- 1 nodes 2562 on an alternating basis, then each load balancing cycle takes only 4bD_φ time units (where D_φ is the load balancing data and b is the bandwidth of each communication channel). By periodically load balancing home nodes 2562 and 2564 using separate channels, any home node may be used to control a job.

Processing Thread Models and Communication

Processing threads have overhead components that are analogous to communication overheads and while independent, consume processor bandwidth. One can express the number of processor cycles available to a particular processing thread on a particular node in the presence of multiple processing threads as: Equation 161. Processor Cycle Availability Formula

where: C_x ≡ Amount of processing time available to thread C_x T = Amount of processing time available to current node S_x = Context switch time needed to transition to thread C_x n ≡ Number of threads running on current node Sj ≡ Context switch time needed to transition to thread C_t C; ≡ Amount of time allocated to thread C, As can be seen, the primary effect of having multiple threads per node is to decrease the time available to each thread on that node. There are four general models for using processing threads with multiple compute nodes: the One-to-One Thread model (one processing thread per job per node), the One-to-Many Thread model (one processing thread for multiple jobs per node), the Many-to-One Thread model (multiple processing threads for one job per node), and the Many-to-Many Thread model (multiple processing threads for multiple jobs per node). FIG. 124 shows hierarchy 2580 of thread model one-to-one 2582, thread model one-to-many 2584, thread model many-to-one 2586 and thread model many-to-many 2588. These thread models are described in further detail below. One-to-One Thread Model Thread model One-to-One 2582 has the most time per node associated with the thread. In addition to having the most processing time, the node's behavior is the most deterministic of the thread models. FIG. 125 shows one exemplary job 2600 with one thread 2602 on one node 2604. Thread 2602 is also shown with input 2606 and output 2608. Where a job utilizes multiple nodes, there is a serialization-at-communication effect which decreases overall performance. Since there is only one processing thread (e.g., thread 2602) within each node (e.g., node 2604), sending or receiving data from/to another node can only occur at certain specific times. FIG. 126 shows one exemplary job 2620 that utilizes a thread 2622 running on a node 2626 and a thread 2624 running on a node 2628. Thread 2622 is shown with input 2623 and output 2627, and thread 2624 is shown with input 2625 and output 2629. In a pair-wise exchange between nodes 2626 and 2628, threads 2622 and 2624 are synchronized. For example, for thread 2622 to send data to thread 2624, thread first sends a request to send signal 2630 to thread 2624. Thread 2624 then acknowledges this signal, when ready to receive the data, by sending a clear to send signal 2632. When thread 2622 receives this clear to send signal 2632 it sends the data 2634 to thread 2624. Thus, both tiireads 2622 and 2624 may wait for each other to become ready to transfer the data. The situation becomes worse with a broadcast exchange; that is, every node in the broadcast is synchronized.

One-to-Many Thread Model The one-to-many thread model has the effect of increasing the probability of serializing multiple jobs. This is because one job may have to wait for the other job to complete prior to running for that second job. This is a useful effect when locking or unlocking a node for use. FIG. 127 shows two jobs 2640, 2650 being processed by a single thread 2642 on a single node 2644. In particular, arrow 2646 represents input to thread 2642 and arrow 2648 represents output from thread 2642 for job 2640; arrow 2652 represents input to thread 2642 and arrow 2654 represents output from thread 2642 for job 2650. Where multiple nodes are utilized, there is an additional serialization-at-communication effect that further decreases overall performance. Since there is only one processing thread in each node and it is shared between jobs, sending or receiving data from/to another node can only occur at certain specific times, but unlike thread model One-to-One 2582, one job waits until the other job is completely finished before transmitting and/or receiving. FIG. 128 shows two jobs 2660, 2670 being processed by two threads 2662, 2672 on two nodes 2664, 2674, respectively. Job 2660 provides input 2663 to thread 2662 and input 2673 to thread 2672. Job 2670 provides input 2666 to thread 2662 and input 2676 to thread 2672. Thread 2662 sends output 2665 to job 2660 and sends output 2667 to job 2670. Thread 2672 sends output 2675 to job 2660 and sends output 2677 to job 2670. This means that, in a pah-wise exchange as shown in FIG. 126, nodes 2664 and 2674 are synchronized at two levels.

The first synchronization level incurs the same wait period as experienced by thread model One-to-One 2582. For example, thread 2662 running job 2660 sends a request to send signal 2668 to thread 2672 running job 2660 and receives a clear to send signal 2671 when thread 2672 is ready to receive data 2669. The second synchronization is the job-to-job synchronization (i.e., synchronization between jobs 2660 and 2670). For example, thread 2662, running job 2660 sends a request to send signal 2661 to thread 2672 running job 2670, thread 2672, once thread 2672 is ready to switch from processing of job 2670 to processing job 2660, sends a clear to send signal 2678. Thus, two synchronizations are required for sending data from thread 2662 to thread 2672 to ensure both threads are running the same job. The situation becomes worse with a broadcast exchange; that is, every node in the broadcast is synchronized at both levels.

Many-to-One Thread Model The many-to-one thread model has a high processing share penalty (as described above). Limiting the number of threads in operation can mitigate this penalty. FIG. 129 shows one job 2680 running on two nodes 2682, 2684, each with an input thread 2686, a processing thread 2688 and an output thread 2690. These three threads represent a minimum number of threads required to communicate without synchronization. As shown in FIG. 129, input thread 2686 receives input asynchronously and communicates this input to process thread 2688 as required. Process thread 2688 may provide output data to output thread 2690 as required, and without delay, such that output thread 2690 asynchronously sends the output data. Thus, during the illustrated communication between node 2682 and 2684, neither processing thread 2688 is necessarily stalled. Additional processing threads generate the same effect as the One-to-One multi-node thread model unless balanced by matching asynchronous transmit and receive threads. If the processing thread services more than one job, then the behavior is the same as the One-to-Many Multi-Node thread model, unless there are matching asynchronous transmit and receive threads. Many-to-Many Thread Model The Many-to-Many thread model also has a high processing share penalty (as described above). Limiting the number of threads in operation can mitigate this penalty. As previously described, in a multi-node system, the minimum number of threads required to communicate without synchronization is three: the input thread, the output thread, and the processing thread. Thus, the -_t- _ω ϋ in ui. uii -αu, one output thread, and one processing thread per job. It should be noted that the use of a Many-to-Many thread model consumes more resources than the Many-to-One thread model. This resource consumption is, at best, linear to the number of jobs within the node. FIG. 130 shows two jobs 2700, 2710 being processed on two nodes 2702, 2704, where each node has three processes 2706, 2708 and 2712, allocated to each job 2700, 2702; thus totaling six processes on each node. As shown in FIG. 130, synchronization is not required providing there is a balance between the number of input threads, the number of output threads, and the number of jobs. If this balance is not maintained, then the effects are analogous to the Many-to-One multi-node thread case with the addition of multiple jobs as described above.

Hardware Enhancements for Multiple Thread Models In order to decrease the overhead spent in an I/O thread versus a processing thread (to maximize computation), Direct Memory Access technology (DMA) may be apphed to perform the I/O. This has the affect of accelerating the total processing time by off-loading I/O from the processor. Since a DMA transfer requires only the source and destination hardware registers to be filled, filling those registers becomes the only task required of the I/O threads. This technique benefits the Many-to- One and the Many-to-Many Thread Models.

Check Pointing As a Cross-Communication Issue

Check pointing is the process of capturing sufficient state information about a running program to allow it to be interrupted and restarted at some later time. Such actions might be required if a system is shut down for maintenance or to allow for recovery of a job if there is a system failure. Single processor systems or multi-processor systems with coherent system images are able to capture all the pertinent data and write it to disk. Multi-processor systems consisting of clustered computers, or those with distributed views of the operating system, are required to gather information from all compute nodes to ensure the checkpoint captures a complete view of the system. This implies that checkpointing on clustered systems can be considered as a case of Type H agglomeration. For the Howard-Lupo Hyper-Manifold, one could certainly use its Type II agglomeration process to capture checkpoint information on the home node. However, an all-to-all exchange mechanism may also be used to place checkpoint information for all nodes on every node. That allows any number of nodes, except all, to fail and still be able to recover the system. Hot spares could be assigned to replace the failed nodes, and/or work may be reassigned to continue on fewer nodes. Having each node obtain a global view is not practical in other systems because the Butterfly all-to-all exchange time scales as 0(P_φ ²/b), and broadcast exchange time scales as 0(P_< bv), whereas the Howard-Lupo Hyper-Manifold all-to-all exchange time scales as 0(P, btf). This time-scaling makes obtaining a global state view by all system elements more practical within a hyper-manifold. Each compute node sends its execution state information to a designated master node, and if an error is detected, the master node uses the state information to restart the job from the most recent checkpoint. Comprehensively defined systems use this state information to assign the failed node's portion of the job to another node, allowing processing to continue without human intervention. If errors should occur on both the master node and in one of the compute nodes of this prior art system, the job fails.

The time cost to perform a master/slave checkpoint is: Equation 162. Master-slave Checkpoint Exchange step Formula ηπ c bψ where: ^L msc # of time-units for the checkpoint operation. D_c = Size of the checkpoint data set. N = # of compute nodes. b = Bandwidth of the channels connecting the compute nodes. ψ = # of channels at the master level. FIG. 131 shows a parallel processing environment 2720 illustrating transfer of checkpoint information to a master node 2722 from all other nodes 2724. Each node 2724 in turn sends its checkpoint data to master node 2722, requiring as many time units to complete the exchange as there are nodes 2724 sending checkpoint information. In this example, checkpoint data is stored on only one node; this one node reduces the probability of recovery. To increase the probability of recovery, the number of master nodes 2722 is increased, which in turn increases the time required to save the checkpoint information in proportion to the number of additional master nodes. Equation 162 becomes: Equation 163. Multi-master Master-slave Checkpoint Exchange step Formula xNZ T msc = ^■ bψ where: x = # of master nodes There is a way, however, to guarantee the ability to recover the system as long as at least one node in the system does not fail; it requhes that the checkpoint data be stored on every node. In a sense, every node becomes a master node for check-pointing purposes and may participate in system recovery. Type I Checkpoint, Data Stays on the Cluster Checkpoint K the amount of data is low for a particular checkpoint then a truly global checkpoint may be provided. Providing system checkpoint data to all nodes requhes some form of all-to-all exchange. Each compute node belongs to a so-called broadcast family that allows each node to know which other nodes to which to broadcast. The time to perform a broadcast all-to-all checkpoint is then: Equation 164. True Broadcast All-to-All Exchange Time Formula NZ> τ_bc =- bv where: Tι,_c = # time required to move some unit block of checkpoint data. N = # of compute nodes. b = Bandwidth of the channels connecting the compute nodes. υ = # of channels on each compute node. Equation 162 and Equation 164 are quite similar and, in fact, are equal if each compute node has the same number of communication channels as the master node. Given the case of one channel per node, Equation 93 shows that using a cascade all-to-all exchange to perform the checkpoint is approximately v/2 times faster. For example, if there are 8 channels per node in a cascade, the checkpoint proceeds 32 times faster than a broadcast on the same number of single channel nodes, or 4 times faster than a broadcast on the same number of 2-channel nodes.

Overlapping Computation with Checkpoint Communication When there is data to checkpoint and if each node has significant disk storage, then a modified global all-to-all exchange can be used. First the dataset size to computation relationship is computed, see Equation 108, Equation 109, and Equation 110 above. Ifthe dataset communication time compared to its processing time is favorable then the all-to-all cross-communication, used in the checkpoint above, may be overlapped with processing of the dataset. This decreases exposure of the time required moving checkpoint data. The total amount of data that is saved is the amount of data generated since the last checkpoint divided by the number of compute nodes, that is: Equation 165. Node Checkpoint Data Calculation L φ — ^checkpoint ⁼ L'since_last_checkpoint ' l φ

Type H Checkpoint, Data Off of the Cluster Checkpoint The type π checkpoint uses type Hlb manifold I/O to move large quantities of data from the cluster to the associated disk array. Since the multiple home nodes are aware of which home nodes are to receive data from which compute nodes, dropped node detection is possible as in the type I checkpoint. In addition, overlapping computation with this checkpoint is also possible, if the prerequisites are met. Once the data is saved an all-to-all exchange may be performed at the home node level. This exchange allows each home node to be able to reconstruct the data of any compute node, eliminating the single points of failure. See FIG. 56 and associated description for more information.

When to Perform a Checkpoint There are two primary ways to notify a parallel processing system to perform a checkpoint.

The first way is to insert special checkpoint commands into the parallel processing code. This is difficult because of the need to ensure that all checkpoint calls are made at the same execution point, and it has the disadvantage of requiring the program source code to be changed. The second method entails embedding information on checkpoint times into the parallel operating system. This method can be used without changing the source code but does require an externally detectable event to trigger the checkpoint. To ensure the operational checkpoint methodology, the second method may be utilized in the following examples. There are several events that are detectable externally to both the source code and the node: elapsed time from last checkpoint, checkpoint after data transmission event, and elapsed time from data transmission.

Elapsed Time from Last Checkpoint Event The elapsed time event requhes the internal clocks of all nodes to be reasonably synchronized.

This can be accomplished via a synchronization pulse at the start of the job. In a cascade, an exchange of the initial problem definition can be used for this purpose. The elapsed time is then calculated on each node by a timer thread that is separate from any computational threads. The timer thread takes processor control after the deshed elapsed time has occurred, and starts performing the checkpoint. In addition, a checkpoint receive thread, fully independent of the timer thread, exists and continuously listens for checkpoint information data from other nodes. If the checkpoint receive thread begins receiving checkpoint information data prior to a checkpoint timer event, it triggers a checkpoint timer event, thereby re-synchronizing the checkpoint timer.

Checkpoint After Data Transmission Event This event requhes a message to be sent by the current node. Since a parallel processing environment may have its own group of send/receive commands, these commands may need to be modified such that a count is made of the transmissions. When the proper number of transmissions occurs, the checkpoint thread is notified and the transmission count reset. For this to work, the transmissions occur at about the same time. Parallel processing environment transmission methods ensure that this is the case.

Elapse time From Data Transmission Event The elapse-time from data transmission is a combination event. The initial trigger occurs with the start of a data transmission event, starting the timer. Once the elapsed time is reached, the timer thread triggers a checkpoint event. What State Information To Checkpoint Restoration of system functionality uses knowledge of a node's current execution state and access to the data being processed. State information consists of: data pointers, thread pointers, thread identities, timer values, stack pointers, heap pointers, stack values, heap values, node identity, list of D? addresses, socket data, node cascade position, cascade size, etc. All data may be captured during every checkpoint operation, or incremental methods could be used to capture only information that has changed from some referenced earlier checkpoint. Where to Perform the Checkpoint There are several places where checkpoint triggers may be placed. Each node may store its copy of the checkpoint data on local disk. It is also possible to include one or more hot spares in the system to keep additional copies. The latter technique offers the potential for a hot spare to assume the identity of a failed node, allowing extremely rapid fault recovery. Since the checkpoint data for all nodes is present, the failed node's data may not need be sent to some other node to initiate recover.

Automated System Restoration from Checkpoint If the hot spare method is used and a spare is available, the next closest node could assume the responsibility of activating it and providing it with the checkpoint data necessary to assume the failed node's role. FIG. 132 shows a parallel processing envhonment 2740 with a three node 2742, 2744, 2746 cascade and a hot spare' node 2748. In this example, node 2744 determines that node 2746 has failed and sends a shutdown message 2750 to node 2746. Node 2744 then sends a message 2752 to node 2748 (the hot spare) become active and replace failed node 2746. Node 2744 then sends messages 2754 and 2756 to nodes 2748 and 2742, respectively, to initiate a checkpoint restart. Thus, parallel processing system 2740 recovers from failure of node 2746. If there are no liot spares' available when a node fails in a cascade, then the cascade is collapsed by one cascade depth to free one or more nodes for use as hot spares'. After collapsing the cascade by one cascade depth, all but one of the remaining nodes in the cascade spawns two additional processing threads (and associated communication threads, if necessary) to assume the work of the collapsed nodes. FIG. 133 shows one exemplary parallel processing envhonment 2760 that has one cascade of seven nodes 2762, 2763, 2764, 2765, 2766, 2767 and 2768. In this example, node 2763 determines that node 2765 has failed and therefore sends a shutdown message 2770 to node 2765. Node 2763 then sends a collapse cascade message to remaining nodes 2762, 2764, 2766, 2767 and 2768 of the cascade. Nodes 2764, 2767 and 2768 assume the roll of hot spares. Node 2763 then insttucts nodes 2762 and 2766 that form the reduced size cascade to spawn associate threads and restart from the latest checkpoint. This is possible because the all-to-all exchange enables each node to have the state information of every other node. In the example of FIGs. 132 and 133, shutdown messages 2750 and 2770 may be remote network interface card control messages, using a simple network management protocol (SNMP) for example, that powers-down the node (i.e., nodes 2746 and 2765. It may be prudent to electrically isolate failed nodes so that spurious signals cannot interfere with operation of the remaining nodes. This also facilitates human repair intervention by making the failed node more obvious in a large array of nodes. If data needs to be transferred from the Mass Storage device to the cluster then a type nib Cascade or Manifold may be used to ttansfer data to the new nodes in the mimmum amount of time. If a non-cascade system is used, one or more of the nodes can still spawn additional threads and decrease the number of nodes involved with the job such that sufficient spare nodes are made available. As can be seen above, it takes three messages to recover the system, and human intervention is not required. Dynamically Increasing Cascade Size To Increase Job Performance Normally, prior art parallel processing environments have a fixed allocation of nodes given at the start of a job. This remains true even if there are additional compute nodes available and the job's performance could be increased via the use of additional nodes. This is a problem because optimal system performance cannot take place under those chcumstances. One way to dynamically assign new nodes to a job is to use the above-defined global checkpoint data (found on each compute node) to perform the proper problem re-decomposition. FIG. 134 shows one exemplary processing envhonment 2780 that has one home node 2782 and seven compute nodes 2784. In particular, home node 2782 and nodes 2784(1), 2784(2) and 2784(4) are configured as a cascade. Fhst, home node 2782 (i.e., a job controller) detects a circumstance whereby additional nodes may be utilized by the job. This circumstance detection could be as simple as the prediction of reduced job completion time, in conjunction with sufficient free nodes and an empty job queue. Home node 2782 broadcasts a job nodal expansion message to nodes 2784(1), 2784(2) and 2784(4) (i.e., the nodes assigned to the current job). This message identifies nodes 2784(3), 2784(5), 2784(6) and 2784(7) (i.e., free nodes) specifies a cascade expansion size as a deshed increase in cascade depth. This job nodal expansion message causes nodes 2784(1), 2784(2) and 2784(4) to suspend the current job. Since each node knows the cascade topology and size, each node allocated to a job has enough information to repartition the dataset (e.g., in a virtual manner, such as by rearrange the pointers into the dataset). Each of nodes 2784(1), 2784(2) and 2784(4) also knows which additional nodes (e.g., nodes 2784(3), 2784(5), 2784(6) and 2784(7)) they are to activate. Node 2784(1) activates node 2784(5), node 2784(2) activates nodes 2784(6) and 2784(3) and node 2784(4) activates node 2784(7); these node activation messages contain the job, state, and data as appropriate for the destination node. Upon activation of node 2784(3), the last node to be activated in this example, node 2784(3) sends a signal to home node 2782 indicating all nodes are activated. Finally, home node 2782 sends a restart- from-last-checkpoint message to the new cascade (i.e., nodes 2784) thereby causing the job to resume. The time it takes to perform an expansion is a function of the time it takes to activate a single processor. The following equation shows how to calculate that time: Equation 166. Node Expansion Required Dataset Size Let: *Jnode-expansion-data ⁼ JJ+ + M where: D ≡ the required dataset size in bytes S ≡ the state information size in bytes M = the grow cascade and restart message size Then the time to complete the expansion is given by: Equation 167. Node Expansion Time r_ji 2\ Vp new — p φ _~ ι )n tiode-exp ansion-data , ^ exp ansion i 2 S bv where: P new ≡ the number of processors in the expanded cascade P_φ = the number of processor in the original cascade λg ≡ transmission startup latency This expansion method can be readily changed for the manifold and hyper-manifold cases.

Parallel Programming as a Communication Issue

Programming a parallel machine is generally thought of as a way to implement an algorithm using multiple processors. Algorithms are generally classified as either embarrassingly parallel, parallelizable, or not parallelizable. Embarrassingly parallel algorithms are implemented as transactional processes, while parallelizable algorithms requhe special parallel processing codes injected into the algorithms to facilitate coordination and communication between multiple processors. Rather than there being two separate and incompatible parallel processing methods, it is proposed that there is only one over-arching technique. The only major difference between a single processor and a multiple processor implementation of an algorithm is the need to handle the distributed data load. This means that there are only five possible times during the processing life of an algorithm when the movement of data is a factor. FIG. 135 shows a block diagram 2800 illustrating three data movement phases 2802, 2804 and 2806 of an algorithm processing life cycle and the possible times when movement of data is a factor. In particular, pre-execution phase 2802 includes algorithm distribution 2808 and dataset input 2810; execution phase 2804 includes cross communication 2812; and post-execution phase 2806 includes agglomeration 2814 and dataset output 2816. An algorithm is defined to be any mathematical/logical function or functions whereby access to the portions of the function(s) bounded by data movement can occur through parameter lists. Embarrassingly Parallel Algorithm Communication Issues Since embarrassingly parallel algorithms O^PAs) can be modeled as transactions, one can now see how that model fits the Data Movement Life-Cycle model. FIG. 136 shows a schematic diagram 2820 illustrating transaction data movement between a remote host 2822 and a home node 2824 (operating as a controller), and between the home node 2824 and compute nodes 2826(1-3). Remote host 2822 may, for example, send transactions to home node 2824 as shown by data paths 2823. These transactions may represent programs with their respective data (Type 1), data when the programs were distributed a priori (Type 2), or simply a start signal as the program and data were distributed beforehand, (Types 3 and 4). Home node 2824 then distributes these transactions to compute nodes 2826, as shown by data paths 2825. There is no cross-communication between compute nodes 2826, and once the transactional processing is complete, responses are sent from compute nodes 2826 back to home node 2824, as shown by data paths 2827, and then back to remote host 2822 as shown by data paths 2828. Table 47 shows the results of describing this transaction in the form of an algorithm data movement life-cycle.

Embarrassingly Parallel Algorithm Data Movement Life-Cycle Table Pre-Execution Execution Post-Execution Algorithm Dataset Cross-

Pattern # Agglomeration Dataset Output Distribution Input Communication 1 One-to-Many One-to-Many None None Many-to-One 2 None One-to-Many None None Many-to-One 3 One-to-Many None None None Many-to-One 4 None None None None Many-to-One Table 47. EPA Data Movement Life-Cycle

Transactional processing takes its name from banking transactions. One banking transaction has no interaction with any other banking ttansaction. For instance, Sally's deposit is independent of Joe's withdrawal. This independence means that the processing of these transactions may occur on different compute nodes without recourse to inter-compute node communication. Parallelizable Algorithm Communication Issues Parallelizable algorithms (PAs) may be modeled as transactions with cross-communication and/or agglomeration. Therefore, its description may be cast into a Data Movement Life-Cycle model as well. FIG. 137 shows a schematic diagram 2840 illustrating transaction data movement between a remote host 2842, a home node 2844 (operating as a controller) and three compute nodes 2846(1-3). Remote host 2842 sends, as shown by data path 2843, a PA transaction to home node 2844. Home node 2844 separates the program and data into sub-components that are disttibuted, as shown by data paths 2845, to compute nodes 2846. During execution of these sub-components by compute nodes 2846, cross-communication may occur to allow exchange of some or all sub-component data, as indicated by data paths 2858. As used above, a PA transaction may represent a) a program with the associated data or b) the data alone where the program is disttibuted a priori to program start or c) a start signal where program and data are distributed beforehand. Once the data is exchanged, the processing is continued. Compute nodes 2846 continue processing and data exchanging until processing is complete. Results are transmitted from compute nodes 2846 to home node 2844, either after agglomeration (a kind of exchange) or directly. There are 8 types of communication patterns required by PAs, as shown in Table 48 and Table 49.

Table 48. PA Communication Patterns

Table 49. Data Movement Life-Cycle Chart for PAs From Table 48 and Table 49, we may characterize data centric parallelizable algorithms as data movement life-cycle tables, but does not capture the essence of most of the parallelization techniques. Common Parallelization Models To determine the efficacy of this approach, the common parallelization models are first examined, followed by a unification of those models via a single common data centtic model, and finally, it is shown that the data centric unified view is at least as efficient as could be achieved without the unified view. This provides the first unified computation model for parallel computation.

Parallel Random Access Model Parallel Random Access Model (PRAM) is a shared memory arbitrary processor count model. Each processor executes the program out of its own local memory. Program synchronization is such that each processor runs the same code at the same time. Shared memory access is controlled via a number associated with each processor. That number is used as an index into specific portions of memory. Although simultaneous reads are allowed, only a single write request from a single processor can be processed at any particular time. The maximum speedup for this model occurs when the total processing time with'n'processors approaches the fastest time possible on a single processor divided by the'n'processors.

Data Parallel Model This is another shared memory model. In this model, it is assumed that there are enough vhtual processors such that all concurrent operations occur in one exchange step and that there are one or more exchange steps, which implies the need for synchronous processing. All process variables are accessible to all processors; hence, the need for shared memory.

Shared Memory Model In this model, all of the processors are connected via a single, large, shared access memory, not through any sort of switching network. The primary problems with this model include resource contention and the fact that only a relatively small number of processors are cost effective to be connected in this way. This model is the PRAM model without the imposed synchronization requhement. The programmer sets up synchronization using the single processor multiple data (SPMD) methods; that is, the same program is run on different processors. Block Data Memory Model The Block Data Memory (BDM) model uses messages to communicate and access data between two or processing systems. A processing system consists of local memory, processor, and communication system. Data is broken into blocks of fixed size and distributed among the various processing systems. A processing system identity is used to differentiate between the various systems and, thus, theh data. Low-level communication is prohibited, with communication limited to the block fransfer of data elements from one processing system to another. Resource contention is handled via various synchronization messages.

MPT Block Data Memory Model Removing the BDM constraint that the data blocks are of uniform size and remove the constraint on low-level communication, results in the MPT Block Data Memory (MPT-BDM) model. Removing these constraints allows for better load balancing (especially in a heterogeneous envhonment) and also allows low-level communication to be properly overlayed.

Unified Model The Embarrassingly Parallel Algorithm (EPA) model can be thought of as a special case of the Data Parallel (DP) model. That is, if cross-communication is allowed then you have the DP model.

Both the DP and PRAM models can be expressed as a PRAM model. All of the salient activities of the PRAM model can be duplicated in the Shared Memory (SM) model but not vice- versa. The Block Data Memory (BDM) model, like the shared memory model, has multiple programs running, which are mostly asynchronous. However, synchronization is allowed via a messaging system. By relaxing the constraint that memory is cohesive, salient features of the shared memory model can be duplicated using the BDM model. Finally, all of the salient aspects of the BDM model can be duplicated within the MPT-BDM model. FIG. 138 shows a hierarchy diagram 2860 illustrating hierarchy of models EPA, DP, PRAM, SM, BDM and MPT-BDM. The DP, PRAM and SM models require function code changes (i.e., the insertion of communication codes within the functional code). Using the model shown in FIG. 135, and described above, functional code may be separated from communication code based upon Life-Cycles.

Depicting Algorithms Using Life-Cycle Planes FIG. 139 shows a function commumcation life-cycle 2880 illustratively shown as three planes 2882, 2884 and 2886 that represent the kind of processing that may be accomplished by the function. Planes 2882, 2884 and 2886 may be further separated into sub-planes. FIG. 140 shows I/O plane 2882 of FIG. 139 depicted as an input sub-plane 2902 and an output sub-plane 2912. Input sub-plane 2902 is illustratively shown with type 19204, type H 9206 and type UI 9208 input functionality and contains communication functions requhed to receive data from an external source. This input functionality also includes special functions like'NULE'9210 which do not receive information from outside of the system. Similarly, output sub-plane 2912 is illustratively shown with type 1 9214, type II 9216 and type III 9218 output functionality and contains all of the commumcation functions requhed to send data to a destination that is external to the system. Output sub-plane 2912 may also include special functions like'NULE'2919 that do not send information from the system. FIG. 141 shows translation plane 2884 separated into a system translation sub-plane 2930 and an algorithm translation sub-plane 2922. System translation sub-plane 2930 performs all the hardware requhed translations, including precision changes, type conversions and endian conversions; system translation sub-plane 2930 is illustratively shown with a real-to-integer converter 2932, a big-endian- to-little-endian converter 2934 and an N-bit precision-to-M-bit precision converter 2936. Arbitrary precision mathematics can allow any level precision, regardless of the hardware, providing for better parallel processing scaling. At the expense of slower processing speed per compute node, it becomes possible to precision match different processor families. This plane is amenable to hardware acceleration as the functions are small and known. Algorithm translation plane 2922 provides for algorithm specific translations, such as, row and column ordering, and vector or scalar type conversions. Algorithm translation plane 2922 is illustratively shown with a vector-to-scalar converter 2923, a scalar-to-vector converter 2924, a 2D-col- major-to-row-major converter 2926 and a 2D-row-major-to-col-major converter 2928. Algorithm translation plane 2922 requhes the most interaction with the algorithm developer. The entire structure revolves around the notion of a user-defined function. A user-defined function can be any algorithm such that for a given set of input values, only one output value is generated. For example, a 2-dimensional correlation function could be defined as: Image_output = 2D-Correlation(Imagei„_pUt, Image_kemai) Creating a program in this model requhes the order and the planes of the functions to be specified. Table 50 shows a programming template for a 2D-Correlation function.

Table 50. Filled Programming Template

In Table 50, each variable name specifies a description, type, and size structure arranged as shown in Table 51.

Table 51. Variable Description Table

In Table 51, each variable type specifies a type endian value and any ttanslation function associated with that type.

Table 52. Variable Type Description Table

A more complex algorithm, that shows multiple functions could be written as follows: Image_{o put} = 2D-lF^' l'(2D-Correlation(2D-FFT(Image_inpUt), Image_ker„_ai))

Its corresponding programming table entries are shown in Table 53.

Table 53. Complex Algorithm Programming Template

If programming requhes filling the programming template directly, it may not provide much benefit over existing models. However, this is not the case. A simplified program construction method is now defined. Simplified Programming Pre-requisites and Activities Rather than unrolling loops or using multiple threads to achieve parallel effects, in one method an algorithm traverses its data transformation space as the parallel arena. Using a composition-of- functions, in which simple functions are used to construct complex functions, thereby allowing computation on any one compute node to continue uninterrupted, which is a necessary pre-requisite for using this simplified programming model. This allows arbitrarily complex algorithms to be used. Further, a second prerequisite is that algorithms are not requhed to have an input dataset. A data generator function may be used to generate data, as opposed to transforming data with transforming functions. This increases the range of functions that can be parallelized. These two pre-requisites allow each algorithm to remain whole on each compute node, be arbitrarily complex, and not be limited to functions that require input datasets. When combined, these two pre-requisites allow single processor/single processing thread algorithms to be parallelized external to theh software embodiment. Thus, each compute node may completely solve its portion of a problem independently of other compute nodes, thereby decreasing unnecessary cross-communication. Each compute node may therefore contain the complete algorithmic code. The separate portion of the problem solution from each compute node is agglomerated with particular attention paid to maximizing the amount of parallelism achieved. To use this method, the developer need only develop one set of single processing thread/single processor code for execution on all compute nodes. Balancing compute nodes requhes only that the workload is balanced on each node. This allows different processor speeds to be computationally balanced (described in further detail below). New Data and Results Mapping Since computation may be broken up in ways that are asymmetric in time, it becomes difficult to calculate the flow of information through the system a priori. This time asymmetry is largely due to the fact that the algorithm itself has been split across nodal boundaries (i.e., different parts of the algorithm are performed of different compute nodes). This requhes each node to cross-communicate at times not endemic to the algorithm. Eliminating this non-algorithm endemic cross-communication decreases the total system overhead. An example of a computation commonly performed on parallel processing machines is the normalized cross-correlation (Equation 168). This algorithm is used to compare a small object image (kernel) to another image for any matches. A real number value is computed for each pixel that corresponds to how well the kernel matches the image at that location. The result of the computation is an array of real numbers whose dimensions match the dimensions of the input image being tested. Equation 168. Normalized Cross-correlation Formula

where: «,v) ≡ Normalized Cross-Correlation βx,y) ^≡ The region under the kernel K^χ>y) ^≡ The kernel / ≡ The mean of the function/under the kernel t ≡ The mean of the kernel

Each compute node involved in the computation determines how much of the solution it is responsible for (called the output mesh) by dividing the number of rows by the number of processors. If there is a remainder of rows, the first few compute nodes in the computation each do one additional row so that all rows are computed. The input mesh is computed in the same manner except that each node also has an additional number of rows to match the number of rows in the kernel (-1). These calculations are as follows: Output Mesh Calculation: Out eshRows = Inpu ImageRows / ComputeNodeCount RemainderRo s = Input ImageRows - ( OutMeshRows * ComputeNodeCount ) OutMeshlndex = OutMeshRows * ComputeNodeIndex if ( ComputeNodeIndex < RemainderRows ) OutMeshRows = OutMeshRows + 1 OutMeshlndex = OutMeshlndex + ComputeNodelndex Else OutMeshlndex = OutMeshlndex + RemainderRows

FIG. 142 shows one exemplary output mesh 2940 illustrating portioning of output mesh 2940 (and hence computation) onto a plurality of compute nodes 2942. In particular, output mesh 2940 if formed of 31 rows, each with 27 columns. A first 120 elements are allocated to a compute node 2942(0); a next 120 elements are allocated to a compute node 2942(1); a next 120 elements are allocated to compute node 1942(2); a next 120 elements are allocated to compute node 2942(3); a next 119 elements are allocated to compute node 2942(4); a next 119 elements are allocated to compute node 2942(5); and a final 119 elements are allocated to a compute node 2942(6). Using the output mesh calculation, each node determines which portion of the result it is responsible for and proceeds to completely compute that portion of the result. All portions (i.e., from all compute nodes computing a portion of the result) are combined to produce the complete result for the computation. The code on all compute nodes is identical and is able to compute a portion of the result. This requires that the code performs a number of distinct steps to compute its portion of the solution. See, for example, steps shown in FIG. 135. The first step, algorithm distribution, determines the input and output meshes, which specify how much of the input data set is needed by this node and how much of the problem solution may be computed by this node. The code starts by determining the size and characteristics of the input data set. From the information about the input data set and the computation type to be performed, the code can determine the size and characteristics of the result set. Using the computation type and how many compute nodes are involved in the computation, the code can determine what portion of the problem solution is to be computed by this node and how much of the input data set is needed for the computation. The second step, data input, is to acquhe the input dataset or at least a portion of the data set that is needed by this node to calculate its portion of the result set. This step determines the source of the input data set and acquires the needed data from the source. The third step, execution, is to perform the actual computation. For this step, the code is written to completely compute the node's portion of the problem solution. Some computations may only be completed during the agglomeration exchange, but each compute node completes as much of the computation as possible. The code is also able to perform the computation no matter what the size of its portion of the computation (within reason). This code is then written to minimize or eliminate communication with other compute nodes (i.e., eliminate cross-communication), which slows the completion of the computation. Once the step is completed, the node's portion of the problem solution is ready for agglomeration with other node's portions and/or returns to the requestor (e.g., a remote host). The fourth step, agglomeration, is where the results of all nodes' computations are combined to form the complete problem solution. This may be done by Type-I agglomeration. In Type-I agglomeration, each node receives partial solutions from nodes to which it forwarded the computation request and combines the partial solutions. The fifth and last step, dataset output, is to return the partial (possibly agglomerated) or complete solution to either the node from which the node received the compute request (Type-I agglomeration), to a home node (Type-II), or to multiple home nodes (Type-HI Output). In this step, a message is formatted according to the computation type and the type of node to which the partial or complete solution is being sent. The message is used to return the data to the appropriate destination.

Toolkit Developed After examining the above steps, a fair amount of code is written for each computation type such that the code includes all five of these steps. This amount of code slows the addition of new algorithms to the compute node's capabilities and uses a large amount of maintenance as the number of supported computation types increases. As many types of computation use the same method for computing its meshes, acquiring input datasets, agglomerating partial problem solutions, and returning partial or complete results to another node, it simplifies and speeds up addition of new computation types when a set of tools is available for the code writer to use. A set of tools is developed for use when additional computation types are added to a parallel processing envhonment (e.g., a cluster computer). Separate tools were created for the Compute Mesh, Acquhe Data, Agglomeration, and Return Results steps. These tools were developed after examining the current set of computation types and types of computation that may be added in the future. To link the tools together, a data (mesh) structure is developed. This sttucture is set up at the beginning of, and used throughout, the computation process. It holds information about the input and output datasets as well as the local mesh sizes (amount of the solution set that is the responsibility of this node). It holds locators that identify where input, intermediate, and result data are to be stored. It holds an identifier that specifies the computation function to use. It also holds information for each of the tools as to what method to employ in performing the computation process. Lastly, it holds information that is passed from one tool to later tools that help them complete theh task. Each tool takes the computation request and the above described mesh structure as input, giving all of the tools a common, simple interface. Because of this regularity of input and use, all algorithms can be processed from a single driver function, eliminating more development and maintenance costs.

Using the Toolkit To use the toolkit to implement a new computation type, several pieces of code are used. The first piece of code provides a setup function. This function interprets the computation request and extracts the description of the input data set and computation parameters, such as computation node count, and place them in the mesh structure. It sets up the mesh structure with all of the information listed above that are used throughout the computation process, including the computation function identifier. The second piece of code is the computation function itself. It takes the computation request and the mesh structure as inputs and computes the partial solution, for which it is responsible to complete as much as possible. This function also updates the mesh structure as to the size and location of the result data to be returned. A third piece of code is dictated by the computation needs. This is a post-agglomeration processing function. This function may be called by the agglomeration tool, after the agglomeration step has been performed, to do any post-agglomeration processing. This processing may involve stitching or melding agglomerated partial results or converting the partial or complete results into the needed form for return to another node or to the requestor. These two or three functions are added to the code on each node to be available for use. These functions are, for example, added to the list of files to be linked into the final node executable. They may be added as a DLL or similar dynamic link library. One final step needed to add the computation type to the node code is to add a link between the computation type identifier and the setup function.

Simplified Parallel Programming Method Arguably, the easiest computer program manipulation model is the installation method used by modern personal computers. This model consists of a number of displays that offer a limited set of options to configure a computer program to work on the computer system. An analog is used to create a parallel programming model in a like fashion. Each step in the analog uses the single processor algorithm as the basis of the computation. Any data that is to be sent to or from the single processor algorithm is passed by reference, not by value. This allows the data location to be accessed externally. FIG. 135 outlines the steps requhed. They are: 1) Pre-Execution 2) Execution 3) Post Execution The pre-execution step allows the programmer to specify how the data is to be loaded into the machine. There are two parameters: Algorithm Distribution and Dataset Input. FIG. 143 shows a first exemplary screen 2960 illustrating how this information may be presented. Screen 2960 allows an algorithm that is defined for a single processor to be identified by the system. That means that the function name and location path are entered into field 2962. In addition, the algorithm distribution method is selected using one of buttons 2964, 2966, 2968 and 2970. The algorithm distribution method determines what kind of downward cascade is used to activate the program on the compute nodes. FIG. 144 shows a second exemplary screen 2980 illustrating input of an algorithm's input dataset and its location. That means that the dataset name and location path are entered in field 2982. After the dataset name and path are given, the system requests a variable name, in field 2984, that points to the current location within the dataset for the algorithm to start processing that dataset. In addition, the input datasets distribution method is selected using buttons 2986, 2988, 2990 and 2992. Like the algorithm distribution method given above, this distribution method determines what kind of downward cascade is used to distribute the input dataset to the compute nodes. FIG. 145 shows one exemplary screen 3000 for specifying algorithm input conversion. In order to be able to convert outputs of a different function into inputs of the current algorithm, the current input data type(s) may be selected using one of buttons 3002, 3004, 3006 and 3008. Once selected, either automated conversion can take place or the system can request a new conversion model. FIG. 146 shows a third exemplary screen 3020 for specification of algorithm cross- communication. The execution step defines the cross-communication model used by the algorithm. There are only four cross-communication types, plus a NULL function. These communication models are discussed above and consist of the Howard-Lupo Hyper-manifold cross-communication model (which includes the Howard Cascade Cross-Communication model and the Howard-Lupo Manifold Cross-Communication model), the Broadcast Cross-Communication model, the Next'WNeighbor Cross-Communication model, and the Butterfly Cross-Communication model. After the cross- communication model is selected, the data ttansfer type may be selected using one of buttons 3022, 3024, 3026 and 3028 of screen 3020. The two types are full dataset exchange (FX) or partial dataset exchange (PX). The full dataset exchange may be used to ttansfer all of the data on each compute node to every other compute node. The partial dataset exchange is used in functions like the mathematical transpose function. FIG. 147 shows one exemplary screen 3040 for selecting an agglomeration type for the algorithm using one of buttons 3042, 3044, 3046 and 3048. The agglomeration model marks the beginning of the post-execution work. Agglomeration is a parallel processing term that means to gather up the final solution. There is only one way to gather the final solution that is not an output event — the cross-sum like (button 3042) or Type I Agglomeration. FIG. 148 shows a fifth exemplary screen 3060 for specifying the algorithm's output dataset and its location to the system. That means that the dataset name and location path are specified in field 3062. The output datasets distribution method is selected using one of buttons 3064, 3066, 3068 and 3070. This distribution method determines what kind of upward cascade is used to distribute the output dataset outside of the system. Once the all of the functions are entered into the system, real programming starts, to chain together functions/algorithms to form more complex function/algorithms. FIG. 149 shows one exemplary screen 3080 for specifying the programming model. In FIG. 149, data path 3081 represents input, and data path 3082 represents output. Data paths 3083 represent cross-communication associated with a function 3084, and data paths 3085 represent function-to-function I/O. In addition, data paths 3085 also depict the processing order of functions 3084, 3086 and 3087. If a multi-channel I/O type is requested, then additional information is requhed to bind the channels. The data conversion (e.g., row- to-col conversion 3088) that may occur between functions is automatic unless there is no proper conversion routine. If there is no proper conversion routine, the system may provide information on what kind of conversion is requhed and then ask that the proper conversion routine be loaded; this information is displayed in the information box. The programming model may be named to allow the model (i.e., the group of functions) as a single new algorithm. Adaptive Processing Power

In prior art parallel processing environments, the number of compute nodes used on a job is fixed. Since the scaling factor of a complex algorithm can vary as different functions within the algorithm are executed, the total scaling factor achieved is reduced to some fraction of the least scaling portion of the algorithm. Thus, the over-all efficiency of the job is low, and there is a consummate over-allocation of processing elements for the job. By viewing a complex algorithm as a series of smaller algorithms and or functions that are linked together, like scaling algorithms may be linked together to form pockets of programming efficiency. FIG. 150 is a functional diagram 3100 illustrating six functions 3102, 3104, 3106, 3108, 3110 and 3112 grouped according to scalability. In particular, functions 3102 and 3104 have an acceptable scaling performance for a 31 node group 3114, functions 3106, 3108 and 3110 have an acceptable scaling performance for a 255 node group 3116, and function 3112 has an acceptable scalability for a 63 node group 3118. As long as the time to expand from 31 nodes to 255 nodes is less than the time to expand to, and calculate on, 255 nodes, and still generates an acceptable scaling factor, then functions 3106, 3108 and 3110 may be processed using theh maximum number of nodes. Finally, since functions 3102 and 3104 scale poorly, the number of nodes used is reduced, as compared to the number of nodes used for other functions of the algorithm. To determine if a node count expansion is justified, perform the following: Equation 169. Node Count Expansion Justification Formula Expand the number of nodes iff 1 expansion-time ~r U 1 ' ^ l φ-new "^^ - ' ^l φ-current Where: T = the total processing time for one node S ≡ the scaling constant for the current function D ≡ The total data requhed for a node to the last checkpoint

The formation of the scaling grouping as well as the decision to change the node count at the transitions between groups may be calculated once by a home node or by each compute node, depending upon the implementation. The output of the decisions goes into the following group transition table:

Starting Changed Group Job Number Processor Processor Number Count Count

Table 54. Group Transition Table Template The job number is the currently assigned job code for the current algorithm or group of algorithms. The group number is a number assigned by the system that represents a group of algorithms/functions that scale to the same number of processors. The starting processor count is the number of processors the system has assigned prior to changing that number, the changed processor count is the changed number of processors based upon analysis performed by the system; it may be higher, lower, or the same as the starting processor count. If the changed processor count is different from the starting processor count, the new number of nodes is allocated by the system. Below is an example of a completed Group Transition Table:

Table 55. Filled Group Transition Table Scheduling as a Communication Issue

In order to perform scheduling that takes advantage of the actual node level utilization of both individual jobs and the system as a whole, more process flow information is utilized. The process flow information may contain data on the process flow components per job, the component level scaling polynomial coefficients, the dataset size or equivalent per component, an estimate of component completion time, and the communication pattern per component. This information allows use of the scaling variability found within a job and thereby allows commingling of multiple jobs on the same set of nodal resources. This commingling of jobs increases the net efficiency of the system compared to non-commingled systems. This information is transmitted to controllers and the compute nodes. If done properly, each components attribute list is attached when a component is profiled on a machine and saved on each node for future use. When a component is used in a job flow, the run-time attribute information (dataset size, calculated coefficients, calculated start time, etc.) is presented to the initial set of nodes, together with node transition information. The node transition information defines a set of nodes that the current set of nodes collapses or expands into. Finally, checkpoint times, if any, are determined. Commingled Scheduling FIG. 150 shows a single job with six functions. The functional elements (e.g., algorithm parts) are various computational functions found within that single job. As shown in FIG. 150, the functional elements may (and usually do) scale differently from one another. This scaling difference offers an opportunity to increase actual system utilization over that obtainable by non-commingled schedulers. A parallel compute system may be logically separated into zones as shown in FIG. 151. In particular, FIG. 151 shows a node map 3120 for a parallel processing system that has 960 compute nodes (each cell 3121 represents 16 compute nodes). Node map 3120 is shown with three zones 3122, 3124 and 3126 that may represent different kinds of job queue; a queue that operates with zone 3122 for large jobs (those requhing greater than 64 nodes), a queue that operates with zone 3124 for medium jobs (those requiring 64 or fewer nodes), and a queue that operates with zone 3126 for small jobs (those requhing 16 or fewer nodes). This zoning effectively creates three machines. A system administrator may change the number of nodes associated with each queue. Common scheduling techniques include block-scheduling (a user is allowed to schedule nodes by allocating x nodes per allocation), node level scheduling (the user is allowed to select individual nodes with the minimum/maximum node count equal to the queue constraint). Once the number of nodes has been selected, that count remains for the duration of the job.

Process Flow Steering for Parallel Processing

Prior art parallel processing systems currently run a job until completion then, after job completion, start the data analysis and/or visualization cycle. This, in fact, is the standard batch processing mode of operations. Many problems could benefit from having a'man-in-the-lood'to help steer the processing such that solution convergence can be assisted by the human. The prerequisites to performing such run-time processing are: • Process flow stop/restart capability • Process flow visualization interface • Process flow branch point selection capability

Process Flow Steering Interface FIG. 152 shows a programming model screen 3140 illustrating a job with algorithm/function displays 3141, 3142, 3143, 3144, 3145 and 3146 linked by arrow-headed lines to indicate processing order. Each function display 3141, 3142, 3143, 3144, 3145 and 3146 indicates a number of processors assigned and a maximum number of processors that may provide useful scaling. The current processing position is indicated by highlighting completed functions, for example. An estimated activation time 3148 of a next algorithm function is indicated, and an estimated time to completion 3150 of the job is displayed. In one example of operation, if the user selects a function display (e.g., function displays 3141, 3142, 3143, 3144, 3145 and 3146) prior to its activation or if the select function button 3152 is pushed, then programming model screen 3140 allows functions to be added or deleted. The order of execution may also be changed. The number of nodes allocated to a function may be selected and updated. For example, Table 56 may be displayed to allow modification.

Job Function # Nodes Maximum # Requested # Delay Time System # Name Assigned Nodes Nodes To Get New Override Nodes Code

Table 56. Node Allocation Change Table Template The job number, function name, number of nodes assigned, and maximum number of nodes data come from the process flow diagram. The requested number of nodes is input by the user. Because multiple jobs can exist simultaneously within the system, those jobs consume system resources; that is, nodes. This means that if the current job requests more nodes than currently assigned, it has to wait until some of the nodes in other jobs are freed before they can be assigned to this job. The expected time delay is displayed in the delay time to get new nodes field. For emergencies, a system override code can be entered that instantly reassigns nodes to this job at the expense of other running jobs. FIG. 153 shows a process flow display 3160 indicating that a decision point has been reached, as indicated by indicator 3162. Once a decision point has been reached, the system stops, the time to next function is set to zero, the time to complete clock is frozen, and the decision point function 3142 is highlighted 3162. All decision functions should be display functions; this allows the intermediate data to be shown to the user so that the proper process decision can be made. The system knows that the decision function is a display function because that function is associated with the output plane. Next, the user selects an arrowed line 3164, 3166 that attaches the next requhed function. For example, FIG. 154 shows a process flow display 3180 illustrating process flow display 3160 after selection of arrow 3166 to indicate that continues with function 3144, which is then highlighted 3182. The process flow continues until another event takes place that stops processing.

Automated Decisions and Anomalous Data Detection

Standard compute systems may be confused by data anomalies, leading to processing errors, lost data (from various data rejection techniques), and increased human perusal to determine the cause of the anomaly. Natural systems behave in a very different way. Rather than rejecting anomalous data, natural systems attempt to determine enough about the anomaly to incorporate this new data into its overall processing. Generally, this is considered curiosity and self-directed learning. Curiosity is defined as'ϊ/ze detection and attempted resolution of anomalous data'.' If decision point functions are analysis functions rather than display functions, then the process flow may be changed from one with human intervention to an autonomous activity as shown in FIG. 155. In particular, FIG. 155 shows a process flow display 3200 illustrating process flow display 3160 after an automated decision has been made by function 3142. This method allows for the mixing of human selection and autonomous selection of process pathways. In addition, this is a basis for anomalous data detection. If algorithm/function 3142 is comprised of multiple vhtual sensors, it could be defined as a process flow in and of itself. As can be seen, algorithm/function 3142 only knows enough to select path 3164 or 3166. However, it also knows what it does not know; that is, it can also generate an unknown data path. This path is not formally part of the process flow but is, instead, a connection to another process flow that operates as a background process. This background process attempts to reconcile the input data to this new output data. It does this by trying to model the analyze data function in this example. The background process tries to model the analyze data function and then determines whether the data anomaly fits one of the following categories: mathematical incompatibility, scale incompatibility, data type incompatibility, or dataset size incompatibility. If the anomaly category cannot be determined, then the data associated with the anomaly is saved as a Cariosity Analogue (Culog) storage area. The structure of the Culog Storage is discussed in detail below. FIG. 156 shows a programming model screen 3220 illustrating one programming model where a function 3222 encounters anomalous data that cannot be categorized, and therefore selects an unknown function 3228 to handle the anomalous data. Architectural Performance Benchmarks

Comparisons of high performance computing systems may use a series of benchmark tests that evaluate these systems. These comparisons are used in an attempt to define the relative performance merits of different architectures. Tests such as LinPack and ID-EFT are thought to give real-world architectural comparisons. However, what is really being tested is the processor and cross- communication performance; i.e., they test how well the hardware performs on specific tasks - not how well different architectures perform. The definition of benchmarks is therefore expanded to include what most interests users and developers of machines, i.e., what is the best architecture. The following compares how data flows through various computing systems rather than comparing the component performance of those systems. In reality, there are only a few times when the architecture of a parallel processing machine is stressed: initial problem-set distribution, bulk I/O data transfers, data exchange times, and result agglomeration time.

Problem-set Distribution Problem-set distribution has two main components: distribution of the same components over all nodes and distribution of dissimilar components to individual nodes. If the same code and data are distributed to all nodes under the assumption that each node knows how to determine which part of the data on which to work, a simple broadcast mechanism can be used. An appropriate test then examines the number of input time units various architectures requhed to broadcast the program and the data requhed for different types of problems. Input time units may be defined as the number of exposed time units requhed to move the program and data onto all compute nodes from a remote host. For example, if there are one or more channels connecting the remote host with one or more gateway nodes, and if there are one or more channels connecting the gateway nodes with one or more controller nodes, and if each processor node is connected to the cluster network by one or more channels, then the possible number of architectural time units for a broadcast based problem-set distribution is given by: Equation 170. Problem-set Distribution Time Formula T . D D - O^ _D - O_c ^{l psd} ~ θ ⁺ μ ⁺ ψ where: T_pSd = time units requhed for problem-set distribution. D = Total size of problem-set to be distributed. θ = # of channels between remote host and gateway node. μ = # of channels between gateway and controller nodes. ψ ≡ # of channels between the controller and compute nodes. O_g = Data overlap between gateway and controller nodes. O_c = Data overlap between controller and compute nodes The data overlap represents what happens when two sets of serial data movements occur at the same time. This may be more easily seen in FIG. 157. FIG. 157 shows one example of problem-set distribution illustrating a data ttansfer 3250, starting at time 0, from a remote host 3242 to a gateway 3244, a data transfer 3252, starting at time 1, from gateway 3244 to a home node 3246 and a data ttansfer 3254, starting at time 2, from home node 3246 to compute node 3248. To make the example more concrete, assume 100 bytes of data are ttansferred from the remote host to the compute node with high efficiency. That is, transmission to the next node in line begins as soon as the first byte is received from the upstream node. This gives: D - 101 _g = 99 o_c = 99 θ = 1 μ = 1 ψ = 1 Equation 170 yields T_psd = 102 time units. This implies that the architecture is 98% efficient at problem-set distribution. If dissimilar components are to be distributed to individual nodes, each component is ttansferred to each compute node separately; this means the communication from controller to compute nodes is point-to-point commumcation, rather than a broadcast. Equation 170 is then modified by adding an additional term: Equation 171. Augmented Problem-set Distribution Formula D D - O_g D - O, -* psd — —— ++ -- ^{8 s} + υ-υ ₊ ^^ ₊ (N-l)(D -O_c ) θ μ ψ where: N = # of compute nodes. Additional compute nodes are then attached to the end of home node 3246 in FIG. 157, producing the situation shown in FIG. 158. FIG. 158 shows one exemplary distribution 3260 for a dissimilar problem-set. In particular, distribution 3260 shows a data transfer 3264, starting at time 0, from a remote host 3242 to a gateway 3244, a data ttansfer 3266, starting at time 1, from gateway 3244 to a home node 3246 and three data transfers 3268(1-2), starting at times 2, 3 and 4 from home node 3246 to compute nodes 3262(1-3), respectively. Using the same 100 byte example from above, this gives T_ps = 104 time units, so the efficiency of the example architecture drops to 96%. Note that these figures represent the theoretical maximum that the architecture can achieve. The actual performance may be determined by a benchmark and then compared to theory. Problem-set Distribution Benchmarks Two benchmarks are used for problem-set distribution. The first benchmark represents the similar components benchmark. The benchmark consists of a dataset, whose size is determined by the generator of the benchmark, and a sequence of data transmissions, the last of which is consistent with a broadcast data ttansmission. The sequence involves transmission of the dataset from the remote host through the gateway and controller nodes and ending with a broadcast to all processors. The processor nodes each transmit a data-received acknowledgement message to the controller nodes. After all compute nodes report back, or the controller nodes perform a timeout from non-receipt of the acknowledgment message, the controllers send either a timeout or completion message to the gateway node. The gateway node then passes the message back to the remote host. The controller nodes monitor the elapsed time for data transmission to each node, as well as the total elapsed time from start of the first dataset transmission to receipt of the last acknowledgement message. These timings are sent back to the gateway node. The gateway node monitors the elapsed time of the dataset transmission to the controller nodes and the elapsed time from end of dataset transmission to receipt of the last home node acknowledgement message. The gateway node then calculates the total internal system latency using the following.

Let: Equation 172. First Benchmark Timing Calculation Formula a - - Han HI b - T ¹ Can —T ² Cl

Then: Equation 173. First Benchmark Total Internal System Latency Formula E_≠ = a + b, iff α > 0 θb > 0 E_Λ = 0 otherwise where: THOII = Time stamp of the last data received acknowledgement message from the home nodes. Tm Time stamp of the first data ttansmission byte to the home nodes. T- Can Time stamp of the last received acknowledgement message from the compute nodes. Ta ^≡ Time stamp of the first data transmission byte to the compute nodes. The second benchmark is the dissimilar components benchmark. This benchmark consists of a test dataset, whose size is determined by the generator of the benchmark, and a series of point-to- point data transmissions. The dataset is transmitted from the remote host node through the gateway and controller nodes and then to all processor nodes in sequence. The processor nodes then transmit a dataset acknowledgement message to the controller nodes. After all compute nodes report back, or the controller nodes perform a timeout from non-receipt of an acknowledgement message, the controllers send either a timeout or completion message to the gateway machine. The gateway relays this message on to the remote host. The controller nodes monitor: 1) the elapsed time requhed to transmit the datasets to all nodes and 2) the time between the last dataset send completion and receipt of the last acknowledgement message. These timings are sent back to the gateway node. The gateway node monitors: 1) the elapsed time requhed to transmit the datasets to the controllers and 2) the time between the end of dataset transmission and receipt of the first acknowledgement message. The gateway node then calculates the total internal system latency using the following. Let: Equation 174. Second Benchmark Timing Calculation Formula ^a ~ Han ~ *- H\ ^{b = T}Can - ^TCl Then: Equation 175. Second Benchmark Total Internal Latency Formula E = a + b, iff α > 0Θ b > 0 E_w - 0 otherwise where: T_HOΠ ^≡ Time stamp of the last data received acknowledgement message from the home nodes. T_H1 ≡ Time stamp of the first data ttansmission byte to the home nodes. Tc _n - Time stamp of the last received acknowledgement message from the compute nodes. Ta ^s Time stamp of the fhst data transmission byte to the compute nodes. As can be seen, the E_γb and E_yp are the same equation. However, they produce different results because of the difference between the broadcast data transmission from the home node(s) to the compute nodes, which is described by E_yb, versus the sequential data ttansmission from the home node(s) to the compute nodes, which is described by E_yp. The internal system latency expressed as system efficiency can now be compared against the architecture's theoretical efficiency using: Equation 176. Detected Versus Theoretical System Efficency Formula

Multi-Job Interference Benchmark for Problem-set Distribution Distributing a single job into a machine poses one particular set of problems. Dealing with multiple simultaneous jobs introduces an additional set of problems. Different architectures respond differently in the face of multiple simultaneous job requests. These differences can be expressed as the percentage of the implementation efficiency for multiple jobs. The number of jobs is increased from one job until there is one job per compute node. An array may be created which shows both the theoretical and implementation efficiencies.

Table 57. Benchmark Table Layout

The efficiency can then be calculated from the data as: Equation 177. Multiple Job Mean Implementation Performance Efficiency ¹ T i em = Ϋ / i— _j

Where: I_em ≡ Mean implementation efficiency across all simultaneous job counts. x ≡ Current simultaneous job count. J ≡ Maximum number of simultaneous jobs. I_ex = Implementation efficiency for a given number of jobs. I_es ≡ Implementation efficiency standard deviation across all job sizes. The number of jobs grows at the natural growth rate of the number of processor nodes within a system. This means that a cascade-based system increases the number of jobs at the cascade growth rate. Table 57, is then modified as follows:

Table 58. Benchmark Table Layout for Cascading Systems Bulk I/O data transfer The bulk I/O data fransfer moves the bulk data from the remote host to all of the requhed compute nodes. Other than the size of the data, which should range from Mega-to Tera-byte data transfers, all activity corresponds to the problem-set distribution test activities.

Data Exchange Times Data Exchange Times are the times within an algorithm during which cross-communication between compute nodes takes place. This document has aheady shown the principal cross- communication techniques: Butterfly all-to-all exchange, Broadcast all-to-all exchange, Manifold all- to-all exchange, Next-Neighbor Exchange. The most difficult of the exchanges are the all-to-all exchanges. Below is a discussion of how to compare the efficiencies of various topologies when using one of the all-to-all cross-communication techniques.

Butterfly All-to-All Exchange The butterfly exchange is a point-to-point pah-wise data exchange model. Since the model of this exchange method is described above, effects of transfers endemic to the topology (but not to the model) are considered. One case in point is the number of hops it takes to complete the ttansfer. A hop is a data ttansfer through a node that does not requhe that ttansfer. Alternately, hops can be the number of ttansfers requhed to traverse a switching network. Equation 178. Restatement of Exchange Count Formula for Butterfly T = ^-P(P - l) bv This equation takes into consideration the number of channels and nodes, the channel bandwidth, and the dataset size but does not consider hops. Since a hop penalty occurs as a function of the number of compute nodes in the transfer, a strong per-node relationship is implied. However, in some topologies, the number of hops is statistically based so that a statistical relationship is used. The topological time unit cost can now be given by: Equation 179. Effect of Network Hops Formula

T = ^(P +Yh)(P + Yh - l) bv

Where: Y = the probability of a hop h ≡ # of hops

Butterfly Exchange Benchmark This benchmark consists of a test dataset, whose size is determined by the generator of the benchmark, and a series of data transmissions which are consistent with the butterfly all-to-all data exchange. The exchange-start message is generated from the controller node(s). The controller node(s) measures the elapsed time from the sending of the first byte of the exchange-start message until receipt of the last byte of the exchange-completed message from the compute nodes. These timings are sent back to the gateway node. There is no gateway node timing. The gateway node then transmits the timing data to the remote host.

Multi-Job Interference Benchmark for Butterfly All-to-All Exchange Once again, performing a single exchange on the cluster offers one set of problems. However, having multiple jobs performing multiple exchanges compounds those problems. Different topologies respond differently to having multiple simultaneous exchange jobs requested. These differences can be expressed as the percentage of the implementation efficiency for multiple jobs. The number of jobs is increased from one job until the maximum number of jobs performing an exchange for the given topology is used. An array is created showing both the theoretical and implementation efficiencies (as described above).

Broadcast All-to-All Exchange The Broadcast exchange is not a point-to-point data exchange model. However, since the model of this exchange method is described above, only the effects of ttansfers which are endemic to the topology (but not to the model) are considered. This is directly analogous to the butterfly exchange work. Equation 180. Restatement of Broadcast Exchange Timing Formula PD„ T = - bv Equation 180 takes into consideration the number of channels, nodes, the channel bandwidth and the dataset size but does not consider hops. Since a hop penalty occurs as a function of the number of compute nodes in the ttansfer, a strong per node relationship is implied. However, in some topologies the number of hops is statistically based so that statistical relationship may be used. The topological time unit cost can now be given by: Equation 181. Broadcast Exchange Topological Cost Formula (P + Yh)D_ε T = ^■ bv

Where again: Y ≡ the probability of a hop h ≡ # of hops

Broadcast Exchange Benchmark This benchmark consists of a test dataset, whose size is determined by the generator of the benchmark, and a series of data transmissions which are consistent with the broadcast all-to-all data exchange. The exchange-start message is generated from the controller node(s). The controller node(s) measure the elapsed time between the sending of the fhst byte of the exchange-start message and receipt of the last byte of the exchange- completed message from all of the compute nodes. These timings are sent back to the gateway node. There is no gateway node timing. The gateway node then transmits the timing data to the remote host.

Multi-Job Interference Benchmark for Broadcast All-to-All Exchange Performing a single exchange on the cluster offers one set of problems. However, having multiple jobs performing multiple exchanges compounds those problems. Different topologies respond differently to having multiple simultaneous exchange jobs requested. These differences can be expressed as the percentage of the implementation efficiency for multiple jobs. The number of jobs is increased from one job until the maximum number of jobs performing an exchange for the given topology is used. An array is created showing both the theoretical and implementation efficiencies (see above).

Manifold All-To-All Exchange The Manifold exchange is a point-to-point data exchange model. A model of this exchange method aheady exists and only the effects of transfers which are endemic to the topology but not to the model are considered. This is directly analogous to the butterfly exchange work. Equation 182. Restatement of Manifold Exchange Timing Formula

_T = ^(p _i) ^ε bv ^ψ ' Equation 182 takes into consideration the number of channels, nodes, the channel bandwidth and the dataset size but does not consider hops. Since a hop penalty occurs as a function of the number of compute nodes in the ttansfer, a strong per node relationship is implied. However, in some topologies the number of hops is statistically based so that statistical relationship may be used. The topological time unit cost can now be given by: Equation 183. Topological Time Unit Cost 2D T_ε = -^-{p_{φ +}Yh-l)

Where: Y s the probability of a hop h s # of hops

Manifold Exchange Benchmark As with the Butterfly and Broadcast Exchange Benchmarks, this benchmark consists of a test dataset, whose size is determined by the generator of the benchmark, and a series of data transmissions, which are consistent with the broadcast all-to-all data exchange. The exchange-start message is generated from the controller node(s). The controller node(s) measure the elapsed time between the sending of the fhst byte of the exchange-start message and the receipt of the last byte of the exchange- complete from all of the compute nodes. These timings are sent back to the gateway node. There is no gateway node timing. The gateway node then transmits the timing data to the remote host.

Multi-Job Interference Benchmark For Manifold All-to-AH Exchange As with the Butterfly and Broadcast Job Interference benchmarks, performing a single exchange on a cluster exposes just one set of problems. The presence of multiple jobs on a single cluster exposes a different set of problems. Various topologies respond in different fashion to having multiple jobs simultaneous request an exchange. These differences can be expressed as the percentage of the implementation efficiency for multiple jobs. The number of jobs is increased from one job until the maximum number of jobs performing an exchange for the given topology is used. An array is created showing both the theoretical and implementation efficiencies (see above). Combining Benchmarks into One Measure Since all of the benchmarks generate a single metric pah (mean and standard deviation), they can be combined to make a single composite metric.

Program Profiling for Better System Performance Before the scaling of a system can be computed (described above), information concerning the computational performance of each program on each node is gathered. Because of differences in cache sizes, ram sizes, disk size, disk access speeds, processor speeds, number of processors, bus speeds, etc., it is not reasonable to use a simple metric like the processor name and processor speed to predict the performance of any computer program running on any particular computer. Since a parallel processing machine can consist of machines of varying types and speeds, the only way to obtain the correct performance value is to run a copy of a program on each of the compute nodes with the same dataset. This may need to be enhanced by varying the dataset size and running the programs on each compute node with each dataset size. Once the single processor versions of the computer programs have been run on each node, all of the timing data is sent to a single controller machine (Home Node). This controller machine takes the slowest time recorded and other times are relative to this base time. Equation 184. Relative Scale Factor Calculation Formula

N„ =

where: Relative scale factor for node c. scale The relative time scale constant. rnd() The rounding function. Maximum execution time on any node. Execution time on node c.

Below is a hypothetical example of a program run on 7 nodes: scale = 1000 7Λ = 250 7/₂ = 280 T₃ = 246 T₄ = 259 T₅ = 260 T₆ = 237 T_η = 266 Tmax = 280

Equation 184 then yields: iVι = 1120 iV₂ = 1000 #₃ = 1138 #₄ = 1081 N₅ = 1077 #₆ = 1181 N_η = 1053 The information concerning the relative performance of the nodes on the current computer program is then distributed among all of the home nodes. Once compute nodes are assigned, the home node is able to compute theh relative performance. In addition, the scaling order S, or computational workload as a function of the dataset size D, is known. This may be expressed in the form O ( fnj), where βn) is a function of the number of elements. Thus, a function which scales linearly is said to be of O (n), while an algorithm whose work scales quadratically may be expressed as O (n²), and so on. Thus, S can be written as: Equation 185. Dataset Size Scaling Factor Calculation Formula

where: S ≡ Data set size scaling factor. D_c = Data set size for node c. £>_ref ≡ Reference data set size. The final requhed information is the base node count (B_n). B_n is the number of nodes that a program/algorithm is expected to use, based on the profiling user's expectation of the dataset sizes and scaling requirements likely to be encountered. This number is converted to H where H is the Howard- Lupo Hyper-manifold (or Howard-Lupo manifold, or Howard Cascade) node count that comes closest to the requhed node count without exceeding it. For instance, if the user requests 66 nodes, then H=63 on a single channel cascade is the best match, resulting in setting B_n=63. The Base Priority Vector Value (B_c) can now be defined. This value is the percent of B_n that can normally be expected for this program/algorithm. If the expected node count used is 70% of the Base Node Count, then B_c = 70. The Minimum Priority Value (M_p) may be defined as the minimum percent of the B_n that can be used by the program/algorithm. Finally, the user decides on a value for the System Core Response Urgency Flag. If this flag is set, then the current program/algorithm is allowed to override the Minimum Priority Value. These values are used below.

Work Share Calculation Each selected node is placed in a cell containing its scale factor; for example; Cellj is assigned node Nl with scale factor 1120.

Cell Cell Cell 1 2 3 N1 N4 N6 1120 1081 1181 Table 59. Node Allocation to Program Speed For this cascade configuration, the relative amount of work, expressed as a percentage of theob size, which each node should be assigned may be calculated from: Equation 186. Nodal Work Share Calculation Formula w =- s.

1=1 Table 59, can be expanded to show the share of work each node has:

Table 60. Node Allocation to Work Percentage Now each node has the appropriate work share to start the job with reasonable expectation that the work time is balanced. If the job requhes rebalancing after its currently assigned activities are performed, then some form of exchange may be used.

End-To-End Scaling Predictive Model

In the prior art, there is no way to predict the behavior of an algorithm written for a single processor when operating in a parallel processing envhonment. Being able to predict the parallel performance of an algorithm allows for both the efficient allocation of processors and an understanding of the time requhed to complete the job. If the single processor total compute time is multiplied by the requhed percentage scale-up factor, then the time for cross-communication, load balancing, check point computation, and I/O are determinable. FIG. 159 shows a graphical representation 3280 single processor timing 3282 and multi-processor timing 3284. To describe the times available on a single node to accomplish its tasks, let: T = Total processing time on single processor (e.g., single processor timing 3282). P ≡ Total number of multi-processor nodes. S ≡ End-to-end system speed-up. K ≡ Requhed scaling percentage. 1-κ is total time available to node, α ≡ Total processing time on one node, β ≡ Total time spent in inter-node communication on one node, χ ≡ Total check-pointing time on one node, δ ≡ Total system I/O time on one node. λ ≡ Total exposed latency experienced on one node. Then linear, or even super-linear, scaling, can occur if: Equation 187. Scaling Predictive Model (l- κ)- ≥ ( + β + χ+ δ ^'+ )=> S ≥ xP This equation allows one to characterize the scalability of a particular program and identifies those elements of software design which directly impacts scaling. For example, let: T s Total processing time on single processor P ≡ total number of multi-processor nodes α s Total processing time on one node β s Total time spent in inter-node communication on one node X ^s Total check-pointing time on one node δ = Total system I/O time on one node λ = Total exposed latency experienced on one node O s Total system overhead = (oe + β + χ + δ + γ) Then super-linear performance requhes: Equation 188. Super-linier Performance Predictive Model

Emotional Analogues for Dynamic Resource Allocation

All natural compute systems have an emotional component to them. This emotional component is generally considered, at best, unnecessary and, at worst, detrimental to the computational process. Since emotional traits can be ttaced through all animals with central nervous systems, it is a very strong trait, with an ongoing requhement. Natural compute systems all deal directly with a chaotic natural envhonment that requhes multiple simultaneous responses, many of which have strong, yet transient, real-time requirements. Since physical computational elements do not increase and decrease to handle these problems with these natural compute systems, the computational elements is dynamically allocated. To meet the realtime requhements, this dynamic allocation occurs very rapidly. Using emotional ttaits to perform gross level computational resource allocation (a) offers a model that can perform rapid resource allocation, (b) offers a rapid learning method, and (c) provides a centtal system focus even though the computational components are distributed. Emotional analogues (Emlogs) are defined to be the computational resource allocation and prioritization scheme(s) needed to best meet the processing timing requhements of multiple dynamically changing compute jobs. To understand how this definition can be both appropriate and meaningful, consider the following hypothetical example.

Emotional Resource Allocation Response Example Story Emotion 1: Calm, Hungry An elk in a valley feels hunger, and there are no other overriding requhements. Computational resource allocation might be: 80% to eating, 15% to general observation of envhonment, 5% to general physical activity. Emotion 2: Wary, Hungry The elk sees something in the distance - is it safe? Computational resource allocation: 65% to eating, 15% to general observing of envhonment, 10% to threat identification, 10% to general physical activity (more head/eye, ear movement). Emotion 3: Concern, Hungry The elk identifies object as a threat. Computational resource allocation: 40% to eating, 20% to general observing, 20% to predicting threat level, 15% to general physical activity, 5% to preparation for flight or fight response. Emotion 4: Fear (Hunger Overridden) The threat is within strike range and appears to be hunting. Computational resource allocation: 0% to eating, 20% to general observing, 20% to predicting threat level, 15% to general physical activity, 45% to preparation for flight or fight response. Emotion 5: Terror (Hunger Overridden) The threat attack: elk chooses flight. Computational resource allocation: 0% to eating, 0% to general observing, 0% to predicting threat level, 0% to general physical activity, 0% to prepare for threat activity, 100% to threat evasion. Emotion 6: Calm, Tired, Thirsty (Hunger Overridden) The threat is gone. Computational resource allocation: 55% to resting, 15% to general observing, 25% to seeking water, 5% to general physical activity. As can be seen in the above example, it is possible to relate a set of natural system responses to external stimuli as a emotion-to-resource allocation mapping. To give a man-made compute system (e.g., a parallel processing envhonment) analogous capability, several primary prerequisites are met: • The ability to predict the completion time for each program/job in the system. • Access to the requhed time frames for each program/job in the system. • The ability to add new processing elements to a program/job in the system/ • The ability to subtract processing elements from a program/job in the system. The fhst prerequisite requhes the ability to profile each program in the system. This is accomplished using the methods outlined above. The second primary prerequisite has no special requhements. The thhd primary prerequisite requhes global checkpointing capabilities and the ability to add additional nodes. This is accomplished using methods outlined above. The fourth primary prerequisite also requhes global check pointing capabilities and the ability to remove nodes. This may also be accomplished using methods discussed earlier. With these prerequisites in mind, a multi-dimensional resource prioritization model that determines when a program meets a given set of capabilities may be constructed. FIG. 160 shows a simple linear resource priority called an EmLog 3300. The size of a vector 3302 of emlog 3300 represents the maximum number of nodes that a program is able to use efficiently. A point 3304 denotes a nominal, or starting, number of nodes assigned to the program. A bar 3306 indicates a minimum number of nodes the program requhes to meet the specified completion time. In any collection of running programs, each program has its own resource vector (e.g., emlog 3300). FIG. 161 shows one exemplary emlog star 3320 with five emlogs 3321, 3322, 3323, 3324 and 3325. Emlogs 3321, 3322, 3323, 3324 and 3325 are treated as an orthogonal basis set, and emlog star 3320 is said to define a particular system state. The priority values define a point in the emotional space. The purpose of an emlog star (e.g., emlog star 3320) is to take advantage of the fact that the different programs are running on different sets of processors. This means that rather than a single node or group of nodes controlling the activity of the programs, this conttol can instead be distributed over each program domain. To put this in terms of an impulse function, program allocation priority of Emlog star 3320 uses an impulse 3340 as shown in FIG. 162. Sub-impulses 3344, 3346, 3348, 3350 and 3352 represent emlogs 3321, 3322, 3323, 3324 and 3325 of FIG. 161. Impulse 3340 has an impulse width 3342. FIG. 163 shows a program allocation impulse function 3360 with five concurrent exchange steps, one for each program. The node count requhed by a particular Emlog Star is the sum of all the nominal node counts of all the programs in the star. Since the Howard Cascade, and thus all other structures associated therewith, has set node count groupings that are not consecutive (such as 1, 3, 7 or 15 for the single channel Howard Cascade), the node count may be adjusted to the closest value that neither exceeds the total nominal node count nor falls below the total minimum node count. Each program/algorithm carries with it the program/algorithm requhed completion time, the completion time estimate, the starting number of nodes used, the starting resource priority vector value, the minimum resource vector value, and the system core response urgency flag. These values and flag may be arranged as shown in Table 61.

Table 61. Algorithm Specific Emlog Data

The requhed completion time is an externally received value that bounds the process timing requhements. This allows real-time requhements to be associated with a particular program under a particular set of chcumstances. The estimated completion time is a value that is computed as a function of the algorithm's scaling behavior: Equation 189. Estimated Time To Complete an Emlog Supported Algorithm E_t = N_CST The Base Priority Vector Value comes from the normal percentage of the nodes that are used by the program/algorithm. The Minimum Priority Value (M_p) is obtained from the minimum number of nodes which can be used by the program/algorithm. The largest Maximum Priority Value possible is always 100%. The System Core Response Urgency Flag is an independent flag that allows a particular program/algorithm to override the normal priority schemes and set the priority values of other programs/algorithms below theh normal Minimum Priority Value. One might consider this flag as authorization to perform an emergency over-ride of other program settings. This flag is set during profiling to allow the system to find the best normal balance for a particular state.

Instinct Analogues for Initial System Conditions The initial groups of Emlog Stars are called the Instinct Analogue Stars (Inlog Stars). They differ from normal Emlog Stars in that they cannot be deleted or overridden. Otherwise, theh behavior is exactly the same as any other Emlog Star. After a system is operational, additional Inlog Stars can be created by the user.

Multiple Emlog Stars Since an Emlog Star represents both the priority and the resource allocation for a particular group of programs/algorithms, if programs/algorithms are added to or deleted from the system, then the resource allocation has to be changed. This implies that there can be more than one Emlog Star. In fact, there may be one Emlog Star for every grouping of program/algorithms. Rather than programming in all possible states, it is proposed that new stars can be developed naturally from an existing star in a kind of budding process, described in more detail below. With multiple Emlog Stars comes the requhement to move from one star to another. Without an efficient method of transitioning between stars, there cannot be near real-time or real-time requhements given to the system.

Emlog Star Traversal There are two signals which can cause an Emlog Star to transition to another Emlog Star: (a) detection of a new program/algorithm in the system and (b) deletion of a program/algorithm from the system. In order to efficiently transition from one Star to the next, at least two Emlog Star Vector Tables are associated with each star. The fhst table contains vector information on new program/algorithms, and the second contains vectors for dropped program/algorithms.

Emlog Star Number Algorithm Name Min Dataset Size Max Dataset Size New Emlog Star #

Table 62. Added Algorithm Emlog Star Vector Table Template This vector table is used when an added program/algorithm is detected running in the system.

The new algorithm name is used to search the vector table, and then the dataset attributes are checked to ensure that the correct response occurs. Once the correct new algorithm and dataset attribute combination is found, the associated Emlog Star number is used to vector to the new state. If a match is not found, then one of two transitions occurs, a non-linear transfer takes place or a new Emlog Star is budded and a new vector entry is made to the vector table.

Deleted Emlog Star Number Algorithm New Emlog Star # Name

Table 63. Deleted Algorithm Emlog Star Vector Table When a program/algorithm completes its processing and is no longer in the system, the name of the deleted algorithm is used to search the vector table for the associated Emlog Star number corresponding to the current state. Once found, the New Emlog Star number is used to vector to the new state. If the algorithm name is not found, a system error condition exists.

Da t At P nter e

Table 64. Dataset Attributes Table Template

Non-linear Emlog Star Transfer When a normal Emlog Star data dhected transition cannot occur because there is no transition vector for the current dataset attributes, and before an Emlog Star budding process takes place, the system attempts to perform a non-linear Emlog Star ttansfer. Each of the dataset attributes is treated as a tuple in the relational database sense. A database query is then made of the existing Emlog Stars. If the dataset attributes do not match an existing Emlog Star, then an Emlog Star budding process is initiated. If a match is found then a new vector entry is made from the old Emlog Star to the newly found Emlog Star (to eliminate future database calls for this transition) and the newly found Emlog Star is accessed.

Output Imbalance Model A job can experience an output imbalance if some regions of an input dataset produce significantly more output than the average over the other regions. This can be represented as a certain sigma value departure from the average. FIG. 164 shows an example grid 3380 with an output imbalance region. In this example grid 3380, each cell represents an output. Cells 3384 generate an average output of 30. Cells 3386 each generate an average output of 50, and cells 3382 each generate an output of 100. The average data output value is calculated as: [(757 * 30) + (40 * 50) + (8 * 100)] / 805 = 33.7 Thus, cells 3382 represent a value more than three sigma above the average data scale. Several analysis methods may now be applied such as size filtering, various equalization techniques, or pattern matching to the area of interest. Once the area of interest is correlated to some known object, then the analysis attributes of the known object are associated with the new object. The object and its known analysis atttibutes are associated as a Culog. Dataset Size Incompatibility Model A Dataset Size incompatibility occurs whenever the output from a sensor exceeds the expected amount for that sensor. The result is a need to generate a new Emlog that uses enough compute nodes to meet the processing requhements. If additional nodes cannot be accessed, then a report on the number of additional nodes requhed is generated. In addition to the report, a Culog is also generated that associates the report information with the current incompatibility so that the system does not continue requesting additional nodes.

Data Type Incompatibility A Data Type Incompatibility is where the data accuracy or data type is misaligned for a function. Examples are included in the following table:

Table 65. Type Conversion Table

The above table gives an example of the type conversions which can be detected and changed at run-time. Once a type conversion has been found, it, along with the information concerning which interface, is saved as a Culog. Correlation Incompatibility Model A Correlation incompatibility occurs when the system tries to perform a correlation on an object within a data stream and finds that the object does not correspond to any remembered object. When this occurs, the system tries to perform a sub-correlation that attempts to match sub-parts of the object. FIG. 165 shows one exemplary output grid 3400 that represents an image of a tree. In this example, the object is not found in the object database; however, the white outlined area 3402 has the characteristics of a stick (found within the database). Therefore, stick-like attributes can be attached to this portion of the object. By decomposing an object into sub-objects, it becomes possible to do a partial categorization of an object. This in turn allows a model of what an object can do to be built up. This information is stored in a Culog.

Mathematical Incompatibility Model Mathematical incompatibility occurs if the data leads to a mathematically nonsensical result, such as a division by zero, an infinity, lack of convergence after some number of iterations, or solution divergence after some number of iterations, etc. When an incompatibility of this type occurs, then a wide array of analysis tools is brought to bear, based upon the type of incompatibility encountered. If an existing tool can generate more reasonable results, the results are shown to a human in the loop to verify those results or to declare the results incorrect. If the results are reasonable, then the analysis tools used to generate those results are associated with the data incompatibility and saved as a Culog. If the results are unreasonable, then this type of data incompatibility, plus what has been tried, are saved as a Culog.

Emlog Star Budding Process The emlog Star budding process begins by fhst duplicating the current Star. A new emlog Star Number is associated with the duplicated star. Next, a new priority vector is created for the out-of- tolerance program or new program (whichever the case may be). If the new emlog Star is created because a dataset attribute is out-of-tolerance, then the old associated priority vector is deleted in the newly budded emlog Star. Finally, the linkage is made between the old emlog Star and the new emlog Star. The linkage is in the form of an added algorithm emlog Star entty in the old emlog Star and a deleted algorithm emlog Star entty in the new emlog star. To select the new group of algorithms/functions used within an emlog Star requhes the anomalous data be analyzed. There are a special group of emlog Stars whose sole purpose is to analyze anomalous data and associate computer programs with that data. These specific emlog Stars are called Curiosity Analogues (culogs). culogs differ from emlogs in that they are groups of programs that only assist in the creation of new emlogs. That is to say the only way to transition to the culog space is when an emlog bud needs to be created (See FIGs. 166, 167 and 168). If the culog stars that exist cannot resolve the data such that a new emlog Star can be created, then this is an indication that additional culog stars need to be created. There is an additional type of emlog Star called a Sentient Analogue (selog). Like the culog Stars the selog stars are only called at special times. The selog stars are only called when all culog stars have been used, and they are used to create additional culogs Stars. It should be noted that the algorithms/programs that are represented by the emlog star-like objects still send and receive data as expected of any algorithm program. It is only the higher order use of that data to specify additional algorithm/programs and ttansition to or create new emlog stars that is different. FIG. 166 shows one exemplary emlog star ttansition map 3420 illustrating a Linear translation from one emlog star to the next, using the output data generated by each current emlog star to ttansition to the next emlog star. FIG. 167 shows a culog star ttansition map 3440 illustrating how the analysis of one culog star allows for the ttansition to the next portion of the analysis. FIG. 168 shows an emlog star budding process 3460 illustrating how, with the assistance of the culog stars, a new emlog star can be generated. It now becomes possible to combine the reactions of multiple systems using emlog star state ttansition maps.

Merging Compute Systems Using Emlog State Transition Maps Two emlog state transition maps are combined by fhst copying the set of each machine onto every other machine as an all-to-all exchange. Each machine merges each set into one master set by deleting all duplicate emlog state ttansition maps. For non-duplicated emlog stars, some of theh associated priority vectors may be bound to a program/algorithm that is not within the merging machine. This can be used to determine which computer programs need to be shared or which of the merged compute systems should process the data. Selection of the proper machine requhes the addition of a machine name field to the emlog data, Table 66.

Algorithm Machine Require Max System Name Completion CH) (Bo) (M_P) Priority Name Time (Et) Value Flag

Table 66. Algorithm Specific Emlog Data with Machine Name

Using a machine name attached to an algorithm eliminates the need to move codes around.

Use of Data Patterns to Select an Emotional Analogue State Returning to the emotional resource response story and look at the fhst two entries, it gives: Emotional State 1: Calm, Hungry An elk in a valley feels hunger, and there are no overriding requhements. Computational resource allocation might be: 80% to eating, 15% to general observing of envhonment, and 5% to general physical activity. Emotional State 2: Wary, Hungry The elk sees something in the distance - is it safe? Computational resource allocation: 65% to eating, 15% to general observing of envhonment, 10% to threat identification, 10% to general physical activity (more head/eye, ear movement). The question becomes/how did the elk transition from emotional state 1 to 2?' The trigger occurs when the elk's general observations notice something that has not been quantified (something new in the distance). The reaction is a kind of curiosity to determine what it has detected. Curiosity in this instance can be described as The detection and attempted resolution of anomalous data. In this case the anomalous data causes a new program, threat identification, to be called. Having this new program causes the emotional state transition. Even though the input data stream from the elk's external sensors remains the same, there is now a new data stream component: the output of the general observing program which added the anomalous detected data, thus changing how the input stream should be utilized. The same effect occurs from the emotional state 2 to 3 transition: Emotional State 2: Wary, Hungry The elk sees something in the distance - is it safe? Computational resource allocation: 65% to eating, 15% to general observing of envhonment, 10% to threat identification, 10% to general physical activity head/eye, ear movement). Emotional State 3: Concern, Hungry The elk identifies object as a threat. Computational resource allocation: 40% to eating, 20% to general observing, 20% to predicting threat level, 15% to general physical activity, 5% prepare for flight or fight response. Again, a program generates additional data; in this case, the threat identification program generates threat detected data which causes the system state to change. It does not matter if new sensor data is detected or if the equivalent of new sensor data (new output from an internal process) occurs. Rapid reallocation of programs and program resources can now occur fast enough to meet the needs of a changing envhonment. In general, the sensor type and the amount of data are used to access an emlog. For instance, the "Calm/Hungry" emlog might be identified as number 0110 with the following characteristics:

Sensor Group Name: Calm/Hungry Em tog Star #0110 Vector Tab] e Sensor Name Min Dataset Size j Max Dataset Size Eating 10 MB 20 MB General Observing 30 MB 80 MB General Physical 300 KB 600 KB Table 67. Emlog Vector Table Example

The sensor group name is the name of the current grouping of sensors used to vector to the emlog Star. The sensor name is a list of programs/algorithms/detection devices which are used to generate the vector. The Min Dataset Size and Max Dataset Size define the minimum and maximum amount of data generated by the current sensor name. The emlog star number is the vector to the correct emotional analogue state. Below is an example of the Data to Emlog Vector Table with values. Sensor Group Name: Wary/Hungry Emlog Star # 0111 Vector Table Max Dataset Sensor Name Min Dataset Size Size Eating 10 MB 20 MB General Observing 30 KB 80 KB Anomalous Data General Physical 300 KB 600 KB Table 68. Adding Sensor Group to Emlog

The above example allows the system to ttansition from Calm/Hungry to Wary/Hungry. Using the data to perform the primary state transitions allows for a wider range of responses. When a new emlog Star is created, an entty is made to both its data to emlog vector table and its culog storage tables (see CMriosity Analogue Storage Definition, below). This allows the system to grow and learn over time.

Cariosity Analogue Storage Definition A culog storage is a set of tables that contains information on what is known, as well as what remains unknown about data objects. This includes output balance information, data type incompatibility information, dataset size incompatibility, and correlation incompatibility.

Object Output Data Type Dataset Size Correlation Forward Backward Name Balance Incompatibility Incompatibility Incompatibility Pointer Pointer Pointer Pointer Pointer Pointer

Table 69. Main Culog Template

Entry Mean Standard Coordinate Coordinate Data Forward Backward Point Balance Deviation Value Value Amount for Pointer Pointer Coordinates

Table 70. Output Balance Template

Table 71. Data Type Incompatibility Template

Entry Mean Standard Coordinate Coordinate Data Set Forward Backward Point Balance Deviation Size Size Size for Pointer Pointer Coordinates Table 72. Data Set Size Template

Entty Componen Component Componen Component Forward Backward Point t Coordinate t Description Pointer Pointer Coordinate Size Correlatio Pointer Size n Value

Table 73. Correlation Incompatibility Template The Templates shown above are only representative of those requhed in a full system.

Sleep-like System Requirements In order to keep down the number of excess emlog stars, theh associated datasets, and theh dataset templates, it is necessary to prune those objects. The pruning of the emlog stars takes place when the system is quieted. The dataset attributes of the newly budded emlog stars are treated, once again, as tupels in the relational database sense. A database query is conducted against all old emlog stars. If a match is found a transition link is formed between where the newly budded emlog star is connected and the old emlog star. Then the newly budded emlog star is deleted. This is repeated for all emlog stars budded after the last reconciliation and for any newly created inlog, culog, or selog stars. FIGs. 169 through 178 are flow charts for one exemplary process 4000 and nine sub-processes

5000, 5020, 5040, 5060, 5080, 5100, 5120, 5140 and 5160 illustrating communication between nodes of a parallel processing envhonment. In step 4002, process 4000 invoked sub-process 5000, FIG. 170, to initialize home nodes and compute nodes of a parallel processing envhonment (e.g., parallel processing envhonment 160, FIG. 4). In one example of step 4002, remote host 162 initializes parallel processor 164. In step 4004, a problem-set and a data description is distributed to home nodes from the remote host. In one example of step 4004, remote host 162 sends a problem-set and a data description to home node 182 of cascade 180, FIG. 5. In step 4006, process 4000 invokes sub-process 5020 to verify that the problem-set and data description are valid, terminating process 4000 if an error is detected. In one example of step 4006, home node 182 analyzes the problem-set and data description of step 4004 to ensure all functions are loaded onto compute nodes and that the data description is consistent with the functions specified in the problem-set. In step 4008, process 4000 invokes sub-process 5040 to determine cascade size based upon input/output type requhements of the problem-set and data description. In one example of step 4008, home node 182 determines that a type HI input/output is necessary. In step 4010, process 4000 invoked sub-process 5060 to distribute the problem-set and data description from the home nodes to top level compute nodes of the parallel processing envhonment. In one example of step 4010, home node 182 distributes the problem set of step 4004 to top level nodes 184(1), 184(2) and 184(3) of cascade 180. In step 4012, process 4000 invokes sub-process 5080 to distribute the problem-set and data description to lower level compute nodes. In one example of step 4012, using cascade 180 of FIG. 5, compute nodes 184(1) and 184(2) distribute the problem-set and data description to compute nodes 184(4), 184(5) and 184(6), and compute node 184(5) distributes the problem-set and data description to compute node 184(7). In step 4014, process 4000 invokes sub-process 5100 to process the problem-set and data description on compute nodes, exchanging data between compute nodes as necessary. In one example of step 4014, compute nodes 184 of cascade 180 process the problem-set and data description of step 4004, performing an all-to-all full data set exchange. In step 4016, process 4000 invokes sub-process 5160 to agglomerate results from the compute nodes. In one example of step 4016, compute nodes 184 agglomerate results back to home node 182, which then sends the results to remote host 162. FIG. 170 shows a flow chart illustrating one exemplary sub-process 5000 for initializing home and compute nodes within a parallel processing envhonment. In step 5002, sub-process 5000 connects compute nodes of the parallel processing envhonment together to form a network with a low hop count. In one example of step 5002, a network switch is configured such that compute nodes are organized within a network allowing a remote host to access each compute node in as few hops as possible (i.e., communication does not unnecessarily pass through nodes for which it is not destined). In step 5004, sub-process 5000 loads compute node controller software onto each compute node within the parallel processing envhonment. In one example of step 5004, the remote host loads compute node controller software onto each compute node using the connections configured in step 5002. In step 5006, sub-process 5000 connects home nodes as a flat network of nodes. In one example of step 5006, a network switch is programmed to connect home nodes of the parallel processing envhonment into a flat topology. This step may be performed for both manifold and hyper- manifold for connection of home nodes. In step 5008, sub-process 5000 loads home node controller software onto the home nodes. In one example of step 5008, the remote host loads home node controller software onto each home node using the network configured in step 5006. In step 5010, sub-process 5000 loads initialization files from the remote host onto each home node and compute node. In one example of step 5010, the remote host sends an initialization file to each home node and to each compute node. This initialization file specifies port numbers for the gateway computer, home node computers and compute node computers, thereby defining part of the parallel processing envhonment communication protocol. For example, the initialization file may specify shorted TCP/IP addresses for compute nodes and home nodes, specific TCP/IP addresses for home nodes and compute nodes, indicating the number of threads running on each machine. In step 5012, sub-process 5000 sends a problem-set and a data description to each home node from the remote host. In one example of step 5012, the remote host sends a problem-set and data description to each home node using the networks connections of steps 5002 and 5006. Sub-process 5000 then terminate, returning conttol to process 4000. FIG. 171 is a flow chart illustrating one sub-process 5020 for checking the validity of the problem-set and data description. Step 5022 is a decision. If, in step 5022, sub-process 5020 determines that all functions requhed by the problem-set and data description are loaded onto the compute nodes, sub-process 5020 continues with step 5024; otherwise sub-process 5020 continues with step 5026. In one example of step 5022, home node 182 analyzes the problem-set and data description, received in step 4004 of process 4000, and generates a list of functions requhed by the problem-set and compares this list to a list of functions pre-loaded into each compute node 184. If each function in the list is pre-loaded on each compute node 184, the home node continues with step 5024; otherwise the home node continues with step 5026. Step 5024 is a decision. If, in step 5024, sub-process 5020 determines an error exists in the data description, received in step 4004 of process 4000, sub-process 5020 continues with step 5026; otherwise sub-process 5020 terminate and return control to process 4000. In one example of step 5024, home node 182 determines that the data description is not suited to the problem-set, and therefore continues with step 5026 to report the error. In step 5026, sub-process 5020 generates an error code and sends this error code to the remote host from the home node. In one example of step 5026, home node 182, FIG. 5, generates an error code indicating an error in the problem-set and/or data description (i.e., that certain functions requhed by the problem-set are not pre-loaded within compute nodes 184), and sends the error code to the remote host. Process 5000 then terminates, returning control to process 4000, which also terminates since an error occurred. FIG. 172 is a flow chart illustrating one sub-process 5040 for determining cascade size based upon the input/output type requhed by the problem-set and data description. Sub-process 5040 is implemented within home nodes of a parallel processing envhonment. Step 5042 is a decision. If, in step 5042, sub-process 5040 determines that type I or type Ila input/output is requhed, sub-process 5040 continues with step 5044; otherwise sub-process 5040 continues with step 5046. In one example of step 5042, home node 182 determines that the problem- set, received in step 4004 of process 4000, requhes type Ila input/output, and therefore continues with step 5044. In step 5044, process 5040 uses pre-computed load balancing information, stored within the home node, to calculate the size of the cascade requhed to process the problem-set received in step 4004. In one example of step 5044, home node 182 determines that seven compute nodes are requhed to process the problem-set. Process 5040 then terminate returning control to process 4000. In step 5046, sub-process 5040 uses pre-computed load balancing information, stored on the home node, to calculate the size of the cascade requhed to process the problem-set received in step

4004. In one example of step 5046, home node 182 determines that seven compute nodes are requhed to process the problem-set. Process 5040 then terminates returning conttol to process 4000. FIG. 173 is a flow chart illustrating one sub-process 5060 for distributing the problem-set and data description to top level compute nodes from the home node. Sub-process 5060 is implemented by each home node within the parallel processing envhonment, for example. In step 5062, sub-process 5060 computes data pointers for top level compute nodes. In one example of step 5062, home node 182, FIG. 5, computes data pointers for top level compute nodes 184(1), 184(2) and 184(3). In step 5064, sub-process 5060 sends data description messages from home node to top level compute nodes. In one example of step 5064, home node 182 sends data description messages, including the pointers determined in step 5062, to compute nodes 184(1), 184(2) and 184(3). Sub- process 5060 then terminates, returning control to process 4000. FIG. 174 is a flow chart illustrating one exemplary sub-process 5080 for distributing problem- set and data description to lower compute nodes. In step 5082, sub-process 5080 computes data pointers for lower tier compute nodes. In one example of step 5082, compute nodes 184(1) and 184(2) compute data pointers for compute nodes 184(4), 184(5) and 184(6), thereby indicating data areas allocated to those lower tier compute nodes. In step 5084, sub-process 5080 sends data descriptions message from upper level compute nodes to lower level compute nodes. In one example of step 5084, compute node 184(1) first sends a data description message, containing data pointers determined in step 5082 for compute node 184(4), to compute node 184(4) and then sends a data description message, containing data pointers determined in step 5082 for compute node 184(5), to compute node 184(5). Similarly, compute node 184(2) sends a data description message, containing data pointers determined in step 5082 for compute node 184(6), to compute node 184(6). Steps 5082 and 5084 repeat until the data description is distributed to all compute nodes within the cascade. For example, compute node 184(5) compute data pointers for compute node 184(7) in step 5082 and then sends data description message, including these data pointers, to compute node 184(7) in step 5084. Sub-process 5080 then terminate, returning conttol to process 4000. FIG. 175 is a flow chart illustrating one exemplary sub-process 5100 for processing the problem-set and data description on compute nodes and exchanging data if necessary. In step 5102, sub-process 5100 processes the problem-set and data description. In one example of step 5102, compute nodes 184 process the problem-set and data description until complete, or until a data exchange is requhed. Step 5104 is a decision. If, in step 5104, sub-process 5100 determines that a data exchange is requhed, sub-process 5100 continues with step 5106; otherwise sub-process 5100 terminates as all processing is complete. In one example of step 5104, compute nodes 184 reach a point where a data exchange is requhed between all nodes before further processing can continues, and therefore each compute node continues with step 5106. Step 5106 is a decision. If, in step 5106, sub-process 5100 determines than an all-to-all data exchange is requhed, sub-process 5100 continues with step 5108; otherwise sub-process 5100 continues with step 5110. In one example of step 5108, compute nodes 184, based upon the problem- set, determine that an all-to-all data exchange is requhed and continue with step 5108. In step 5108, sub-process 5100 invokes sub-process 5120 to perform an all-to-all data exchange. In one example of step 5108, compute nodes 184 invoke sub-process 5120 to perform an all- to-all full data set exchange. Sub-process 5100 then continues with step 5102 to continue processing the problem-set and data description, including exchanged data. In step 5110, sub-process 5100 invokes sub-process 5140 to perform a next neighbor exchange. In one example of step 5110, compute nodes 184 invoke sub-process 5140 to perform a 3-D partial data exchange. Sub-process 5100 then continues with step 5102 to continue processing the problem-set and data description, including exchanged data. FIG. 176 is a flowchart illustrating one exemplary sub-process 5120 for performing an all-to- all exchange. Step 5122 is a decision. If, in step 5122, sub-process 5120 determines that a full dataset exchange is requhed, sub-process 5120 continues with step 5124; otherwise sub-process continues with step 5126. In step 5124, sub-process 5120 performs an bi-directional all-to-all exchange using an (n- l)/bv timestep model. For example, see Equation 57 titled All-to-All Exchange Steps for Full Duplex Channel. Sub-process 5120 then terminate, returning control to sub-process 5100. In step 5126, sub-process 5120 performs a bi-directional all-to-all exchange using (n-l)/bv timestep model. For example, see Equation 65 titled Full-Duplex General Cascade All-to-All Exchange. Sub-process 5120 then terminate, returning conttol to sub-process 5100. FIG. 177 is a flow chart illustrating one exemplary sub-process 5140 for performing a next neighbor exchange. Step 5142 is a decision. If, in step 5142, sub-process 5140 determines that a 3-D exchange is requhed, sub-process 5140 continues with step 5144' otherwise sub-process 5140 continues with step 5150. Step 5144 is a decision. If, in step 5144, sub-process 5140 determines that a partial dataset exchange is requhed, sub-process 5140 continues with step 5146; otherwise sub-process 5140 continues with step 5148. In step 5146, sub-process 5140 defines a 26 cell 3-D region and performs an exchange using

(26+λ)/v exchanges with some data in each compute node being exchanged. In one example of step 5146, a 26 cell 3-D region is defined and an exchange using (26+λ)/v exchanges, with some data in each node being exchanged. For example, see Equation 98 titled 3-D Nearest-Neighbor Exchange Time for Planar Exchange Method. Process 5140 then terminates returning control to sub-process 5120. In step 5148, sub-process 5140 defines a 26 cell 3-D region and performs an exchange using

(26)/v exchanges with all data in each compute node being exchanged. In one example of step 5148, a 26 cell 3-D region is defined and an exchange using (26)/v exchanges, with all data in each node being exchanged. . For example, see Equation 98 titled 3-D Nearest-Neighbor Exchange Time for Planar Exchange Method. Process 5140 then terminates returning control to sub-process 5120. Step 5150 is a decision. If, in step 5150, sub-process 5140 determines that a partial dataset exchange is requhed, sub-process 5140 continues with step 5152; otherwise sub-process 5140 continues with step 5154. In step 5152, sub-process 5140 defines an 8 cell 2-D region and performs an exchange using (4)/v exchanges, with some data in each node being exchanged. For example, see Equation 94 titled Pah- wise Nearest Neighbor Exchange Time. Process 5140 then terminates returning control to sub- process 5120. In step 5154, sub-process 5140 defines an 8 cell 2-D region and performs an exchange using (4)/v exchanges, with all data in each node being exchanged. For example, see Equation 94 titled Pah- wise Nearest Neighbor Exchange Time. Process 5140 then terminates returning conttol to sub-process 5120. FIG. 178 is a flow chart illustrating one sub-process 5160 for agglomerating results. Sub- process 5160 is invoked in step 4016 of process 4000. In step 5162, sub-process 5160 determines a next highest level of communication for a current node. In one example of step 5162, compute node 224(13), FIG. 7, determines that compute node 224(13) is a next highest level of communication. Step 5164 is a decision. If, in step 5164, sub-process 5160 determines that the level above is a compute node, sub-process 5160 continues with step 5166; otherwise sub-process 5160 continues with step 5168. In step 5166, sub-process 5160 sends an agglomeration message to the compute node at the level above as determined in step 5162. In one example of step 5166, compute node 224(15) sends an agglomeration message containing result data of compute node 224(15) to compute node 224(13). Sub- process 5160 then terminates, returning conttol to process 4000. Step 5168 is a decision. If, in step 5168, sub-process 5160 determines that the level above the current node is a home node, sub-process 5160 continues with step 5170; otherwise sub-process 5160 terminates, returning control to process 4000. In step 5170, sub-process 5160 sends an agglomeration message to an associated home node. In one example of step 5170, compute node 224(2), FIG. 7, sends an agglomeration message containing result data to home node 222. In another example of step 5170, compute node 1044(7), FIG. 49, sends an agglomeration message containing result data to home node 1042. Sub-process 1560 then terminates, returning conttol to process 4000. FIG. 179 is a flow chart illustrating one process 5200 for increasing information content within an electrical signal. In step 5202, process 5200 determines a number of bits requhed per waveform. In one example of step 5202, if a waveform of a signal has a period of 1 μ-second and contains one bit of data (as contained in a conventional digital waveform), the signal has a bandwidth of 1 M-bit (i.e., the clock rate is 1 MHz). If the signal is requhed to have a bandwidth of 12 M-bits and the clock rate cannot be increased, each waveform conveys twelve bits. In step 5204, process 5200 determines the number of ttansitions requhed for the number of bits using: Trans = P₁log₂(l/Pι)+..+P_nlog₂(l/P_n)- In one example of step 5204, process 5600, FIG. 186, is used to determine the number of ttansitions requhed for the number of bits determined in step 5202. In step 5206, process 5200 distributes the transitions across the ten primary waveform regions. In one example of step 5206, the transitions, determined in step 5204, are disttibuted across primary waveform regions 2040, 2042, 2044, 2406, 2048, 2050, 2052, 2054, 2056 and 2058. In step 5208, a signal is encoded by varying each waveform region identified in step 5206. FIGs.180, 181 and 182 show a flow chart illustrating one exemplary process 5300 for increasing effective memory capacity within a node. In step 5302, process 5300 ttansmits an algorithm requhed precision to one or more home nodes, and these home nodes transmit the algorithm requhed precision to compute nodes. In one example of step 5302, remote host 162, FIG. 4, transmits an algorithm requhed precision to home node 182, FIG. 5, and home node 182 transmits the algorithm requhed precision to compute nodes 184. Step 5304 is a decision. If, in step 5304, process 5300 determines that a loss-less compression is requhed, process 5300 continues with step 5306; otherwise process 5300 continues with step 5308. Step 5306 is a decision. If, in step 5306, process 5300 determines that the hardware has a L2 and/or a L3 cache, process 5300 continues with step 5312; otherwise process 5300 continues with step 5310. Step 5310 is a decision. If, in step 5310, process 5300 determines that the hardware has a Ll cache, then process 5300 continues with step 5312; otherwise process 5300 continues with step 5320. Step 5312 is a decision. If, in step 5312, process 5300 determines that a Huffman compression is to be used, process 5300 continues with step 5340; otherwise process 5300 continues with step 5316. Step 5316 is a decision. If, in step 5316, process 5300 determines that an arithmetic compression is requhed, process 5300 continues with step 5350; otherwise process 5300 continues with step 5318. Step 5318 is a decision. If, in step 5318, process 5300 determines that a dictionary compression is requhed, process 5300 continues with step 5360; otherwise process 5300 continues with step 5320. In step 5320, process 5300 creates an error code to indicate an error has occurred and sends the error code to the home node. Process 5300 then terminates. Step 5308 is a decision. If, in step 5308, process 5300 determines that no compression is requhed, process 5300 terminates; otherwise process 5300 continues with step 5320 to inform the home node of an error. In step 5330, process 5300 receives the algorithm requhed precision in the compute nodes from the home node message. In step 5332, process 5300 computes the lossy quanta size requhed. In one example of step 5332, compute node 184(1) computes a quanta size based upon a message from home node 182 that includes the algorithm requhed precision. In step 5334, process 5300 creates a pointer to a data area that contains the dataset. In step 5336, process 5300 compresses data in the data area based upon the lossy quanta size determined in step 5332. In one example of step 5336, compute node 184(1) compresses a dataset indicated by the pointer of step 5334 and based upon the quanta size determined in step 5332. In step 5338, process 5300 saves the position of the compressed and uncompressed data locations in an address conversion table. In one example of step 5338, compute node 184(1) stores compressed and uncompressed data locations in address conversion table 1962, FIG. 93. Process 5300 then terminates. In step 5340, process 5300 computes a Huffman distribution free. In one example of step 5340, compute node 184(1) generates a Huffman code table 37. In step 5342, process 5300 compresses data using Huffman bit encoding schema and the Huffman compression table generated in step 5340. Process 5300 then terminates. In step 5350, process 5300 computes both upper and lower bounds for the data. In one example of step 5350, pattern'XXXY'is compressed to determine a lower bound of 0.71874 and an upper bound of 0.7599 as described in Table 39. Arithmetic Compression Table. In step 5352, process 5300 constructs a compression table using both the bounding data and the compression characters. In one example of step 5352, compute node 184(1) constructs arithmetic compression table 39. In step 5354, process 5300 saves the position of both compressed and uncompressed data locations in the arithmetic compression table. In one example of step 5354, compute node 184(1) stores compressed and uncompressed data locations in Table 40. Example Arithmetic Address Conversion Table. In step 5356, the compressed data is saved. In one example of step 5356, compute node 184(1) saves compressed data to RAM 1992, FIG. 94. Process 5300 then terminates. In step 5360, process 5300 constructs a dictionary table. In one example of step 5360, compute node 184(1) constructs LZW dictionary table 41 based upon input stting'ABAXABBAXABB'. In step 5362, process 5300 uses the dictionary table of step 5360 to compress data from the dataset. In step 5364, process 5300 saves the position of both compressed and uncompressed data locations in the dictionary compression table of step 5360. Process 5300 then continues with step 5356 to save the compressed data. FIG. 183 is a flow chart illustrating one process 5400 for improving parallel processing performance by optimizing communication time. FIG. 100 and description thereof introduced total communication time, t_c, priming time, £ and processing time t_p. In step 5402, process 5400 gets the exposed latency time t from a home node. In step 5404, process 5400 calculates priming time based upon t = D/bv. In step 5406, process 5400 calculates algorithm processing time t_p. In step 5408, process 5400 calculates non-algorithm processing time t_n. In step 5410, process 5400 calculates data drain time t_d = Dd/bv. In step 5412, process 5400 calculates total procedure time T = t + 1 + t_p + t + t_n. In step 5414, process 5400 calculates total communication time ^ = γD_a/αβ. Step 5416 is a decision. If, in step 5416, process 5400 determines t_c-t is greater than t_p then alpha phase is achieved and process 5400 continues with step 5418; otherwise process 5400 terminates. In step 5418, process 5400 increases D_a by increasing the number of fhst level cascade or manifold communications. Step 5420 is a decision. If, in step 5420, process 5400 determines that fhst level communication is at a maximum, process 5400 terminates; otherwise process 5400 continues with step 5416. Steps 5416 through 5420 thus repeat until first level communications are at a maximum. FIGs. 184 and 185 show a flow chart illustrating one process 5500 for comparing two parallel exchange models using Exchange Entropy Metrics. In step 5502, process 5500 calculates a number of node-to-node steps requhed by first and second exchange models. In step 5504, process 5500 calculates a number of communication channels used by fhst and second exchange models. In step 5506, process 5500 calculates a number of communication channels left unused by fhst and second exchange models. In step 5508, process 5500 determines selected algorithm requhed data movement. In step 5510, process 5500 determines exchange model required data movement. In step 5512, process 5500 determines: α = used channels / number of exchange steps (1^st exchange model) and of = used channels / number of exchange steps (2^nd exchange model). In step 5514, process 5500 determines: β = unused channels / number of exchange steps (1^st exchange model) and β = unused channels / number of exchange steps (2^nd exchange model). In step 5516, process 5500 determines γ = exchange requhed data amount/algorithm requhed data amount (1^st exchange model) and f = exchange requhed data amount algorithm requhed data amount (2^nd exchange model) . Step 5518 is a decision. If, in step 5518, process 5500 determines that α > of then process 5500 continues with step 5520; otherwise process 5500 continues with step 5524. Step 5520 is a decision. If, in step 5520, process 5500 determines that β > β\ then process 5520 continues with step 5522; otherwise process 5500 continues with step 5524. Step 5522 is a decision. If, in step 5522, γ > f, then process 5500 continues with step 5526; otherwise process 5500 continues with step 5524. In step 5524, process 5500 displays 'SECOND EXCHANGE MODEM MAY BE BETTER THAN FIRST EXCHANGE MODEL'. Process 5500 then terminates. In step 5526, process 5500 displays'FJRST EXCHANGE MODEL IS BETTER THAN THE SECOND EXCHANGE MODEL". Process 5500 then terminates. FIG. 186 is a flowchart illustrating one process 5600 for determining information carrying capacity using Shannon's equation. In step 5602, process 5600 inputs the number of transitions into variable T. In one example of step 5602, an end-user specified the number of ttansitions as 12 and T is set equal to 12. In step 5604, a variable Sum is set equal to zero. Step 5606 is a decision. If, in step 5606, T = 0, process 5600 continues with step 5612; otherwise process 5600 continues with step 5608. In step 5608, process 5600 computes a new value for variable sum by adding log₂(l/T)/T. In one example of step 5608, if T is 12, log₂(lll2)IV2 is computed and added to variable sum. In step 5610, T is decremented by one. Process 5600 then continues with step 5606. In step 5612, process returns the value stored in variable sum as the information carrying capacity. In one example of step 5612, the value of variable sum is returned to the end-user. Changes may be made in the above methods and systems without departing from the scope hereof. It should thus be noted that the matter contained in the above description or shown in the accompanying drawings should be interpreted as illustrative and not in a limiting sense. The following claims are intended to cover all generic and specific features described herein, as well as all statements of the scope of the present method and system, which, as a matter of language, might be said to fall there between.

Claims

CLAIMS What is claimed is:

1. A method for inputting a problem-set to a parallel processing envhonment based upon

Howard cascades, comprising: receiving the problem-set at a first home node of the parallel processing envhonment; distributing the problem-set from the fhst home node to one or more other home nodes of the parallel processing envhonment; and distributing the problem-set from the first home node and the other home nodes to a plurality of compute nodes of the parallel processing envhonment.

2. The method of claim 1, the step of distributing comprising distributing the problem-set from the first home nodes to the other home nodes using type I input.

3. The method of claim 2, the step of distributing comprising distributing the problem-set from the first home nodes and the other home nodes to the plurality of compute nodes using type I input.

4. The method of claim 1, further comprising outputting results from the parallel processing envhonment, comprising: agglomerating results from a plurality of compute nodes of the parallel processing envhonment to a plurality of home nodes of the parallel processing envhonment; agglomerating results from the plurality of home nodes to a conttol node of the plurality of home nodes; and outputting the results from the conttol node to a remote host.

5. The method of claim 4, the step of agglomerating comprising agglomerating results from the plurality of compute nodes to the plurality of home nodes using Type I output.

6. The method of claim 5, further comprising agglomerating data from the plurality of home nodes to a control node of plurality of home nodes using Type I output.

7. The method of claim 4, the steps of agglomerating comprising agglomerating results from the plurality of compute nodes to the plurality of home nodes using Type II output and agglomerating results from the plurality of home nodes to the conttol node using Type II output.

8. The method of claim 4, the steps of agglomerating comprising agglomerating results from the plurality of compute nodes to the plurality of home nodes using Type UI output and agglomerating results from the plurality of home nodes to the control node using Type HI output.

9. A method for performing a partial data-set all-to-all exchange between a plurality of compute nodes of a parallel processing envhonment based upon Howard cascades, comprising: creating first and second identical lists of unique identifiers for the compute nodes, wherein the identifiers are organized in ascending order within the first and second lists; appending an identifier for a home node of the parallel processing envhonment to the fhst and second lists, if the number of compute nodes is odd; applying a first pattern to the fhst list to identify one or more first node pahs and exchanging data between each first node pah; applying, if the number of compute nodes is odd, a second pattern to the second list to identify one or more second node pahs and exchanging data between each second node pah; applying, if the number of compute nodes is even, a second pattern to the second list to identify one or more thhd node pahs and exchanging data between all but the last node pah of the thhd node pahs; applying, if the number of compute nodes is odd, the second pattern to the second list to identify one or more fourth node pahs and exchanging data in a reverse direction between each fourth node pah; applying, if the number of compute nodes is even, the second pattern to the second list to identify one or more fifth node pahs and exchanging data in a reverse dhection between all but the last node pah of the fifth node pahs; applying the fhst pattern to the first list to identify one or more sixth node pahs and exchanging data in a reverse dhection between each node pah of the sixth node pahs; shifting all but the last entty in the first list up by one and inserting the identifier moved from the fhst entty in the last but one entry; shifting all but the last entry in the second list down by one and inserting the identifier moved from the last but one entty into the first entty; and repeating the steps of applying and shifting until all data is exchanged.

10. The method of claim 9, the step of repeating comprising: determining an integer value by dividing the number of compute nodes by two and ignoring any remainder; and repeating the steps of applying and shifting value number of times.

11. A manifold, comprising: a plurality of home nodes organized as a cascade; wherein each home node forms a cascade with one or more compute nodes.

12. The manifold of claim 11 configured as a hyper-manifold, comprising: the plurality of home nodes organized as a level-2, or higher, cascade; wherein each communication channel of each home node forms a cascade with a plurality of compute nodes.

13. The manifold of claim 12, wherein each communication channel is fully consumed during input and output data exchanges.

14. A method for generating a hyper-manifold, comprising: generating a cascade of home nodes from each communication channel of a fhst home node; for each additional home node level, generate a cascade of home nodes from each communication channel of each home node of the last home node level generated; generating a cascade group from each communication channel of each home node.

15. A method for forming a cascade, comprising: using a network generator to generate a pattern of nodes; and converting the pattern to a tree structure of interconnected nodes, to form the cascade.

16. The method of claim 15, the network generator utilizing a numerical series to generate the pattern.

17. The method of claim 15, further comprising forming a manifold using the network generator, comprising: generating a cascade of generators; generating a cascade of compute nodes from each generator.

18. The method of claim 15, further comprising generating a hyper-manifold using the network generator, comprising: generating a cascade of first generators; generating a cascade of next generators from each generator of the fhst generators; generating a cascade of compute nodes from each generator of the first generators and next generators; and replacing each generator with a home node.

19. A method for all-to-all communication within a manifold having two or more cascade groups, comprising: exchanging data between compute nodes within each cascade group; exchanging data between each cascade group; and exchanging data between top-level nodes of the manifold.

20. The method of claim 19, wherein the manifold is a hyper-manifold, and further comprising: exchanging data between nodes at each level of the hyper-manifold.

21. The method of claim 19, further comprising: converting an algorithm for execution on the compute nodes from alpha-phase to beta-phase by overlapping communication time with processing time; and wherein the step of converting masks communication time and improves communication performance during all-to-all data exchanges.

22. A method for next-neighbor data exchange between compute nodes of a cascade, comprising: utilizing a neighbor stencil for each element in a dataset allocated to a fhst compute node to identify nearest neighbor elements in a dataset allocated to other compute nodes; and exchanging data with the other compute nodes to receive the nearest neighbor elements.

23. The method of claim 22, further comprising adding communication channels to each compute node to lower time for a nearest neighbor exchange.

24. The method of claim 22, the communication channels comprising full-duplex communication channels.

25. The method of claim 22, wherein the data exchange is one of a 2_D data exchange, a 3_D data exchange, a red-black data exchange, and a left-right data exchange.

26. The method of claim 22, comprising: disttibuting ghost cells with a dataset for each compute node such that additional data requhed by the compute node resides with the compute node.

27. A processor for increasing memory capacity of a memory by compressing data, comprising: a plurality of registers; a level 1 cache; and a compression engine located between the registers and the level 1 cache; wherein the registers contain uncompressed data and the level 1 cache contains compressed data such that compressed data is written to the memory.

28. The processor of claim 27, further comprising a bitmap for locating lines within the Ll cache.

29. The processor of claim 27, the compression engine comprising one of a lossless compression algorithm and a lossy compression algorithm.

30. The processor of claim 29, the lossless compression algorithm comprising one of Huffman compression, arithmetic compression and dictionary compression.

31. The processor of claim 29, the lossy compression algorithm comprising a quantization algorithm.

32. A method for forming a parallel processing envhonment, comprising: organizing first compute nodes of the envhonment as a cascade; organizing home nodes of the envhonment as a manifold; and organizing an algorithm for processing by the fhst compute nodes so that the fhst compute nodes concurrently cross-communicate while executing the algorithm, such that additional compute nodes added to the envhonment improve performance over performance attained by the fhst compute nodes.

33. A method for parallelizing an algorithm for execution on a cascade, comprising: decomposing the algorithm into uncoupled functional components; and processing the uncoupled functional components on one or more compute nodes of the cascade.

34. A method for balancing work between multiple controllers within a hyper-manifold, comprising: organizing home nodes as a hyper-manifold; performing an alternating all-to-all exchange on level-2 nodes of the hyper-manifold; performing an alternating all-to-all exchange on level- 1 nodes of the hyper-manifold; and using any level- 1 home node as a controller.

35. A method for reducing communication latency within a compute node of a cascade, comprising: using a fhst processing thread to handle asynchronous input to the compute node; using a second processing thread to process a job of the cascade; and using a thhd processing thread to handle asynchronous output from the compute node.

36. The method of claim 35, further comprising utilizing dhect memory access hardware to handle input to the compute node and output from the compute node.

37. A method for checkpointing a hyper-manifold, comprising: performing an all-to-all exchange of checkpoint data; and storing, at each compute node of the hyper-manifold, checkpoint data of all other compute nodes.

38. The method of claim 37, wherein the all-to-all exchange is performed concurrently with processing of a job.

39. The method of claim 38, further comprising recovering from failure of a node of the hyper- manifold, comprising: detecting a failed node; isolating the failed node from the hyper-manifold; replacing the failed node with a spare node; loading the spare node to a status based upon a most recent checkpoint of the failed node using checkpoint data from a neighbor or parent node; and restarting processing by the hyper-manifold from the most recent checkpoint.

40. The method of claim 38, further comprising recovering from failure of a node of a hyper- manifold, comprising: detecting a failed node; isolating the failed node from the hyper-manifold; reducing size of the hyper-manifold to create one or more spare nodes; replacing the failed node with a spare node; repartitioning work to each node based upon the reduced size of the hyper-manifold; and restarting processing by the hyper-manifold from the most recent checkpoint.

41. The method of claim 37, further comprising dynamically increasing cascade size to increase processing performance of a job, comprising: determining, within a home node, that additional compute nodes may be used; broadcasting a nodal expansion message to all nodes currently allocated to the job; suspending processing of the job; activating additional nodes to create a larger cascade; repartitioning datasets within each compute node of the larger cascade based upon the nodal expansion message; restarting execution of the job using the larger cascade.

42. A method for processing a problem on a cascade, comprising: receiving the problem from a user; creating an input mesh, based upon the problem, to apportion the problem to compute nodes of the cascade; acquhing an input dataset, based upon the input mesh, on each compute node; processing, on each compute node, the input dataset to produce output data; agglomerating the output data from all compute nodes to form an output mesh; and returning the results, based upon the output mesh, to a user.

43. The method of claim 42, further comprising adaptively processing a job on a cascade, comprising: dividing the job into a plurality of small algorithm parts; determining a number of compute nodes to use for each algorithm part; for each algorithm part: sizing the cascade based upon the algorithm part; processing data of the job on the cascade using the algorithm parts; and agglomerating data from the compute nodes of the cascade.

44. The method of claim 43, the step of sizing further comprising repartitioning the data to each compute node of the resized cascade.

45. The method of claim 43, wherein the repartitioning is faciUtated by checkpoint data stored on each compute node of the cascade.

46. The method of claim 42, further comprising zoning a plurality of nodes to form two or more cascades of differing sizes to facilitate job execution efficiency, comprising: constructing a first cascade using a first portion of the plurality of nodes; constructing at lease a second cascade from the remaimng nodes of the plurality of nodes; and allocating a job to the first or second cascade based upon the number of compute nodes requhed for processing the job.

47. The method of claim 42, further comprising interacting with the cascade by: providing process stop/restart capability; providing a flow visualization interface; and providing process flow branch selection capability.

48. The method of claim 47, the step of providing a flow visualization interface comprising: displaying one or more algorithm parts of the job; displaying the number of compute nodes allocated to each algorithm part; displaying a current processing location; estimating and displaying activation time of the next algorithm part; and estimating and displaying time to complete the job.

49. The method of claim 47, the step of providing process stop/restart capability comprising interacting with a user, through use of the flow visuahzation interface, to allow the user to stop execution of the job on the cascade and to restart execution of the job on the cascade.

50. The method of claim 47, further comprising allowing the user to add, delete and change the order of algorithm parts through use of the flow visualization interface.

51. The method of claim 47, further comprising allowing the user to change the number of compute nodes requested for each algorithm part through use of the flow visualization interface.

52. The method of claim 47, wherein processing of the job stops at a decision point to await selection, by a user, of a path with which to continue.

53. The method of claim 52, further comprising allowing, through use of the flow visualization interface, the user to select a path for the job to proceed.

54. The method of claim 47, wherein a decision point is an analysis function such that the cascade makes a decision as to which path to follow when the decision point is reached.

55. The method of claim 47, further comprising determining a path to follow at a decision based upon curiosities.

56. The method of claim 47, further comprising selecting an alternate process flow when a curiosity is detected.

57. The method of claim 56, further comprising storing anomalous data as a Culog.

58. A method for benchmarking problem-set distribution of parallel processing architecture, comprising: determining, if the same code and data is distributed to all nodes, a number of time units requhed to broadcast the code and data to all nodes; and determining, if dissimilar code and data is distributed to each node, a number of time units requhed to send the code and data to each node from a single controller.

59. A method for applying emotional analogues to dynamic resource allocation within a parallel processing envhonment, comprising: predicting completion time for each job running on the parallel processing envhonment; accessing the requhed time frames for each job; increasing a number of compute nodes allocated to a job if the priority of the job increases; and decreasing the number of compute nodes allocated to a job if the priority of the job decreases.

60. The method of claim 59, wherein each job is profiled within the parallel processing envhonment.

61. The method of claim 59, wherein the parallel processing envhonment has global checkpoint capabilities and the ability to increase and decrease the number of compute nodes allocated to a job.

62. A method for applying emotional analogues to dynamic resource allocation within .a parallel processing envhonment, comprising: defining a first processing state of one or more associated jobs within the parallel processing envhonment as a fhst emlog star; defining a second processing state of one or more associated jobs within the parallel processing envhonment as a second emlog star; and transitioning from the fhst emlog star to the second emlog star to ttansition from the fhst processing state to the second processing state.

63. The method of claim 62, further comprising defining one or more additional emlogs to define one or more additional processing states for one or more associated jobs within the parallel processing envhonment.

64. The method of claim 62, further comprising budding new emlog stars from a current emlog star if no emlog star exists for a current processing state.

65. The method of claim 62, wherein the parallel processing envhonment ttansitions from the fhst emlog star to the second emlog star when another job is added to the one or more associated jobs within the parallel processing envhonment.

66. The method of claim 62, wherein the parallel processing envhonment ttansitions from the first emlog star to the second emlog star when one of the associated jobs is deleted from the parallel processing envhonment.

67. The method of clahn 62, wherein each emlog star includes one or more emlogs that define a resource priority indicating a maximum number of compute nodes, a minimum number of computed nodes requhed to meet a specified completion time and a nominal, or starting, number of compute nodes.

68. The method of claim 62, wherein each emlog star transition is represented within an added algorithm emlog star vector table or a delected algorithm vector table.

69. The method of claim 62, further comprising transitioning from the fhst emlog star to a culog star, the culog star initiating algorithms to analyze anomalous data and to assist in creation of a new emlog star.

70. The method of clahn 69, transitioning from the culog star to a second culog star to initiate further algorithms to analyze the anomalous data and to assist in creation of a new emlog star.

71. The method of claim 62, further comprising transitioning from the fhst emlog star to a selog star to create a new culog star, the selog star being utilized after all culog stars have been used.

72. A method for applying emotional analogues to dynamic resource allocation within a parallel processing envhonment, comprising: defining a processing state of one or more associated algorithms within the parallel processing envhonment as an emlog star; predicting completion time for a job running in the parallel processing envhonment; accessing certain time frames of the job when running in the parallel processing envhonment; increasing the number of compute nodes allocated to the job if increased processing power is requhed; and decreasing the number of compute nodes allocated to the job if less processing power is requhed.

73. A method for communication between compute nodes of a parallel processing system, comprising: organizing the compute nodes as a cascade; distributing, using type I input, a problem-set to the compute nodes; processing the problem-set on the compute nodes; and agglomerating, using type I agglomeration, processing results from the compute nodes.

74. The method of claim 73, the step of distributing a problem-set comprising disttibuting function and algorithm code and data to the compute nodes.

75. The method of claim 73, the step of disttibuting a problem-set comprising distributing tags and data to the compute nodes, wherein the tags identify functions and algorithms that are pre-installed on the compute nodes for use in processing the data.

76. A parallel processing envhonment, comprising: a remote host; a plurality of home nodes, each home node having at least one communication channel, the home nodes configured as a manifold; a gateway for interfacing the remote host to one or more of the plurality of home nodes; and a plurahty of compute nodes, each compute node having one communication channel, the compute nodes are configured as a plurality of cascades, each cascade connected to a communication channel of a home node of the manifold; wherein the remote host sends a problem-set to the manifold via the gateway, the problem-set comprising identification of one or more algorithms, the home nodes distributing the problem-set to the cascades, the compute nodes of the cascades process the problem- set and agglomerate data back to the home nodes, and the home nodes agglomerate the data back to a controlling home node of the manifold, and the controlling home node ttansfers the agglomerated result to the remote host via the gateway.

77. The parallel processing envhonment of claim 76, wherein each home node has two or more communication channels, the two or more communication channels facilitating faster distribution of the problem-set and faster agglomeration of the results.

78. The parallel processing envhonment of claim 76, wherein the compute nodes and the home nodes are dynamically reconfigured into manifolds and cascades.

79. The parallel processing envhonment of claim 76, further comprising one or more spare nodes, the spare nodes configurable for operation as home nodes and/or compute nodes to replace a failed home node or compute nodes in the manifold and/or cascades.

80. The parallel processing envhonment of claim 79, wherein the manifold is a hyper-manifold.

81. The parallel processing envhonment of claim 80, the hyper-manifold comprising a fhst level having a plurality of first manifolds, each fhst manifold having a plurality of home nodes, and wherein one home node of each of the fhst manifolds forms a second manifold at a higher level.

82. The parallel processing envhonment of claim 81 , each of the fhst manifolds being equal in size.

83. The parallel processing envhonment of claim 81, the first and second manifolds configured as a cascade of home nodes.

84. The parallel processing envhonment of claim 81 , the hyper-manifold further comprising one or more additional manifold levels.

85. The parallel processing envhonment of claim 81, the hyper-manifold further comprising sub- cascades with a plurality of compute node with two or more communication channels, where each communication channel generates a cascade.

86. A software product comprising instructions, stored on computer-readable media, wherein the insttuctions, when executed by a computer, perform steps for executing a job on a parallel processing envhonment, comprising: insttuctions for implementing any of claims 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74 and 75.