WO2011114539A1

WO2011114539A1 - A programming supporting device and a method for generating routing information for an array of computing nodes

Info

Publication number: WO2011114539A1
Application number: PCT/JP2010/055149
Authority: WO
Inventors: James Awuor Okello
Original assignee: Nec Corporation
Priority date: 2010-03-17
Filing date: 2010-03-17
Publication date: 2011-09-22

Abstract

A device for aiding programming of an array of computing nodes connected to each other through routing devices and communication paths. In the array, connection between a pair of communicating computing nodes for a given state is fixed, and a connecting path between the pair of nodes is the same or different at different states. The devise includes: a storage unit storing an application code for the array of computing nodes; and a time dependent route resource allocator that evaluates a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when communicating computing nodes are active.

Description

DESCRIPTION

A PROGRAMMING SUPPORTING DEVICE AND A METHOD FOR GENERATING ROUTING INFORMATION FOR AN ARRAY OF COMPUTING NODES

Technical Field:

The present invention relates to a processor with a set number of computing nodes and its programming supporting device.

Background Art:

Necessity of implementing powerful algorithms for communications systems, such as SVD (Singular Value Decomposition), QRD (QR Decomposition) and RLS (Recursive Least Square) filtering, demands that system developers utilize powerful processors that can execute these algorithms in the shortest possible time. This demand initially resulted in design and development of single core processors that were clocked at high frequency. However, in last two decades, the trend has been to move from single core processors to parallel architectures with multiple processing element (or computing nodes) and multiple storage regions (distributed memory and registers). The driving force behind the change has been the recognition that parallel architecture offers a better cost-power-performance in comparison to single core processors [1].

In general, parallel processors consists of at least two on-chip computing nodes (or processors) that communicate with each other using on-chip communication bus(es). In addition, distributed memory could also be available, with communication among all the modules being done using on-chip communication bus(es). In the present Description, we shall refer to all these modules and on-chip multiple processors as an array of computing nodes. Currently, there are two ways of implementing the communication bus(es), as shown in FIGs. 1 and 2, respectively; namely:

(a) Communication bus 101 is shared by at least two computing nodes as indicated in FIG. 1. In this case, computing nodes (including processing nodes 102, memory module 103, and other modules 105) poll the shared communication bus 101 with transmission initiated only when communication bus 101 is detected as being idle. Bus arbiter 104 is connected to communication bus 101. Though this approach minimizes wires on the processor with multiple computing nodes, it demands the design of a high speed communication bus which unfortunately suffers from severe cross-talk and high power dissipation. (b) Multiple communications buses 201 are shared by multiple computing nodes 202 as indicated in FIG. 2. Access of computing nodes 202 to individual buses is controlled by multiple routers 203 [2]. Information relating to how the routers route data could however be generated at distributed routing nodes 203 [2] or at a central arbiter [3]. At this point, it should be noted that the method using the central arbiter [3] relies on internet protocol (IP) packet based communication that relies on a polling system. As in any other IP-based routing scheme, data is routed between processors using a polling scheme that is not always optimum. There is always a possibility of data collision or detection of a busy bus, resulting in data buffering that increases processing delay of the process.

The present invention relates to the second case (b) of linking a pair of computing nodes through routers. Typical routers include an input buffer, a controller and an output buffer. The buffer acts as a temporary storage for holding data while a bus is busy. Whereas this mode guarantees data transfer between two computing nodes, it also introduces additional processing delay whenever a bus is detected as busy and transmission has to be re-routed or data transfer has to be temporarily suspended. As a solution, fixed routing (or static routing), such as the table-less X-Y routing has already been proposed [4]. These fixed routing methods, reduce the latency but does not eliminate completely the random delay that may be generated during the polling state.

In addition, a concept of look-ahead routing [5] has also been incorporated to reduce delay by eliminating the need for table lookup for routing in the current router. The approach also minimizes routing delay but does not eliminate the possibility of data collision, and hence the random routing delay that arises from polling.

Thus, the present invention relates only to a processor with multiple computing nodes, but with fixed routing for a given state and no polling or arbitration done on each block data. An arbitration-like process if any is done a priori using the programming environment or a centralized arbiter. In this case, a state is defined as a condition where one computing node is communicating with itself or with another computing node. FIG. 3 illustrates a case where computing node 301 communicates with node 306. The communication is undertaken through paths P300 and P304. During the same state, computing node 306 communicates with node 305 using path P307. At a different state, as illustrated in FIG. 4, computing node 306 no longer communicates with node 305. Instead, at this state, node 306 communicates with node 303 using paths P308 and P305.

Data that is transferred between two computing nodes is accompanied with a header or control data that aids the routers to route the data through the communication paths. A packet or a frame of data associated with a particular state is of the format shown in FIG. 5. Due to the block characteristic of data, a processor with multiple computing nodes and buses may have different sections of the processor in different states. If we consider the time span when the processor is running, the states of different sections of the processor can be envisioned as illustrated in FIG. 6. In FIG. 6, three states State-1, State-2 and State-3 are monitored at three different computer nodes, and the region with hatched lines indicates specific time when the relevant state is active. In the illustrated example, State-0 is active between time "0" and "14," State-1 is active time between "0" and "24," and State-2 is active between time "10" and "24." From the above discussion, we note that static routing need to be done in advance by considering all the possible states that exist and routing information generated to avoid any conflict during data transfer between communicating pair of nodes. Typical programming environment for a processor with multiple computing nodes that possess different states in different times dictates that a user defines which path resources will be used during each state. As an example, consider a section of processor with multiple computing nodes defined as shown in FIG. 7. In FIG. 7, several computing nodes 310, 311 and 312 communicate with another node 313, and multiple paths P400, P401, ... are provided for the communication between the nodes. Application dependent time sharing of a path therefore arises, and optimum path selection needs to be done. In this example, a programmer's target is to design a code that enables computation of data at node, 311. The result of the computation is then forwarded to node 313. At time dependency on some parameters, node 312 communicates its result to node 313. The programmer therefore ends up in a dilemma of determining the paths that will not cause a run time conflict. Alternatively, the same programmer could rely on conventional programming to generate static routing information, but he or she does not consider the time dependent nature of communications between a pair of computing nodes. Such a design ends up creating routing information with redundant states or routing information that causes buffering of data at the point of conflict.

Simulation Environment:

Next, a simulation environment according to the related art will be described.

FIG. 8 illustrates a typical simulating system for simulating performance of a program on a processor with multiple computing nodes [6]. In FIG. 8, there is provided an input unit 600 that acts as an interface between a programmer and the simulating system and has input function and displaying function. Compiler 601 compiles the code from input unit 600 and converts it to format suitable for execution in simulator 602. Simulator 602 then generates optimum number of computing nodes. Simulator 602 also analyses the programmer's code for any syntax error or modeling that violates the programming rule, and evaluates the processing time of each process which has been obtained by function division in complier 601. Whereas simulator 602 offers or generates performance index, it is based on analysis of text code that does not define explicitly the process that is implemented on a given computing node. Compiler 601 also generates codes for multiple computing nodes. Multiprocessor simulator 603 has a function of emulating the multiple computing nodes and executes the codes generated by compiler 601. The emulation results obtained at multiprocessor simulator 603 is then sent to input unit 600. Coding might be simpler, but as is known in the field of programming, such mode of coding is never optimum.

In an alternative typical environment [7], a simulator is provided for use in debugging hardware with multiple computing nodes. In this environment, there is provided a processor that sets or controls communication between computing nodes in the simulator. Additional processors are provided to model and simulate operations in the multiple computing nodes. However, this environment does not aid in development of codes with optimum routing information and minimum processing delay.

In summary, a typical device for developing an application specific code with optimum routing and resource allocation for a processor with multiple computing nodes demands that a programmer presents detailed optimum routing information. Failure to provide this detailed information results in sub-optimum routing information being generated by the programming environment. On the other hand, providing parameters for optimization result in additional overhead in the design process.

Therefore, it remains an issue to develop an environment for programming processors with multiple computing nodes. Such an environment will generate optimized routing information for routing data between computing nodes in a processor with a number of computing nodes and communication paths. There are demands for new application specific approach for generating routing information for a processor with a set number of multiple computing nodes. Such approach eliminates all overheads that exist in the routing technology of the related art for processors with multiple computing nodes and non-polling mode of data routing. In addition, the approach eliminates the design stage where a programmer has to generate and provide information related to static routing, while at the same offering optimum routing information for running an application on a processor with multiple computing nodes.

SUMMARY OF THE INVENTION:

An exemplary object of the present invention is to provide a device for a programmer to avoid providing explicit or implicit optimum routing information and resource allocation, while at the same time generating application specific codes that are optimized to a processor with multiple computing nodes.

Another exemplary object of the present invention is to provide a method for a programmer to avoid providing explicit of implicit optimum routing information and resource allocation, while at the same time generating application specific codes that are optimized to a processor with multiple computing nodes.

According to one exemplary aspect of the present invention, provided is a device for aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed and a connecting path between the pair of nodes is the same or different at different states. The device includes: a storage unit storing an application code for the array of computing nodes; and a time dependent route resource allocator that evaluates a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.

According to another exemplary aspect of the present invention, provided is a method of aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed and a connecting path between the pair of nodes is the same or different at different states. The method includes evaluating a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.

According to the exemplary aspects of the present invention, since the time dependent routing resource allocator that generates optimum routing information using a time log or time stamp of the various states of an application is incorporated, a programmer can implement application specific codes, which are optimized for a specific processor with multiple computing nodes, without specifying or providing optimum information that aids data routing. The routing information automatically generated enables routers with non-polling mode of operation to efficiently route data using static mode of routing. In addition, the optimum routing information thus generated eliminates unpredictable delay in routers with polling mode of routing.

The above and other objects, features, and advantages of the present invention will become apparent from the following description based on the accompanying drawings which illustrate exemplary embodiments of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a processor with multiple computing nodes, where each of these nodes communicate with each other node using a single shared bus;

FIG. 2 is a block diagram illustrating a processor with multiple computing nodes, where each of these nodes communicate with each other node using multiple shared buses;

FIG. 3 is a diagram illustrating a state that defines communication between pairs of computing nodes in a processor with multiple computing nodes;

FIG. 4 is a diagram illustrating another state that defines communication between pairs of computing nodes in a processor with multiple computing nodes;

FIG. 5 is a diagram illustrating two kinds of blocks of data that could be transmitted between two communicating computing nodes;

FIG. 6 is a diagram illustrating time span of three states when monitored at three different computing nodes;

FIG. 7 is a diagram illustrating a scenario where several computing nodes communicate with another node;

FIG. 8 is a block diagram illustrating a conventional simulation environment for a processor with multiple computing nodes;

FIG. 9 is a block diagram illustrating a device for automatically generating optimum routing information for a processor with multiple computing nodes, according to an exemplary embodiment of the present invention;

FIG. 10 is a block diagram illustrating main functions of the device of the exemplary embodiment and parameters exchanged between the main functions;

FIG. 11 is a flowchart illustrating how the device aids a programmer to develop codes with optimum routing information;

FIG. 12 is a flowchart illustrating the operation of the device of the exemplary embodiment;

FIG. 13 is a diagram illustrating a processor with four computing nodes;

FIG. 14 is a diagram illustrating time bands and regions of overlap;

FIG. 15 is a flowchart illustrating a TDRA (time dependent route resource allocator) function that generates optimum routing information for a processor with multiple computing nodes that is operating in the static mode of data routing;

FIG. 16 is a flowchart illustrating a second method for implementing the TDRA;

FIG. 17 is a diagram illustrating a two computing nodes, with four paths available for communication; FIG. 18 is a diagram illustrating a three computing nodes; one node with four paths available for communication, and a second node with one path available for communication; and

FIG. 19 is a flowchart illustrating a third method for implementing the TDRA.

DESCRIPTION OF EXEMPLARY EMBODIMENTS OF INVENTION

FIG. 9 shows device 650 which aids a programmer to come up with codes that are optimized for a processor with a set number of computing nodes. Device 650 has a function automatically generating optimum routing information for a processor with multiple computing nodes, the processor operating in the static mode of data routing.

Device 650 could be a standalone computing device such as a personal computer with single or multiple processors 654. It could also be a work station with single or multiple processors 654. As illustrated in FIG. 9, device 650 further includes: interface 651; memory 653 for storing program 900; and second memory 652 for holding program 900 while it is being executed by processor or processors 654. In one example, memory 652 is an internal RAM (random access memory) while memory 653 is an external memory or storage such as a disk drive.

Program 900 is constituted of multiple functions that include main program portion 901 for automatic generation of the optimum routing information, as illustrated in FIG. 10. FIG. 10 illustrates main functions of program 900 and also shows parameters that are exchanged between the main functions. Program 900 aids in writing codes for a processor with multiple computing nodes that are operating in the static mode of data routing.

In main program portion 901, there is a function called a time dependent routing information- generator (TDRA) 902 that evaluates optimum routing information. The information is generated based on the state of all the computing nodes at different time stamps, under the assumption that the processor with multiple computing nodes is running a specific application. The TDRA is also referred to as a time dependent route resource allocator or a time dependent path resource allocator. TDRA 902 uses results generated from behavior level simulation function 802 and application codes 820. In this case, application codes 820 are Type II user's codes stored in advance in storage units 700, 751, and are void of any specific optimum or sub-optimum routing information. Storage units 700 store the codes for the respective nodes and storage unit 751 stores source and target information. TDRA 902 generates detailed information or codes 852 with optimum routing information. Detailed information 852 and intermediate codes 851 that have been lexically analyzed and/or pursed in the lexical analyzing function 801 are used to generate an assembler or machine code by assembler/machine code generating function 803 for the processor with multiple computing nodes. Lexical analyzer and parser unit 801 outputs information 952 of source and destination nodes to TDRA 902.

Assembler/machine code generator 803 outputs simulation results and log 755, and the results 755 are represented as assembler (.assm) files 752.

Also, for main program portion 901, the input codes/scripts/model stored in storage units

700 and 751 , and a test data stored in storage unit 702 are written by a programmer. The scripts are used for testing the programmer's codes/scripts/model. In contrast to the approach in the related art, no optimum routing information is provided (either directly or implied). In addition as in the environment in the related art, the device according of the exemplary embodiment has lexical analyzer and parser 801 that analyses programmer's code 820 for any syntax error or modeling that violates the programming rule of the programming environment. Lexical analyzer and parser unit 801 generates intermediate codes/tokens 853 that are used by behavior level simulator 802 to generate a time log 855 that includes the state of each node, i.e., the source and destination nodes, during the simulation. Generation of log 855 is aided by the test data (or test bench) 702 including control information. Behavior level simulator 802 also outputs simulation results and log 854, and these simulation results and log 854 are sent to graphical user interface 804. In addition, the simulation results for behavior level simulator 802 is outputted as text (.txt) file 751 with the simulation results.

As shown in FIG. 11, device 650 illustrated in FIG. 9 operates as follows:

Upon receiving the start command at step 500, device 650 loads, at step 501, program

900 into memory 652. Next, at step 502, device 650 executes program 900 in memory 652 using processor or processors 654. On completion of step 502, device 650 stops at step 503.

Detail of step 502 is illustrated in FIG. 12.

When the operation of the device is initiated at step 504, the device starts the execution of program 900 by entering a wait state at step 505. In this state, device 650 monitors all user's interface for any input from a programmer. The input describing user's codes should be considered for execution. This input could also determine if device 650 should terminate execution of program 900. Alternatively, device 650 executes a code or script that describes the programmer's codes that is considered for execution.

Next, in step 506, the programmer's code that has been selected in step 505 is loaded into memory 652. Loading into memory 652 could be skipped if the codes were loaded together with program 900. In next step 507, programmer's codes 820 are analyzed for any errors by lexical analyzer and parser function 801. It then makes a decision in step 508 by either stopping executing the programmer's code in step 509, or moving to next step 510. In step 510, device 650 executes function 802 that simulates behavior level description of programmer's code 820. During this step, data that describes specific times when a computing node is active is evaluated. We refer to these data obtained by function 802 as time log 855 of programmer's codes 820. The time-log information is generated as follows:

Consider an example shown in FIG. 13 where a processor has four computing nodes. As an example, consider also a case where a programmer is interested in writing code where Node- 0 communicates with Node-2 and Node-1 communicates with Node-2. The time when each node starts communication is not defined by the programmer as such definition will result in additional coding overhead to the programmer. In this example, it should be noted that Node-0 can communicate with Node-2 using path- A or path-B. On the other hand, Node-1 can communicate with Node-2 using path-B only. The straight forward design procedure is to specify within programmer's code that Node-0 will communicate with Node-2 using path-A. However, this straight forward approach also results in program design overhead when there are very many computing nodes. Thus, in this exemplary embodiment, step 510 will generate information that describes when two communicating computing nodes are active. In this particular case, the computing nodes refer to the computing nodes of the processor in which a programmer is interested upon programming. In addition, other data that describes output results of programmer's code are also generated by behavior level simulator 802. These results 854 are displayed, at step 511, on interface apparatus 651 by graphical user interface 804 or stored as text files 751 in the memory at step 512 for programmer's verification. It should be noted that steps 511 and 512 can be reversed or combined without affecting the overall execution of program 900.

^' FIG. 14 illustrates the kind of time log that will be generated for the example shown in FIG. 13 and explained above. In FIG. 14, the time when a pair of communicating nodes are active and communicating with each other has been marked with a digit " 1. " Thus,

communication between Node-0 and Node-2 is active in the time interval "0" to "14."

In step 514, device 650 executes TDRA function 902, which generates optimized routing information for the target processor with multiple computing nodes. If step 514 is not completed successfully, based on a decision made in step 515, the device stops, at step 516, further execution of programmer's codes 820 or analysis of data generated in step 510. If no error is detected in step 515, device 650 executes a function that generates, at step 517, assembler codes or machine codes for the processor with multiple computing nodes by assembler/machine code generator 803.

Details of Operation of TDRA: Method-1 Step 514 described in FIG. 12 for executing the TDRA function is described in details in FIG. 15. FIG. 15 illustrates the process by which routing information is generated using TDRA function 902. After simulating programmer's code 820 as described in step 510 of FIG. 12, TDRA 902 will receive time log information 855 from behavior level simulator 802.

In the first step, i.e., step 1100, information 855 will be used to indentify when a node is active and all the computing nodes marked ^"with a flag as indicated in the example of the table shown in FIG. 14. The table of FIG. 14 represents a single state scenario, where Node-0 communicates with Node-2 and Node-1 also communicates with Node-2. It is assumed for simplicity that communication between Node-0 and Node-2 is active between time intervals "0" and "14," while communication between Node-1 and Node-2 is active between time interval "10" and "24." Also in this step 1100, TDRA 902 generates overlapping time bands for all the available paths. In FIG. 12, data from Node-0 and Node-1 would overlap or collide along path- B during the time interval "10" to "14." This region of possible data collision is referred to as the time overlap regions or time overlap band.

In the second step, i.e., step 1200, TDRA 902 ranks the nodes in accordance with the number of possible paths that could be used to transmit data between the communicating nodes. In the example of FIG. 12, communication between Node-1 and Node-2 will be ranked first followed by communication between Node-0 and Node-2.

In the third step, i.e., step 1300, it is checked whether all computing nodes under different states have been assigned communication paths at expected time of communication. If there is no node to be routed at step 1300, then the routing code is generated by generator 803 at step 1350 and then assembler files 752 are outputted. If there is a node that has not been assigned a communication path at step 1300, then the fourth step, i.e., step 1400 is performed.

Step 1400 consists of the following three substeps 1401 to 1403:

(a) Step-A (step 1401) checks the paths of the pairs of communicating computing nodes for existence of a path without time overlap;

(b) Step-B (step 1402) is implemented if there is a node (NodeJNo Overlap) with a path without time overlap at step 1401. If there is such a node (Node_No_Overlap), the node (Node_No_Overlap) is assigned to the path without time overlap;

(c) Step-C (step 1403) is implemented if all nodes have paths that have a time overlap at step 1401. A node with the least number of possible paths is assigned to a path with minimum number of communicating pair of computing nodes.

In the fifth step, i.e., step 1500, the pair of communicating nodes that have been assigned their corresponding paths, are removed from the list (or table) and new time bands are generated. The new time bands may or may not include an overlapping time band.

Details of Implementing TDRA: Method-2

FIG. 16 illustrates implementation of the TDRA for generating optimum routing .

information for a processor with multiple computing nodes that is operating in the static mode of data routing. In this implementation, optimum information for routing is evaluated by evaluating a parameter that is related to the joint probability of all pairs of communicating computing nodes in the processor.

In step 2100, TDRA 902 generates an array of flags (marked as "1" in FIG. 14) for all possible paths that can be used to connect two communicating nodes. These flags are positioned at application specific time indices corresponding to the time when a pair of communicating nodes is active. Judgment is based on the time-log information 855 provided by simulator 802. In addition, time bands that represent the state of actively communicating nodes is generated. Time bands are generated in such a way that there is at least one unique active path for connecting a pair communicating nodes in two adjacent time bands. In time band "0," only path-A is active, while in time band "1," path-A and path-B are active. Thus, path-B is unique in time band "1."

In step 2200, TDRA unit 902 or function evaluates a first probability which is the probability of assigning a path to two communicating nodes. As an example, FIG. 17 illustrates two nodes (Node-0 and Node-2) that communicate at some time instant (or within some time band). In this case, since four paths P10, PI 1, P12 and P13 are available for the paths from

Node-0 to Node-2, and the first probability for each of the available four possibilities of Node-0 communicating with Node-2 is given by the following equation:

First Probability = - (1).

(Number of available possible paths)

Thus, the first probability for the case of FIG. 17 will be described in TABLE 1 below.

TABLE 1

If multiple pairs of computing nodes communicate at the same time band, each pair of · nodes is considered independently at each time band. As an example, FIG. 18 illustrates Node-0 cormriunicating with Node-2. At the same time, Node-3 communicates with Node-2. Path P 14 is equivalent to path P13. For convenience of explanation, we assume that the paths between the routers are unidirectional. It should be noted that the same approach can be extended to a case where the paths support full duplex communication between routers.

In step 2300, a second probability is evaluated by finding the probability of assigning a path to communicating nodes given pairs of other communicating nodes. This is a conditional probability of assigning a path to a pair of communicating nodes under the assumption of the existence of other communicating nodes and availability of alternate paths of communication. For each path, the second probability is evaluated by finding an average of the first probability. That is, for each pair of communicating nodes, the second probability is given by the following equation (2):

(Second Probability!,. _tM = (First Probability) _ _

l(Gⁱven a path) _{(Sum of alI First} p_r0bability of Pairs of Cummunicating Node )| _{Given path)}

(2)·

Thus, the second probability for the case of FIG. 18 will be given by the results provided in TABLE 2 below.

TABLE 2

In step 2400, a pair of communicating nodes with the highest probability of being assigned a path is determined. This is evaluated by determining the maximum value of the second probability for each pair of communicating nodes. As an example, with the second probability specified in TABLE 2 for a particular time band, the maximum second probability for a pair of communicating nodes Node-0 and Node-2 is 1, while the maximum second probability for a pair of communicating nodes Node-3 and Node-0 is 0.8. Thus, the priority of assigning nodes in this time band is provided to communication between Node-0 and Node-2.

In step 2500, priority level is evaluated for all time bands. For explanation purpose, it is assumed that in the first time band the priority levels are as indicated in TABLE 3. For completeness, we assume that the priority level in the second time band have also been calculated as explained earlier. TABLE 4 indicates possible results. In TABLE 4, we assume that Node-0 and Node-2 are only communicating in the time band "0." Overall priority level for each pair of communicating nodes is evaluated by taking the mean of priority levels at the time bands when the pair of communicating nodes are active. Communication between Node-0 and Node-2 is active at time band "0," hence the mean priority level is 1. On the other hand, mean priority level of communication between Node-3 and Node-2 is evaluated using the two time bands. Thus, the link with highest priority level is a link for communications between Node-0 and Node-2.

TABLE 3

TABLE 4

In step 2600, a path to a node with highest priority level is assigned. This path is selected by evaluating the joint probabilities at all active time bands. Communication between Node-0 and Node-2 is active during one time band, hence the joint probability will correspond to the second probability values of this pair of communicating nodes. From TABLE 2, the joint probability at time band "0" and time band " 1 and for all paths is given by { 1 , 1 , 1 , 0.2} . Thus, path PlO or PU or P12 can be assigned safely to the pair of communicating nodes Node-0 and Node-2.

When a pair of communicating nodes is active during multiple time bands, the joint probability is evaluated by multiplying the second probability values of the paths at each time band. In other words, joint probability of a path is given by the following equation:

Maximum index of. time band

(Joint probability of a pair of nodes) = ]^~J (Second probability at time band, tb) (3).

TimebanJ tb=0,

Path is actve

Alternatively, the joint probability can be simplified and evaluated by taking the mean of second probability as indicated below:

1

(Joint probability of a pair of nodes) :

Number of active time bands

(4)·

Maximum index of time band

T (Second probability at time band, tb)

Timeband, tb=0,

Path is actve

After assignment of the path in step 2600, then, in step 2700, it is checked whether all of the communicating nodes have been assigned paths. If there is the communicating node without an assigned path, the nodes to which the paths have been assigned are removed from the list in step 2800 and the process goes back to step 3100 to handle a next pair of communicating nodes. If the all pairs of the communicating nodes have been assigned the paths in step 2700, then the process in the TDRA is completed.

Details of Implementing TDRA: Method-3

Another method of implementing the TDRA is illustrated in FIG. 19. In this case, optimum information for routing is evaluated by evaluating a parameter that is related to the joint probability of all pairs of communicating computing nodes in the processor.

The TDRA generates flags and time bands in step 3100, evaluate a first probability in step 3200, and evaluates second probability in step 3300 in the same manner as steps 2100, 2200 and 2300 shown in FIG. 16, respectively.

Next, in step 3400, the TDRA evaluates a third probability for nodes that communicate at multiple time bands. The third probability corresponds to the joint probability as defined in the alternative step of step 2600 shown in FIG. 16. The value corresponding to the joint probability is then used to replace the second probability values that have been evaluated in step 3200. The replacement is done in all time bands.

In fifth step 3500, a pair of communicating nodes with the highest probability of being assigned a path is determined in the same manner as step 2400 of FIG. 16. In step 3600, overall priority is evaluated for each pair of communicating nodes in the same manner as step 2500 of FIG. 16. However, in contrast to step 2500 of FIG. 16, step 3600 in this case does not have to re-evaluate the joint probabilities.

In step 3700, a path to a pair of nodes with the highest priority level is assigned. This path is selected from among the list of possible paths, with the criterion of selection being based on joint probability values has evaluated in step 3400. This mode of selection is similar to step 2600 of FIG. 16. In order to repeat the above steps to assign the paths to the possible pairs of communicating nodes, steps 3800 and 3900 are performed in the same manner as steps 2700 and 2800 of FIG. 16, respectively.

In the above description, device 650 for generating the optimum routing information is implemented in a software manner. However, those skilled in the art easily could make device 650 in a hardware implementation. For example, each of the function blocks in program 900 can configured as a hardware component. Such a device in the hardware configuration includes: for example, storage devices for storing the programmer's codes and test data; a user interface device for receiving user's instructions and displaying the results and logs; a file output device for outputting the resultant files; a lexical analyzer and parser unit; a behavior level simulation unit; a time dependent route resource allocator (TDRA) unit; and code generator unit for generating assembler and/or machine codes. Each of these unit is implemented as a hardware apparatus.

The technical concept here can be applied to a programming environment that generates optimum routing information for a processor with multiple computing nodes. Description was made in connection with the examples described above. The present invention, however is not limited to a software environment, but can be extended to a processor with a central arbiter and a number of computing nodes. In this case, the central arbiter would include a state controller that generates time-log information using control part of a code of an application. The TDRA in this arbiter will then generate optimum routing information as explained in the examples above.

In addition, functions described in this invention can be combined or broken down into smaller functions. The functions of the invented program can then be executed in a device with multiple processors such a network of computers.

The whole or part of the exemplary embodiments disclosed above can be described as, but not limited to, the following supplementary notes.

(Supplementary note 1) A device for aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed and a connecting path between the pair of nodes is the same or different at different states, the device comprising:

a storage unit storing an application code for the array of computing nodes; arid a time dependent route resource allocator that evaluates a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.

(Supplementary note 2) The device according to Supplementary note 1, further comprising:

a behavior level simulator that simulates the application code under constraint that there are infinite number of paths or no routing is needed between a pair of communicating computing node to determine the time-log information,

wherein the time dependent route resource allocator generates and stores information indicating application specific time when pairs of communicating computing nodes have an active link of connection.

(Supplementary note 3) A method of aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed, and a connecting path between the pair of nodes is the same or different at different states; the method comprising:

evaluating a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.-

(Supplementary note 4) The method according to Supplementary note 3, further comprising:

simulating an application code under constraint that there are infinite number of paths or no routing is needed between a pair of communicating computing node to determine the time- log information,

wherein the evaluating of the most appropriate path includes time dependent generating and storing information indicating application specific time when pairs of communicating computing nodes have an active link of connection.

(Supplementary note 5) The method according to Supplementary note 4, wherein the evaluating of the most appropriate path includes:

identifying all possible paths from different computing nodes that overlap in time specified by the time-log.

(Supplementary note 6) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path includes:

generating an array of flags for all possible paths that can be used to connect two communicating computing nodes, the flags being positioned at application specific time indices corresponding to the time when the communicating computing nodes are active, and each pair of cornmunicating computing nodes at different states being provided with its own array of flags.

(Supplementary note 7) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path includes: ranking all pairs of computing nodes according to the number of possible paths for connecting each pair of communicating computing nodes.

(Supplementary note 8) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path includes breaking down the time-log data into time bands, where two adjacent time bands are composed of active possible paths with at least one of the active paths being unique in a given band.

(Supplementary note 9) The method according to Supplementary note 8, wherein the evaluating of the most appropriate path includes: determining priority of assigning a path to a pair of communicating computing nodes at one or more time bands.-

(Supplementary note 10) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path includes assigning a path to a given pair of

communicating computing nodes, wherein the assigning includes: a first step of checking existence of non-overlapping paths; a second step of assigning a path to a pair of communicating computing nodes that has a non-overlapping path, or assigning a path to a pair of

communicating computing nodes with minimum number of possible paths if there are no .overlapping paths; and a third step of removing the assigned path from a list of possible paths in the time bands corresponding to the time when the assigned node is active.

(Supplementary note 11) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path includes:

generating an array of flags for all possible paths that can be used to connect two communicating computing nodes, the flags being positioned at application specific time indices corresponding to the time when the communicating computing nodes are active, and each pair of communicating computing nodes at different states being provided with its own array of flags; ranking all pairs of computing nodes according to the number of possible paths for connecting each pair of communicating computing nodes;

indentifying all possible paths from different computing nodes that overlap in time specified by the time-log; and

assigns paths to non-assigned pair of communicating computing nodes by using a function that involves a first step for checking existence of non-overlapping paths; a second step that assigns a path to a pair of communicating computing nodes that has a non-overlapping path, or assigns a path to a pair of communicating computing nodes with minimum number of possible paths if there are no overlapping paths; a third step that removes the assigned path from the list of possible paths in the time bands corresponding to the time when the assigned node is active.

(Supplementary note 12) The method according to Supplementary note 3, wherein the evaluating of the most appropriate path including evaluating a measure of probability of a pair of communicating computing nodes being assigned a path among available paths in the array of computing nodes.

(Supplementary note 13) The method according to Supplementary note 12, wherein evaluation of the measure of probability is preformed by a first step that evaluates a measure of average {FlagAverage(node_x, node_y, pnl_n2)} of the flags associated with a pair of connected nodes (node_x, node_y) at a particular state; a second step that scales {Flag(node_x, node_y)/FlagAverage(node_x, node_y, pnl_n2)} the flags {Flag(node_x, node_y)} using the averages {FlagAverage(node_x, node_y, pnl_n2)}₅ a third step that evaluates the mean of the scaled path flags {Flag(node_x, node_y)/FlagAverage(node_x, node_y, pnl_n2)}, the mean being evaluated separately for each path at a particular time band with considering all the pair of communicating computing nodes with flags on the path at the same time band.

(Supplementary note 14) The method according to Supplementary note 13, wherein a maximum value among the values obtained from the mean is identified, the maximum of which being evaluated by considering all paths with flags for a pair of communicating computing nodes at a particular time band, and the mean is normalized with the maximum.

(Supplementary note 15) A centralized arbiter generating routing information of routers in an array of computing nodes, the routers enabling communications between computing nodes in the array, wherein connection between a pair of communicating nodes for a given state is fixed and connecting path between the pair of cornmunicating nodes is the same or different at different states, the central arbiter comprising:

a state controller generating or predicting time-log that defines application specific time when a pair of communicating computing nodes will be actively communicating; and

a time dependent route resource allocator that evaluates most appropriate path using information that describes computing nodes that are communicating and the time-log for all the computing nodes that describes application dependent specific time when the a communicating computing nodes are active.

(Supplementary note 16) The centralized arbiter according to Supplementary note 15, wherein the state controller runs control code of an application code for the array of computing codes under constraint that no routing is needed between a pair of cornmunicating computing nodes, and the state controller generates and stores information indicating application specific time when pairs of communicating nodes will have an active link of connection to determine the time-log information,

(Supplementary note 17) The centralized arbiter according to Supplementary note 16, wherein the time dependent route resource allocator indentifies all possible paths from different computing nodes that overlap in time specified by the time-log.

(Supplementary note 18) The centralized arbiter according to Supplementary note 15, wherein the time dependent route resource allocator generates an array of flags for all possible paths that can be used to connect two communicating computing nodes, the flags being positioned at application specific time indices corresponding to the time when communicating computing nodes are expected to be active, and each pair of communicating computing nodes at different states being provided with its own array of flags.

(Supplementary note 1 ) The centralized arbiter according to Supplementary note 15, wherein the time dependent route resource allocator ranks all pairs of computing nodes according to the number of possible paths for connecting each pair of communicating computing nodes.

(Supplementary note 20) The centralized arbiter according to Supplementary note 15, wherein the time dependent route resource allocator breaks down the time-log data into time bands, where two adjacent time bands are composed of active possible paths with at least one of paths being unique in a given band.

(Supplementary note 21) The centralized arbiter according to Supplementary note 20, wherein the time dependent route resource allocator determines priority of assigning a path to a pair of communicating computing nodes at one or more time bands.

(Supplementary note 22) The centralized arbiter according to Supplementary note 15, wherein the time dependent route resource allocator: assigns a path to a given pair of communicating computing nodes using a function, or checks existence of non-overlapping paths; assigns a path to a pair of communicating computing nodes that have a non-overlapping path, or assigns a path to a pair of communicating computing nodes with minimum number of possible paths if there are no overlapping paths; and removes the assigned path from a list or table with possible paths in the time bands corresponding to the time when the assigned node will be active.

(Supplementary note 23) The centralized arbiter according to Supplementary note 15, wherein the time dependent route resource allocator: generates an array of flags for all possible paths that can be used to connect two communicating computing nodes, the flags are positioned at application specific time indices corresponding to the time when the communicating computing nodes are active, and each pair of communicating computing nodes at different states is provided with its own array of flags; ranks all pairs of computing nodes according to the number of possible paths for connecting each pair of communicating computing nodes;

indentifies all possible paths from different computmg nodes that overlap in time specified by the time-log; and assigns paths to non-assigned pair of communicating computing nodes by using a function that involves a first step for checking existence of non-overlapping paths, a second step that assigns a path to a pair of communicating computing nodes that has a non- overlapping path, or assigns a path to a pair of communicating computing nodes with minimum number of possible paths if there are no overlapping paths, and a third step that removes the assigned path from the list of possible paths in the time bands corresponding to the time when the assigned node is active.

(Supplementary note 24) The centralized arbiter according to Supplementary note 13, wherein the time dependent route resource allocator evaluates a measure of probability of a pair of communicating computing nodes being assigned a path among available paths in the array of computing nodes.

(Supplementary note 25) The centralized arbiter according to Supplementary note 24, wherein the time dependent route resource allocator: evaluates a measure of average

{FlagAverage(node_x, node_y, pnl_n2)} of the flags associated with a pair of connected computing nodes (node_x, node_y) at a particular state; scales {Flag(node_x,

node_y)/FlagAverage(node_x, node_y, pnl_n2)} the flags {Flag(node_x, node_y)} using the averages {FlagAverage(node_x, node_y, pnl_n2)}; evaluates the mean of the scaled path flags {Flag(node_x, node_y)/FlagAverage(node_x, node_y, pnl_n2)}, the mean being evaluated separately for each path at a particular time band with considering all the pair of communicating computing nodes with flags on the path at the same time band.

(Supplementary note 26) The centralized arbiter according to Supplementary note 24, wherein the measure of priority is evaluated by identifying a maximum value among the values obtained from the mean, the maximum value being evaluated by considering all paths with flags for a pair of communicating computing nodes at a particular time band; and normalizing the mean with the maximum.

It will be apparent that other variations and modifications may be made to the above described embodiments and functionality, with the attainment of some or all of their advantages. It is an object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

References:

[1] Zhaohui Liu, Kevin Dickson, and John V. McCanny, "Application-Specific

Instruction Set Processor for SoC Implementation of Modern Signal Processing Algorithms," IEEE Transactions on Circuits and Systems, Vol. 52, No. 4, pp. 775-765, April 2005.

[2] William J. Dally and Brian Towles, "Route Packets, Not Wires: On-Chip

Interconnection Networks," 38th Conference on Design Automation (DAC '01), IEEE, pp. 684- 689, June 18-22, 2001, Las Vegas

[3] JP, 2008-022245A

[4] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny, "QNoC: QoS architecture and design process for networks on chip," Journal of Systems Architecture, Vol. 50, No. 2-3, pp. 105-128, February 2004. [5] Aniruddha S. Vaidya, Anand Sivasubramaniam, and Chita R. Das, "LAPSES: A Recipe for High Performance Adaptive Router Design," The Fifth International Symposium on High-Performance Computer Architecture, IEEE, pp. 236-243, January 9-13, 1999, Orlando.

[6] JP, 05-088912A

[7] JP, 07-249012A

Claims

1. A device for aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed and a connecting path between the pair of nodes is the same or different at different states, the device comprising:

a storage unit storing an application code for the array of computing nodes; and a time dependent route resource allocator that evaluates a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.

2. The device according to claim 1, further comprising:

3. A method of aiding programming of an array of computing nodes connected to each other through routing devices and communication paths, wherein connection between a pair of communicating computing nodes for a given state is fixed and a connecting path between the pair of nodes is the same or different at different states, the method comprising:

evaluating a most appropriate path using information that describes computing nodes that are communicating and time-log for all the computing nodes that describes application dependent specific time and duration when the communicating computing nodes are active.

4. The method according to claim 3, further comprising:

5. The method according to claim 3, wherein the evaluating of the most appropriate path includes:

generating an array of flags for all possible paths that can be used to connect two communicating computing nodes, the flags being positioned at application specific time indices corresponding to the time when the communicating computing nodes are active, and each pair of communicating computing nodes at different states being provided with its own array of flags.

6. The method according to claim 3, wherein the evaluating of the most appropriate path includes: ranking all pairs of computing nodes according to the number of possible paths for connecting each pair of communicating computing nodes.

7. The method according to claim 3, wherein the evaluating of the most appropriate path includes breaking down the time-log data into time bands, where two adjacent time bands are composed of active possible paths with at least one of the active paths- being unique in a given band.

8. The method according to claim 3, wherein the evaluating of the most appropriate path includes assigning a path to a given pair of communicating computing nodes, wherein the assigning includes: a first step of checking existence of non-overlapping paths; a second step of assigning a path to a pair of communicating computing nodes that has a non-overlapping path, or assigning a path to a pair of communicating computing nodes with minimum number of possible paths if there are no overlapping paths; and a third step of removing the assigned path from a list of possible paths in the time bands corresponding to the time when the assigned node is active.

9. The method according to claim 3, wherein the evaluating of the most appropriate path including evaluating a measure of probability of a pair of cornmunicating computing nodes being assigned a path among available paths in the array of computing nodes.

10. The method according to claim 9, wherein the evaluating the measure of probability includes:

a first step of evaluating a measure of average of flags associated with a pair of connected computing nodes at a particular state; a second step of scaling the flags using the averages; and

a third step that evaluating a mean of the scaled path flags , the mean being evaluated separately for each path at a particular time band with considering all the pair of communicating computing nodes with flags on the path at the same time band.