WO2005109232A1

WO2005109232A1 - Cluster switch

Info

Publication number: WO2005109232A1
Application number: PCT/SE2004/000735
Authority: WO
Inventors: Johan Sjöholm
Original assignee: Building 31 Clustering Ab
Priority date: 2004-05-12
Filing date: 2004-05-12
Publication date: 2005-11-17

Abstract

The present invention relates to a method and system for connecting cluster nodes, a cluster switch for facilitating connections of cluster nodes and a cluster system using the cluster apparatus or method wherein cluster nodes communicates with a master node through said cluster switch using a shared memory architecture and triggering interrupts, for data and control traffic transmission purposes, with each other using said shared memory architecture with dual port memory modules. Communication signaling between the cluster switch and nodes may use several different protocols as LVDS solutions or wireless communication links.

Description

P16940PC.a01, FAF, 2004-05-12

Cluster switch

Field of the invention

The present invention relates to computer cluster networks and in particular to an arrangement and method for transferring, controlling, and switching signals between cluster nodes.

Background of the invention

A computer cluster may be defined as a collection of interconnected computational devices working together to for instance solve complex arithmetic problems, handle large quantities of data such as in databases, or provide scalability and/or redundancy in business services. Standard solutions are available for the latter types of cluster applications (redundant web servers, NAS systems, etc). However, in a high-performance computer cluster demands are much different than for simple redundancy applications.

The cluster nodes are interconnected using a communication network; this communication network may consist of several different solutions, like for instance Ethernet. In such a system a work load is divided into small pieces and the pieces are distributed to computational devices or nodes handling each there own computational task. It is thus possible to acquire large computational capacity at a lower cost than for a large supercomputer.

A master hub computer controls the computational process and handles the division of work load into smaller pieces. The central master or hub computer collects data from the nodes and assembles the final result of the computational task.

This kind of processing architecture has during the recent years acquired a great interest both from larger corporations building supercomputer systems and smaller entities like, for example, university groups and non-profit organizations. Such a non-profit university group may be exemplified with the SETI group at Berkeley University, which uses a very large computer network for their now famous SETI@Home project. This project uses data collected at a radio telescope and uses ordinary personal computers to find signs of intelligent life in the radio data using advanced signal processing processes. Data are distributed using the Internet to computers connected to the project. However, this kind of Internet or Ethernet based distribution architecture has several drawbacks, e.g. data distributed in such a way should be of an appropriate size in order to limit the bandwidth requirements, and data can not have any real time dependencies, since latency is a major problem in such systems.

For this purpose dedicated cluster systems has been developed, wherein the computational nodes are located within the same premises and dedicated hardware setups interconnect the nodes with each other. For example one may mention supercomputer systems like the Earth Simulator located at Yokohama in Japan or the Tungsten system by NCSA at Urbana-Champaign in USA.

One of the crucial parts in such systems is the interconnect hardware enabling the communication between cluster nodes and master node(s). Several such interconnect systems are available commercially, for instance from Myricom with their Myrinet solution. This solution includes switch ports in a dedicated cluster switch and interface ports located in the nodes.

Another solution available is the InfiniBand™ architecture incorporated by several vendors and includes dedicated and specialized hardware components such as cabling, interfaces and switches.

The above mentioned examples of solutions are very costly due to their design characteristics and have technical drawbacks, like for instance latency problems.

Summary of the invention

It is the main object of the preferred embodiment of the present invention to provide an arrangement that overcomes some of the above mentioned problems. It is the purpose to provide a high-speed, low latency cluster switch using an emulated memory fabric solution for the use of as a switch in transparent clustering and as a gigabit Ethernet replacement in traditional cluster solutions, such as Beowulf, Mosix, or OpenMosix systems.

This is done by supplying a cluster switch comprising a shared memory structure, in which cluster nodes and the master node can communicate through the shared memory structure located in the switch. Thus data and control traffic can be interchanged and interrupts can be triggered at respective devices by writing data bits to specific memory locations in the shared memory unit.

In a preferred embodiment a network cluster switch comprising: a plurality of connectors connectable to a cluster node(s), a plurality of memory banks, one memory bank for each cluster node connected to the cluster switch, each memory bank comprising: at least one storage means; and memory controller means; a connector connected to a cluster master, being operatively connected to all memory banks for essentially simultaneously communication with at least part of the plurality of the memory banks being essentially simultaneously accessible by both the cluster nodes and master.

The cluster master may trigger an interrupt at at least part of the cluster nodes and a cluster node may trigger an interrupt at the master. Traffic between nodes and master is echoed back to sending device for error detection purposes. The storage means may comprise dual port memory chips and connection between cluster nodes and cluster switch may comprise parallel connections using Low voltage Differential signaling (LVDS) architecture. The cluster switch may also comprise error detection means.

In another embodiment, a system for network clusters is built up, the system may comprise: a network cluster switch according to the present invention as described above. a plurality of cluster nodes; a cluster master; the cluster nodes being connected to the cluster master via the cluster switch, the system characterized in that communication between the cluster nodes and cluster master is maintained through the cluster switch using shared memory banks in the cluster switch. The master can trigger an interrupt through the shared memory bank at at least part of the cluster nodes and/or the cluster node can trigger an interrupt through the shared memory bank at the master. The system may also be configured in such a way that traffic between nodes and master is error detected.

In the system the memory banks may comprise dual port memory chips.

Yet further in the system connection between cluster nodes and cluster switch may comprise parallel connections, wherein communication signals may be transmitted using Low Voltage Differential signaling (LVDS) and/or using a wireless communication protocol.

The above mentioned system may be used in yet another embodiment: a method for connecting and controlling cluster nodes comprising the steps of: connecting the cluster nodes to a cluster switch comprising memory banks for each cluster node, the memory bank shared between the cluster node and master node; memory controlling means; and the method comprise the steps of: connecting the cluster switch to a cluster master; communicating between the cluster nodes and the cluster master using interrupt signaling via the shared memory banks;

In the method, all data traffic between cluster nodes and the cluster master may be error detected.

Yet further in the method, the master may trigger an interrupt through the shared memory bank at at least one of the cluster nodes through a write operation at a memory bank in said cluster switch being operatively connected to the cluster node. Also the opposite may be utilized, wherein the cluster node may trigger an interrupt through the shared memory bank at the master through a write operation at a memory bank in the cluster switch being operatively connected to the master.

In the method, the memory banks may comprise dual port memory chips.

Further in the method, connection between cluster nodes and cluster switch comprise parallel connections and wherein communication signals may be transmitted using Low Voltage Differential signaling (LVDS) and/or using a wireless communication protocol. Brief description of the drawings

In the following, the invention will be described in a non-limiting way and in more detail with reference to exemplary embodiments illustrated in the enclosed drawings, in which:

Fig. 1 shows a schematic drawing of computer cluster according to the present invention.

Fig. 2 shows a schematic view of a cluster switch.

Fig. 3 shows a more detailed schematic block diagram of a cluster switch according to the present invention.

Fig. 4 shows a detailed schematic block diagram of the components of a cluster communication device located at the node.

Fig. 5 shows schematic block diagram of a cluster or master node.

Detailed description of the invention

In computer or network cluster applications several different technologies has been developed in order to share information and resources, like for instance memory resources. For this purpose interconnection devices and components are necessary in order to have an efficient connection between nodes connected to the cluster.

Fig. 1 shows a schematic drawing of a computer cluster 100 wherein a cluster switch 101 is connected to a master 102 controlling the switch 101 and cluster working process. Further, a plurality of nodes 104, 105, 106...10n is connected to the switch 101 using connection means 103 in a point to point manner. This means that all nodes have an individual connection to the switch. The connection means 103 is preferably a multi channel connection cable for transmitting information on parallel lines in order to increase the communication capacity. In one embodiment of the present invention, LVDS signaling (Low Voltage Differential Signaling) is used in order to obtain a stable high speed connection cable solution. However, the connection means are not limited to LVDS signaling systems, other solutions may be used within the scope of the present invention as disclosed in the claims. Such other solutions may include, but not limited to different types of radio or wireless communication solutions, such as for instance Bluetooth, wireless LAN solutions (802.11 standard series, HiperLAN standard series, HomeRF, IR (Infrared) solutions, and UWB (Ultra WideBand)), or different types of fixed line communication signaling protocols such as RS-232, RS-422, RS-485, ECL, TTL, and so on.

Referring to Fig. 2, the cluster switch 101 consists of an array of storage means, such as memory banks 204, 205...207 each connected to a respective cluster node through a connector 210, 211...212. Each memory bank is connected to the master 102 and both the node 104,105...107 and the master 103 can communicate with the memory bank simultaneously.

The master 103 can write to and read from any combinations of memory banks simultaneously. This is necessary since the master 103 controls the operation of the cluster process and distributes the work load. In one preferred embodiment the master 103 may invoke a hardware interrupt in any node or combination of nodes 104, 105...107 by writing a message to a specific memory location in the memory bank or banks 204, 205...207 associated with the node or nodes. Using this interrupt process it is possible to exchange short messages between a master 103 and a node 104,105...107 or control specific functions in cluster nodes. In a similar fashion and for the same purposes one or several nodes 104, 105...107 may also invoke a hardware interrupt in the master 103. Thus the message exchange may be a two-way communication link; however it may also be configured to be a one way communication link.

The memory banks consist of storage circuitry and control circuitry. The storage circuitry may include, but is not limited to, RAM chips or similar volatile memory solutions. Other storage solutions may be utilized, like for example, non-volatile storage solutions, e.g. hard drive, Flash memory, or other similar products.

The present invention preferably uses a dual port memory function in the cluster switch in order for the storage means to be accessible for both nodes and the master simultaneously; this configuration ensures a high speed connection and an efficient data management process according to the present invention. Such a process of sharing an external memory pool may be described as an Emulated Memory Fabric (EMF). This technology let computers or nodes in a cluster share an external memory pool through a peripheral bus connector. The use of such EMF hardware may be for example, but not limited to, transparent clusters and as a high speed Ethernet replacement in traditional cluster environments, such as Beowulf systems.

Let us now take a closer look at the design and function of the cluster switch as may be seen in Fig. 3. Fig. 3 depicts a schematic block diagram of an EMF hardware, used for instance as a cluster switch. A master bus 1 connects to a peripheral bus of the master computer 103 and to a local bus bridge 2 in the switch 101. The local bus bridge 2 allows for selecting a memory bank 204, 205...207 for reading and writing operations or combinations of banks for writing operations. When selecting several memory banks 204, 205...207 for simultaneous write operations a bit mask is used. The local bus bridge also provides the control system for servicing interrupts, e.g. provides the functionality for triggering interrupts at the master and provides a bit mask of the interrupts that have not been serviced yet.

The cluster nodes or slaves 104, 105...107 connect through a node peripheral bus 29 of the node computer and to a local bus bridge 28 as may be seen in Fig. 4. The point-to-point cable 15 connects to identical coder/decoder units 14 and 16 in the switch 101 and in the cluster node connection hardware (not shown). In one preferred embodiment the coder/decoder unit 14 and 16 provide four communication channels 10, 17, 11, 18, 12, 20, 13, and 19, however it should be appreciated by the person skilled in the art that the invention is not limited to four channels and that more or fewer channels may be used. The bandwidth allocated for each channel is depending on the specific task for that specific channel and the configuration of the switch 101.

The local bus bridge 2 of the master connects to storage means, such as a random access memory (RAM) 3 in each memory bank 204, 205...207 with control, address, and data connection lines. Since the RAM in a preferred embodiment of the present invention is a dual port RAM, control, address, and data connections are supplied simultaneously from two sides. Identical control, address, and data connection lines connect from the node. Control and address information 6 is supplied from the node via a channel 10 and data information 7 is received from the node through a separate channel 13 and delivered to the node through yet another channel 12.

In order to synchronize work being undertaken in the cluster setup synchronization means and/or a voting system is utilized. This is in one preferred embodiment facilitated by a voting system that enables high speed hardware assisted barrier synchronization and voting system. Barrier synchronization let the cluster nodes to synchronize by using a barrier level that forces cluster nodes to wait until all cluster nodes has reached the barrier level before continuing operation. A voting system handles scheduling of work load and guarantees fairness. The fairness concept include a fair sharing and/or usage of cluster resources by giving all users equal allocation of resources; however, the fairness concept may also be based on other decision concepts such as historical resource usage, political issues, and job value.

Different types of error detection systems may be utilized and below three different solutions will be discussed with the common denominator that the error detection is implemented in hardware.

1) All information transferred between the master and node is echoed back for verification to respective sending node or master. Control and address information 21 and 17 is sent to the master, received 10 and 6 sent back 11, received by the slave 18 and compared 22 with the initial information. If the sources mismatch, a wait signal 27 is triggered and the information is sent again.

Data sent from the node to the switch 23 and 19 is verified in the same way. It is received 13 and 7, echoed back 12, received 20 and 24, and compared 26. Upon a mismatch 27, the node's local bus bridge 28 retries the write operation.

Data requested and sent from the switch to the node 8 is verified differentially. It is sent 12, received 20 and 25, echoed 19, received 13 and 19, and compared 4 with the original data. Upon a mismatch, a wait signal 5 is triggered and sent via the interconnection lines 14, 15, and 16, which triggers the node's wait signal 27 which in turn trigger the node's local bus bridge 28 to retry the read operation.

2) In another embodiment the error correction system include simple parity testing of transmitted data and control traffic. 3) In yet another error detection solution, the system utilizing a Cyclic Redundancy Check (CRC) of information in communication or in at least part of a storage module or storage modules.

The switch and the node communication hardware is similar in the architecture and handles the same error functions and communication processes, as can be seen by comparing Figs 3 and 4.

The number of nodes 104, 105...107 connecting to the switch 101 may be any suitable number however in one preferred embodiment the number of nodes is between 2 and 256, and accordingly with the same amount of memory banks as nodes 104, 105...107.

The switching device is built using ordinary electronic design technologies and the components are mounted on an electronic circuit board or boards of single or multilayer type. Some parts may be mounted on a second circuit board that is either mounted directly (such as a daughter board) or indirectly connected via connection means (e.g. a wire line connector) to the main electronic circuit board.

In Fig. 5 a cluster or mater node 500 is schematically shown as a block diagram. In each cluster node and in the master a connectivity board 502 is installed. Communication between nodes, the master, and the switch is maintained using these connectivity boards 502. The node connectivity boards 502 may be used in two different system setups: a) In a system wherein a switch according to the present invention is present the system will be able to use the added functionality provided by the switch as discussed above, b) The connectivity boards may also be used in an ordinary network configuration such as an Ethernet system; in this case the boards will then act as a fast network card with low latency and will use an ordinary TCP/IP protocol architecture or other communication protocol standard solutions. In such a case the connectivity cards will be transparent in the sense that any application may use them for connectivity purposes as long as a standard communication protocol is used in communication over the network.

In case a) above, the switch will provide a cluster transparent system in which any third party cluster application software may be setup as long as it fulfills one demand: the cluster application software must be able to be run in threaded mode or as an SMP system (Symmetric Multi-Processing system). Preferably, the nodes, according to the present invention, are run under a Linux environment (no special demands on Linux dialect exists), however is should be appreciated by the person skilled in the art that other operating systems (OS) may be used in the nodes, such as, but not limited to, Windows, Mac, Unix, FreeBSD, VMS systems, and variations thereof, and also standard or proprietary OS for embedded applications may be used, all depending on computing hardware used and configuration demands on the cluster application. In the case of setting up a complete system according to the present invention the third party cluster application software may be used without any recompilation or other modifications due to the transparency of the system setup.

There is no special demand on hardware used for building nodes regarding cluster nodes or the master node; however, the nodes should be able to run the above mentioned OSs and install the connectivity board 502 in an appropriate interface 504 in the nodes and the master. This interface may include, but is not limited to, PCI bus, ISA bus, or similar parallel connectivity buses. Also, the nodes need not to be of the same hardware type or running the same OS environment in order to operate together in a cluster application. However, the will need to have some common elements: a processing unit 501 (CPU), some internal memory or storage means 505, and an optional non-volatile memory 506 such as a hard drive, flash disk, or similar element. The CPU 501 communicates with the connectivity cards 502 and 503 using an internal communication bus 507, such as a PCI bus or ISA bus.

A node 500, whether it is a cluster or master node, may also include a separate standard network connectivity card 503 or cards, such as an Ethernet card or similar, thus enabling the node to also communicate with other devices connected to a network or for access to the Internet. In such a coupling the nodes may communicate with non-cluster dependent applications or transmitting statistical data, control data, or data not depending on high-speed and low latency connections.

Although the invention has been described in detail for the purpose of illustration, it is to be understood that such detail is solely for that purpose and that variations can be made therein by those skilled in the art without departing from the spirit and scope of the invention except as it may be limited by the following claims.

Claims

1. A network cluster switch (101), comprising : a plurality of connectors (210,211...212) each connectable to a cluster node(s) (104,105,106... lOn), a plurality of memory banks, at least one memory bank connected to each cluster node connector, each memory bank comprising: at least one storage means; and memory controller means; a connector (220) connectable (103) to a cluster master (102), for operatively connecting said master (102) to all memory banks (204, 205...207) for essentially simultaneous communication with at least part of the plurality of the memory banks (204, 205...207), said memory banks being essentially simultaneously accessible by both said cluster nodes and said master.

2. The network cluster switch (101) according to claim 1, wherein said cluster switch (101) comprise error detection means supervising each memory bank (204, 205...207).

3. The network cluster switch (101) according to claim 1, wherein said master is provided to be able to trigger an interrupt at at least one of the cluster nodes (104,105,106...10n) through a write operation at a memory bank (204, 205...207) in said cluster switch (101) being operatively connected to said cluster node.

4. The network cluster switch (101) according to claim 1, wherein said cluster node (104,105, 106...10n) being able to trigger an interrupt at the master (102) through a write operation at a memory bank (204, 205...207) in said cluster switch being operatively connected to said master (102).

5. The network cluster switch (101) according to claim 1, wherein traffic between cluster nodes (104, 105, 106...10n) and master (102) is echoed back to sending device for error detection purposes.

6. The network cluster switch (101) according to claim 1, wherein said storage means comprises dual port memory chips.

7. The network cluster switch (101) according to claim 1, wherein connection between cluster nodes (104,105, 106...10n) and cluster switch comprise parallel connections.

8. The network cluster switch (101) according to claim 7, wherein communication signals are transmitted using Low voltage differential signaling (LVDS).

9. The network cluster switch (101) according to claim 1, wherein communication signals are transmitted using a wireless communication protocol.

10. A system (100) for network clusters comprising: a network cluster switch (101) according to claim 1; a plurality of cluster nodes (104,105, 106...10n); a cluster master (102); said cluster nodes (104,105, 106...10n) being connected to said cluster master

(102) via said cluster switch (101), characterized in that communication between said cluster nodes (104,105, 106...10n) and cluster master (102) is maintained through said cluster switch (101) using shared memory banks (204, 205...207) in said cluster switch (101), said shared memory (204, 205...207) being able of essentially simultaneous accessibility by both said cluster node (104,105, 106...10n) and master (102).

11. The system (100) according to claim 10, wherein said master can trigger an interrupt through said shared memory bank (204, 205...207) at at least one of the cluster nodes (104, 105,106...10n) through a write operation at a memory bank (204, 205...207) in said cluster switch (101) being operatively connected to said cluster node (104,105,106...10n).

12. The system (100) according to claim 10, wherein said cluster node (104,105,106...10n) can trigger an interrupt through said shared memory bank (204, 205...207) at the master (102) through a write operation at a memory bank (204, 205...207) in said cluster switch (101) being operatively connected to said master (102).

13. The system (100) according to claim 10, wherein traffic between nodes (104,105,106...10n) and master is error detected.

14. The system (100) according to claim 10, wherein said memory banks comprise dual port memory chips.

15. The system (100) according to claim 10, wherein connection (103) between cluster nodes (104, 105,106...10n) and cluster switch (101) comprise parallel connections (103).

16. The system (100) according to claim 15, wherein communication signals is transmitted using Low Voltage Differential signaling (LVDS).

17. The system (100) according to claim 10, wherein communication signals is transmitted using a wireless communication protocol.

18. A method for connecting and controlling cluster nodes, said method comprising: connecting said cluster nodes to a cluster switch (101) comprising memory banks (204, 205...207) for each cluster node (104,105,106...10n), said memory bank (204, 205...207) shared between said cluster node (104,105,106...10n) and master node (102); memory controlling means; and said method comprising the steps of: connecting said cluster switch (101) to a cluster master (102); communicating between said cluster nodes (104, 105,106...10n) and said cluster master (102) using interrupt signaling via said shared memory banks (204, 205...207);

19. The method according to claim 18, wherein all data traffic between cluster nodes (104,105, 106...10n) and the cluster master (102) is error detected.

20. The method according to claim 18, wherein said master (102) can trigger an interrupt through said shared memory bank (204, 205...207) at at least one of the cluster nodes (104, 105, 106... lOn) through a write operation at a memory bank (204, 205...207) in said cluster switch (101) being operatively connected to said cluster node (104,105, 106...10n).

21. The method according to claim 18, wherein said cluster node

(104,105, 106...10n) can trigger an interrupt through said shared memory bank (204, 205...207) at the master (102) through a write operation at a memory bank (204, 205...207) in said cluster switch (101) being operatively connected to said master (102).

22. The method according to claim 18, wherein said memory banks (204, 205...207) comprise dual port memory chips.

23. The method according to claim 18, wherein connection (103) between cluster nodes and cluster switch (101) comprise parallel connections.

24. The method according to claim 23, wherein communication signals is transmitted using Low Voltage Differential signaling (LVDS).

25. The method according to claim 18, wherein communication signals is transmitted using a wireless communication protocol.