EP1565822A2 - Clustersystem und clusterverfahren mit netzwerkverbindung - Google Patents

Clustersystem und clusterverfahren mit netzwerkverbindung

Info

Publication number
EP1565822A2
EP1565822A2 EP03783677A EP03783677A EP1565822A2 EP 1565822 A2 EP1565822 A2 EP 1565822A2 EP 03783677 A EP03783677 A EP 03783677A EP 03783677 A EP03783677 A EP 03783677A EP 1565822 A2 EP1565822 A2 EP 1565822A2
Authority
EP
European Patent Office
Prior art keywords
node
nodes
network
set forth
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP03783677A
Other languages
English (en)
French (fr)
Inventor
Wim A. Coekaerts
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Publication of EP1565822A2 publication Critical patent/EP1565822A2/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the invention relates to node clustering. It finds particular application to a clustering system and method that includes interconnection between nodes.
  • a cluster is a group of independent servers that collaborate as a single system.
  • the primary cluster components are processor nodes, a cluster interconnect (private network), and a disk subsystem.
  • the clusters share disk access and resources that manage the data, but each distinct hardware cluster nodes do not share memory.
  • Each node has its own dedicated system memory as well as its own operating system, database instance, and application software.
  • Clusters can provide improved fault resilience and modular incremental system growth over single symmetric multiprocessors systems. In the event of subsystem failures, clustering ensures high availability. Redundant hardware components, such as additional nodes, interconnects, and shared disks, provide higher availability. Such redundant hardware architectures avoid single points-of-failure and provide fault resilience.
  • the interconnect was typically a low-cost/slow-speed Ethernet card running TCP/IP or UDP, or a high-cost/highspeed proprietary interconnect like Compaq's Memory Channel running Reliable DataGram (RDG) or Hewlett-Packard's Hyperfabric/2 with Hyper Messaging Protocol (HMP).
  • RDG Reliable DataGram
  • HMP Hewlett-Packard's Hyperfabric/2 with Hyper Messaging Protocol
  • the present invention provides a new and useful method and system of clustering that addresses the above problems.
  • a cluster in one embodiment, includes one or more data storage devices and a plurality of nodes each having data communication access with the one or more data storage devices.
  • An interconnect bus provides a node-to-node communications link between the plurality of nodes.
  • Self-monitoring logic detects topology changes in the cluster based on a signal on the interconnect bus.
  • a method of communicating data in a cluster where the cluster includes a plurality of nodes in a communication network with one or more data storage devices and where each of the plurality of nodes controls a software instance.
  • the plurality of nodes being in further communication with each other via an interconnect bus.
  • a data request message is sent, from a first node directly to a second node in the plurality of nodes over the interconnect bus, requesting a selected data from the second node.
  • the selected data is retrieved, by the second node, by direct memory access if the selected data is available and the selected data is transmitted from the second node directly to the first node over the interconnect bus.
  • Figure 1 is an example system diagram of one embodiment of a cluster node in accordance with the present invention.
  • Figure 2 is an example diagram of the interconnect bus controller of Figure i;
  • Figure 3 is an example of a shared disk cluster architecture
  • Figure 4 is an example of an share-nothing cluster architecture
  • Figure 5 is an example methodology of communicating data using the interconnect bus
  • Figure 6 is an example methodology of detecting a topology change
  • Figure 7 is another example methodology of detecting a topology change
  • Figure 8 is another embodiment of a cluster including a heartbeat system
  • Figure 9 is another embodiment of a heartbeat system
  • Figure 10 is an example methodology of maintaining a quorum file
  • Figure 11 is an example methodology of determining the status of a node using the quorum file.
  • Computer-readable medium refers to any medium that participates in directly or indirectly providing signals, instructions and/or data to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media may include, for example, optical or magnetic disks. Volatile media may include dynamic memory. Transmission media may include coaxial cables, copper wire, and fiber optic cables. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH- EPROM, any other memory chip or cartridge, a carrier wave/pulse, or any other medium from which a computer can read.
  • logic includes but is not limited to hardware, firmware, software and/or combinations of each to perform a function(s) or an action(s), and/or to cause a function or action from another component.
  • logic may include a software controlled microprocessor, discrete logic such as an application specific integrated circuit (ASIC), or other programmed logic device.
  • ASIC application specific integrated circuit
  • Logic may also be fully embodied as software.
  • Signal includes but is not limited to one or more electrical signals, analog or digital signals, a change in a signal's state (e.g. a voltage increase/drop), one or more computer instructions, messages, a bit or bit stream, or other means that can be received, transmitted, and/or detected.
  • Software includes but is not limited to one or more computer readable and/or executable instructions that cause a computer or other electronic device to perform functions, actions, and/or behave in a desired manner.
  • the instructions may be embodied in various forms such as routines, algorithms, modules or programs including separate applications or code from dynamically linked libraries.
  • Software may also be implemented in various forms such as a stand-alone program, a function call, a servlet, an applet, instructions stored in a memory, part of an operating system or other type of executable instructions. It will be appreciated by one of ordinary skill in the art that the form of software is dependent on, for example, requirements of a desired application, the environment it runs on, and/or the desires of a designer/programmer or the like.
  • FIG. 1 Illustrated in Figure 1 is one embodiment of a simplified clustered database system 100 in accordance with one embodiment of the present invention.
  • node 105 and node 110 different numbers of nodes may be used and clustered in different configurations.
  • a database cluster is used as an example, the system can also be applied to other types of clustered systems.
  • Each node is a computer system that executes software and processes information.
  • the computer system may be a personal computer, a server, or other computing device.
  • Each node may include a variety of components and devices such as one or more processors 115, an operating system 120, memories, data storage devices, data communication buses, and network communication devices.
  • Each node may have a different configuration from other nodes.
  • node 105 will be used to describe an example configuration of a node in the clustered database system 100.
  • nodes are networked in a data sharing arrangement where each node has access to one or more data storage devices 125.
  • the data storage devices 125 can maintain a variety of files such as database files that may be shared by the nodes connected in the cluster.
  • a network controller 130 connects the node 105 to a network 135.
  • the operating system 120 includes a communication interface between software applications running on the node 105 and the network controller 130.
  • the interface may be a network device driver 140 that is programmed in accordance with the selected communications protocol of the network 135.
  • Examples of communication protocols that may be used for network controller 130 and network 135 include the Fibre Channel ANSI Standard X3.230 and/or the SCSI-3 ANSI Standard X3.270.
  • the Fibre Channel architecture provides high speed interface links to both serial communications and storage I/O.
  • Other embodiments of the network controller 130 may support other methods of connecting the storage device 125 and nodes 105, 110 such as embodiments utilizing Fast-40 (Ultra-SCSI), Serial Storage Architecture (SSA), IEEE Standard 1394, Asynchronous Transfer Mode (ATM), Scalable Coherent Interface (SCI) IEEE Standard 1596-1992, or some combination of the above, among other possibilities.
  • the node 105 further includes a database instance 145 that manages and controls access to data maintained in the one or more storage devices 125. Since each node in the clustered database system 100 executes a database instance that allows that particular node to access and manipulate data on the shared database in the storage device 125, a lock manager 150 is provided.
  • the lock manager 150 is an entity that is responsible for granting, queuing, and keeping track of locks on one or more resources, such as the shared database stored on the storage device 125. Before a process can perform an operation on the shared database, the process is required to obtain a lock that grants to the process a right to perform a desired operation on the database. To obtain a lock, a process transmits a request for the lock to a lock manager. To manage the use of resources in a network system, lock managers are executed on one or more nodes in the network.
  • a lock is a data structure that indicates that a particular process has been granted certain rights with respect to the resource.
  • a more detailed description of one example of a lock management system is found in U.S. Patent Number 6,405,274 Bl entitled "ANTICIPATORY LOCK MODE CONVERSIONS IN A LOCK MANAGEMENT SYSTEM,” assigned to the present assignee, and which is incorporated herein by reference in its entirety for all purposes.
  • a cluster configuration file 155 is maintained.
  • the cluster configuration file 155 contains a current list of active nodes in the cluster including identification information such as node address, node ID, and connectivity structure (e.g. neighbor nodes, parent-child nodes). Of course, other types of information may be included in such a configuration file and may vary based on the type of network system.
  • identification information such as node address, node ID, and connectivity structure (e.g. neighbor nodes, parent-child nodes).
  • connectivity structure e.g. neighbor nodes, parent-child nodes.
  • topology changes include when a node is added, removed, or stops operating.
  • the database cluster system 100 further includes an interconnect network 160 that provides node-to-node communication between the nodes 105 and 110.
  • the interconnect network 160 provides a bus that allows all nodes on the network to have two-way communication with each other.
  • the interconnect 160 provides an active communication protocol for sending messages and data to and from each node over the same bus.
  • each node includes an interconnect bus controller 165 which may be a peripheral card plugged into a PCI slot of the node.
  • the controller 165 includes one or more connection ports 170 for connecting cables between nodes.
  • connection ports are illustrated in port 170 although different numbers of ports may be used.
  • the interconnect bus controller 165 operates in accordance with IEEE 1394 protocol, also known as f ⁇ rewire or i.LINK.
  • IEEE 1394 protocol also known as f ⁇ rewire or i.LINK.
  • a bus device driver 175 is provided in order for the database instance 145, or other application running on node 105, to communicate with the interconnect bus 160.
  • the bus device driver 175 works with the operating system 120 to interface applications with the interconnect bus controller 165.
  • database commands from the database instance 145 are translated by the bus device driver 165 to IEEE 1394 commands or open host controller interface (OHCI) commands.
  • OHCI open host controller interface
  • the IEEE 1394 OHCI specification defines standard hardware and software for connections to the IEEE 1394 bus. OHCI defines standard register addresses and functions, data structures, and direct memory access (DMA) models.
  • IEEE 1394 is a bus protocol that provides easy to use, low cost, high speed communications.
  • the protocol is very scaleable, provides for both asynchronous and isochronous applications, allows for access to large amounts of memory mapped address space, and allows peer-to-peer communication.
  • the interconnect bus controller 165 may be modified to accommodate other versions of the IEEE 1394 protocol such as IEEE 1394a, 1394b, or other future modifications and enhancements.
  • the IEEE 1394 protocol is a peer-to-peer network with a point-to-point signaling environment.
  • Nodes on the bus 160 may have several ports on them, for example ports 170. Each of these ports acts as a repeater, retransmitting any data packets received by other ports within the node.
  • Each node maintains a node map 180 that keeps track of the current state of the network topology/configuration.
  • the IEEE 1394 protocol supports up to 63 devices on a single bus, and connecting to a device is as easy as plugging in a telephone jack. Nodes, and other devices, can be instantly connected without first powering down the node and rebooting the network. Management of the database cluster topology will be described in greater detail below.
  • the database 145 in node 105 may directly request data, transmit/receive data, or send messages to a running database application on node 110 or other node in the cluster. This avoids having to send messages or data packets to the storage device 125 which would involve one or more intermediate steps, additional disk I/O, and would increase latency.
  • FIG. 2 Illustrated in Figure 2 is an example of the interconnect bus controller 165 based on the IEEE 1394 standard. It includes three ISO protocol layers: a transaction layer 200, a link layer 205 and a physical layer 210. The layers may be implemented in logic as defined above including hardware, software, or both.
  • the transaction layer 200 defines a complete request-response protocol to perform bus transactions with three basic operations: read, write, and lock.
  • the link layer 205 is the midlevel layer and it interacts with both the transaction layer 200 and the physical layer 210, providing asynchronous and isochronous delivery service for data packets. Components to control data delivery include a data packet transmitter, data packet receiver, and a clock cycle controller.
  • the physical layer 210 provides the electrical and mechanical interface between the controller 165 and a cable(s) that forms part the interconnect bus 160. This includes the physical ports 170.
  • the physical layer 210 also ensures that all nodes have fair access to the bus using an arbitration mechanism. For example, when a node needs to access the bus, it sends a request to its parent node(s), which forwards the request to a root node. The first request received by the root is accepted; all others are rejected and withdrawn. The closer the node is to the root, the better its chance of acceptance. To solve consequent arbitration unfairness, periods of bus activity are split into intervals. During an interval, each node gets to transmit once and then it waits until the next interval. Of course, other schemes may be used for arbitration.
  • Other functions of the physical layer 210 include data resynchronization, encoding and decoding, bus initialization, and controlling signal levels.
  • the physical layer of each node also acts as a repeater, translating the point-to-point connections into a virtual broadcast bus.
  • a standard IEEE 1394 cable provides up to 1.5 amps of DC power to keep remote devices "aware,” even when they are powered down.
  • the physical layer also allows nodes to transmit data at different speeds on a single medium. Nodes, or other devices, with different data rate capabilities communicate at the slower device rate.
  • the interconnect bus controller 165 operating based on IEEE 1394 protocol, is an active port and provides for a self-monitoring/self-configuring serial bus. This is known as hot plug-and-play that allows users to add or remove devices even if the bus is active. Thus, nodes and other devices may be connected and disconnected without interrupting network operation.
  • a self-monitoring/self- configuring logic 215 automatically detects topology changes in the cluster system based on changes in the interconnect bus signal.
  • the bus controller 165 of a node places a bias signal on the interconnect bus 160 once the node is connected to the bus. Neighboring nodes, through the self-monitoring logic 215, automatically detect the bias signal which may appear as a change in voltage.
  • the detected bias signal indicates that a node has been added and/or that the node is still active. Conversely, the absence of the bias signal indicates that a node has been removed or has stopped functioning. In this manner, topology changes can be detected without using polling messages that are transmitted between nodes.
  • the self-configuring aspect of the logic 215 will be described in greater detail with reference to Figures 6 and 7.
  • An application program interface (API) layer 220 may be included in the bus controller 165 as an interface to the bus device driver 175. It generally includes higher level system guidelines/interfaces that bring the data, the end system design, and the application together.
  • the API layer 220 may be programmed with desired features to customize communication between the database instance 145 (and other applications) and the interconnect bus controller 165.
  • the functions of the API layer 220 may be embodied in whole or in part within the transaction layer 200 or the bus device driver 175.
  • a database cluster architecture 300 is shown in which the present system and method may be implemented.
  • the architecture 300 is generally known as a shared disk architecture and is similar to Figure 1 except that additional nodes are shown.
  • files and/or data are logically shared among the nodes with each database instance having access to all data.
  • the shared disk access is accomplished, for example, by direct hardware connectivity to one or more storage devices 305 that maintain the files.
  • the connections may be performed by using an operating system abstraction layer that provides a single view of all the storage devices 305 on all the nodes.
  • D are also connected via the node interconnect 160 to provide node-to-node communication.
  • transactions running on any database instance within a node can directly read or modify any part of the database on storage device 305. Access is controlled by one or more lock managers as described previously.
  • Cluster architecture 400 is typically referred to as a shared-nothing architecture.
  • An example of a shared- nothing architecture is described in U.S. Patent Number 6,321,218, entitled "HYBRID SHARED NOTHING/SHARED DISK DATABASE SYSTEM,” assigned to the present assignee, and which is incorporated herein by reference in its entirety for all purposes.
  • database files for example, are partitioned among the database instances running on nodes A-D. Each database instance or node has ownership of a distinct subset of the data and all access to this data is performed exclusively by this "owning" instance.
  • the nodes are also connected with the interconnect 160.
  • the data files stored on storage devices A-D contained employee files
  • the data files may be partitioned such that node A controls employee files for employee names beginning with the letters A-G, node B controls employee files on storage device B for employee names H-N, node C controls employee files for names "O-U" on storage device C and node D controls employee file names "V- Z" on storage device D.
  • a message would be sent requesting such data.
  • node D desired an employee file which was controlled by node A
  • Node A would then retrieve the data file from storage device A and transmit the data to node D.
  • the present system and method may be implemented on other cluster architectures and configurations such as tree structures and with other data access rights and/or restrictions as desired for a particular application.
  • FIG. 5 Illustrated in Figure 5 is one embodiment of a methodology associated with the cluster system of Figure 3 or 4.
  • the embodiment describes directly transmitting and receiving data between nodes using the interconnect bus 160.
  • the illustrated elements denote "processing blocks" and represent computer software instructions or groups of instructions that cause a computer to perform an action(s) and/or to make decisions.
  • the processing blocks may represent functions and/or actions performed by functionally equivalent circuits such as a digital signal processor circuit or an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the diagram, as well as the other illustrated diagrams does not depict syntax of any particular programming language. Rather, the diagram illustrates functional information one skilled in the art could use to fabricate circuits, to generate computer software, or a combination of hardware and software to perform the illustrated processing.
  • diagram 500 is one example of communicating data between nodes using the node-to-node interconnect network 160.
  • a node a requesting node
  • a data request message is transmitted (block 505) to a destination node via the interconnect bus 160.
  • the data request may be sent directly to one or more selected destination nodes by attaching the node name and/or address to the request. If the location of the requested data is unknown, the data request may be broadcasted to each node in the interconnect network.
  • the database instance determines whether the data is available on that node (block 510). If the data is not available, a message is transmitted to the requesting node that the data is not available (block 515). If the data is available, the data is retrieved from local memory (block 520) by direct memory access, and it is transmitted to the requesting node over the interconnect bus (block 525). Remote direct memory access can also be implemented to perform a direct memory to memory transfer. In this manner, messages and data may be transmitted directly between nodes without having to transmit the messages or data to a shared storage device. The node-to-node communication reduces latency and reduces the number of disk inputs/outputs.
  • FIG. 6 Illustrated in Figure 6 in an example methodology of reconfiguring the cluster architecture based on the IEEE 1394 bus protocol.
  • the interconnect bus controller 165 ( Figure 1), operating based on IEEE 1394 protocol, is an active port and provides for a self- configuring serial bus.
  • nodes and other devices may be connected and disconnected without interrupting network operation.
  • the bus is reset (block 605).
  • the interconnect controller 165 of the added node automatically sends a bias signal on the bus and neighboring nodes can detect its bias signal (block 610).
  • the absence of a node's bias signal can be detected when a node is removed.
  • the interconnect controller 165 of neighboring nodes can detect signal changes on the interconnect bus 160 such as a change in the bus signal strength caused by adding or removing a node.
  • the topology change is then transmitted to all other nodes in the database cluster.
  • the bus node map is rebuilt with the changes (block 615).
  • the node map can be updated with the changes.
  • the database instance is notified and it updates the cluster configuration file (block 620) to keep track of the active nodes for the lock managers.
  • the order of the illustrated sequence may be implemented in other ways.
  • the interconnect controller 165 is an active port that includes a self-monitoring/self-configuration mechanism as described above. With this mechanism, the database cluster system can be reconfigured without the added latency involved with polling mechanisms since nodes can virtually instantly detect a change in the topology. The active port also allows reconfiguration of the cluster without having to power-down the network.
  • FIG. 7 Illustrated in Figure 7 is another embodiment of detecting and reconfiguring the cluster.
  • Each node monitors the interconnect bus (block 705) to detect a change in the bus signal such as the presence or absence of a bias signal.
  • a node detects a topology change (block 710), it sends a bus reset signal on the bus, starting a self-configuring mechanism.
  • This mechanism managed by the physical layer 210, may include three phases: bus initialization, tree identification, and self identification.
  • bus initialization active nodes are identified and a treelike logical topology is built (block 715). Each active node is assigned an address, a root node is dynamically assigned, and the node map is rebuilt or updated with the new topology (block 720).
  • the nodes can then access the bus.
  • the database instances on each node are notified of the topology change (block 725) and the database lock manager(s) are reconfigured with the changes so that the shared database can be managed properly throughout the cluster (block 730).
  • network connections such as network 135 may be implemented in other ways.
  • it may include communications or networking software such as the software available from Novell, Microsoft, Artisoft, and other vendors, and may operate using TCP/IP, SPX, IPX, and other protocols over twisted pair, coaxial, or optical fiber cables, telephone lines, satellites, microwave relays, radio frequency signals, modulated AC power lines, and/or other data transmission wires known to those of skill in the art.
  • the network 135 can be connectable to other networks through a gateway or similar mechanism.
  • the protocol of the interconnect bus 160 may include a wireless version.
  • a heartbeat system is a mechanism where nodes periodically generate signals or messages indicating that they are active and functioning. The mechanism also allows nodes to determine the health or status of other nodes in the cluster based on the generated signals.
  • the cluster 800 includes nodes 805 and 810 although any number of nodes may be connected to the cluster.
  • the illustrated nodes may have a similar configuration as the nodes shown in Figure 1. However, a simplified configuration is shown for illustrative purposes.
  • the nodes 805, 810 share access to a storage device 815 that maintains files such as database files.
  • the nodes are connected to the storage device 815 by a shared storage network 820.
  • the network 820 is based on IEEE 1394 communication protocol.
  • nodes 805, 810 and the storage device 815 include an IEEE 1394 network controller 825.
  • the network controller 825 is similar to the interconnect bus controller 165 and in one embodiment, is a network card that is plugged into each device. Alternatively, the controller may be fixed within the node.
  • the network controller 825 includes one or more ports so that cables can be connected between each device. Additionally, other types of network connections may be utilized, for example wireless connections, that are based on the IEEE 1394 protocol, or other similar protocol standards.
  • each node includes a database instance 830 that controls access to the files on the storage device 815. Since resources are shared between nodes in the database cluster 800, each node includes logic to inform other nodes of their health and includes logic to determine the health of other nodes on the network. For example, a heartbeat logic 835 is programmed to generate and transmit a heartbeat message within a predetermined time interval. A heartbeat message is also referred to as a status signal.
  • the predetermined time interval may be any selected interval but is typically on the order of milliseconds to seconds, for example, 300 milliseconds to 5 seconds. So if the interval is one second, each node would transmit a heartbeat message every one second.
  • the network load is used as a factor in determining the heartbeat time interval. For example, if heartbeat messages are transmitted on the same network as data, then a high frequency of heartbeat messages on the network may cause delays in data transmission processes.
  • Figure 8 shows a network that may be impacted by this situation while Figure 9 shows a network that reduces the amount of network traffic by implementing the heartbeat system on a different network. It will be further appreciated that the networks of Figures 8 and 9 may also be configured as a shared-nothing architecture.
  • heartbeat messages from each node are collected and stored in a quorum file 840.
  • the quorum file 840 is one or more files or areas defined within the storage device 815 which also maintains the shared files.
  • Each node in the cluster 800 is allocated address space within the quorum file 840 to which its heartbeat messages are stored.
  • the space of the quorum file 840 is typically equally divided and allocated to each node although other configurations may be possible.
  • the quorum file 840 can be implemented as a separate file for each node rather than one file for the entire cluster even though the file may be logically defined as one data structure.
  • the quorum file may be implemented as a stack, an array, a table, a linked list, a text file or other type of data structure, stored in one or more memory locations, registers, or other type of storage area.
  • FIG. 9 Illustrated in Figure 9 is another embodiment of a database cluster 900 and a heartbeat system.
  • nodes 905 and 910 communicate with a quorum device 915 over a quorum network 920.
  • the quorum network 920 is a separate network than a shared storage network 925.
  • the nodes access shared files on storage device 930 using a different network bus than the quorum network.
  • the quorum network 920 may be part of a node-to-node interconnect network as previously described.
  • the quorum device 915 includes data storage configured to maintain a quorum file for storing heartbeat messages received from the nodes in the cluster.
  • the nodes 905, 910 are connected to the quorum device 915 and communicate to each other in accordance with the IEEE 1394 communication protocol.
  • Each node and the quorum device 915 includes an IEEE 1394 controller 935 similar to the controllers described previously. Since a separate network is configured for data communication to the files, each node includes a separate shared network controller 940 that communicates to the storage device 930.
  • the shared network controller 940 may be an IEEE 1394 controller or other network protocol such as fibre channel protocol.
  • a database instance 945 within each node processes data requests over the shared network controller 940.
  • a heartbeat logic 950 controls the heartbeat mechanism and uses the IEEE 1394 controller 935 to communicate with the quorum device 915.
  • adding or replacing a quorum devices 915 within an existing database cluster 900 can be easily performed with minimal impact on the existing network.
  • traffic on the shared storage network 925 is reduced allowing quicker responses for data processing requests.
  • the clusters of Figures 8 and 9 may include a node-to-node interconnect network.
  • Illustrated in Figure 10 is an example methodology 1000 of a heartbeat system performed with the quorum file 840 or quorum device 915, both of which will be referred to below as a quorum file.
  • a quorum file Once a quorum file is configured and activated within a database cluster, memory within the quorum file is allocated to each of the nodes in the cluster (block 1005). The quorum file may be equally divided and allocated to each node or other allocations may be defined.
  • the quorum file receives heartbeat messages from each node in accordance with the IEEE 1394 protocol (block 1010). Each heartbeat message includes a node identifier that identifies the node sending the message and a time stamp indicating the time of the message. Each message received by the quorum file is then stored in its node's allocated location (block 1015) and the process repeats for each received heartbeat message.
  • heartbeat messages are stored in the quorum file in the order they are received. Thus, by comparing the most recently received time stamps to the current time, the system can determine which nodes are actively sending their heartbeat messages. This information can indicate whether a node is active or not. For example, if a node has missed a predetermined number of consecutive time stamps, a potential problem may be assumed. Any number of messages can be stored for each node including one message.
  • the heartbeat logic of each node is programmed to generate and transmit a heartbeat message at a predetermined interval. Thus, by reading the data from the quorum file, the logic can determine if a number of missed intervals has occurred. This type of status check logic may be part of the heartbeat logic 835 or 950 and will be described in greater detail with reference to Figure 11.
  • FIG 11 illustrates an example methodology for determining the health or status of a node.
  • the heartbeat logic includes logic for generating each heartbeat message at the predetermined time interval and transmitting the message to the quorum file.
  • the heartbeat logic of a node may update its cluster configuration file to determine the current set of active nodes and to determine if any nodes have stopped functioning or otherwise have been removed from the network. This determination may also be synchronized throughout the cluster.
  • a status check logic (not illustrated) may be programmed as part of the heartbeat logic to perform this task as follows. [0063] To begin a status check, the quorum file is read to review the time stamped information for each of the nodes (block 1105).
  • the logic can determine if a particular node is still functioning based on the time of the last messages written to the quorum file (block 1110).
  • a threshold may be set to allow a predetermined number of time stamps to be missed before the determination indicates that a problem may exist. For example, a node may be allowed to miss two consecutive time stamps but if a third is missed, then the node may not be functioning properly.
  • the threshold may also be set to other values, for example a value of 1.
  • a node misses the designated amount of time stamp messages (block 1120), it may not necessarily mean that the node has stopped functioning. Since the nodes are connected to the quorum file in accordance with the IEEE 1394 standard, an additional status check can be performed. As explained previously, the IEEE 1394 bus is active and each device connected to the bus can detect if a neighboring node stops functioning or is removed from the network. This additional information may help to better determine the health of a node. The status logic can compare the time stamp information from the quorum file and the node map data maintained by the IEEE 1394 controller.
  • a node if a node does not miss its time stamp, the node is presumably functioning properly. However, an additional determination may be made by checking if the node is active in the node map (block 1140). If the node is active (block 1145), then the node is functioning properly. If the node is not active (block 1150), then a possible network bus error may exist. Thus, with information from both the quorum file and the node map of the IEEE 1394 bus, a more detailed analysis of node health may be determined. Furthermore, in the cluster configuration shown in Figure 9 in the embodiment where the shared storage network 925 is also a IEEE 1394 bus, two separate network node maps are maintained. The additional node map may also be included in the above comparison process and status check.
  • a simplified embodiment may be implemented.
  • the logic can declare that node as non-functioning and remove it from the cluster configuration file of the database instances. In this process, the node maps are not reviewed.
  • a storage device may include one or more dedicated storage devices such as magnetic or optical disk drives, tape drives, electronic memories or the like.
  • a storage device may also include one or more processing devices such as a computer, a server, a hand-held processing device, or similar device that contains storage, memories, or combinations of these for maintaining data.
  • the storage device may also be any computer-readable medium.
  • Suitable software for implementing the various components of the present system and method are readily provided by those of skill in the art using the teachings presented here and programming languages and tools such as Java, Pascal, C++, C, CGI, Perl, SQL, APIs, SDKs, assembly, firmware, microcode, and/or other languages and tools.
  • the components embodied as software include computer readable/executable instructions that cause a computer to behave in a prescribed manner.
  • the software may be as an article of manufacture and/or stored in a computer readable medium as defined previously.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Small-Scale Networks (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)
EP03783677A 2002-11-27 2003-11-19 Clustersystem und clusterverfahren mit netzwerkverbindung Ceased EP1565822A2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US30556702A 2002-11-27 2002-11-27
US305567 2002-11-27
PCT/US2003/036944 WO2004051474A2 (en) 2002-11-27 2003-11-19 Clustering system and method having interconnect

Publications (1)

Publication Number Publication Date
EP1565822A2 true EP1565822A2 (de) 2005-08-24

Family

ID=32467766

Family Applications (1)

Application Number Title Priority Date Filing Date
EP03783677A Ceased EP1565822A2 (de) 2002-11-27 2003-11-19 Clustersystem und clusterverfahren mit netzwerkverbindung

Country Status (6)

Country Link
EP (1) EP1565822A2 (de)
JP (1) JP4653490B2 (de)
CN (1) CN1717659B (de)
AU (1) AU2003291089A1 (de)
CA (1) CA2504170C (de)
WO (1) WO2004051474A2 (de)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100375427C (zh) * 2005-11-25 2008-03-12 杭州华三通信技术有限公司 一种集群设备批量传输文件的方法及文件传输设备
US8922559B2 (en) * 2010-03-26 2014-12-30 Microsoft Corporation Graph clustering
WO2012032572A1 (ja) * 2010-09-08 2012-03-15 株式会社日立製作所 計算機
CN102521297B (zh) * 2011-11-30 2015-09-09 北京人大金仓信息技术股份有限公司 无共享数据库集群中实现系统动态扩展的方法
CN103905499B (zh) * 2012-12-27 2017-03-22 深圳市金蝶天燕中间件股份有限公司 利用共享磁盘构建通信通道的方法和系统
CN103631623A (zh) * 2013-11-29 2014-03-12 浪潮(北京)电子信息产业有限公司 一种集群系统中部署应用软件的方法及装置
CN104753702B (zh) * 2013-12-27 2018-11-20 华为技术有限公司 一种集群系统中的集群处理方法、装置及系统
CN104052804A (zh) * 2014-06-09 2014-09-17 深圳先进技术研究院 一种不同任务拓扑间共享数据流的方法、装置及集群
CN109299407A (zh) * 2018-10-22 2019-02-01 田大可 一种自主构建的多地址多网点信息推送的方法
CN113590709B (zh) * 2021-06-18 2023-11-14 浙江中控技术股份有限公司 工业数据库集群系统及其数据访问方法
CN113608932B (zh) * 2021-10-09 2022-02-15 深圳市科力锐科技有限公司 数据库演练方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS58124353A (ja) * 1982-01-21 1983-07-23 Fujitsu Ltd バス監視方式
JPH11252093A (ja) * 1998-02-27 1999-09-17 Sony Corp 情報処理装置および方法、並びに提供媒体
JP4035235B2 (ja) * 1998-08-24 2008-01-16 キヤノン株式会社 電子機器
US6438705B1 (en) * 1999-01-29 2002-08-20 International Business Machines Corporation Method and apparatus for building and managing multi-clustered computer systems
JP2002328823A (ja) * 2001-04-27 2002-11-15 Toshiba Corp 非共有型パラレルデータベースサーバシステム、このシステムにおけるデータ書き込み方法及び一致化処理方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2004051474A3 *

Also Published As

Publication number Publication date
CA2504170A1 (en) 2004-06-17
WO2004051474A2 (en) 2004-06-17
CA2504170C (en) 2016-06-21
CN1717659B (zh) 2010-04-28
AU2003291089A1 (en) 2004-06-23
CN1717659A (zh) 2006-01-04
JP4653490B2 (ja) 2011-03-16
WO2004051474A3 (en) 2004-07-29
JP2006508469A (ja) 2006-03-09

Similar Documents

Publication Publication Date Title
US7451359B1 (en) Heartbeat mechanism for cluster systems
US11354179B2 (en) System, method and computer program product for sharing information in a distributed framework
US6934878B2 (en) Failure detection and failure handling in cluster controller networks
US8055735B2 (en) Method and system for forming a cluster of networked nodes
US6892316B2 (en) Switchable resource management in clustered computer system
CA2504170C (en) Clustering system and method having interconnect
US7499987B2 (en) Deterministically electing an active node
CN104657240B (zh) 多内核操作系统的失效控制方法及装置
WO2001075677A1 (en) Constructing a component management database for managing roles using a directed graph
US20040024732A1 (en) Constructing a component management database for managing roles using a directed graph
Farazdel et al. Understanding and using the SP Switch

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20050614

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK

DAX Request for extension of the european patent (deleted)
RBV Designated contracting states (corrected)

Designated state(s): DE FR GB NL

REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1080560

Country of ref document: HK

RIN1 Information on inventor provided before grant (corrected)

Inventor name: COEKAERTS, WIM, A.

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20070209

REG Reference to a national code

Ref country code: HK

Ref legal event code: WD

Ref document number: 1080560

Country of ref document: HK