WO2003088594A1 - Procede pour creer une redondance en cas de defaillance des adaptateurs de canaux - Google Patents

Procede pour creer une redondance en cas de defaillance des adaptateurs de canaux Download PDF

Info

Publication number
WO2003088594A1
WO2003088594A1 PCT/EP2003/003530 EP0303530W WO03088594A1 WO 2003088594 A1 WO2003088594 A1 WO 2003088594A1 EP 0303530 W EP0303530 W EP 0303530W WO 03088594 A1 WO03088594 A1 WO 03088594A1
Authority
WO
WIPO (PCT)
Prior art keywords
channel adapter
ports
control information
providing
host channel
Prior art date
Application number
PCT/EP2003/003530
Other languages
English (en)
Inventor
Thomas Schlipf
Gerd Konrad Bayer
Wolfgang Eckert
Markus Helms
Juergen Maergner
Christoph Raisch
Klaus Theurich
Original Assignee
International Business Machines Corporation
Ibm Deutschland Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm Deutschland Gmbh filed Critical International Business Machines Corporation
Priority to JP2003585378A priority Critical patent/JP2005527898A/ja
Priority to AU2003226784A priority patent/AU2003226784A1/en
Priority to KR10-2004-7014653A priority patent/KR20050002865A/ko
Publication of WO2003088594A1 publication Critical patent/WO2003088594A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/22Alternate routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/58Association of routers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/35Switches specially adapted for specific applications
    • H04L49/356Switches specially adapted for specific applications for storage area networks
    • H04L49/358Infiniband Switches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/557Error correction, e.g. fault recovery or fault tolerance

Definitions

  • the present invention relates generally to digital network communication, and specifically to provide improved reliability of a computer system or any other node attaching to an InfiniBand subnet or fabric .
  • I/O interconnect architectures in which computing hosts and peripherals are linked by a switching network, commonly referred to as a switching fabric.
  • IB InfiniBand
  • the IB architecture is described in detail in the InfiniBand Architecture Specification, Release 1.0. a, which is available from the InfiniBand Trade Association at www.infinibandta.org and is incorporated herein by reference.
  • HCAs Host Channel Adapters
  • TCAs Target Channel Adapters
  • the HCAs tend to be located near the servers' CPUs and memory, while the TCAs tend to be located near the systems' disk storage and other peripherals.
  • Switches or Routers may be located between HCAs and TCAs, directing data packets to the correct TCA destination based on information that is contained in the data packets themselves .
  • HCAs and TCAs are either an InfiniBand point-to-point link or a switch or router, which allows to create an uniform InfiniBand subnet or fabric environment, respectively.
  • a switch or router which allows to create an uniform InfiniBand subnet or fabric environment, respectively.
  • One of the key points of this switch is that it allows packets of information (or data) to be managed based on variables, such as service level (SL) and a destination identifier (DLID/DGID) .
  • SL service level
  • DLID/DGID destination identifier
  • the InfiniBand architecture is developed with a serial, switched fabric approach. This switched nature allows for low- latency, high-bandwidth characteristics of the InfiniBand architecture. Clustered systems and networks require a connectivity standard that allows for fault tolerant interconnects .
  • InfiniBand architecture which incorporates advanced fault detection and correction mechanisms.
  • IBM PCI-X to InfiniBand Host Channel Adapter which allows connectivity between a host's PCI-X bus and an InfiniBand network.
  • the dual InfiniBand ports provide the capability to support Automatic Path Migration and single or multiple subnet connections with a single HCA device.
  • APM Automatic Path Migration
  • HCA Host Channel Adapter
  • TCA Target Channel Adapter
  • APM provides a redundancy mechanism in case of a port failure of a HCA or TCA or a link, switch, or router failure in a subnet or fabric.
  • InfiniBand does only define a redundancy mechanism in case only one or more ports of an HCA fail but not in case the entire HCA fails. Summary of the invention
  • the invention provides for a redundancy mechanism for a Channel Adapter (CA) , such as a Host Channel Adapter (HCA) or a Target Channel Adapter (TCA) , in case of a complete Channel Adapter failure. It is a particular advantage of the invention that the redundancy mechanism fits seamlessly into the InfiniBand architecture and relies on the fault detection and correction methods which are specified in the InfiniBand architecture.
  • CA Channel Adapter
  • HCA Host Channel Adapter
  • TCA Target Channel Adapter
  • At least two physical Host Channel Adapters are provided.
  • the two physical Host Channel Adapters are registered as one logical Host Channel Adapter in terms of the InfiniBand architecture.
  • Both Host Channel Adapters have dedicated caching means which cooperate with the system memory for storaging Queue Pair (QP) control information in terms of Queue Pair Control Blocks (QPCBs) .
  • QP Queue Pair
  • QPCBs Queue Pair Control Blocks
  • write-through caches are utilized.
  • the QPCBs stored in system memory are an exact copy of the dedicated caches of each physical Host Channel Adapters.
  • write-back caches are used for Host Channel Adapters.
  • the system memory is synchronized with the caches at certain times and does not always reflect the actual contents of the caches at any given point in time.
  • This copy may contain stale data.
  • the fault detection and correction mechanisms provided by the InfiniBand architecture are utilized.
  • CAs Channel Adapters
  • Figure 1 shows a block diagram illustrating the operation of a single Host Channel Adapter with a dedicated cache memory
  • Figure 2 shows a block diagram of a computer system having a redundant logical Host Channel Adapter for the case of a write-through cache
  • Figure 3 shows the block diagram of figure 2 after the replacement of the failing Host Channel Adapter by the redundancy mechanism
  • Figure 4 illustrates the discrepancy which can occur between the state of a cache and system memory for a write-back cache
  • Figures 5 to 7 illustrate the utilization of the fault detection and correction methods provided by the InfiniBand architecture for implementing the redundancy mechanism of the invention in case of the use of write-back caches .
  • Figure 1 shows a computer system having a Host Channel Adapter 1 comprising a cache 2 and a cache directory 3. Further the computer system has system memory 4.
  • queue directory 3 and cache 2 the address space for the Queue Pair Control Blocks (QPCBs) is virtualized.
  • QPCBs Queue Pair Control Blocks
  • All Queue Pair Control Blocks reside in system memory 4 and are loaded (unloaded) into the Host Channel Adapter cache 2 when used (no longer used) . A failure of the Host Channel Adapter 1 does not prevent to access this data from a physically different Host Channel Adapter.
  • Figure 2 shows a block diagram of a preferred embodiment of the invention which illustrates the redundancy mechanism. Like elements of the computer system of figure 2 and the computer system of figure 1 are designated by the same reference numerals .
  • the computer system has a physical Host Channel Adapter 1 with one or more ports 6 and a physical Host Channel Adapter 7 with one or more ports 8.
  • the ports 6 and 8 are connected to an InfiniBand subnet or fabric 9.
  • the two physical Host Channel Adapters 1 and 7 are recognized as one single Host Channel Adapter according to the InfiniBand architecture. Thereby a logical Host Channel Adapter 10 is constituted.
  • the logical Host Channel Adapter 10 has the ports 6 and 8 of the physical Host Channel Adapters 1 and 7.
  • the physical Host Channel Adapter 1 has the cache 2 and the physical Host Channel Adapter 7 has the cache 11. Both caches 2 and 11 are organized as write-through caches.
  • the computer system has system memory 4 for storage of Queue Pair control block data for the physical Host Channel Adapters 1 and 7.
  • the Queue Pair numbers of the different physical Host Channel Adapters 1 and 7 are disjoint.
  • Queue Pair numbers There is no further restriction on the Queue Pair numbers .
  • the physical Host Channel Adapter 1 has a block 12 of Queue Pair Control Blocks QPCB_2 to QPCB_m and that the physical Host Channel Adapter 7 has a block 13 of Queue Pair Control Blocks QPCB_m+l to QPCB_n.
  • QPCB_0 and QPCB_1 are used for subnet management purposes and are not further considered here.
  • the QPCB data in the system memory 4 is identical to the data in the caches 2 and 11.
  • Figure 3 illustrates the redundancy mechanism for dealing with a complete failure of the physical Host Channel Adapter 1 of figure 2.
  • a copy of a QPCB in block 12 is made into the cache 11 as needed.
  • block 12 contains an exact copy of the contents of cache 2, no further recovery mechanisms are required.
  • Figure 4 illustrates the situation in the case of write-back caches. If a write-back cache 14 is used rather than a write- through cache, the QPCBs stored in system memory 4 do not always reflect the up-to-date state of the QPCB data in cache 14. This is the reason why in case of using a write-back cache an additional fault detection and correction method of the InfiniBand architecture needs to be invoked.
  • Figure 5 illustrates the situation before failover of one of the physical Host Channel Adapters.
  • PSNs packet sequence numbers
  • sequence 16 of outstanding PSNs is stored in the local cache memory, which is a write-back cache.
  • This sequence 16 represents the up-to-date sequence of transmitted packets.
  • sequence number Sn is up-to-date in this sequence
  • sequence 17 of PSNs At the receiver's side there is a sequence 17 of PSNs.
  • the next packet expected by the receiver is the packet with the sequence number Rn.
  • the sequence 15 After failover of one of the physical Host Channel Adapters the sequence 15 remains unaffected as it is stored in system memory 4.
  • a copy of the sequence 15 is provided to the remaining still operating physical Host Channel Adapter. This way the sequence 16 of the cache of the failing Host Channel Adapter is replaced by the sequence 15 in the cache of the remaining still operating physical Host Channel Adapter.
  • the receiver returns an acknowledgement (ACK) to the sending Host Channel Adapter and discards the packet.
  • ACK acknowledgement
  • the Host Channel Adapter sends the next packet identified in the sequence 15. This way the sequence 15 is processed until it reaches the original state of the sequence 16 before failover. After this state is reached the normal system operation continues normally.
  • the receiver sends an acknowledgement for having received the packet with the sequence number Sn to the logical Host Channel Adapter.
  • the logical Host Channel Adapter i.e. the remaining still operating physical Host Channel Adapter, interprets this acknowledgement as a ghost acknowledgement and ignores it.
  • the sender sends the packet with the sequence number Sm of the sequence 15 as in the scenario shown in figure 5.
  • Figure 7 shows a scenario where the Host Channel Adapter acts as a receiver.
  • a sequence 18 of PSNs is stored in the system memory and an up-to-date sequence 19 in cache memory. Further there is a sequence 20 of outstanding PSNs to be sent by the sender. This is the situation before failover.
  • sequence 19 is replaced by the sequence 18, i.e. a copy of the sequence 18 is provided from system memory to the cache of the remaining still operating physical Host Channel Adapter part of the logical Host Channel Adapter.
  • sequence 20 remains unchanged.
  • NAK negative acknowledgement
  • HCA 1 ports 6 physical Host Channel Adapter 2 7
  • InfiniBand fabric 9 logical Host Channel Adapter 10

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Cette invention se rapporte à un procédé servant à conférer une meilleure fiabilité aux noeuds reliés à un réseau de commutation InfiniBand, ce procédé consistant: à créer un premier et un second adaptateur de canal physique comportant un premier et un second nombre de ports, à créer un programme destiné à enregistrer le premier et le second adaptateur de canal physique comme un seul adaptateur de canal logique comportant un nombre de premiers et de seconds ports, à créer une première et une seconde mémoire cache destinée à mémoriser des premières et des secondes informations de commande pour le premier et le second adaptateur de canal, à créer une mémoire de système destinée à mémoriser les premières et les secondes informations de commande et à créer un moyen destiné à copier ces premières informations de commande de la mémoire de système dans la seconde mémoire cache, en cas de défaillance du premier adaptateur de canal et pour initialiser une migration de voie automatique du premier nombre de ports vers le second nombre de ports.
PCT/EP2003/003530 2002-04-18 2003-04-04 Procede pour creer une redondance en cas de defaillance des adaptateurs de canaux WO2003088594A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2003585378A JP2005527898A (ja) 2002-04-18 2003-04-04 チャネル・アダプタ障害に対する冗長性を提供する方法
AU2003226784A AU2003226784A1 (en) 2002-04-18 2003-04-04 A method for providing redundancy for channel adapter failure
KR10-2004-7014653A KR20050002865A (ko) 2002-04-18 2003-04-04 인피니밴드 채널 어댑터 장애용 리던던시 제공 방법 및 컴퓨터 시스템

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP02008692.2 2002-04-18
EP02008692 2002-04-18

Publications (1)

Publication Number Publication Date
WO2003088594A1 true WO2003088594A1 (fr) 2003-10-23

Family

ID=29225590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2003/003530 WO2003088594A1 (fr) 2002-04-18 2003-04-04 Procede pour creer une redondance en cas de defaillance des adaptateurs de canaux

Country Status (5)

Country Link
JP (1) JP2005527898A (fr)
KR (1) KR20050002865A (fr)
CN (1) CN1647466A (fr)
AU (1) AU2003226784A1 (fr)
WO (1) WO2003088594A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100403248C (zh) * 2005-06-07 2008-07-16 富士通株式会社 库装置
CN102566944A (zh) * 2011-12-31 2012-07-11 曙光信息产业股份有限公司 存储路径冗余方法
CN107451092A (zh) * 2017-08-09 2017-12-08 郑州云海信息技术有限公司 一种基于ib网络的数据传输系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7756012B2 (en) * 2007-05-18 2010-07-13 Nvidia Corporation Intelligent failover in a load-balanced network environment
CN101510142B (zh) * 2008-02-15 2011-12-21 环旭电子股份有限公司 存储设备的多输出入接口系统与通信方法
US10051054B2 (en) 2013-03-15 2018-08-14 Oracle International Corporation System and method for efficient virtualization in lossless interconnection networks
US9990221B2 (en) * 2013-03-15 2018-06-05 Oracle International Corporation System and method for providing an infiniband SR-IOV vSwitch architecture for a high performance cloud computing environment
CN103312564B (zh) * 2013-06-24 2016-07-06 曙光信息产业(北京)有限公司 InfiniBand网络检测方法
US10397105B2 (en) 2014-03-26 2019-08-27 Oracle International Corporation System and method for scalable multi-homed routing for vSwitch based HCA virtualization
CN107547260B (zh) * 2017-07-24 2020-12-22 杭州沃趣科技股份有限公司 一种长距infiniband链路检测切换修复的方法
CN107592361B (zh) * 2017-09-20 2020-05-29 郑州云海信息技术有限公司 一种基于双ib网络的数据传输方法、装置、设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5835696A (en) * 1995-11-22 1998-11-10 Lucent Technologies Inc. Data router backup feature
US5963540A (en) * 1997-12-19 1999-10-05 Holontech Corporation Router pooling in a network flowswitch
US6195705B1 (en) * 1998-06-30 2001-02-27 Cisco Technology, Inc. Mobile IP mobility agent standby protocol
US6295276B1 (en) * 1999-12-31 2001-09-25 Ragula Systems Combining routers to increase concurrency and redundancy in external network access
EP1158725A2 (fr) * 2000-05-24 2001-11-28 Alcatel Internetworking (PE), Inc. Procédé et dispositif supportant multi-protocoles du type routeur redondance

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5835696A (en) * 1995-11-22 1998-11-10 Lucent Technologies Inc. Data router backup feature
US5963540A (en) * 1997-12-19 1999-10-05 Holontech Corporation Router pooling in a network flowswitch
US6195705B1 (en) * 1998-06-30 2001-02-27 Cisco Technology, Inc. Mobile IP mobility agent standby protocol
US6295276B1 (en) * 1999-12-31 2001-09-25 Ragula Systems Combining routers to increase concurrency and redundancy in external network access
EP1158725A2 (fr) * 2000-05-24 2001-11-28 Alcatel Internetworking (PE), Inc. Procédé et dispositif supportant multi-protocoles du type routeur redondance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KNIGHT S ET AL: "Virtual Router Redundancy Protocol", IETF - RFC, April 1998 (1998-04-01), XP002135272, Retrieved from the Internet <URL:ftp://ftp.isi.edu/in-notes/rfc2338.txt> [retrieved on 20000410] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100403248C (zh) * 2005-06-07 2008-07-16 富士通株式会社 库装置
CN102566944A (zh) * 2011-12-31 2012-07-11 曙光信息产业股份有限公司 存储路径冗余方法
CN107451092A (zh) * 2017-08-09 2017-12-08 郑州云海信息技术有限公司 一种基于ib网络的数据传输系统

Also Published As

Publication number Publication date
CN1647466A (zh) 2005-07-27
JP2005527898A (ja) 2005-09-15
KR20050002865A (ko) 2005-01-10
AU2003226784A1 (en) 2003-10-27

Similar Documents

Publication Publication Date Title
US6545981B1 (en) System and method for implementing error detection and recovery in a system area network
EP1543422B1 (fr) Mecanisme de commutation du controleur reseau active par l&#39;acces direct memoire a distance et de commutation de retour a l&#39;etat initial
US6493343B1 (en) System and method for implementing multi-pathing data transfers in a system area network
US6925578B2 (en) Fault-tolerant switch architecture
US7145837B2 (en) Global recovery for time of day synchronization
US7668923B2 (en) Master-slave adapter
CA2483197C (fr) Systeme, procede et produit destines a gerer des transferts de donnees dans un reseau
US6970972B2 (en) High-availability disk control device and failure processing method thereof and high-availability disk subsystem
US6760859B1 (en) Fault tolerant local area network connectivity
US7509419B2 (en) Method for providing remote access redirect capability in a channel adapter of a system area network
US7844730B2 (en) Computer system and method of communication between modules within computer system
US20050081080A1 (en) Error recovery for data processing systems transferring message packets through communications adapters
US20050091383A1 (en) Efficient zero copy transfer of messages between nodes in a data processing system
JP2004032224A (ja) サーバ引継システムおよびその方法
WO2000072421A1 (fr) Multidiffusion fiable
US20050080869A1 (en) Transferring message packets from a first node to a plurality of nodes in broadcast fashion via direct memory to memory transfer
WO2003105417A1 (fr) Procedes et dispositifs de correction d&#39;erreurs de flux de paquets de donnees
US20070230469A1 (en) Transmission apparatus
US20050080920A1 (en) Interpartition control facility for processing commands that effectuate direct memory to memory information transfer
WO2003088594A1 (fr) Procede pour creer une redondance en cas de defaillance des adaptateurs de canaux
US20050080945A1 (en) Transferring message packets from data continued in disparate areas of source memory via preloading
US20050078708A1 (en) Formatting packet headers in a communications adapter

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 1020047014653

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2003808578X

Country of ref document: CN

Ref document number: 2003585378

Country of ref document: JP

WWP Wipo information: published in national office

Ref document number: 1020047014653

Country of ref document: KR

122 Ep: pct application non-entry in european phase
WWR Wipo information: refused in national office

Ref document number: 1020047014653

Country of ref document: KR