BE1020450A5

BE1020450A5 - CLUSTER SYSTEM FOR STORAGE OF DATA FILES.

Info

Publication number: BE1020450A5
Application number: BE2012/0148A
Authority: BE
Inventors: Luc Maria Jozef Andries; Piet Marie Alfons Rosa Demeester
Original assignee: Candit Media
Priority date: 2012-03-08
Filing date: 2012-03-08
Publication date: 2013-10-01

Description

CLUSTERSYSTEEM VOOR DE OPSLAG VAN GEGEVENSBESTANDEN Gebied van de uitvindingCLUSTER SYSTEM FOR STORAGE OF DATA FILES Field of the invention

[0001] De onderhavige uitvinding heeft betrekking op het algemene gèbied van opslag van gegevensbestanden, met name mediabestanden, op een clustersysteem.The present invention relates to the general field of storage of data files, in particular media files, on a cluster system.

Achtergrond van de uitvindingBACKGROUND OF THE INVENTION

[0002] Door de opkomst en rijping van de internettechnologie in de afgelopen decennia is het landschap van de industrie voor informatietechnologie (IT) volledig veranderd. Door het onbetwistbare succes en de populariteit van (voornamelijk op Ethernet gebaseerde) IP-netwerken (Internet Protocol) is deze technologie de eerste keuze geworden voor de architectuur van de meeste IT-omgevingen. In de meeste gevallen zijn centrale maihframecomputers vervangen door een gedistribueerde client/server-architectuur, met verbindingen in de vorm van zeer krachtige IP-netwerken.With the rise and maturation of internet technology in recent decades, the landscape of the information technology industry (IT) has changed completely. Due to the undeniable success and popularity of (mainly Ethernet-based) IP networks (Internet Protocol), this technology has become the first choice for the architecture of most IT environments. In most cases, central maihram computers have been replaced by a distributed client / server architecture, with connections in the form of very powerful IP networks.

[0003] Ook in de media-industrie heeft deze technologie zich een stevige plaats verworven. Een architectuur op basis van IP is inmiddels volledig aanvaard als de standaardoplossing voor op bestanden gebaseerde mediaproductie en hierdoor zijn de interne bedrijfsvoering en werking van omroeporganisaties ingrijpend veranderd. De toepassing van een op ICT gebaseerde infrastructuur en IP^netwerken als transportmiddel biedt met name bij video-/mediaproductie een aantal substantiële potentiële voordelen, waardoor een fundamentele, paradigmaverschuiving, van traditionele videomanipulatie met behulp van banden naar op bestanden gebaseerde productie mogelijk wordt gemaakt. Door deze technologische sprong kan videomateriaal worden behandeld, verwerkt, opgeslagen en vervoerd als gewone bestanden, ongeacht de video-indeling, in plaats van de ononderbroken stromen die worden gebruikt in de hedendaagse klassieke mediatechnologie. Door deze ontwikkeling heeft in de média-infrastructuurtechnologie een ingrijpende verschuiving plaatsgevonden richting centrale opslag op schijven. Veel omroeporganisaties hebben sindsdien gekozen voor de visie van tapeless televisieproductie. Dit concept wordt nader ondersteund door het verschijnen van camera-apparatuur met andere opslagfaciliteiten dan de traditionele videoband, zoals optische schijven (Sony) en halfgeleidergeheugenkaarten (Panasonic).This technology has also gained a strong place in the media industry. An architecture based on IP has now been fully accepted as the standard solution for file-based media production and this has radically changed the internal operations and operation of broadcasters. The use of an ICT-based infrastructure and IP networks as a means of transport offers a number of substantial potential advantages, in particular for video / media production, thus enabling a fundamental, paradigm shift, from traditional video manipulation by means of tapes to file-based production. This technological leap allows video material to be handled, processed, stored and transported as regular files, regardless of the video format, instead of the continuous streams used in today's classic media technology. As a result of this development, media infrastructure technology has undergone a major shift towards central storage on disks. Many broadcasters have since opted for the vision of tapeless television production. This concept is further supported by the appearance of camera equipment with storage facilities other than traditional video tape, such as optical disks (Sony) and semiconductor memory cards (Panasonic).

[0004] Camerateams komen tegenwoordig gewoonlijk bij het bedrijf aan met hun videomateriaal als gewone bestanden opgeslagen op geheugenkaarten, in plaats van op videobanden. De geheugenkaarten worden aangesloten op ingest-stations, bv. een gewone pc, en de bestanden worden zo snel mogelijk overgezet, bij voorkeur sneller dan real-time, op een centraal opslagsysteem op basis van schijven. Als het materiaal eenmaal is opgeslägen op het centrale systeem kan iedereen het gelijktijdig ophalen.Camera teams nowadays usually arrive at the company with their video material stored as normal files on memory cards, rather than on video tapes. The memory cards are connected to ingest stations, for example a normal PC, and the files are transferred as quickly as possible, preferably faster than real-time, to a central disk-based storage system. Once the material has been stored on the central system, everyone can collect it simultaneously.

[0005] Opslag is een van de belangrijkste media-services in een mediaomgeving gebaseerd op bestanden. Net als bij het gebruik van IP-netwerken voor media gelden voor de opslag van media grotendeels andere eisen dan voor klassieke IT-opslagoplossingen. Een architectuur op basis van generieke IT-opslagcomponentêngeniet de voorkeur boven zeer dure én minder betrouwbare merkgebonden mediaoplossingen, voornamelijk vanwege voordelen op het gebied van schaal, betrouwbaarheid, kosten enz., maar media stellen zeer hoge eisen aan het bestandssysteem. Deze bijzondere eisen aan het bestandssysteem vloeien voort uit extreme kenmerken op het gebied van (parallelle) throughput, opslagcapaciteit, schaalbaarheid, redundantie, beschikbaarheid, betrouwbaarheid enz.Storage is one of the most important media services in a media environment based on files. Just as with the use of IP networks for media, media storage requirements are largely different than traditional IT storage solutions. An architecture based on generic IT storage components is preferable to very expensive and less reliable proprietary media solutions, mainly due to advantages in terms of scale, reliability, costs, etc., but media make very high demands on the file system. These special requirements for the file system result from extreme characteristics in the field of (parallel) throughput, storage capacity, scalability, redundancy, availability, reliability, etc.

[0006] Het General Parallel File System (GPFS) van IBM is een van de krachtigste mediabestandssystemen die momenteel in de handel zijn. Het is een infrastructuur voor bestandsbeheer die hoge prestaties en betrouwbaarheid biedt in combinatie met schaalbare toegang tot kritieke bestandsgegevens. Naast bestandopslag biedt GPFS tevens opslagbeheer, tools voor levenscyclusbeheer van informatie, gecentraliseerde administratie en de mogelijkheid van gedeelde toegang tot bestandssystemen vanaf externe GPFS-clusters. GPFS biedt schaalbare gegevenstoegang met hoge prestaties - van een cluster met één knooppunt of een cluster met twee knooppunten dat bijvoorbeeld een platform met hoge beschikbaarheid biedt ter ondersteuning van een databasetoepassing, tot een cluster met 2000 of meer knooppunten. Het ontwerp van GPFS is er vanaf het begin op gericht geweest om ondersteuning te bieden aan parallelle toepassingen met hoge prestaties en inmiddels is het zeer effectief gebleken bij diverse toepassingen.The General Parallel File System (GPFS) from IBM is one of the most powerful media file systems currently on the market. It is a file management infrastructure that offers high performance and reliability in combination with scalable access to critical file data. In addition to file storage, GPFS also offers storage management, information lifecycle management tools, centralized administration and the possibility of shared access to file systems from external GPFS clusters. GPFS offers scalable, high-performance data access - from a single-node cluster or a two-node cluster that, for example, offers a high-availability platform to support a database application, to a cluster with 2000 or more nodes. From the outset, the design of GPFS has focused on providing support for parallel applications with high performance and has since proved to be very effective in various applications.

[0007] Traditioneel bouwt de opslagindustrie sterk schaalbare opslagclusters (2) (d.w.z. groepen van knooppunten voor opslag) op basis van een klassieke SAN-netwerkarchitectuur (Storage Area Network) over FC (Fibre Channel) (zie afb. 1), waarbij het FC-protocol wordt gebruikt voor het transport van opslagverkeer. Bij deze architectuur gaat vergroting van de throughput gewoonlijk gelijk op met vergroting vèn de opslagcapaciteit. Dus als een grotere throughput vereist is, zijn er meer schijven nodig en vice versa. Deze oplossing werkt met high-end opslagsystemen (10) en is daarom vrij duur. Als déze klassieke IT-architectuur zwaar wordt belast met mediaverkeer, vormt het Fibre Channel-netwerk uiteindelijk vaak een bottleneck voor schaalbaarheid.Traditionally, the storage industry builds highly scalable storage clusters (2) (ie groups of storage nodes) based on a classical SAN network architecture (Storage Area Network) over FC (Fiber Channel) (see Fig. 1), the FC protocol is used for the transport of storage traffic. With this architecture, increasing the throughput usually goes hand in hand with increasing the storage capacity. So if a larger throughput is required, more disks are needed and vice versa. This solution works with high-end storage systems (10) and is therefore quite expensive. If this classical IT architecture is heavily burdened with media traffic, the Fiber Channel network often ultimately forms a bottleneck for scalability.

[0008] Voor veel essentiële media-services is echter een platform met verwerkingsvermogen vereist nabij het opslagapparaat, bijvoorbeeld voor transcoding- en rewrapping-services. Dit leidt tot de definitie van een groot aantal verschillende opslag-services, ieder met eigen kenmerken. Zo kan onderscheid worden gemaakt tussen bv. primaire capaciteitsopslag geschikt voor HD (high définition), SD (standard définition) en voor video en audio met lage resolutie, secundaire opslag op schijven en banden met hoog volume en lage kosten ten behoeve Van backup en herstel, ingest-opslag, centrale opslag voor montage, tijdélijkè opslag, distribütiéöpslag ënz. Er bestaat dus behoefte aan een kosteneffectieve fit-for-purpose opslagclusterarchitectuur die onafhankelijke en nauwkeurige schaalbaarheid kan bieden voor verwerkingsvermogen, throughput, opslagcapaciteit en beschikbaarheid, bij voorkeur met gebruik van low-end componenten die goedkoop in de vrije handel verkrijgbaar zijn.For many essential media services, however, a processing power platform is required near the storage device, for example for transcoding and rewrapping services. This leads to the definition of a large number of different storage services, each with its own characteristics. For example, a distinction can be made between, for example, primary capacity storage suitable for HD (high definition), SD (standard definition) and for video and audio with low resolution, secondary storage on disks and tapes with high volume and low costs for backup and recovery , ingest storage, central storage for assembly, temporary storage, distribution storage and storage. Thus, there is a need for a cost-effective fit-for-purpose storage cluster architecture that can provide independent and accurate scalability for processing power, throughput, storage capacity and availability, preferably using low-end components that are inexpensively available in free trade.

[0009] Een GPFS-cluster op.basis van een NAN-model (‘network attached node’) beantwoordt geheel aan deze eisen (zie het voorbeeld in afb. 2). Een GFPS-cluster op basis van het NAN-knooppuntenmodel bestaat uit clusterknooppunten voor opslag (4) en op het netwerk aangesloten clusterknooppunten (6)._ .De opslagservers (4) hebben lokale opslag of zijn rechtstreeks aangesloten op een extern opslagsysteem (10), via een lokale verbinding of een SAN-architectuur. NAN-knooppunten staan via een clusternetwerk in verbinding met alle opslagknooppunten, maar zijn niet rechtstreeks aangesloten op het onderliggende opslagsysteem (10). Elk opslagknooppunt vormt de primaire server voor eén deel van de totale opslag. De gegevensverzoeken van het NAN-knooppunt worden gestriped over alle opslagknooppunten, waardoor de beschikbare bandbreedte van alle afzonderlijke opslagknooppunten en de aangesloten opslagsubsystemen wordt samengevoegd.A GPFS cluster based on a NAN model ("network attached node") fully meets these requirements (see the example in Fig. 2). A GFPS cluster based on the NAN node model consists of cluster nodes for storage (4) and cluster nodes (6) connected to the network. The storage servers (4) have local storage or are directly connected to an external storage system (10) , via a local connection or a SAN architecture. NAN nodes are connected to all storage nodes via a cluster network, but are not directly connected to the underlying storage system (10). Each storage node forms the primary server for part of the total storage. The data requests from the NAN node are stripped across all storage nodes, merging the available bandwidth of all individual storage nodes and the connected storage subsystems.

[0010] Aanvankelijk werd TCP/IP gebruikt als netwerkprotocol en architectuur voor het clusternetwerk. Het is gebleken dat hetzelfde verkeer ook ongewijzigd over Infiniband (IB) kan worden doorgegeven via IPolB. In een latere versie werd ondersteuning toegevoegd voor |B verbs, ook wel native IB genoemd. Het in afb. 2 getoonde cluster gebruikt IB (5) als clusternetwerk.Initially, TCP / IP was used as a network protocol and architecture for the cluster network. It has been found that the same traffic can also be passed on via Infoliband (IP) via IPolB. In a later version, support was added for | B verbs, also known as native IB. The cluster shown in Fig. 2 uses IB (5) as a cluster network.

[0011] Het cluster kan onafhankelijk worden geschaald voor verwerkingsvermogen door versterking van de NAN-knooppunten wat betreft CPU of door verhoging van het aantal NAN-khooppunten in het cluster. Als er een grotere throughput vereist is, kan het clusternetwerk worden geschaald naar een hogere bandbreedte, bv. van SDR Infiniband (single data rate) naar DDR (doublé data rate) en in de toekomst naar QDR (quadruple data rate). De throughput naar de clients kan worden verhoogd door toevoeging van NAN-knooppunten. De opslag-throughput kan worden geoptimaliseerd door het gebruik van snellere schijven of door verhoging van het aantal opslagknooppunten. De zuivere opslagcapaciteit kan worden vergroot door gebruik van grotere vaste schijven, aanbrengen van meer opslag onder elk opslagknooppunt of verhoging van het aantal opslagknooppunten. Elke component in het cluster kan redundant worden gemaakt om single points of failure te vermijden. Bovendien biedt GPFS het concept van failure-groepen, waarmee het opslagsysteem nog beter kan worden beschermd.The cluster can be independently scaled for processing power by enhancing the NAN nodes in terms of CPU or by increasing the number of NAN nodes in the cluster. If a larger throughput is required, the cluster network can be scaled to a higher bandwidth, eg from SDR Infiniband (single data rate) to DDR (double data rate) and in the future to QDR (quadruple data rate). The throughput to the clients can be increased by adding NAN nodes. The storage throughput can be optimized by using faster disks or by increasing the number of storage nodes. The pure storage capacity can be increased by using larger hard disks, providing more storage under each storage node or increasing the number of storage nodes. Each component in the cluster can be made redundant to avoid single points of failure. In addition, GPFS offers the concept of failure groups, with which the storage system can be better protected.

[0Ó12] Het verkeer voor gegevensopslag over het clusternetwerk (d.w.z. het netwerk dat de opslagknooppunten verbindt met de NAN-knooppunten) kan worden gezien als een specifiek geval van mediaverkeer. Om de vaste schijven te optimaliseren voor mediagebruik, moet de segmentgrootte van de schijven zo groot mogelijk worden ingesteld. De schijven worden voor gegevensbescherming geconfigureerd als een Redundant Array of Independent Disks (RAID). Als gevolg daarvan worden zeer grote l/O-blokken, gewoonlijk van 4 mB, door het bestandssysteem getransporteerd over het clusternetwerk. Er is dus sprake van een extreem bursty verkeer.The data storage traffic over the cluster network (i.e. the network connecting the storage nodes with the NAN nodes) can be seen as a specific case of media traffic. To optimize the hard drives for media use, the segment size of the drives must be set as large as possible. The disks are configured for data protection as a Redundant Array or Independent Disks (RAID). As a result, very large I / O blocks, usually of 4 mB, are transported by the file system over the cluster network. So there is extreme bursty traffic.

[0013] Zowel bij' lees- als bij schrijfbewegingen vertoont het clusternetwerk een verkeerspatroon van many-to-one. Als een NAN-knooppunt leest uit de opslagknooppunten, sturen alle opslagknooppunten tegelijk grote bursts terug naar het NAN-knooppunt. Als daarentegen meerdere NAN-knooppunten gegevens naar opslag schrijven, komen bij de betreffende opslagknooppunten tegelijkertijd de bursts aan van alle schrijvende NAN-knooppunten Beide situaties veroorzaken overbelasting van het clusternetwerk. Aangezien hoge efficiëntie bij mediaopslagarchitectuur van bijzonder groot belang is, zijn packet loss en de daaruit voortvloeiende hertransmissie door TCP zeer ongewenst. Dit moet dus te allen tijde worden voorkomen. Bij sommige mediabestandssystemen die een IP-netwerk gebruiken als clusternetwerk, wordt geprobeerd dit te ondervangen met behulp van UDP. Hierbij worden zeer grote switch-buffers ingezet om packet loss vanwege overbelasting tegen te gaan. Dit is alleen effectief als het aantal apparaten dat actief deelneemt aan de betreffende clusterarchitectuur relatief beperkt is en als geen agressief gebruik wordt gemaakt van pre-fetching. Hierdoor wordt de maximale throughput echter sterk begrensd en het werkt niet als er te veel verkeersverzoeken door elkaar heen lopen.In both read and write movements, the cluster network shows a traffic pattern of many-to-one. If a NAN node reads from the storage nodes, all storage nodes simultaneously send large bursts back to the NAN node. Conversely, if multiple NAN nodes write data to storage, the relevant storage nodes simultaneously receive the bursts of all writing NAN nodes. Both situations cause overloading of the cluster network. Since high efficiency in media storage architecture is of particular importance, packet loss and the resulting retransmission by TCP are very undesirable. This must therefore be prevented at all times. Some media file systems that use an IP network as a cluster network attempt to override this using UDP. Very large switch buffers are used to prevent packet loss due to overloading. This is only effective if the number of devices that actively participate in the cluster architecture in question is relatively limited and if pre-fetching is not used aggressively. This, however, greatly limits the maximum throughput and it does not work if too many traffic requests are mixed up.

[0014] Bij GPFS werd TCP/IP oorspronkelijk gebruikt als protocol-stack. Dit maakt flow control mogelijk, maar gaat ten koste van de throughput. Omdat de meest gebruikte netwerktechnologie voor TCP/IP werkt op basis van Ethernet, leidt packet loss in het Ethernet-netwerk tot hertransmissies, waardoor de efficiëntie van de throughput nog verder wordt beperkt. Voor dit specifieke verkeerstype is de netwerktechnologie gebruikt door Fibre Channel of Infiniband bijzonder effectief. Daarbij is het mechanisme voor flow control gebaseerd op buffer-to-buffer-credits om packet loss bij overbelasting geheel te voorkomen. Crédits van beschikbare buffers worden doorlopend uitgewisseld tussen poorten op dezelfde verbinding. Als er geen buffer-credits beschikbaar zijn, worden er geen jDackets verzonden, totdat de congestie van het netwerk is opgelost en er weer buffers vrijkomen.With GPFS, TCP / IP was originally used as a protocol stack. This makes flow control possible, but at the expense of throughput. Because the most common network technology for TCP / IP works on the basis of Ethernet, packet loss in the Ethernet network leads to retransmissions, further reducing the efficiency of the throughput. For this specific traffic type, the network technology used by Fiber Channel or Infiniband is particularly effective. The mechanism for flow control is based on buffer-to-buffer credits to completely prevent packet loss in the event of overloading. Crédits of available buffers are constantly exchanged between ports on the same connection. If no buffer credits are available, jDackets will not be sent until network congestion is resolved and buffers are released.

[0015] Daarom wordt in het hierboven afgebeelde cluster Infiniband gebruikt als clusternetwerktechnologie. Dit is een zeer goedkope technologie met hoge bandbreedte. Nettogegevensbandbreedte bedraagt 8 Gb/s voor SDR-IB en 16 Gb/s voor DDR-IB. De capaciteit van de PCI Express-bus vormt dan de volgende bottleneck. Ook vormt het flow-control-mechanisme met buffer-to-buffer-credits op alle verkeer op de verbinding tegelijkertijd een beperking voor de lineaire schaalbaarheid van deze oplossing.Therefore, in the cluster shown above, Infiniband is used as cluster network technology. This is a very cheap technology with high bandwidth. Net data bandwidth is 8 Gb / s for SDR-IB and 16 Gb / s for DDR-IB. The capacity of the PCI Express bus then forms the next bottleneck. The flow control mechanism with buffer-to-buffer credits on all traffic on the connection also forms a limitation for the linear scalability of this solution.

[0016] De Infiniband-stack is bijzonder efficiënt voor servers die werken op Linux, waarbij de volledige fysieke omvang van de onderliggende bus-technologie wordt benut. De verwerking van de protocol-stack wordt volledig gedelegeerd aan de Host Channel Adapter (HCA), de IB-netwerkkaarten. Zelfs RDMA (remote direct memory accèss) wordt volledig ondersteund en benut. Dit levert een zeer krachtige clusterarchitectuur op, bijzonder goed aangepast aan de productieomgeving voor op bestanden gebaseerde media.The Infiniband stack is particularly efficient for servers running on Linux, utilizing the full physical size of the underlying bus technology. The processing of the protocol stack is completely delegated to the Host Channel Adapter (HCA), the IB network cards. Even RDMA (remote direct memory accèss) is fully supported and utilized. This provides a very powerful cluster architecture, particularly well adapted to the production environment for file-based media.

[0017] Voor veel clienttoepassingen voor media is echter het Microsoft Windows-besturingssysteem vereist. Dit geldt zowel voor Windows-toepassingen die moeten draaien op de NAN-dusterknooppuntén als voor toepassingen die een koppeling naar het centrale bestandssysteem via het CIFS-protocol (Common Internet File System) vereisen. Onlangs heeft IBM een cliënt voor GPFS op Windows toegevoegd aan de NAN-knooppuntconfiguratie. Hierdoor kan een Microsoft Windows 2003 of. 2008 Server als NAN-knooppunt in het GPFS-cluster fungeren. De prestaties van de gloednieuwe Infiniband-stack voor Windows-machines lopen momenteél echter een stuk achter op die van de Linux-variëteit. De clusterprotocol-stack moet terugvallen op het gebruik van IPolB zonder delegatie, want de native IB-stack voor Windows biedt nog geen ondersteuning voor alle GPFS-opdrachten. Hierdoor lopen de prestaties van het clusternetwerk terug met een factor vijf.However, many media client applications require the Microsoft Windows operating system. This applies both to Windows applications that must run on the NAN dust nodes and to applications that require a link to the central file system via the CIFS (Common Internet File System) protocol. IBM recently added a client for GPFS on Windows to the NAN node configuration. This allows a Microsoft Windows 2003 or. 2008 Server act as a NAN node in the GPFS cluster. However, the performance of the brand-new Infiniband stack for Windows machines is currently somewhat behind that of the Linux variety. The cluster protocol stack must fall back on using IPolB without delegation, because the native IB stack for Windows does not yet support all GPFS commands. This reduces the performance of the cluster network by a factor of five.

[0018] Onlangs hebben bepaalde nieuwe ontwikkelingen, ondersteund door een aantal toonaangevende IP-netwerkbedrijven, geresulteerd in de definiëring en implementatie van Data Centre Ethernet (DCE). Data Centre Ethernet is een term die verwijst naar verbeteringen aan Ethernetbridges waardoor lossless LAN- en SAN-verbindingen over een Ethernet-netwerk mogelijk worden gemaakt. Met de term ‘lossless’ wordt bedoeld dat de Ethernetbridges (ofwel-switches) geen frames verliezen bij congestie. DCE; tevens CEE genoemd, Converged of Convergence Enhanced Ethernet, en DÇB, Data Centre Brîdging, is een uitdrukking voor een verbeterde Ethernet die het mogelijk maakt om verschillende toepassingen in datacentres (LAN, SAN, high-performance computing) onder te brengen bij één verbindingstechnologie.Recently, certain new developments, supported by a number of leading IP network companies, have resulted in the definition and implementation of Data Center Ethernet (DCE). Data Center Ethernet is a term that refers to improvements to Ethernet bridges that enable lossless LAN and SAN connections over an Ethernet network. The term "lossless" means that the Ethernet bridges (or switches) do not lose frames in the event of congestion. DCE; also known as CEE, Converged or Convergence Enhanced Ethernet, and DÇB, Data Center Location, is an expression for an improved Ethernet that makes it possible to integrate different applications in data centers (LAN, SAN, high-performance computing) with one connection technology.

[0019] DCE is een bekende techniek, bijvoorbeeld door octrooiaanvraag US2006/251067 van Cisco, een ontwerp van een Ethernet-netwerk voor datacentres en bijbehorende methoden en apparatuur ïn het kader van Fibre Channel over Ethernet. Een DCE-netwerk vereenvoudigt de connectiviteit van datacentres en biedt een netwerk met grote bandbreedte én lage latentie voor vervoer van Ethernet-, opslag-en ander verkeer.DCE is a well-known technique, for example by patent application US2006 / 251067 from Cisco, a design of an Ethernet network for data centers and associated methods and equipment in the context of Fiber Channel over Ethernet. A DCE network simplifies the connectivity of data centers and offers a network with large bandwidth and low latency for the transport of Ethernet, storage and other traffic.

[0020] De Cisco-whitepaper ‘Data Centre Ethernet: Cisco’s Innovation for Data Centre Networks’ biedt een overzicht van DCE. In het verleden werden naast elkaar afzonderlijke fysieke netwerkinfrastructuren ingezet ter ondersteuning van verschillende verkeerstypen, bijvoorbeeld een Fibre Channel-netwerk voor opslagverkeer. een klassiek lossy Ethernet-netwerk voor IP gegevensverkeer of iSCSI-opslagvërkeer en een IB-nëtwerk voor clusterverkeer. Elke netwerktechnologie heeft zijn eigen kenmerken, die aansluiten op de toepassing. DCE biedt ondersteuning voor multi-protocol-transport over een ën' dezelfde Ethernet-netwerkstructuur zodat deze verschillende toepassingen en protocollen kunnen worden gegrondvest op dezelfde fysieke netwerkinfrastructuur. Hiertoe wordt een PFC-mechanisme (Priority-based Flow Control) gedefinieerd om onderscheid te maken tussen verschillende verkeersklassen en een bepaalde verkeërsklasse selectief in de wacht te zetten ('pauzeren') terwijl de transmissie van andere verkeersklassen op dezelfde verbinding doorgang vindt. Er kan een klasse voor ‘no-drop’- services worden gedefinieerd voor FC-verkeer over de Ethernetverbinding (FCoE), zodat een lossless Ethernet-structuur ontstaat voor opslagverkeer op basis van het FC-protpcpl, terwijlhet.overige..Ethernet-yerkeer .pp.de gewone lossy wijze wordt getransporteerd. Gekoppeld aan de PFC-classificatie kan bandbreedtetoewijzing volgens prioriteit worden ingevoerd. De benutting van de beschikbare fysieke bandbreedte kan worden geoptimaliseerd door invoering van Layer 2 Multipathing, waardoor de throughput en schaalbaarheid van Layer 2 Ethernet-netwerktopologieën worden vergroot. Layer 2 is de bekende gegevensverbindingslaag van het OSI-model met zeven lagen.The Cisco "Data Center Ethernet: Cisco's Innovation for Data Center Networks" white paper provides an overview of DCE. In the past, separate physical network infrastructures were deployed side by side to support different traffic types, for example a Fiber Channel network for storage traffic. a traditional lossy Ethernet network for IP data traffic or iSCSI storage traffic and an IB network for cluster traffic. Each network technology has its own characteristics that match the application. DCE supports multi-protocol transport over one and the same Ethernet network structure so that these different applications and protocols can be based on the same physical network infrastructure. To this end, a PFC (Priority-based Flow Control) mechanism is defined to distinguish between different traffic classes and selectively put a certain traffic class on hold ('pause') while the transmission of other traffic classes takes place on the same connection. A class of 'no-drop' services can be defined for FC traffic over the Ethernet connection (FCoE), so that a lossless Ethernet structure is created for storage traffic based on the FC protocol, while the other. .pp.the normal lossy way is transported. Linked to the PFC classification, bandwidth allocation can be entered according to priority. The utilization of available physical bandwidth can be optimized by introducing Layer 2 Multipathing, thereby increasing the throughput and scalability of Layer 2 Ethernet network topologies. Layer 2 is the known data link layer of the seven-layer OSI model.

[0021] In Ciscö-aan vraag ÜS2006/087989 worden methoden en apparaten beschreven voor de implementatie van een Ethernet-oplossing met lage latentie, ook wel de Data Centre Ethernetoplossing genoemd, waardoor de connectiviteit van datacentres wordt vereenvoudigd en een netwerk met grote bandbreedte en lage latentie ontstaat voor het vervoer van Ethernet- en opslagverkeer. Bij sommige voorkeursimplementaties van de octrooiaanvraag worden meerdere virtual lanes (VL's) aangebracht in één fysieke verbinding van een datacentre of een soortgelijk netwerk. Er zijn 'drop'-VL's die Ethernet-achtig gedrag vertonen en ook 'no-drop'-lanes met FC-achtig gedrag. Actief bufferbeheer maakt zowel hoge betrouwbaarheid als lage latentie mogelijk bij het gebruik van kleine frame-buffers.Cisco application No. ÜS2006 / 087989 describes methods and devices for implementing a low-latency Ethernet solution, also referred to as the Data Center Ethernet solution, simplifying data center connectivity and a large bandwidth network and low latency arises for the transport of Ethernet and storage traffic. In some preferred implementations of the patent application, multiple virtual lanes (VLs) are provided in one physical connection of a data center or similar network. There are 'drop' VLs that exhibit Ethernet-like behavior and also 'no-drop' lanes with FC-like behavior. Active buffer management enables both high reliability and low latency when using small frame buffers.

[0022] In de Cisco-whitepaper ‘Fibre Channel over Ethernet Storage Networking Evolution’ worden de verschillende ontwikkelingsfasen beschreven voor de invoering van Fibre Chànnel over Ethernet (FCoE) ter vervanging van klassieke FC-netwerkomgevingen door een enkelvoudige Ethernetstructuur. Eerst wordt FCoE geïmplementeerd op dé servers, zodat zelfstandige servers in staat worden gesteld om te communiceren met via FC aangesloten opslagsystemen die gebruikmaken van het traditionele FC-protocol, maar dan via één enkelvoudige Ethernetinterface en -bekabeling, zodat het bestaande model van SAN-bedrijfsvoering en -beheer behouden kan blijven. Cisco voorspelt dat in een volgende fase bladeservers gebruik zullen gaan maken van hetzelfde principe en dat FC-switches ondersteuning gaan bieden voor FCoE. In de derde fase gaan opslag-arrays en tape-bibliotheken ondersteuning bieden voor native FCoE-interfaces. Hierdoor zal het LAN- en FCoE SÀN-verkeer in de toekomst kunnen samenvloeien tot één enkelvoudige netwerkstructuur.The Cisco White Paper "Fiber Channel over Ethernet Storage Networking Evolution" describes the different development stages for the introduction of Fiber Chnelnel over Ethernet (FCoE) to replace traditional FC network environments with a single Ethernet structure. First, FCoE is implemented on the servers, allowing independent servers to communicate with FC-connected storage systems using the traditional FC protocol, but then via a single Ethernet interface and cabling, so that the existing SAN business model and management can be retained. Cisco predicts that blade servers will use the same principle in the next phase and that FC switches will support FCoE. In the third phase, storage arrays and tape libraries will support native FCoE interfaces. This will allow the LAN and FCoE SÀN traffic to merge into a single network structure in the future.

Doelen van de uitvindingObjects of the invention

[0023] Het doel van de onderhavige uitvinding is het bieden van een sterk schaalbaar clustersysteem met hoge prestaties voor de opslag van bestanden, met name mediabestanden, op basis van de wijd verbreide Ethernet-netwerktechnologie.The object of the present invention is to provide a highly scalable, high-performance cluster system for storing files, in particular media files, based on the widespread Ethernet network technology.

SamenvattingSummary

[0024] De onderhavige uitvinding heeft betrekking op een clustersysteem ter ondersteuning van een geclusterd bestandssysteem. Met clustersysteem wordt bedoeld een verzameling met elkaar verbonden computers dié als één systeem werken. Een geclusterd bestandssysteem is een bestandssysteem dat gelijktijdig kan worden gekoppeld via meerdere servers. Het clustersysteem bestaat uit ten minste twee knooppunten en wordt erdoor gekenmerkt dat de minstens twee knooppunten zijn geconfigureerd voor onderlinge uitwisseling van gegevensverkeer van het geclusterde bestandssysteem via een IP-protocol over een lossless Ethernet-netwerk: Dit lossléss Ethernet-netwerk is bij voorkeur een netwerk op basis van Data Centre Ethernet.The present invention relates to a cluster system to support a clustered file system. By cluster system is meant a collection of connected computers that work as one system. A clustered file system is a file system that can be connected simultaneously via multiple servers. The cluster system consists of at least two nodes and is characterized in that the at least two nodes are configured to exchange data traffic from the clustered file system via an IP protocol over a lossless Ethernet network: This lossléss Ethernet network is preferably a network based on Data Center Ethernet.

[0025] In de uitvoeringsvorm met de grootste voorkeur is dit lossless Ethernet-netwerk geconfigureerd om gebruik te maken van een Priority Flow Control-mechanisme waarmee verschillende verkeersstromen worden ingedeeld in diverse lossless IP over Ethernetverkeersklassen, zodat de genoemde verschillende verkeerstromen, mogelijk elk van de genoemde verkeersstromen, een onderling onafhankelijk flow-controi-mechanisme krijgen. Eventueel kunnen ook meerdere verkeersstromen worden gekoppeld aan één lossless IP over Ethernetverkeersklasse. Zo wordt onderlinge verstoring van de verschillende verkeersstromen veroorzaakt door overbelasting van het netwerk verminderd en bij voorkeur weggenomen en wordt de schaalbaarheid van het clustersysteem vergroot.In the most preferred embodiment, this lossless Ethernet network is configured to use a Priority Flow Control mechanism that divides different traffic flows into various lossless IP over Ethernet traffic classes, so that said different traffic flows, possibly each of the traffic flows, have a mutually independent flow control mechanism. Optionally, multiple traffic flows can also be linked to one lossless IP over Ethernet traffic class. Thus mutual disruption of the different traffic flows caused by network overload is reduced and preferably eliminated and the scalability of the cluster system is increased.

[0026] Bij een voorkeursuitvoeringsvorm worden twee of meerdere verkeersstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijke Ethernetverkeersklasse, waarbij elk van deze Ethernetverkeersklassen zijn eigen onderling onafhankelijk flow-control-mechanisme heeft, behandeld met gelijke prioriteit.In a preferred embodiment, two or more traffic flows, each coupled to their own mutually independent Ethernet traffic class, each of these Ethernet traffic classes having its own mutually independent flow control mechanism, are treated with equal priority.

[0027] Bij een voorkeursuitvoeringsvorm wordt de bandbreedte van één of meerdere verkeerstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijke Ethernetverkeersklasse, waarbij elk van dezeIn a preferred embodiment, the bandwidth of one or more traffic streams, each of which is linked to their own mutually independent Ethernet traffic class, each of these

Ethernetverkeersklassen zijn eigen onderling onafhankelijk flow-control-mechanisme heeft, selectief en onderling onafhankelijk geregeld.Ethernet traffic classes have their own mutually independent flow control mechanism, selectively and mutually independently regulated.

[0028] Bij een voorkeursuitvoeringsvorm wordt de bandbreedte van één of meerdere verkeerstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijke Ethernetverkeersklasse, waarbij elk van dezeIn a preferred embodiment, the bandwidth of one or more traffic streams, each of which is linked to their own mutually independent Ethernet traffic class, each of these

Ethernetverkeersklassen zijn eigen onderling onafhankelijk flow-control- meehanisme heeft, selectief en onderling onafhankelijk gelimiteerd tot een maximum bandbreedte waarde.Ethernet traffic classes have their own mutually independent flow control mechanism, selectively and mutually independently limited to a maximum bandwidth value.

[0029] Bij een voorkeursuitvoeringsvorm wordt de bandbreedte van één of meerdere verkeerstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijke Ethernetverkeersklasse, waarbij elk van dezeIn a preferred embodiment, the bandwidth of one or more traffic streams, each of which is linked to their own mutually independent Ethernet traffic class, each of these

Ethernetverkeersklassen zijn eigen onderling onafhankelijk flpw-control- mechanisme heeft, selectief en onderling onafhankelijk gegarandeerd tot een minimum bandbreedte waarde.Ethernet traffic classes have their own mutually independent flpw control mechanism, selectively and mutually independently guaranteed to a minimum bandwidth value.

[0030] Bij een voorkeursuitvoeringsvorm wordt de bandbreedte van één of meerdere verkeerstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijke Ethernetverkeersklasse, waarbij elk van dezeIn a preferred embodiment, the bandwidth of one or more traffic streams, each of which is linked to their own mutually independent Ethernet traffic class, each of these

Ethernetverkeersklassen zijn eigen onderling onafhankelijk flow-control- mechanisme heeft, selectief en onderling onafhankelijk gegarandeerd tot een minimum bandbreedte waarde en gelimiteerd tot een maximum bandbreedte waarde.Ethernet traffic classes have their own mutually independent flow control mechanism, selectively and mutually independently guaranteed to a minimum bandwidth value and limited to a maximum bandwidth value.

[0031] Bij een voorkeursuitvoeringsvorm wordt één of meerdere verkeerstromen, die elk gekoppeld zijn aan hun eigen onderling onafhankelijkeIn a preferred embodiment, one or more traffic flows, each of which is coupled to their own mutually independent

Ethernetverkeersklasse, waarbij elk van deze Ethernetverkeersklassen zijn eigen onderling onafhankelijk flow-control-mechanisme heeft, vormgegeven.Ethernet traffic class, where each of these Ethernet traffic classes has its own mutually independent flow control mechanism.

[0032] Bij een vóorkeursuitvoeringsvorm is minstens één knooppunt van het clustersysteem een öpslagknooppunt geconfigureerd voor de opslag van in elk geval een deel van het geclusterde bestandssysteem. Bij één uitvoeringsvorm omvat het opslagknooppunt een middel voor lokale opslag. Bij een andere uitvoeringsvorm wordt het opslagknooppunt geconfigureerd voor verbinding met een extern opslagmiddel. Deze twee opties kunnen mogelijk worden gecombineerd.In a preferred embodiment, at least one node of the cluster system is a storage node configured to store at least a portion of the clustered file system. In one embodiment, the storage node comprises a means for local storage. In another embodiment, the storage node is configured for connection to an external storage means. These two options can possibly be combined.

[0033] Bij een voorkeursuitvoeringsvorm omvat het clustersysteem ten minste één öpslagknooppunt en verder ten minste één knooppunt geconfigureerd voor de uitwisseling van gegevensverkeer met een extern apparaat, waarbij het laatstgenoemde knooppunt verder is geconfigureerd om te werken op een ander besturingssysteem dan dat van het opslagknooppunt.In a preferred embodiment, the cluster system comprises at least one storage node and further at least one node configured to exchange data traffic with an external device, the latter node being further configured to operate on a different operating system than that of the storage node.

[0034] Bij een voorkeursuitvoeringsvorm is ten minste één van de ten minste twee knooppunten geconfigureerd voor de uitwisseling van gegevensverkeer mét één extern apparaat, bij voorkeur via een lossless Ethernetverbinding. Bij één uitvoeringsvorm maakt deze losslessIn a preferred embodiment, at least one of the at least two nodes is configured to exchange data traffic with one external device, preferably via a lossless Ethernet connection. In one embodiment, it makes lossless

Ethernetverbinding deel uit van het lossless Ethernet-netwerk dat is aangelegd tussen de minstens twee knooppunten. In plaats hiervan kan de lossless Ethernetverbinding ook deel uitmaken van een ander lossless Ethernet-netwerk.Ethernet connection is part of the lossless Ethernet network that is laid between the at least two nodes. Instead, the lossless Ethernet connection can also be part of another lossless Ethernet network.

[0035] Bij een nuttige uitvoeringsvorm is het externe apparaat een cliënt voor montage met hoge resolutie.In a useful embodiment, the external device is a high resolution mounting client.

[0036] Bij voorkeur is ten minste één van de knooppunten voorzien van verwerkingsmiddelen geschikt voor mediatoepassingen. Enkele voorbeelden zijn transcoding, rewrapping en conversie van de mediabestandsindeling, kwaliteitscontrole enz.Preferably, at least one of the nodes is provided with processing means suitable for media applications. Some examples are transcoding, rewrapping and conversion of the media file format, quality control, etc.

[0037] De uitvinding heeft tevens betrekking op een samenstel van ten minste twee clustersystemen zoals hierboven beschreven, Waarbij de minstens twee clustersystemen zijn geconfigureerd voor de uitwisseling van het gegevensverkeer van minstens één .van de geclusterde bestandssystemen via een IP-protocol over een specifiek hiervoor bestemd ('dedicated') lossless Ethemet-netwerk.. Dit dedicatedJossless Ethernet-netwerk is bij voorkeur een netwerk op basis van Data Centre Ethernet.The invention also relates to an assembly of at least two cluster systems as described above, wherein the at least two cluster systems are configured to exchange the data traffic of at least one of the clustered file systems via an IP protocol over a specific for this purpose intended ('dedicated') lossless Ethemet network. This dedicatedJossless Ethernet network is preferably a network based on Data Center Ethernet.

[0038] Bij één uitvoeringsvorm maakt het dedicated netwerk minstens deel uit van het lossless Ethernet-netwerk van ten minste één van de minstens twee clustersystemen, en mogelijk van beide systemen.In one embodiment, the dedicated network is at least part of the lossless Ethernet network of at least one of the at least two cluster systems, and possibly of both systems.

Beknopte omschrijving van de tekeningenBrief description of the drawings

[0039] Afb. 1 is een illustratie van een klassieke GPFS-clusterarchitectuur gebaseerd op Fibre Channel.FIG. 1 is an illustration of a classic GPFS cluster architecture based on Fiber Channel.

[0040] Afb. 2 is een illustratie van een op Infiniband (IB) gebaseerd GPFS NAN-knooppuntcluster zoals bekend in de bestaande techniek.FIG. 2 is an illustration of a GPFS NAN node cluster based on Infiniband (IB) as known in the prior art.

[0041] Afb. 3 is een illustratie van het PAUSE-mechanisme van 802.3x.FIG. 3 is an illustration of the PAUSE mechanism of 802.3x.

[0042] Afb. 4 is een illustratie van het PAUSE-mechanisme van PFC toegepast op een FCoE-verkeersklasse.FIG. 4 is an illustration of the PAUSE mechanism of PFC applied to an FCoE traffic class.

[0043] In afb. 5 wordt een GPFS NAN-knooppuntencluster gebaseerd op IB (bestaande techniek) vergeleken met een dergelijk cluster gebaseerd op DCE (uitvinding).In Fig. 5, a GPFS NAN node cluster based on IB (existing technique) is compared with such a cluster based on DCE (invention).

[Ö044] Afb. 6 is een illustratie van een GPFS NAN-knooppuntencluster volgens een bepaalde uitvoering van de uitvinding.[Ö044] Fig. 6 is an illustration of a GPFS NAN node cluster according to a particular embodiment of the invention.

[0045] Afb. 7 is een illustratie van een GPFS NAN-knooppuntencluster volgens een uitvoeringsvorm van de uitvinding waarbij PFC wordt gebruikt.FIG. 7 is an illustration of a GPFS NAN node cluster according to an embodiment of the invention where PFC is used.

[0046] Afb. 8 is een illustratie van het PAUSE-mechanisme van PFC toegepast op verscheidene lossless IP-protocolverkeersklassen.FIG. 8 is an illustration of the PAUSE mechanism of PFC applied to various lossless IP protocol traffic classes.

[0047] Afb. 9 is een illustratie van een clientnetwerk voor centrale montage-------- ---------- -------------------- --------------FIG. 9 is an illustration of a client network for central mounting -------- ---------- -------------------- - ------------

[0048] Afb. 10 is een illustratie van een samengevoegd datacentre-Ethernet-netwerk.FIG. 10 is an illustration of a merged data center Ethernet network.

Gedetailleerde omschrijving van de uitvindingDetailed description of the invention

[0049] De onderhavige uitvinding maakt gebruik van het feit dat Data Centre Ethernet (DCE) een lossless Ethernetstructuur creëert door een geringe aanpassing van, of uitbreiding van, enkele bekende Ethernet-standaarden. Hiertoe maakt DCE gebruik van een mechanisme equivalent aan de buffer-to-buffer-credits van Fibre Channel en Infiniband voor het verrichten van flow control zonder packet loss. Dit equivalente mechanisme in Ethernet is gebaseerd op de PAUSE-frame gedefinieerd in IEEE 802.3 (zie afb. 3). Het PAUSE-mechanisme wordt gebruikt om de transmissie van packets gedurende bepaalde tijd te onderdrukken als een ontvangende switch-buffer vol is. DCE wordt geïmplementeerd op een nieuwe soort 10 Gb Ethernet-netwerkadapter, Converged Network Adapter (CNA) genaamd. Deze adapters bieden volledige delegatie, ook voor het Microsoft Windows-besturingssysteem. Vergeleken met klassiek 10 Gb/s Ethernet zijn DCE-switches en -poorten veel goedkoper door het gebruik van SFP+met een koperen twinax-kabel.The present invention makes use of the fact that Data Center Ethernet (DCE) creates a lossless Ethernet structure through a slight adaptation, or extension, of some known Ethernet standards. To this end, DCE uses a mechanism equivalent to the buffer-to-buffer credits of Fiber Channel and Infiniband for performing flow control without packet loss. This equivalent mechanism in Ethernet is based on the PAUSE frame defined in IEEE 802.3 (see Fig. 3). The PAUSE mechanism is used to suppress the transmission of packets for a certain time when a receiving switch buffer is full. DCE is implemented on a new type of 10 Gb Ethernet network adapter, called the Converged Network Adapter (CNA). These adapters offer full delegation, also for the Microsoft Windows operating system. Compared to traditional 10 Gb / s Ethernet, DCE switches and ports are much cheaper due to the use of SFP + with a copper twin-cable.

[0050] Bij een voorkeursuitvoeringsvorm maakt de onderhavige uitvinding gebruik van een prioriteitsmapping zoals bekend van priority flow control (PFC), de reeds genoemde DGE-verbetering, ter voorkoming van onderlinge verstoring van de verschillende verkeersstromen. Een PFC-mechanisme maakt het mogelijk om verschillende verkeerslassen of -stromen aan een bepaalde prioriteitswaarde te koppelen. In PFC worden deze prioriteitswaarden gedefinieerd in het uit drie bits bestaande prioriteitsveld van de IEEE 802.1Q-tag. Zo kunnen acht verschillende prioriteiten worden onderscheiden. PFC maakt mogelijk om hetzelfde PAUSE-mechanisme toe te passen voor elke prioriteitswaarde of -klasse, met behulp van een pauseframe dat informatie vervoert voor één, enkele of alle prioriteiten. Zo ontstaat de mogelijkheid van een selectief flow-control-mechanisme voor de verschillende verkeersklassen of -stromen, verbonden aan verschillende prioriteitswaarden.In a preferred embodiment, the present invention uses a priority mapping known from priority flow control (PFC), the aforementioned DGE improvement, to prevent mutual disruption of the different traffic flows. A PFC mechanism makes it possible to link different traffic welds or flows to a specific priority value. In PFC, these priority values are defined in the three-bit priority field of the IEEE 802.1Q tag. Eight different priorities can thus be distinguished. PFC makes it possible to apply the same PAUSE mechanism for each priority value or class, using a pause frame that transports information for one, some, or all priorities. This creates the possibility of a selective flow control mechanism for the different traffic classes or flows, associated with different priority values.

[0051] Terwijl 802.3x het mogelijk maakt om een lossless Ethernetstructuur te creëren voor al het verkeer tegelijkertijd, zonder enige vorm van classificatie, kan met implementaties van PFC dit lossless-kenmerk exclusief worden gedefinieerd voor een bepaalde verkeersklasse. Daarom is deze functie in het verleden gebruikt voor het koppelen van opslagverkeer dat gebruikmaakt van het FC-protocol aan een bepaalde prioriteitswaarde en om het lossless-kenmerk te definiëren voor deze bepaalde klasse (zie afb. 4). Dit maakt lossless transport mogelijk van op het FC-protocol gebaseerd opslagverkeer via dezelfde Ethernetstructuur waarop ook klassiek IP-protocolverkeer plaatsvindt waaraan een prioriteitswaarde is toegekend die is gekoppeld aan een lossy-kerimerk. Anders dan bij de toepassing van de lossless Ethernetstructuur op FC-opslagverkeer in de bestaande techniek, maakt de onderhavige uitvinding gebruik van de functies van DCE, zowel 802.3x als PFC, voor het creëren van een lossless structuur voor een of meer I P-protocolverkeersklassen.While 802.3x makes it possible to create a lossless Ethernet structure for all traffic simultaneously, without any sort of classification, with implementations of PFC this lossless feature can be defined exclusively for a certain traffic class. Therefore, this function has been used in the past to link storage traffic that uses the FC protocol to a certain priority value and to define the lossless attribute for this particular class (see Fig. 4). This allows lossless transport of FC protocol-based storage traffic via the same Ethernet structure on which conventional IP protocol traffic also takes place to which a priority value has been assigned that is linked to a lossy brand. Unlike the application of the lossless Ethernet structure to FC storage traffic in the existing technique, the present invention uses the functions of DCE, both 802.3x and PFC, to create a lossless structure for one or more IP protocol traffic classes .

[0052] De onderhavige uitvinding maakt gebruik van feit dat hoewel onderling verschillende prïoriteitswaarden kunnen toegewezen worden aan verschillende IP-protocol verkeersklassen, waarbij elk van deze verkeerslassen zijn eigen onderling onafhankelijk flow-control-mechanisme heeft, twee of meerdere verkeerstromen behorende tot verschillende IP-protocol verkeersklassen toch met gelijke prioriteit kunnen behandeld wórden wat het sturen van deze verkeerstromen over dezelfde verbinding of netwerkpoort aangaat. Dit laat toe de beschikbare bandbreedte van een netwerkverbinding té delen tussen verschillende verkeerstromen met gelijke prioriteit, waarbij tegelijkertijd elke verkeerstroom over zijn eigen onderling onafhankelijk flow-control-mechanisme beschikt, zodanig dat onderlinge verstoring van de verkeerströmen ën mogelijk pakketverlies vermeden wordt.The present invention uses the fact that although mutually different priority values can be assigned to different IP protocol traffic classes, each of these traffic classes having its own mutually independent flow control mechanism, two or more traffic flows belonging to different IP traffic class protocol can nevertheless be treated with equal priority with regard to controlling these traffic flows over the same connection or network port. This allows the available bandwidth of a network connection to be shared between different traffic streams with equal priority, while at the same time each traffic stream has its own mutually independent flow control mechanism, such that mutual disruption of the traffic streams and possible packet loss is avoided.

[0053] De onderhavige uitvinding maakt ook gebruik van feit dat de beschikbare, gereserveerde en/of gegarandeerde bandbreedte van twee of meerdere verkeerstromen behorende tot onderling verschillende IP protocol verkeersklassen selectief geregeld kan worden_______________The present invention also makes use of the fact that the available, reserved and / or guaranteed bandwidth of two or more traffic streams belonging to mutually different IP protocol traffic classes can be selectively regulated_______________

[0054] De onderhavige uitvinding maakt ook gebruik van feit dat het verkeer van twee of meerdere verkeerstromen behorende tot onderling verschillende IP protocol verkeersklassen onderling onafhankelijk kan vormgegeven worden, dit om de bursts van de verkeerstromen te minimaliseren of de invloed van bursts te beperken. Hierdoor kan ook de latentie van de verkeerstromen beïnvloed en bijgestuurd worden, evenals de latentie geïntroduceerd door het protocol dat bovenop het IP protocol gebruikt wordt.The present invention also makes use of the fact that the traffic of two or more traffic streams belonging to mutually different IP protocol traffic classes can be designed independently of each other, this to minimize the bursts of the traffic streams or to limit the influence of bursts. This also allows the latency of the traffic flows to be influenced and adjusted, as well as the latency introduced by the protocol used on top of the IP protocol.

[0055] Het clustersysteem volgensde onderhavige uitvinding kan ondersteuning bieden aan een geclusterd bestandssysteem en omvat ten minste twee. knooppunten. De knooppunten, kunnen opslagknooppunten zijn waarin een deel van het geclusterde bestandssysteem kan worden opgeslagen, of knooppunten voor de uitwisseling van gegevensverkeer met een extern apparaat. Het geclusterde bestandssysteem is in principe een GPFS-bestandssysteem dat verscheidene opslagclusterknooppunten en clusterknooppunten voor gegevensuitwisseling met een aangesloten netwerk omvat, zoals eerder vermeld. Men kan zich echter ook een clustersysteem voorstellen dat alleen opslagknooppunten of alleen knooppunten voor gegevensuitwisseling omvat. De minstens twee knooppunten van het clustersysteem staan met elkaar in verbinding via een lossless Ethernet-netwerk, zodat de knooppunten uitsluitend met gebruik van een IP-protocol gegevensverkeer kunnen uitwisselen. Het lossless Ethernet-netwerk is bij voorkeur een netwerk op basis van Data Centre Ethernet.The cluster system of the present invention can support a clustered file system and comprises at least two. nodes. The nodes may be storage nodes in which a portion of the clustered file system can be stored, or nodes for exchanging data traffic with an external device. The clustered file system is basically a GPFS file system that includes several storage cluster nodes and cluster nodes for data exchange with a connected network, as mentioned earlier. However, it is also possible to envisage a cluster system comprising only storage nodes or only data exchange nodes. The at least two nodes of the cluster system are connected to each other via a lossless Ethernet network, so that the nodes can only exchange data traffic using an IP protocol. The lossless Ethernet network is preferably a network based on Data Center Ethernet.

[0056] ... Aangezien GPFS als netwerk TCP/IP gebruikt en agnostisch is ten opzichte van de onderliggende Layer 2-netwerktechnologie (OSI-model), kan bij het ontwerp van het GPFS NAN-cluster voor uitwisseling van gegevensverkeer de DCE-technologie (3) in plaats van Infiniband worden gebruikt als clusternetwerk (zie het voorbeeld in afb. 5). Dit is met name gunstig in de NAN-knooppuntenomgeving van Microsoft Windows. Daarom kan een op DCE gebaseerd GPFS NAN-cluster de volledige bandbreedte geboden door 10 Gb/s DCE benutten.[0056] Since GPFS uses TCP / IP as a network and is agnostic to the underlying Layer 2 network technology (OSI model), the design of the GPFS NAN data exchange cluster allows DCE technology (3) can be used instead of Infiniband as a cluster network (see the example in Fig. 5). This is particularly beneficial in the NAN node environment of Microsoft Windows. Therefore, a DFS-based GPFS NAN cluster can utilize the full bandwidth provided by 10 Gb / s DCE.

Dit is alleen mogelijk als de capaciteiten van de DCE-technologie op verschillende niveaus de functies en prestaties van Infiniband evenaren: - De delegatieprestaties van de DCE-netwerkadapters (CNA) moeten vergelijkbaar zijn met-die van de Infiniband-netwerkadapters (Host Channel Adapter (HCA)).This is only possible if the capacities of the DCE technology at different levels equal the functions and performance of Infiniband: - The delegation performance of the DCE network adapters (CNA) must be comparable to that of the Infiniband network adapters (Host Channel Adapter ( HCA)).

- Hoewel het flow-control-mechanisme. van DCE equivalent is aan het mechanisme van buffer-to-buffer-credits van Infiniband, moet de implementatie van het PAUSE-mechanisme een toereikende responsiviteit vertonen om het lossless gedrag te verwezenlijken bij het veeleisende opslagverkeer van GPFS.- Although the flow control mechanism. of DCE is equivalent to the buffer-to-buffer credit mechanism of Infiniband, the implementation of the PAUSE mechanism must have sufficient responsiveness to achieve lossless behavior in the demanding storage traffic of GPFS.

- De buffercapaciteit van de DCE-switches moet toereikend zijn om dit mechanisme met succes te ondersteunen.- The buffer capacity of the DCE switches must be sufficient to successfully support this mechanism.

- De DCE-switches moeten ondersteuning bieden voor voldoende virtual lanes of equivalente P-waarden om een toereikende schaalbaarheid van het cluster te garanderen.- The DCE switches must support sufficient virtual lanes or equivalent P values to ensure adequate cluster scalability.

- De implementatie van de DCE-technologie in de GPFS NAN-clusterarchitectuur moet de prestaties, schaalbaarheid en hoge beschikbaarheid geboden door de Infinibandimplementatie evenaren.- The implementation of DCE technology in the GPFS NAN cluster architecture must match the performance, scalability and high availability offered by the Infiniband implementation.

[0057] Afb. 6 is een illustratie van een uitvoeringsvorm waarin enkele op een netwerk aangesloten knooppunten (6) (NAN) in verbinding staan met een aantal opslagknooppunten (4). De opslagknooppunten staan op hun beurt in ; verbinding met een opslagsysteem (10) dat verscheidene opslagmiddelen omvat. In het getoonde voorbeeld wisselen de knooppunten uitsluitend over een Data Centre Ethernet-netwerk opslagverkeer met elkaar uit met gebruik van een IP-protocol. In deze uitvoeringsvorm gebruikt dit Data Centre Ethernet-netwerk de flow control van de 802.3x-verbinding (12) om een lossless Ethernetstructuur te creëren.FIG. 6 is an illustration of an embodiment in which some nodes (6) (NAN) connected to a network are connected to a number of storage nodes (4). The storage nodes are in turn in; connection to a storage system (10) comprising various storage means. In the example shown, the nodes only exchange storage data over a Data Center Ethernet network using an IP protocol. In this embodiment, this Data Center Ethernet network uses the flow control of the 802.3x connection (12) to create a lossless Ethernet structure.

[0058] Afb. 7 is een illustratie van een voorkeursuitvoeringsvorm waarbij het Data Centre-netwerk gebruikmaakt van een PFC-mechanisme voor flow control (13). Bij deze uitvoeringsvorm zijn de IP-verkeersstromën tussen verschillende NAN-knooppunten (6) en opslagknooppunten (4) gekoppeld aan verschillende PFC-waa.rden. Afb. 8 is een illustratie van een voorbeeld van een dergelijke mapping, waarbij 4 verschillende IP-verkeersstromên zijn gekoppeld aan 4 verschillende PFC-wäarden, bv. de waarden 4 tot en met 7. Aan elke verkeersklasse is het QoS-kenmerk ..van een lossless Ethernetklasse toegekend. Dit maakt mogelijk om voor iedere IP-verkeersstroom tussen verschillende NAN-knooppunten (6) en opslagknooppunten (4) een eigen flow-control-mechanisme in te zetten. Hierdoor worden verkeersstoringen veroorzaakt door overbelasting van een verbinding in het clusternetwerk, voorkomen.FIG. 7 is an illustration of a preferred embodiment wherein the Data Center network uses a PFC mechanism for flow control (13). In this embodiment, the IP traffic flows between different NAN nodes (6) and storage nodes (4) are coupled to different PFC values. Fig. 8 is an illustration of an example of such mapping, in which 4 different IP traffic streams are linked to 4 different PFC values, e.g. values 4 to 7. On each traffic class the QoS characteristic is of a lossless Ethernet class assigned. This makes it possible to use a separate flow control mechanism for each IP traffic flow between different NAN nodes (6) and storage nodes (4). This prevents traffic disruptions caused by overloading a connection in the cluster network.

[0059] Deze architectuur kan worden benut voor elke media-service die gebruikmaakt van een fit-for-purpose opslagcluster waarvoor een server vereist is die werkt op Microsoft Windows dan wel een koppeling van het centrale bestandssysteem via het CIFS-protocol (zie afb. 6 en 7).This architecture can be utilized for any media service that uses a fit-for-purpose storage cluster that requires a server running on Microsoft Windows or a link to the central file system via the CIFS protocol (see Fig. 6 and 7).

[0060] Bij een voorkeursuitvoering van de uitvinding dient ten minste één van de knooppunten voor uitwisseling van gegevensverkeer met een extern apparaat via een lossless Ethernet-netwerk, bijvoorbeeld een netwerk op basis van Data Centre Ethernet....DCE...kan. bijvoorbeeld worden , gebruikt voor een optimale facilitatie van het mediaverkeer tussen een centrale GPFS-clusteropslag en een client voor montage .met hoge resolutie, zoals geïllustreerd in afb. 9.In a preferred embodiment of the invention, at least one of the nodes serves for exchanging data traffic with an external device via a lossless Ethernet network, for example a network based on Data Center Ethernet .... DCE ... can. for example, are used for optimum facilitation of media traffic between a central GPFS cluster storage and a high resolution editing client, as illustrated in Fig. 9.

[0061] Sommige clienttoepassingen voor media geven de voorkeur aan toegang tot de centrale opslag via het NFS-protocol (Network File System) of het SMB-protocol (Server Message Block), waarbij voor beide Linux vereist is als besturingssysteem op het NAN-knooppunt. Weer andere geven de voorkeur aan het CIFS-protocol (Common Internet File System), waarvoor Windows vereist is op het NAN-knooppünt. Het mediaverkeèr tussen de montageclients en de centrale opslag vormt een verkeersklasse van bestandssysteemverkeer met kleine blokken. Dit soort verkeer is minder bursty dan het verkeerstype bij GPFS-opslag; bij het eerste zijn de vereiste bursts gewoonlijk 64 kB.Some media client applications prefer access to central storage via the NFS (Network File System) or SMB (Server Message Block) protocol, where both Linux is required as an operating system on the NAN node . Still others prefer the Common Internet File System (CIFS) protocol, which requires Windows on the NAN node. The media traffic between the mounting clients and the central storage forms a traffic class of file system traffic with small blocks. This type of traffic is less bursty than the traffic type with GPFS storage; at the first, the required bursts are usually 64 kB.

[0062] Er is echter, wel sprake van een potentiële nieuwe bron van overbelasting. Indien dé NAN-knooppupten van het cluster aan de clientzijde zijn uitgerust met 10 Gb/s Ethernet, hetgeen bij het gebruik van DCE in elk geval zo is, terwijl de media-clients slechts een 1 Gb/s Ethernetinterface hebben, ontstaat er een bestemmingsoverbelasting doordat bron en bestemming niet op elkaar aansluiten. Het PAUSE-mechanisme voor frame flow control van DCE zóu eèn natuurlijke oplossing voor dit probleem kunnen vormen. Aangezien er momenteel geen DCE CNA's van 1 Gb/s bestaan, moet de switch aan de clientzijde, de DCE-extender (11), de PAUSE-frames richting dê server-zijde van 10 Gb/s leveren. (Dé DCE-extehder biedt de mogelijkheid om de 10 Gb/s DCE-poorten van de DCE-switch uit te waaieren tot 1 Gb/s Ethernetpoorten.)There is, however, a potential new source of overload. If the NAN node nodes of the cluster on the client side are equipped with 10 Gb / s Ethernet, which is in any case the case when using DCE, while the media clients only have a 1 Gb / s Ethernet interface, a destination overload occurs because source and destination do not connect. The PAUSE frame flow control mechanism from DCE could be a natural solution to this problem. Since there are currently no 1 Gb / s DCE CNAs, the client side switch, the DCE extender (11), must provide the PAUSE frames to the 10 Gb / s server side. (The DCE extehder offers the possibility to fan out the 10 Gb / s DCE ports of the DCE switch to 1 Gb / s Ethernet ports.)

[0063] De twee netwerkarchitecturen, het clusternetwerk en het montage-clientnetwerk, kunnen worden samengevoegd tot één netwerk op basis van Data Centre Ethernet dat beide functies biedt (zie afb. 10). Vermoedelijk zal dit met name praktisch en kosteneffectief zijn in kleinschalige omgevingen.The two network architectures, the cluster network and the mounting client network, can be merged into one network based on Data Center Ethernet that offers both functions (see Fig. 10). This is likely to be particularly practical and cost-effective in small-scale environments.

[0064] Opgemerkt dient te worden dat deze uitvinding zich niet beperkt tot clusters op basis van hét GPFS-bestandssysteem. Integendeel, de voorgestelde oplossing kan worden toegepast op elk cluster gebaseerd op een ander bestandssysteem dat IP-protocollen als netwerkprotocol gebruikt.It should be noted that this invention is not limited to clusters based on the GPFS file system. On the contrary, the proposed solution can be applied to any cluster based on a different file system that uses IP protocols as a network protocol.

[0065] Hoewel de onderhavige uitvinding is geïllustreerd aan de hand van specifieke uitvoeringsvormen, zal het voor personen bekend met de techniek duidelijk zijn dat de uitvinding zich niet beperkt tot de details van de bovengenoemde illustratieve uitvoeringsvormen en dat de onderhavige uitvinding kan worden uitgevoerd met diverse wijzigingen en modificaties zonder af te wijken van de reikwijdte van de uitvinding. De onderhavige uitvoeringsvormen moeten daarom in alle opzichten worden beschouwd als illustratief en niet als beperkend; de reikwijdte van de uitvinding wordt aangeduid door de conclusies in het aanhangsel en niet door de bovenstaande omschrijving en alle wijzigingen die binnen de zin en het equivalentiebereik van de bijgevoegde conclusies vallen, worden daarom geacht erin te zijn inbegrepen. Met andere woorden: er wordt overwogen om het octrooi te laten gelden voor alle modificaties, variaties en equivalenten die binnen de reikwijdte van de onderliggende basisprincipes vallen en waarvan de essentiële kenmerken binnen de conclusies van deze octrooiaanvraag vallen. Voorts zal de lezer van deze octrooiaanvraag begrijpen dat de woorden 'bestaande uit' en 'omvatten' niet betekenen dat andere elementen of stappen zijn uitgesloten, dat het woord 'een' een mogelijk meervoud niet uitsluit, en dat één element, zoals een computersysteem, een processor of een andere geïntegreerde eenheid de functies kan vervullen van meerdere in de conclusies genoemde middelen. Eventuele verwijzingstekens in de conclusies mogen niet worden geïnterpreteerd als beperking van de betreffende conclusies. De termen 'eerste', 'tweede', 'derde, 'a', 'b', 'c' en dergelijke, dienen bij gebruik in de omschrijving of in de conclusies om vergelijkbare elementen of stappen van elkaar te onderscheiden en omschrijven niet noodzakelijkerwijs een procedurele of chronologische volgorde. Zo ook worden de termen 'bovenkant', 'onderkant', 'boven', 'onder' en dergelijke gebruikt ter omschrijving en verwijzen ze niet noodzakelijkerwijs naar relatieve posities. Het dient te worden begrepen dat de zodanig gebruikte termen onder voorkomende omstandigheden onderling verwisselbaar zijn en dat uitvoeringsvormen van de uitvinding overeenkomstig de onderhavige uitvinding kunnen werken in andere volgordes of standen dan de hierboven beschreven of geïllustreerde.Although the present invention has been illustrated with reference to specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the details of the above illustrative embodiments and that the present invention may be practiced with various changes and modifications without departing from the scope of the invention. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive; the scope of the invention is indicated by the claims in the appendix and not by the above description, and all modifications that fall within the meaning and the equivalence range of the appended claims are therefore deemed to be included. In other words: it is considered to apply the patent to all modifications, variations and equivalents that fall within the scope of the underlying basic principles and whose essential features fall within the claims of this patent application. Furthermore, the reader of this patent application will understand that the words "consisting of" and "include" do not mean that other elements or steps are excluded, that the word "a" does not exclude a possible plural, and that one element, such as a computer system, a processor or other integrated unit can perform the functions of a plurality of means mentioned in the claims. Any reference characters in the claims may not be interpreted as limiting the claims concerned. The terms "first", "second", "third," a "," b "," c ", and the like, when used in the description or in the claims, serve to distinguish and not necessarily describe similar elements or steps a procedural or chronological order. Similarly, the terms "top", "bottom", "top", "bottom" and the like are used for description and do not necessarily refer to relative positions. It is to be understood that under such circumstances the terms used in this way are interchangeable and that embodiments of the invention according to the present invention may operate in sequences or positions other than those described or illustrated above.

Claims

A cluster system (1) configured to support a clustered file system, said cluster system consisting of at least two nodes (4, 6) configured to interchange data traffic of said clustered file system via an IP protocol over a lossless Ethernet network (3), said lossless Ethernet network being configured to use a Priority Flow Control mechanism (13) to link different data traffic flows to several lossless IP over Ethernet traffic classes, so that said different data traffic flows provided with an independent flow control mechanism, and characterized by the fact that two or more of said data traffic flows belonging to different of said traffic classes are treated with equal priority.

Cluster system as in claim 1, wherein said lossless Ethernet network is a network based on Data Center Ethernet.

A cluster system as in any of claims 1 or 2, wherein at least one of said nodes is a storage node configured to store at least a portion of said clustered file system.

A cluster system as in claim 3, wherein said storage node comprises local storage means. .

A cluster system as in claim 3 or 4, wherein said storage node is configured for connection to external storage means (10).

6. Cluster system as in one of the claims! 1 to 5, wherein at least one of said nodes is configured to exchange said data traffic with an external device (9).

The cluster system as in any of claims 3 to 5, further comprising a node configured to exchange said data traffic with an external device (9), said node being further configured to run on an operating system other than said storage node.

A cluster system as in claim 6 or 7, wherein said at least one node is configured to exchange said data traffic with an external device (9) via a lossless Ethernet connection.

The cluster system as in claim 8, wherein said Ethernet connection is part of said lossless Ethernet network.

A cluster system as claimed in any of claims 6 to 9, wherein said external device is a high resolution mounting client.

A cluster system as in any preceding claim, wherein at least one of said nodes comprises media application processing means.

A cluster system as in any preceding claim, equipped to selectively control two or more of said data traffic streams belonging to different of said traffic welding in bandwidth.

A cluster system as in any preceding claim, adapted to shape the traffic of two or more of said data traffic streams belonging to different of said traffic welds independently of each other.

14: Assembly of at least two cluster systems as mentioned in one or more of the preceding claims, wherein said at least two cluster systems are configured to exchange data traffic from at least one of the clustered file systems via an IP protocol over a special purpose ('dedicated') lossless Ethernet network.

An assembly as in claim 14, wherein said dedicated lossless Ethernet network is a network based on Data Center Ethernet.

An assembly as in claim 14 or 15, wherein said dedicated network is at least part of the lossless Ethernet network of at least one of said at least two cluster systems.