WO2007088081A1 - Efficient data management in a cluster file system - Google Patents

Efficient data management in a cluster file system Download PDF

Info

Publication number
WO2007088081A1
WO2007088081A1 PCT/EP2007/050047 EP2007050047W WO2007088081A1 WO 2007088081 A1 WO2007088081 A1 WO 2007088081A1 EP 2007050047 W EP2007050047 W EP 2007050047W WO 2007088081 A1 WO2007088081 A1 WO 2007088081A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
stored
dataset
specified dataset
specified
Prior art date
Application number
PCT/EP2007/050047
Other languages
French (fr)
Inventor
Pradeep Vincent
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited filed Critical International Business Machines Corporation
Priority to EP07700245A priority Critical patent/EP1979806A1/en
Publication of WO2007088081A1 publication Critical patent/WO2007088081A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0605Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present invention is directed generally to the storage of digital information in a cluster file system and, in particular, to the efficient use of inter-node bandwidth.
  • a cluster file system allows multiple servers to access the same files using independent paths to data storage.
  • a group of independent nodes are interconnected through a backbone switch and work together as a single system. Users (clients) are provided with access to all files located on the storage devices in the system using common file system paths.
  • each node is configured into two virtual servers, a front-end server and a back-end server.
  • the location of datasets on the various servers is maintained in metadata.
  • a request by a client for an operation on a specified dataset may be received by any node in the cluster.
  • the specified dataset may be located on one of the virtual servers (or on one of the nodes if the nodes are not configured with virtual servers) .
  • the write data is then typically stored by the receiving node in a cache in that node.
  • the modified dataset is flushed out of the cache and sent to its original location. If the original location is on a virtual server in a node other than the receiving node, the dataset must be transferred across the backbone switch, consuming backbone resources and bandwidth.
  • the present invention provides a cluster file system accessible to clients through a network.
  • the file system comprises a plurality of file system nodes in a cluster, including a first node and a second node, a backbone switch interconnecting the first node and the second node and a metadata structure identifying the node on which datasets are stored.
  • the first node comprises a first cache and a dataset controller.
  • the dataset controller is configured to, if a specified dataset is stored on the second node, receive a request from a client to perform a file system operation on the specified dataset, access the metadata structure to determine the node on which the specified dataset is stored, retrieve through the backbone switch from the second node that a first portion of the specified dataset to which the file system operation is directed and leave a remainder portion of the specified dataset stored in the second node, store the retrieved first portion in the first cache and upon completion of the file system operation, modify the metadata structure to indicate that at least the first portion of the specified dataset is stored in the first node, whereby the first portion is not returned through the backbone switch to the second node.
  • the present invention further provides a method for managing datasets in a cluster file system.
  • the method comprises receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster, retrieving the specified dataset from a first node through a backbone switch, storing the retrieved specified dataset in a cache in a second node, performing the requested file system operation on the specified dataset and, upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node, whereby the specified dataset is not returned through the backbone switch to the first node.
  • the present invention further provides a computer program product of a computer readable medium usable with a programmable computer and having computer-readable code embodied therein for managing datasets in a cluster file system.
  • the computer-readable code comprising instructions for receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster, retrieving the specified dataset from a first node through a backbone switch, storing the retrieved specified dataset in a cache in a second node, performing the requested file system operation on the specified dataset and, upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node, whereby the specified dataset is not returned through the backbone switch to the first node.
  • the present invention further provides a file system node in a multi-node cluster file system.
  • the node comprises means for interconnecting the node to at least a second node through a backbone switch, a cache, a metadata structure identifying the node on which datasets are stored, means for receiving a request from a client to perform a file system operation on a specified dataset, means for accessing the metadata structure to determine the node on which the specified dataset is stored, means for retrieving through the backbone switch that first portion of the specified dataset to which the file system operation is directed and leaving a remainder portion of the specified dataset stored in the second node if the specified dataset is stored on the second node, means for storing the retrieved first portion in the first cache and means for modifying the metadata structure upon completion of the file system operation to indicate that at least the first portion of the specified dataset is stored in the first node, whereby the first portion is not returned through the backbone switch to the second node.
  • Fig. 1 is a block diagram of a cluster file system in which the present invention may be implemented
  • Fig. 2 is a block diagram of one configuration of a node of the cluster file system of Fig. 1;
  • Figs. 3A-3C are sequential functional block diagrams of one embodiment of a cluster file system of the present invention in which the location of an entire dataset is transferred from one node to another;
  • Fig. 4 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 3A-3C;
  • Figs. 5A-5C are sequential functional block diagrams of initial dataset processing in which a dataset is dividable into subsets;
  • Fig. 6 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 5A-5C;
  • Figs . 7A and 7B continue from the sequential functional block diagrams of Figs. 5A-5C and illustrate an embodiment of a cluster file system of the present invention in which the subsets are reassembled in one node;
  • Fig. 8 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 7A and 7B;
  • Fig. 9 continues from the sequential functional block diagrams of Figs. 5A and 5B and illustrates another embodiment of a cluster file system of the present invention in which the ultimate locations of the subsets are split between two nodes;
  • Fig. 10 is a flowchart of a method of the embodiment of the present invention illustrated in Fig. 9;
  • Figs. 11A-11C continue from the sequential functional block diagrams of Figs. 5A and 5B and illustrate an embodiment of the present invention in which the subsets are rejoined in their original node location during a period of reduced activity of the backbone switch;
  • Fig. 12 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 11A-11C.
  • Fig. 1 is a block diagram of a cluster file system 100 in which the present invention may be implemented.
  • the system 100 includes clients 110 and a plurality of nodes. For clarity, two nodes 120 and 200 are illustrated and included in the description; however, the system 100 may include additional nodes and the scope and operation of the present invention do not depend upon the number of nodes.
  • a backbone switch 130 couples the nodes 200 and 120, herein referred to as Node 1 and Node 2, respectively, enabling datasets to be transferred between the nodes 200 and 120.
  • Node 1 200 is a block diagram of one configuration of Node 1 200; it will be appreciated that the other node(s) may have the same or similar configuration.
  • Node 1 200 has been configured to include two virtual servers, a front-end load balancing sever 202 and a back-end dataset storage server 204.
  • the front-end server 202 receives file system requests from clients, determines the appropriate node to which the request is to be routed and decides when and how to flush the cache.
  • the back-end server 204 manages the datasets and provides a locking/leasing mechanism for the front-end server to use.
  • Node 1 200 includes a memory cache 210, a dataset controller 220 and storage for dataset metadata 230. For each dataset stored in the cluster file system 100, the metadata 230 identifies its location in a virtual server (if the nodes are so configured) or in a node (if virtual servers are not used) .
  • a file system request is sent by a client 110 (step 400) , such as a write operation on a specified dataset
  • the request is received by one of the nodes 200, 120.
  • the write data or modified data is stored in the cache 210 (Fig. 3A; step 404) .
  • the dataset controller 220 determines from the metadata 230 the location of the specified dataset on which the operation is to be performed (step 406) .
  • the metadata 230 may indicate that the specified dataset is dataset 1 122 and is located in Node 2 120 (Fig. 3B) .
  • cache 210 upon completion of the requested operation, cache 210 would be flushed and the modified dataset 122 would be transferred through the backbone switch 130 to Node 2 120 to be stored.
  • the cache 210 is instead flushed (step 408) and the modified dataset 122 stored in Node 1 200 (step 410) .
  • the metadata 230 is updated (step 412) to reflect the new location (Fig. 3C) .
  • Figs. 5A-5C and the accompanying flowchart of Fig. 6 illustrate the initial dataset processing during another embodiment of the present invention.
  • the request is received by one of the nodes 200, 120.
  • the write or modified data is stored in the cache 210 (Fig. 5A; step 604) .
  • the dataset controller 220 determines from the metadata 230 the location of the dataset on which the operation is to be performed (step 604) .
  • the metadata 230 may indicate that the specified dataset is dataset 2 124 and is located in Node 2 120 (Fig. 5B) .
  • the dataset 2 124 may be subdivided into subsets (Fig. 5C; step 608) .
  • the size of the dataset 2 124 may be 8 GB but the requested file operation pertains to only 6 GB.
  • the dataset 2 124 may then be divided into four subsets DS-2A - DS-2Dm the cache 210 in Node 1 200 (Fig. 5C) .
  • the requested file system operation may be completed (step 610) .
  • the present invention provides several alternatives for processing the subsets following their processing in accordance with the requested file system operation.
  • Figs. 7A and 7B and the flowchart of Fig. 8 illustrate one such alternative.
  • it is a more efficient use of backbone resources to reassemble the subsets DS-2A - DS-2D of dataset 2 124 (Fig. 7A; step 800) and store it in Node 1 200 (step 802) .
  • the metadata 230 is then updated to reflect that the dataset 2 124 is now stored in Node 1 200 (step 804; Fig. 7B) .
  • Fig. 9 and the flowchart of Fig. 10 illustrate another alternative.
  • the modified subsets DS-2A - DS-2C are separated from the remaining subset DS-2D (step 1000) and then flushed from the cache 210 into storage in Node 1 200 (step 1002) while the other subset DS-2D remains in Node 2 120.
  • the metadata 230 is updated to reflect the new location of subsets DS-2A - DS-2C and the location of subset DS-2D (step 1004) .
  • subsets DS-2A - DS-2C may be reassembled with subset DS-2D in Node 2 during a period in which the backbone switch 130 is idle or otherwise at a reduced activity level (step 1200) ; that is, when the backbone switch 130 is idle or the full backbone bandwidth is otherwise not being used.
  • the subsets DS-2A - DS-2C may be transferred back through the backbone switch 130 (Fig. HA; step 1202) to be joined with the remaining subset
  • the metadata 230 is then updated to reflect the change in location of the subsets DS-2A - DS-2C and the reassembly of dataset 2 (Fig. HB; step 1206) .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods and systems manage datasets in a cluster file system. A request is received from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster. The specified dataset is retrieved from a first node through a backbone switch and stored in a cache in a second node. The requested file system operation is performed on the specified dataset and, upon completion of the requested operation, metadata is modified to indicate that the specified dataset is stored in the second node. The specified dataset is not returned through the backbone switch to the first node.

Description

EFFICIENT DATA MANAGEMENT IN A CLUSTER FILE SYSTEM
TECHNICAL FIELD OF THE INVENTION
The present invention is directed generally to the storage of digital information in a cluster file system and, in particular, to the efficient use of inter-node bandwidth.
BACKGROUND OF THE INVENTION
A cluster file system allows multiple servers to access the same files using independent paths to data storage. A group of independent nodes are interconnected through a backbone switch and work together as a single system. Users (clients) are provided with access to all files located on the storage devices in the system using common file system paths. In one cluster file system, each node is configured into two virtual servers, a front-end server and a back-end server. The location of datasets on the various servers is maintained in metadata. A request by a client for an operation on a specified dataset may be received by any node in the cluster. By accessing the metadata, the specified dataset may be located on one of the virtual servers (or on one of the nodes if the nodes are not configured with virtual servers) . The write data is then typically stored by the receiving node in a cache in that node. Upon completion of the operation, the modified dataset is flushed out of the cache and sent to its original location. If the original location is on a virtual server in a node other than the receiving node, the dataset must be transferred across the backbone switch, consuming backbone resources and bandwidth.
DISCLOSURE OF THE INVENTION
The present invention provides a cluster file system accessible to clients through a network. The file system comprises a plurality of file system nodes in a cluster, including a first node and a second node, a backbone switch interconnecting the first node and the second node and a metadata structure identifying the node on which datasets are stored. The first node comprises a first cache and a dataset controller. The dataset controller is configured to, if a specified dataset is stored on the second node, receive a request from a client to perform a file system operation on the specified dataset, access the metadata structure to determine the node on which the specified dataset is stored, retrieve through the backbone switch from the second node that a first portion of the specified dataset to which the file system operation is directed and leave a remainder portion of the specified dataset stored in the second node, store the retrieved first portion in the first cache and upon completion of the file system operation, modify the metadata structure to indicate that at least the first portion of the specified dataset is stored in the first node, whereby the first portion is not returned through the backbone switch to the second node.
The present invention further provides a method for managing datasets in a cluster file system. The method comprises receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster, retrieving the specified dataset from a first node through a backbone switch, storing the retrieved specified dataset in a cache in a second node, performing the requested file system operation on the specified dataset and, upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node, whereby the specified dataset is not returned through the backbone switch to the first node.
The present invention further provides a computer program product of a computer readable medium usable with a programmable computer and having computer-readable code embodied therein for managing datasets in a cluster file system. The computer-readable code comprising instructions for receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster, retrieving the specified dataset from a first node through a backbone switch, storing the retrieved specified dataset in a cache in a second node, performing the requested file system operation on the specified dataset and, upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node, whereby the specified dataset is not returned through the backbone switch to the first node.
The present invention further provides a file system node in a multi-node cluster file system. The node comprises means for interconnecting the node to at least a second node through a backbone switch, a cache, a metadata structure identifying the node on which datasets are stored, means for receiving a request from a client to perform a file system operation on a specified dataset, means for accessing the metadata structure to determine the node on which the specified dataset is stored, means for retrieving through the backbone switch that first portion of the specified dataset to which the file system operation is directed and leaving a remainder portion of the specified dataset stored in the second node if the specified dataset is stored on the second node, means for storing the retrieved first portion in the first cache and means for modifying the metadata structure upon completion of the file system operation to indicate that at least the first portion of the specified dataset is stored in the first node, whereby the first portion is not returned through the backbone switch to the second node.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a block diagram of a cluster file system in which the present invention may be implemented;
Fig. 2 is a block diagram of one configuration of a node of the cluster file system of Fig. 1;
Figs. 3A-3C are sequential functional block diagrams of one embodiment of a cluster file system of the present invention in which the location of an entire dataset is transferred from one node to another;
Fig. 4 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 3A-3C;
Figs. 5A-5C are sequential functional block diagrams of initial dataset processing in which a dataset is dividable into subsets;
Fig. 6 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 5A-5C;
Figs . 7A and 7B continue from the sequential functional block diagrams of Figs. 5A-5C and illustrate an embodiment of a cluster file system of the present invention in which the subsets are reassembled in one node;
Fig. 8 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 7A and 7B;
Fig. 9 continues from the sequential functional block diagrams of Figs. 5A and 5B and illustrates another embodiment of a cluster file system of the present invention in which the ultimate locations of the subsets are split between two nodes; Fig. 10 is a flowchart of a method of the embodiment of the present invention illustrated in Fig. 9;
Figs. 11A-11C continue from the sequential functional block diagrams of Figs. 5A and 5B and illustrate an embodiment of the present invention in which the subsets are rejoined in their original node location during a period of reduced activity of the backbone switch; and
Fig. 12 is a flowchart of a method of the embodiment of the present invention illustrated in Figs. 11A-11C.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Fig. 1 is a block diagram of a cluster file system 100 in which the present invention may be implemented. The system 100 includes clients 110 and a plurality of nodes. For clarity, two nodes 120 and 200 are illustrated and included in the description; however, the system 100 may include additional nodes and the scope and operation of the present invention do not depend upon the number of nodes. A backbone switch 130 couples the nodes 200 and 120, herein referred to as Node 1 and Node 2, respectively, enabling datasets to be transferred between the nodes 200 and 120.
Fig. 2 is a block diagram of one configuration of Node 1 200; it will be appreciated that the other node(s) may have the same or similar configuration. Node 1 200 has been configured to include two virtual servers, a front-end load balancing sever 202 and a back-end dataset storage server 204. The front-end server 202 receives file system requests from clients, determines the appropriate node to which the request is to be routed and decides when and how to flush the cache. The back-end server 204 manages the datasets and provides a locking/leasing mechanism for the front-end server to use. In addition, Node 1 200 includes a memory cache 210, a dataset controller 220 and storage for dataset metadata 230. For each dataset stored in the cluster file system 100, the metadata 230 identifies its location in a virtual server (if the nodes are so configured) or in a node (if virtual servers are not used) .
Turning now to the block diagrams of Figs. 3A-3C and the flow chart of Fig. 4, the operation of one embodiment of the present invention will be described. When a file system request is sent by a client 110 (step 400) , such as a write operation on a specified dataset, the request is received by one of the nodes 200, 120. For purposes of this description, it will be assumed that the request is received by Node 1 200 (step 402) . The write data or modified data is stored in the cache 210 (Fig. 3A; step 404) . The dataset controller 220 determines from the metadata 230 the location of the specified dataset on which the operation is to be performed (step 406) . For example, the metadata 230 may indicate that the specified dataset is dataset 1 122 and is located in Node 2 120 (Fig. 3B) .
In a conventional cluster file system, upon completion of the requested operation, cache 210 would be flushed and the modified dataset 122 would be transferred through the backbone switch 130 to Node 2 120 to be stored. However, in order to reduce bandwidth usage through the backbone switch 122, in the embodiment of the present invention illustrated in Figs. 3A-3C, the cache 210 is instead flushed (step 408) and the modified dataset 122 stored in Node 1 200 (step 410) . The metadata 230 is updated (step 412) to reflect the new location (Fig. 3C) .
Figs. 5A-5C and the accompanying flowchart of Fig. 6 illustrate the initial dataset processing during another embodiment of the present invention. As in the previous embodiment, when a file system request is sent by a client 110 (step 600) , the request is received by one of the nodes 200, 120. For purposes of this description, it will again be assumed that the request is received by Node 1 200 (step 602) . The write or modified data is stored in the cache 210 (Fig. 5A; step 604) . The dataset controller 220 determines from the metadata 230 the location of the dataset on which the operation is to be performed (step 604) . For example, the metadata 230 may indicate that the specified dataset is dataset 2 124 and is located in Node 2 120 (Fig. 5B) . If the dataset 2 124 is large relative to the aggregate write size, it may be subdivided into subsets (Fig. 5C; step 608) . For example, the size of the dataset 2 124 may be 8 GB but the requested file operation pertains to only 6 GB. The dataset 2 124 may then be divided into four subsets DS-2A - DS-2Dm the cache 210 in Node 1 200 (Fig. 5C) . Once creation of the subsets DS-2A - DS-2D has been performed in the cache 210, the requested file system operation may be completed (step 610) .
The present invention provides several alternatives for processing the subsets following their processing in accordance with the requested file system operation. Figs. 7A and 7B and the flowchart of Fig. 8 illustrate one such alternative. Rather than transfer the modified subsets DS-2A - DS-2C through the backbone switch 130 from Node 1 200 to Node 2 120, it is a more efficient use of backbone resources to reassemble the subsets DS-2A - DS-2D of dataset 2 124 (Fig. 7A; step 800) and store it in Node 1 200 (step 802) . The metadata 230 is then updated to reflect that the dataset 2 124 is now stored in Node 1 200 (step 804; Fig. 7B) .
Fig. 9 and the flowchart of Fig. 10 illustrate another alternative. Rather than transfer the modified subsets DS-2A - DS-2C through the backbone switch 130 from Node 1 200 to Node 2 120 (thereby using backbone bandwidth and resources) , the modified subsets DS-2A - DS-2C are separated from the remaining subset DS-2D (step 1000) and then flushed from the cache 210 into storage in Node 1 200 (step 1002) while the other subset DS-2D remains in Node 2 120. The metadata 230 is updated to reflect the new location of subsets DS-2A - DS-2C and the location of subset DS-2D (step 1004) .
In still a further embodiment of the present invention, illustrated in the block diagrams of Figs. HA and HB and the flowchart of Fig. 12, if the subsets DS-2A - DS-2C have been stored in Node 1 as described with respect to Figs. 9 and 10, they may be reassembled with subset DS-2D in Node 2 during a period in which the backbone switch 130 is idle or otherwise at a reduced activity level (step 1200) ; that is, when the backbone switch 130 is idle or the full backbone bandwidth is otherwise not being used. Thus, the subsets DS-2A - DS-2C may be transferred back through the backbone switch 130 (Fig. HA; step 1202) to be joined with the remaining subset
DS-2D (step 1204) . The metadata 230 is then updated to reflect the change in location of the subsets DS-2A - DS-2C and the reassembly of dataset 2 (Fig. HB; step 1206) .
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media such as a floppy disk, a hard disk drive, a RAM, and CD-ROMs and transmission-type media such as digital and analog communication links. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated. Moreover, although described above with respect to methods and systems, the need in the art may also be met with a computer program product containing instructions for managing datasets in a cluster file system.

Claims

1. A cluster file system accessible to clients through a network, comprising:
a plurality of file system nodes in a cluster, comprising a first node and a second node;
a backbone switch operable to interconnect the first node and the second node;
a metadata structure identifying the node on which datasets are stored; and
the first node comprising a first cache and a dataset controller configurable to, if a specified dataset is stored on the second node:
receive a request from a client to perform a file system operation on the specified dataset;
access the metadata structure to determine the node on which the specified dataset is stored;
retrieve through the backbone switch from the second node that a first portion of the specified dataset to which the file system operation is directed and leave a remainder portion of the specified dataset stored in the second node;
store the retrieved first portion in the first cache; and
upon completion of the file system operation, modify the metadata structure to indicate that at least the first portion of the specified dataset is stored in the first node
2. The system of claim 1, wherein: the first portion is not returned through the backbone switch to the second node.
3. The system of claim 1, wherein:
the first node and the second node each comprise a virtual front-end server and a virtual back-end server; and the metadata structure identifies the virtual server and the node on which datasets are stored.
4. The system of claim 1, wherein the dataset controller is further configurable to:
upon completion of the file system operation, retrieve through the backbone switch the remainder portion of the specified dataset;
modify the metadata structure to indicate that the entire specified dataset is stored in the first node; and
store the entire specified dataset in the first node.
5. The system of claim 1, wherein the dataset controller is further configurable to:
divide the specified dataset into a plurality of subsets, each having a size wherein the first portion and the remainder portion of the specified dataset each comprise at least one subset;
modify the metadata structure to indicate that subsets comprising the first portion are stored in the first node and subsets comprising the remainder portion are stored in the second node; and
store the subsets of the first portion in the first node.
6. The system of claim 5, wherein the dataset controller is further configurable to, during a time in which the backbone switch is at a reduced level of activity:
transfer the subsets comprising the first portion from the first node through the backbone switch to the second node;
combine the at least one subset of the first portion with the at least one subset of the remainder portion to reform the specified dataset;
store the reformed specified dataset in the second node; and
modify the metadata structure to indicate that the specified dataset is stored in the second node.
7. The system of claim 1, wherein the dataset controller is further configurable to, during a time in which the backbone switch is at a reduced level of activity:
transfer the first portion from the second node through the backbone switch to the first node;
combine the first portion with the remainder portion to reform the specified dataset;
store the reformed specified dataset in the first node; and
modify the metadata structure to indicate that the specified dataset is stored in the first node.
8. A method for managing datasets in a cluster file system, comprising:
receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster;
retrieving the specified dataset from a first node through a backbone switch;
storing the retrieved specified dataset in a cache in a second node;
performing the requested file system operation on the specified dataset; and
upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node.
9. The method of claim 8, wherein the specified dataset is not returned through the backbone switch to the first node.
10. The method of claim 8, wherein:
the file system operation is requested to be performed on a first portion of the specified dataset; and retrieving the specified dataset comprises retrieving the first portion through the backbone switch whereby a second portion remains stored in the first node.
11. The method of claim 10, wherein modifying the metadata comprises modifying the metadata to indicate that the first portion of the specified dataset is stored in the second node and the second portion is stored in the first node.
12. The method of claim 10, wherein:
the method further comprises dividing the specified dataset into a plurality of subsets wherein the first portion and the second portion each comprise at least one subset; and
modifying the metadata comprises modifying the metadata to indicate that subsets comprising the first portion are stored in the second node and subsets comprising the second portion are stored in the first node.
13. The method of claim 12, further comprising, during a time in which the backbone switch is at a reduced level of activity:
transferring the at least one subset of the first portion from the second node through the backbone switch to the first node;
combining the at least one subset of the first portion with the at least one subset of the second portion to reform the specified dataset;
storing the reformed specified dataset in the first node; and
modifying the metadata structure to indicate that the specified dataset is stored in the first node.
14. The method of claim 8, further comprising, during a time in which the backbone switch is at a reduced level of activity:
transferring the first portion from the second node through the backbone switch to the first node;
combining the first portion with the second portion to reform the specified dataset; storing the reformed specified dataset in the first node; and
modifying the metadata structure to indicate that the specified dataset is stored in the first node.
15. A computer program product of a computer readable medium usable with a programmable computer, the computer program product having computer-readable code embodied therein for managing datasets in a cluster file system, the computer-readable code comprising instructions for:
receiving a request from a client to perform a file system operation on a specified dataset stored in one of a plurality of nodes in a cluster;
retrieving the specified dataset from a first node through a backbone switch;
storing the retrieved specified dataset in a cache in a second node;
performing the requested file system operation on the specified dataset; and
upon completion of the requested operation, modifying metadata to indicate that the specified dataset is stored in the second node.
16. The computer program product of claim 15, wherein the specified dataset is not returned through the backbone switch to the first node.
17. The computer program product of claim 15, wherein:
the file system operation is requested to be performed on a first portion of the specified dataset; and
the instructions for retrieving the specified dataset comprise instructions for retrieving the first portion through the backbone switch whereby a second portion remains stored in the first node.
18. The computer program product of claim 17, wherein instructions for modifying the metadata comprises instructions for modifying the metadata to indicate that the first portion of the specified dataset is stored in the second node and the second portion is stored in the first node.
19. The computer program product of claim 17, wherein: the instructions further comprise instructions for dividing the specified dataset into a plurality of subsets wherein the first portion and the second portion each comprise at least one subset; and
the instructions for modifying the metadata comprise instructions for modifying the metadata to indicate that subsets comprising the first portion are stored in the second node and subsets comprising the second portion are stored in the first node.
20. The computer program product of claim 19, further comprising instructions for, during a time in which the backbone switch is at a reduced level of activity:
transferring the at least one subset of the first portion from the second node through the backbone switch to the first node;
combining the at least one subset of the first portion with the at least one subset of the second portion to reform the specified dataset;
storing the reformed specified dataset in the first node; and
modifying the metadata structure to indicate that the specified dataset is stored in the first node.
21. The computer program product of claim 15, further comprising instructions for, during a time in which the backbone switch is at a reduced level of activity:
transferring the first portion from the second node through the backbone switch to the first node;
combining the first portion with the second portion to reform the specified dataset;
storing the reformed specified dataset in the first node; and
modifying the metadata structure to indicate that the specified dataset is stored in the first node.
22. A file system node in a multi-node cluster file system, comprising:
means for interconnecting the node to at least a second node through a backbone switch;
a cache;
a metadata structure identifying the node on which datasets are stored;
means for receiving a request from a client to perform a file system operation on a specified dataset;
means for accessing the metadata structure to determine the node on which the specified dataset is stored;
if the specified dataset is stored on the second node, means for retrieving through the backbone switch that first portion of the specified dataset to which the file system operation is directed and leaving a remainder portion of the specified dataset stored in the second node;
means for storing the retrieved first portion in the first cache; and
means for modifying the metadata structure upon completion of the file system operation to indicate that at least the first portion of the specified dataset is stored in the first node.
23. The file system node of claim 22, whereby the first portion is not returned through the backbone switch to the second node.
24. The file system node of claim 22, further comprising:
means for retrieving through the backbone switch the remainder portion of the specified dataset upon completion of the file system operation;
modifying the metadata structure to indicate that the entire specified dataset is stored in the first node; and
storing the entire specified dataset in the first node.
25. The file system node of claim 22, further comprising: means for dividing the specified dataset into a plurality of subsets, each having a size wherein the first portion and the remainder portion of the specified dataset each comprise at least one subset; means for modifying the metadata structure to indicate that subsets comprising the first portion are stored in the first node and subsets comprising the remainder portion are stored in the second node; and
means for storing the subsets of the first portion in the first node.
26. The file system node of claim 25, further comprising:
means for transferring the subsets comprising the first portion from the first node through the backbone switch to the second node, during a time in which the backbone switch is at a reduced level of activity;
means for combining the at least one subset of the first portion with the at least one subset of the remainder portion to reform the specified dataset;
means for storing the reformed specified dataset in the second node; and
means for modifying the metadata structure to indicate that the specified dataset is stored in the second node.
27. The file system node of claim 22, further comprising:
means for transferring the first portion from the second node through the backbone switch to the first node during a time in which the backbone switch is at a reduced level of activity;
means for combining the first portion with the remainder portion to reform the specified dataset;
means for storing the reformed specified dataset in the first node; and
means for modifying the metadata structure to indicate that the specified dataset is stored in the first node.
PCT/EP2007/050047 2006-01-31 2007-01-03 Efficient data management in a cluster file system WO2007088081A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07700245A EP1979806A1 (en) 2006-01-31 2007-01-03 Efficient data management in a cluster file system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/343,305 2006-01-31
US11/343,305 US20070179981A1 (en) 2006-01-31 2006-01-31 Efficient data management in a cluster file system

Publications (1)

Publication Number Publication Date
WO2007088081A1 true WO2007088081A1 (en) 2007-08-09

Family

ID=38323346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2007/050047 WO2007088081A1 (en) 2006-01-31 2007-01-03 Efficient data management in a cluster file system

Country Status (4)

Country Link
US (1) US20070179981A1 (en)
EP (1) EP1979806A1 (en)
CN (1) CN101375241A (en)
WO (1) WO2007088081A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101188566B (en) * 2007-12-13 2010-06-02 东软集团股份有限公司 A method and system for data buffering and synchronization under cluster environment
WO2017019128A1 (en) * 2015-07-29 2017-02-02 Hewlett-Packard Development Company, L.P. File system metadata representations

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101296176B (en) * 2007-04-25 2010-12-22 阿里巴巴集团控股有限公司 Data processing method and apparatus based on cluster
US8495250B2 (en) 2009-12-16 2013-07-23 International Business Machines Corporation Asynchronous file operations in a scalable multi-node file system cache for a remote cluster file system
US8473582B2 (en) * 2009-12-16 2013-06-25 International Business Machines Corporation Disconnected file operations in a scalable multi-node file system cache for a remote cluster file system
US8458239B2 (en) * 2009-12-16 2013-06-04 International Business Machines Corporation Directory traversal in a scalable multi-node file system cache for a remote cluster file system
US9158788B2 (en) * 2009-12-16 2015-10-13 International Business Machines Corporation Scalable caching of remote file data in a cluster file system
US8402106B2 (en) * 2010-04-14 2013-03-19 Red Hat, Inc. Asynchronous future based API
US8463762B2 (en) * 2010-12-17 2013-06-11 Microsoft Corporation Volumes and file system in cluster shared volumes
US8645978B2 (en) * 2011-09-02 2014-02-04 Compuverde Ab Method for data maintenance
US8886908B2 (en) 2012-06-05 2014-11-11 International Business Machines Corporation Management of multiple capacity types in storage systems
US9836419B2 (en) 2014-09-15 2017-12-05 Microsoft Technology Licensing, Llc Efficient data movement within file system volumes
US10021212B1 (en) 2014-12-05 2018-07-10 EMC IP Holding Company LLC Distributed file systems on content delivery networks
US10936494B1 (en) 2014-12-05 2021-03-02 EMC IP Holding Company LLC Site cache manager for a distributed file system
US10423507B1 (en) 2014-12-05 2019-09-24 EMC IP Holding Company LLC Repairing a site cache in a distributed file system
US10452619B1 (en) 2014-12-05 2019-10-22 EMC IP Holding Company LLC Decreasing a site cache capacity in a distributed file system
US10430385B1 (en) 2014-12-05 2019-10-01 EMC IP Holding Company LLC Limited deduplication scope for distributed file systems
US10445296B1 (en) 2014-12-05 2019-10-15 EMC IP Holding Company LLC Reading from a site cache in a distributed file system
US10951705B1 (en) 2014-12-05 2021-03-16 EMC IP Holding Company LLC Write leases for distributed file systems
DE102017200263A1 (en) * 2017-01-10 2018-07-12 Bayerische Motoren Werke Aktiengesellschaft Central data storage in the electrical system
US10839093B2 (en) 2018-04-27 2020-11-17 Nutanix, Inc. Low latency access to physical storage locations by implementing multiple levels of metadata
US10831521B2 (en) * 2018-04-27 2020-11-10 Nutanix, Inc. Efficient metadata management
EP4160422A4 (en) * 2020-07-02 2023-12-06 Huawei Technologies Co., Ltd. Method for using intermediate device to process data, computer system, and intermediate device
US11809709B2 (en) * 2021-03-02 2023-11-07 Red Hat, Inc. Metadata size reduction for data objects in cloud storage systems

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117438A1 (en) * 2000-11-02 2004-06-17 John Considine Switching system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615235B1 (en) * 1999-07-22 2003-09-02 International Business Machines Corporation Method and apparatus for cache coordination for multiple address spaces
US7272613B2 (en) * 2000-10-26 2007-09-18 Intel Corporation Method and system for managing distributed content and related metadata
US7266556B1 (en) * 2000-12-29 2007-09-04 Intel Corporation Failover architecture for a distributed storage system
US7054927B2 (en) * 2001-01-29 2006-05-30 Adaptec, Inc. File system metadata describing server directory information
US6912669B2 (en) * 2002-02-21 2005-06-28 International Business Machines Corporation Method and apparatus for maintaining cache coherency in a storage system
US7003631B2 (en) * 2002-05-15 2006-02-21 Broadcom Corporation System having address-based intranode coherency and data-based internode coherency
US6857001B2 (en) * 2002-06-07 2005-02-15 Network Appliance, Inc. Multiple concurrent active file systems
US7139772B2 (en) * 2003-08-01 2006-11-21 Oracle International Corporation Ownership reassignment in a shared-nothing database system
US7987268B2 (en) * 2003-10-23 2011-07-26 International Business Machines Corporation Methods and systems for dynamically reconfigurable load balancing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117438A1 (en) * 2000-11-02 2004-06-17 John Considine Switching system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUANG QIFENG ET AL: "ADAPTIVE FILE-GRAIN DISTRIBUTION IN A CLUSTER FILE SYSTEM", PROCEEDINGS OF THE IASTED/ISMM INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND SYSTEMS, XX, XX, vol. 15, no. 1, 2003, pages 117 - 121, XP008079703 *
WEI LIU ET AL.: "Design of an I/O balancing file system on web server clusters", PARALLEL PROCESSING, 2000. PROCEEDINGS. 2000 INTERNATIONAL WORKSHOPS ON 21-24 AUGUST 2000, 21 August 2000 (2000-08-21), pages 119 - 125
WEI LIU ET AL: "Design of an I/O balancing file system on web server clusters", PARALLEL PROCESSING, 2000. PROCEEDINGS. 2000 INTERNATIONAL WORKSHOPS ON 21-24 AUGUST 2000, PISCATAWAY, NJ, USA,IEEE, 21 August 2000 (2000-08-21), pages 119 - 125, XP010511941, ISBN: 0-7695-0771-9 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101188566B (en) * 2007-12-13 2010-06-02 东软集团股份有限公司 A method and system for data buffering and synchronization under cluster environment
WO2017019128A1 (en) * 2015-07-29 2017-02-02 Hewlett-Packard Development Company, L.P. File system metadata representations

Also Published As

Publication number Publication date
CN101375241A (en) 2009-02-25
US20070179981A1 (en) 2007-08-02
EP1979806A1 (en) 2008-10-15

Similar Documents

Publication Publication Date Title
US20070179981A1 (en) Efficient data management in a cluster file system
US7076553B2 (en) Method and apparatus for real-time parallel delivery of segments of a large payload file
JP5411250B2 (en) Data placement according to instructions to redundant data storage system
CN102483768B (en) Memory structure based on strategy distributes
US7546486B2 (en) Scalable distributed object management in a distributed fixed content storage system
US8046422B2 (en) Automatic load spreading in a clustered network storage system
JP5765416B2 (en) Distributed storage system and method
US7539735B2 (en) Multi-session no query restore
CN102708165B (en) Document handling method in distributed file system and device
US20060167838A1 (en) File-based hybrid file storage scheme supporting multiple file switches
US7689764B1 (en) Network routing of data based on content thereof
US20150347434A1 (en) Reducing metadata in a write-anywhere storage system
EP1902394B1 (en) Moving data from file on storage volume to alternate location to free space
JP5330503B2 (en) Optimize storage performance
US7506005B2 (en) Moving data from file on storage volume to alternate location to free space
JP5516575B2 (en) Data insertion system
US8700684B2 (en) Apparatus and method for managing a file in a distributed storage system
US20100318584A1 (en) Distributed Cache Availability During Garbage Collection
KR20100073154A (en) Method for data processing and asymmetric clustered distributed file system using the same
WO2023207492A1 (en) Data processing method and apparatus, device, and readable storage medium
US10057348B2 (en) Storage fabric address based data block retrieval
US7792966B2 (en) Zone control weights
JP7421078B2 (en) Information processing equipment, information processing system, and data relocation program
US11188258B2 (en) Distributed storage system
JP4224279B2 (en) File management program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 200780003835.0

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007700245

Country of ref document: EP