CN111417939A - Hierarchical storage in a distributed file system - Google Patents

Hierarchical storage in a distributed file system Download PDF

Info

Publication number
CN111417939A
CN111417939A CN201880065539.1A CN201880065539A CN111417939A CN 111417939 A CN111417939 A CN 111417939A CN 201880065539 A CN201880065539 A CN 201880065539A CN 111417939 A CN111417939 A CN 111417939A
Authority
CN
China
Prior art keywords
data
file server
identifier
virtual cluster
cluster descriptor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201880065539.1A
Other languages
Chinese (zh)
Inventor
U·V·萨拉蒂
A·A·潘德
K·拉斯托基
G·P·雷迪蒂
N·布帕雷
R·波度
C·G·K·B·萨纳帕拉
P·炯纳拉
A·桑万
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Mapr Technologies Inc
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Publication of CN111417939A publication Critical patent/CN111417939A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9017Indexing; Data structures therefor; Storage structures using directory or table look-up
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Abstract

The file server receives a request for data from a user device. The data is represented at the file server by a virtual cluster descriptor. The file server queries the identifier mapping using the identifier of the virtual cluster descriptor. In response to the identifier mapping indicating that the requested data is stored at a location remote from the file server, the file server accesses a cold layer translation table that stores a mapping between an identifier of each of the plurality of virtual cluster descriptors and a storage location of the data associated with the respective virtual cluster descriptor. The cold layer translation table is queried using an identifier of the virtual cluster descriptor to identify a storage location of the requested data, and the data is loaded from the identified storage location to the file server.

Description

Hierarchical storage in a distributed file system
Cross Reference to Related Applications
This application claims the benefit of U.S. patent application No. 16/xxx, filed on day 8, month 16, 2018 and U.S. patent application No. 62/546,272, filed on day 8, month 16, 2017, which are incorporated herein by reference in their entirety.
Technical Field
Various embodiments disclosed relate to distributed file systems, and more particularly to tiered storage in distributed file systems.
Background
Enterprises are seeking solutions that can meet the conflicting requirements of low cost storage (typically at an external location) while maintaining high speed data access. They also want to have almost unlimited storage capacity. With current approaches, users typically have to purchase inefficient and expensive third party products (such as cloud gateways) and introduce administrative and application complexities.
There are also some other additional considerations in modern big data systems when attempting to transfer cold data to the cold storage layer, where "cold" or "frozen" data is data that is rarely accessed. One particular aspect of many low cost object stores, such as the amazon S3 or Azure object store, is that it is preferable to make the objects in the object store relatively large (10MB or larger). Much smaller objects may be stored, but storage efficiency, performance, and cost considerations make designs that use larger objects more preferable.
For example, in modern big data systems, there may be a large number of files. For example, some of these systems have over 1 trillion files, file creation speeds exceed 20 billion per day, and these numbers are expected to only continue to grow. In a system with a large number of files, the average and median file sizes must be much smaller than the expected data units written to the cold layer storage. For example, a system with 1PB storage and one trillion files, with an average file size of 1018/1012 ═ 1MB, much lower than the desired object size. Moreover, the total size of many large number file systems is very small compared to a beat byte, and the average file size is about 100 kB. Amazon, S3 to 2014, has only 2 trillion objects in total among all users. Due to the transaction cost, writing only one trillion objects into S3 requires 50 million dollars. For a 100kB object, the upload cost alone is equivalent to a two month storage fee. Objects smaller than 128kB also cost the same as objects of size 128 kB. These cost structures reflect the efficiency of the underlying object storage and are a way that amazon encourages users to have larger objects.
In addition to traditional files such as message flows and key value tables, data types further exacerbate the problem of inefficient cloud storage. An important feature of message flow is that a flow is usually a very long-lived object (a lifetime of several years is not reasonable), but is usually updated and accessed throughout its lifetime. To save space, it may be desirable for the file server to offload the partial streams to a third party cloud service, but the partial streams may remain active and thus frequently accessed by the file server process. This typically means that only a small portion of the additional message stream can be sent to the cold layer at a time, while most objects are still stored at the file server.
Security is also a critical requirement for any system that stores cold data in a cloud service.
Drawings
One or more embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
FIG. 1A is a block diagram illustrating an environment for implementing a hierarchical file storage system in accordance with one embodiment.
FIG. 1B is a schematic diagram illustrating the logical organization of data in a file system.
FIG. 2A illustrates an example of a snapshot of a large amount of data.
FIG. 2B is a block diagram illustrating a process for offloading data to a cold layer.
FIG. 3 is a block diagram illustrating elements and communication paths in a read operation in a hierarchical file system, according to one embodiment.
FIG. 4 is a block diagram illustrating elements and communication paths in a write operation in a hierarchical file system, according to one embodiment.
FIG. 5 is a block diagram of a computer system that may be used to implement certain features of some embodiments.
Detailed Description
Various example embodiments will now be described. The following description provides certain specific details for a thorough understanding and enabling description of these examples. One skilled in the relevant art will understand, however, that some of the disclosed embodiments may be practiced without many of these details.
Similarly, one skilled in the relevant art will also appreciate that some embodiments may include many other obvious features not described in detail herein. Additionally, some well-known structures or functions may not be shown or described in detail below to avoid unnecessarily obscuring the relevant description of the various examples.
The terminology used below should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of embodiments. Indeed, certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this detailed description section.
Overview of the System
The hierarchical file storage system provides a policy-based automatic hierarchical function that uses a file system with full read-write semantics and third party cloud based object storage as additional storage tiers. The hierarchical file storage system maintains different types of data using file servers (e.g., operated internally by a company) that communicate with remote third-party servers. In some embodiments, a file server receives a request for data from a user device. Data is represented at the file server by virtual cluster descriptors. The file server queries the identifier mapping using the identifier of the virtual cluster descriptor. In response to the identifier mapping indicating that the requested data is stored at a location remote from the file server, the file server accesses a cold layer translation table that stores a mapping between an identifier of each of the plurality of virtual cluster descriptors and a storage location of the data associated with the respective virtual cluster descriptor. The cold layer translation table is queried using the identifier of the virtual cluster descriptor to identify a storage location of the requested data, and the data is loaded from the identified storage location to the file server.
The use of third party storage addresses the problem of rapid data growth and improved data center storage resources by using third party storage as an economical storage tier with the great ability to handle "cold" or "frozen" data that is rarely accessed. In this way, valuable local storage resources may be used for more active data and applications, while cold data may be retained at reduced cost and administrative burden. Data structures in the file server allow cold data to be accessed using the same method as hot data.
FIG. 1A is a block diagram illustrating an environment for implementing a hierarchical file storage system in accordance with one embodiment. As shown in FIG. 1A, an environment may include a file system 100 and one or more cold storage devices 150. The file system 100 may be a distributed file system that supports traditional objects such as files, directories, and links, as well as objects such as key value tables and message streams. The cold storage device 150 may be co-located with a storage device associated with the file system 100, or the cold storage device 150 may include one or more servers physically remote from the file system 100. For example, the cold storage device 150 may be a cloud storage device. The data stored by the cold storage device 150 may be organized into one or more object pools 155, each object pool 155 being a logical representation of a data set.
Data stored by the file system 100 and the cold storage device 150 are classified into a "hot" layer and a "cold" layer. Generally, "hot" data is data that is determined to be actively used or frequently accessed, while "cold" data is data that is considered to be rarely used or accessed. For example, cold data may include data that must be retained for regulatory or compliance purposes. The storage devices associated with file system 100 constitute a hot tier, which stores hot data. Storing the hot data locally at the file system 100 enables the file system 100 to access the hot data quickly when requested, thereby providing a quick response to data requests, thereby achieving lower processing costs than accessing the cold layer. The cold storage device 150 may store cold data and constitute a cold layer. Offloading the infrequent data to the cold layer frees up space at the file system 100 for new data. However, invoking data from the cold layer may take more cost and time than accessing locally stored data.
Data may be identified as hot or cold based on rules and policies set by an administrator of file system 100. These rules may include, for example, the time since the last access, since the modification, or since the creation. The rules may be different for different data types (e.g., the rules applied to a file may be different than the rules applied to a directory). Any new data created within file system 100 may initially be classified as hot data and written to a local storage device in file system 100. Once the data is classified as cold data, it is offloaded to the cold layer. Reads and writes to cold data may result in partial caching or other temporary storage of the data locally in the file system 100. However, if there are no administrative actions, such as changing the rules applied to the data or recalling the entire data volume to the file system 100, the offloaded data may not be reclassified as "hot".
File system 100 maintains data stored across multiple cluster nodes 120, each including one or more storage devices. Each cluster node 120 hosts one or more storage pools 125. In each storage pool 125, the data is structured within containers 127. The container 127 may hold files, directories, tables, and fragments of streams, as well as linking data representing logical connections between these items. Each container 127 may hold up to a specified amount of data, such as 30GB, and each container 127 may be fully contained within one of the storage pools 125. The container 127 may be replicated to another cluster node 120, with one container designated as the master container. For example, container 127A may be the primary container for some data stored therein, and container 127D may store a copy of that data. The end user of the file system 100 may not see the container 127 and the logical representation of the data provided by the container.
When data is written to the containers 127, the data is also written to each container 127 that holds a copy of the data before the write is acknowledged. In some embodiments, data to be written to the container 127 is first sent to the master container, which in turn sends the write data to other copies. If any copy fails to acknowledge the write within a threshold time and after a specified number of retries, the chain of copies of container 127 can be updated. The cycle counter associated with the container 127 may also be updated. The cycle counter enables each container 127 to verify that the data to be written is current and to reject old writes from the master container of the previous cycle.
When a storage pool 125 recovers from a transient failure, the containers 127 in the pool 125 may not be outdated. As such, file system 100 may apply a grace period after noticing the loss of a container copy before creating a new copy. If the missing copy of the container reappears before the grace period ends, it may be resynchronized with the current state of the container. Once the copy is updated, the period of the container will be incremented and a new copy will be added to the replication chain of the container.
Within the container 127, the data may be segmented into chunks and organized in a data structure such as a b-tree. The data blocks include up to a specified amount of data (such as 8kB) and may be compressed in a specified number of blocks (e.g., 8) packets. If a group is compressed, the updating of a block may require reading and writing several blocks from the group. If the data is uncompressed, each individual block can be directly overwritten.
Data stored in file system 100 may be represented to an end user as a volume. Each roll may include one or more containers 127. When presented to an end user, the volume may have a similar appearance to a catalog, but may include additional management capabilities. Each volume may have a mount point that defines a location in the namespace where the volume is visible. Operations in file system 100 for handling cold layer data, such as snapshots, mirroring, and defining data locally within a cluster, may be performed at the volume level.
The file system 100 also includes a container location database (C L DB) 110. C L DB 110 maintains information about where each container 127 is located and builds the structure of each copy chain for the data stored by the file system 100. C L DB 110 may be maintained by several redundant servers and the data in C L DB itself may be stored in the container 127. thus, C L DB 110 may be copied in a similar manner to other data in the file system 100, allowing the C L DB to have several hot backups that can take over in the event of a C L DB failure.
For example, the C L DB 110 may store rules for selectively identifying data to offload to cold tiers and a schedule for when to offload the data, the C L DB 110 may also store object pool attributes for storing and accessing the offloaded data, for example, the C L DB 110 may store IP addresses of storage devices for storing the offloaded data, authentication credentials for accessing the storage devices, compression levels, encryption details, or recommended object sizes.
In general, the term "tiered services" is used herein to refer to various independent services that manage different aspects of the data lifecycle for a particular tier, these services are configured in the C L DB 110 for each tier enabled on each volume, the C L DB 110 manages the discovery, availability, and some global state of these services, the C L DB 110 may also manage any volumes that these services need to store their private data (e.g., metadata for the tiered services), as well as any service specific configurations, such as on which hosts these services may run.
As described above, data is stored in the file system 100 and the cold storage device 150 in the form of blocks. FIG. 1B is a schematic diagram illustrating the logical organization of data in the file system 100. As shown in FIG. 1B, the data blocks 167 may be logically grouped into Virtual Cluster Descriptors (VCDs) 165. For example, each VCD165 may contain up to eight data blocks. One or more VCDs 165 may together represent data, such as a file, in discrete data objects stored by file system 100. VCD165 represents the creation of a layer of indirection between the underlying physical storage of data and the high-level operations of creating, reading, writing, modifying, and deleting data in a tiered storage system. These high-level operations may include, for example, reads, writes, snapshot creation, replication, resynchronization, and mirroring. Indirection allows these operations to continue to work with the VCD abstraction without requiring them to know how or where to physically store data belonging to the VCD. In some embodiments, the abstraction may apply only to the substantive data stored in the tiered storage system; filesystem metadata (such as namespace metadata, inode lists, and fidmaps) may be persistently stored at the file server 100, and thus, the filesystem 100 may not benefit from abstracting the location of the metadata. However, in other cases, the file metadata may be represented by a VCD.
Each VCD165 is assigned a unique identifier (referred to herein as a VCDID). The file system 100 maintains one or more mappings 160 (referred to herein as VCDID mappings) of the physical locations where data associated with each VCDID is stored. For example, each container 127 may have a corresponding VCDID map 160. In the ordinary case, where data has not been offloaded to object pool 155, VCDID map 160 may be a one-to-one mapping from a plurality of VCDIDs 165 to physical block addresses where data associated with each VCDID is stored. Thus, when data is stored locally at the file server 100, the file server 100 can query the VCDID map 160 using the VCDID to identify the physical location of the data. Once the data has been offloaded to the object pool, the VCDID map 160 may be empty, or otherwise indicate that the data has been offloaded from the file system 100.
Typically, when the file system 100 receives a request (e.g., a read request or a write request) associated with stored data, the file system 100 checks the VCDID associated with the requested data in the VCDID map 160. If the VCDID map 160 lists the physical block addresses of the requested data, the file system 100 can use the listed addresses to access the data and satisfy the data request directly. If the entry is empty, or the VCDID map 160 otherwise indicates that the data has been unloaded, the file system 100 may query the cold layer service sequence to find the data associated with the VCDID. For example, the cold layer services may be arranged in priority order such that erasure coding may be preferable to cloud storage. Using the priority search tiering service also makes data available in multiple layers (e.g., hot and cold layers), simplifying the process for moving data between layers.
The use and maintenance of VCDID mappings can affect the data retrieval performance of the file system 100 in two main ways. First, querying the VCDID map for the local location of the data in the VCD results in an additional lookup step compared to, for example, querying the file b-tree. This additional lookup step incurs a cost to the filesystem 100, primarily due to the cost of loading the cache of VCDID map entries. However, the ratio of the size of the actual data in the container to the VCDID map itself is large enough that the cost of loading the map is small on a amortized basis. Furthermore, the ability to selectively enable tiering for certain volumes and not others enables volumes with transient and very hot data to avoid such costs altogether.
The second type of performance impact is caused by interference between background file system operations and foreground I/O operations, in particular, inserting data into the VCDID map takes time and processing resources of the file system 100. in some embodiments, the cost of the insertion may be reduced by using techniques similar to log structure merge (L SM) trees.
Offloading data to a cold layer
Data operations in a hierarchical file system may be configured at the volume level. These operations may include, for example, replication and mirroring of data within the file system 100, and hierarchical services, such as cold layering using the object pool 155. An administrator can configure different layered services on the same volume as if multiple images could be defined independently.
From the user's perspective, the file appears to be the smallest logical unit of user data identified for offloading to the cold tier, since the offload rules defined for the volume refer to file-level attributes. However, offloading data on a per file basis has the following disadvantages: snapshots share unmodified data at the physical block level in file system 100. Thus, the same file across snapshots may share many blocks with each other. Thus, offloading at the file level will result in duplication of shared data in the files of each snapshot. However, a snapshot at the VCD level may utilize shared data to save space.
FIG. 2A illustrates an example of a snapshot of a large amount of data. In FIG. 2A, data blocks in a file are shared between a snapshot and the latest writable view of the data. The example file experiences the following sequence of events:
1. the first 192kB of the file (represented by three VCDs) is written,
2. creating a snapshot S1
3. Overlay the last 128kB of the file (represented by two VCDs)
4. Creating a snapshot S2
5. Overlay the last 64kB of the File (represented by a VCD)
If the blocks in snapshot S1 were moved to cold storage 150, layering at the VCD level would allow snapshot S2 and the current version of the file to share the layered data with snapshot S1. Conversely, offloading at the file level will not take advantage of the space that might be saved by shared blocks. This wasted storage space may have a significant impact on the efficiency and cost of maintaining data in the cold layer, especially for persistent or large numbers of snapshots.
As shown in FIG. 2A, data blocks in a file are shared between a snapshot and the latest writable view of the data. When a data block is overwritten, the new block may obscure the blocks in the old snapshot but may be shared with the new view. Here, blocks starting from offset 0 are never covered, blocks starting from 64k and 128k are covered before snapshot 2 occurs, and blocks at 128k have been covered again sometime after snapshot 2.
If the data represented in FIG. 2A is offloaded at the file level, the entire file must be either "hot" (available on local storage) or "cold" (stored in the object pool), and the remote I/O of the file is difficult to manage in partial blocks. Since certain data types (such as message streams) may have both very hot and very cold data in the same object, it is inefficient to determine whether the entire object should be stored locally or in the cold layer. In contrast, layering at the cluster descriptor level enables file system 100 to more efficiently sort data. For example, with respect to the data blocks in FIG. 2A, all of the blocks in snapshots 1 and 2 may be considered cold, while file system 100 retains the latest version of the unique block as hot data.
FIG. 2B is a block diagram illustrating a process for offloading data to a cold layer. As shown in fig. 2B, these processes may include a cold layer converter 202, a cold layer unloader 205, and a cold layer compressor 204. Each of the cold layer translator 202, cold layer offloader 205, and cold layer compressor 204 may be executed by one or more processors of the file system 100 and may be configured as software modules, hardware modules, or a combination of software and hardware. Alternatively, each process may be performed by a computing device other than the file system 100, but may be called by the file system 100.
To accomplish this, the CTT 202 maintains an internal database table 203, which internal database table 203 translates the VCDID into a location corresponding to the VCD, where the location is returned as an object identifier and offset, the CTT 202 may also store any necessary information for validating the data retrieved from the object pool 155 (e.g., a hash or checksum), decompressing the data if the compression level between the object pool 155 and the file system 100 is different, and decrypting the data if encryption is enabled, the CTT table 203 may be updated with an entry for the DIVCD corresponding to the offloaded data when the data is offloaded to the object pool 155, the CTT 202 may also update the table 203 after any reconfiguration of the objects in the object pool 155, the CTT 202 may also update the table 203. an example object reconfiguration is compression of the object pool 155 by the cold layer compressor 204, the CTT 202 may be a persistent process, and since each container process may know where the active contact location of the CTT 202 may store information in the file system such as a connection to the CTT 202, the file system 100 may store information in place of a connection with another CTT system 202, such as a process L.
A cold layer uninstaller (CTO)205 identifies files in a volume that are ready to be uninstalled, retrieves data corresponding to these files from the file system 100, and packages this data into objects to be written to the object pool 155. the CTO 205 process may be initiated according to a defined schedule, which may be configured in the C L DB 110. to identify files to be uninstalled, the CTO 205 may retrieve information 207 about which containers 127 are in the volume, and then retrieve 208 a list of inodes and attributes of these containers from the file system 100. the CTO 205 may apply volume-specific hierarchical rules on this information and identify files or file portions that meet the requirements for transfer to the new layer.
A Cold Tier Compactor (CTC)204 identifies deleted VCDIDs and deletes them from CTT table 203. operations such as file deletes, snapshot deletes, and overwrites existing data may result in logical removals of data in file system 100. ultimately, these operations translate to deleting VCDIDs from the VCDID map. to remove deleted VCDIDs, the CTC 204 checks 214 the VCDID map to find opportunities to delete completely or compress 215 objects stored in the cold pool.
The compression process performed by the CTC 204 may be performed securely even in the face of updates to data in the file system. Since the VCDID map and each cold pool are probed in sequence, adding a reference to a particular block in the VCDID map can make any changes to the downstream hierarchy irrelevant. Thus, the CTC 204 can change the hierarchy before or after changing the VCDID mapping without affecting the user's view of the data state. Furthermore, since the hierarchical copy of the data may be immutable and eventually a reference to another data block within any data block is mapped through the VCDID map, the data may be updated cleanly without performing checks such as distributed locks.
Each of CTT 202, CTO 205, and CTC 204 may service multiple volumes because internal metadata is split at each volume level in some embodiments, C L DB 201 may ensure that only one service of each type is active for a given volume at a given time C L DB 201 may also stop or restart services based on cluster status and heartbeats received from these services, thereby ensuring high availability of tiered services.
Sample manipulation of hierarchical data
FIG. 3 is a block diagram illustrating elements and communication paths in a read operation in a hierarchical file system, according to one embodiment. The components and processes described with respect to fig. 3 may be similar to the components and processes described with respect to fig. 1 and 2B.
As shown in fig. 3, a client 301 sends 302 a read request to a file server 303. The read request identifies data requested by client 301, e.g., for use in an application executed by client 301. The file server 303 may contain a mutable container or invariant copy of the desired data. Each container or replica is associated with a set of directory information and file data, e.g., stored in a b-tree.
The file server 303 may examine the b-tree to find the VCDID corresponding to the requested data and examine the VCDID map to identify the location of the VCDID. If the VCDID map identifies a list of one or more physical block addresses where data is stored, the file server 303 reads the data from the location indicated by the physical block address, stores the data in a local cache, and sends 304 a response to the client 301. If the VCDID map indicates that the data is not stored locally (e.g., if the mapping for a given VCDID is empty), the file server 303 identifies the pool of objects to which the data has been offloaded.
Because it takes more time to retrieve data from the object pool than to read data from disk, file server 303 may send 305 an Error Message (EMOVED) to client 301. In response to the error message, client 301 may delay subsequent read operations 306 for a preset time interval. In some embodiments, client 301 may repeat read operation 306 a specified number of times. If client 301 cannot read data from the file server 303 cache after a specified number of attempts, client 301 may return an error message to the application and no further attempts to read the data are made. The time between read attempts may be the same or may be gradually increased after each failed attempt.
After sending the EMOVED error message to client 301, file server 303 may begin the process of calling data from the cold layer. The file server 303 may send 307 a request to the CTT 308 with a list of one or more VCDID corresponding to the requested data.
The CTT 308 queries its translation table for each of the one or more VCDIDs. The translation table may contain a mapping from VCDID to object ID and an offset identifying the location of the corresponding data. Using the object ID and the offset, the CTT 308 retrieves 310 the data from the cold layer 311. The CTT 308 verifies the returned data against the expected values and returns 312 the data to the file server 303 if the expected and actual verification data match. If the stored data is compressed or encrypted, the CTT 308 may decompress or decrypt the data before returning 312 the data to the file server 303.
When file server 303 receives data from CTT 308, file server 303 stores the received data in a local cache. If a subsequent read request 306 is received from the client 301, the file server 303 returns 304 the desired data from the cache.
FIG. 3 provides an overall outline of the elements and communication paths in a read operation. If the data is stored locally on the file server 303, the read operation can be satisfied quickly. If the data is not stored locally, the file server 303 may return an error message to the client 301, causing the client to repeatedly re-request the data while the file server 303 asynchronously retrieves the desired data. This way of reading avoids long requests by the client. Instead, the client repeats the request until a specified number of failed attempts is reached or the desired data is received. Since the client 301 repeats the data request, the file server 303 need not retain information about the client state when retrieving data from the cold layer. Using the process described with respect to fig. 3, many of the customer's requirements can be met quickly. This may reduce the number of pending requests on the server side and reduce the impact of a file server crash. Since there are typically many clients making requests to each file server, placing more state on the client side means that more state can be retained after a file server crash, so operations can be restored faster.
FIG. 4 is a block diagram illustrating elements and communication paths in a write operation in a hierarchical file system, according to one embodiment. The components and processes described with respect to fig. 4 may be similar to the components and processes described with respect to fig. 1, 2B, and 3.
As shown in fig. 4, a file client 401 sends 402 a write request to a file server 403. The write request includes a modification to data stored by the file server 403 or a remote storage device, such as changing a portion of the stored data or adding to the stored data. The data to be modified may be replicated between multiple storage devices. For example, the data may be stored on both the file server 403 and one or more remote storage devices, or the data may be stored on multiple remote storage devices.
When the file server 403 receives a write request from the client 401, the file server 303 may allocate a new VCDID to the newly written data. The new data may be sent to any other storage device 404 that maintains a copy of the data to be modified, thereby enabling other servers 404 to update the copy.
The file server 403 may examine the b-tree to retrieve the VCDID of the data to be modified. Using the retrieved VCDIDs, the file server 403 may access the metadata of the VCDs from the VCDID map. If the metadata contains a list of one or more physical block addresses that identify the location of the data to be modified, the file server 403 may read the data from the locations identified by these addresses and write the data to the local cache. The file server 403 may modify the data in the cache according to the instructions in the write request. The write operation may also be sent 406 to all devices that store the copy of the data. Once the original data and copy have been updated, file server 403 may send 405 a response to client 401 indicating that the write operation completed successfully.
If the metadata does not identify the physical block address of the data to be modified (e.g., if the mapping for a given VCDID is empty), the file server 403 identifies the pool of objects to which the data has been offloaded. Since it takes more time to retrieve data from the object pool than to read data from the disk, the file server 403 may send 407 an Error Message (EMOVED) to the client 401. In response to the error message, client 401 may delay the subsequent write operation 408 by a preset time interval. In some embodiments, client 401 may repeat write operation 408 a specified number of times. If the write operation fails after the specified number of attempts, client 401 may return an error message to the application and may not attempt to write the data any more. The time interval between write attempts may be the same or may be gradually increased after each failure.
After sending the EMOVED error message to client 401, file server 403 may begin the process of recalling data from the cold layer to update the data. File server 403 may send 409 a request to CTT 410 with a list of one or more VCDIDs corresponding to the data to be modified.
The CTT 410 searches its conversion table for one or more VCDID and retrieves 411 data from the cold layer 412 using the object ID and offset output by the conversion table. The CTT 410 verifies the returned data according to the expected values and returns 413 the data to the file server 403 if the expected and actual verification data match. If the stored data is compressed or encrypted, the CTT 410 may decompress or decrypt the data before returning 413 the data to the file server 403.
When the file server 403 receives the data from the CTT 410, the file server 403 copies 406 the unchanged data to any copy and writes the data to the local cache (converts the data back to hot data) using the same VCDID. If a subsequent write request is received from client 401, file server 403 may perform an overwrite on the invoked data to update the data according to the instructions in the write request.
According to the process described with respect to fig. 4, the data flow is the same whether the data is stored locally at the file server 403 or has been offloaded to the cold tier. Because the write data is sent to the copy before the b-tree is checked to determine the location of the data to be modified, the copy may need to discard the write data if the data to be modified has already been unloaded. However, even if this process results in replicated data that is subsequently discarded, the replicated data is discarded only if the data has been offloaded, and the file server 403 does not need to use different processes for hot and cold tier storage of data. However, in other embodiments, the steps of the process described with reference to fig. 4 may be performed in a different order. For example, the file server 403 may check the b-tree to identify the location of the data before sending the write request to the copy.
The cold layer data store using the object pool enables new options to create a read-only image for disaster recovery (referred to herein as a DR image). The pool of objects is typically hosted by a cloud server provider and, therefore, stored on a server that is physically remote from the file server. Volumes that have been offloaded to the cold tier may contain only metadata, and along with the metadata stored in the volumes used by the cold tier service, the offloaded data occupies only a small portion (e.g., less than 5%) of the actual storage space used by the volumes. By mirroring user volumes and volumes used by the cold layered service to locations remote from the file server (and thus potentially outside of the disaster area affecting the file server), an inexpensive DR image may be constructed. For recovery, a new set of cold layered services may be instantiated that allow the DR image read-only access to a nearly consistent copy of the user volume.
Computer system
FIG. 5 is a block diagram of a computer system that may be used to implement certain features of some embodiments. The computer system may be a server computer, a client computer, a Personal Computer (PC), a user device, a tablet computer, a laptop computer, a Personal Digital Assistant (PDA), a cellular telephone, an iPhone, an iPad, a blackberry, a processor, a telephone, a network device, a network router, switch or bridge, a console, a handheld console, a (handheld) gaming device, a music player, any portable device, mobile device, handheld device, wearable device, or any machine capable of executing a sequence of instructions, sequential or otherwise, that specify operations to be performed by that machine.
Computing system 500 may include one or more central processing units ("processors") 505, memory 510, input/output devices 525 (e.g., keyboard and pointing devices, touch devices, display devices), storage devices 520 (e.g., disk drives), and network adapters 530 (e.g., network interfaces) connected to interconnect 515. Interconnect 515 is shown as an abstraction that represents any one or more separate physical buses, point-to-point connections, or both, connected by appropriate bridges, adapters, or controllers. Thus, interconnect 515 may include, for example, a system bus, a Peripheral Component Interconnect (PCI) or PCI-Express bus, a HyperTransport or Industry Standard Architecture (ISA) bus, a Small Computer System Interface (SCSI) bus, a Universal Serial Bus (USB), an IIC (12C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus (also known as FireWire).
Memory 510 and storage 520 are computer-readable storage media that may store instructions that implement at least portions of various embodiments. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium (e.g., a signal on a communication link). Various communication links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer-readable media may include computer-readable storage media (e.g., non-transitory media) and computer-readable transmission media.
The instructions stored in memory 510 may be implemented as software and/or firmware for programming processor 505 to perform the actions described above. In some embodiments, such software or firmware may be initially provided to the processing system 500 by being downloaded from a remote system through the computing system 500, for example, via the network adapter 530.
The various embodiments described herein may be implemented, for example, by programmable circuitry (e.g., one or more microprocessors) that may be programmed by software and/or firmware, or by dedicated hardwired (non-programmable) circuitry altogether, or by a combination of such forms.
Remarks for note
The foregoing description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the present disclosure. However, in certain instances, well-known details are not described in order to avoid obscuring the description. In addition, various modifications may be made without departing from the scope of the embodiments.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. In addition, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not other embodiments.
The terms used in this specification generally have their ordinary meanings in the art, in the context of the present disclosure, and in the specific context in which each term is used. Certain terms used to describe the present disclosure are discussed above or elsewhere in the specification in order to provide additional guidance to the practitioner regarding the description of the disclosure. For convenience, certain terms may be highlighted, for example using italics and/or quotation marks. The use of highlighting does not affect the scope and meaning of the term; in the same context, the terms have the same scope and meaning, whether highlighted or not. It should be understood that the same thing can be said in more than one way.
Thus, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is there any special meaning as to whether or not a term is set forth or discussed herein. Synonyms for certain terms are provided. The recitation of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification (including examples of any terms discussed herein) is illustrative only and is not intended to further limit the scope and meaning of the disclosure or any example terms. Also, the present disclosure is not limited to the various embodiments presented in this specification.
Without intending to further limit the scope of the present disclosure, examples of devices, apparatuses, methods, and their related results according to embodiments of the present disclosure are given above. Note that for the convenience of the reader, titles or subtitles may be used in the examples, which should in no way limit the scope of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. In case of conflict, the present document, including definitions, will control.

Claims (20)

1. A method, comprising:
receiving, at a file server, a request from a user device for data represented by a virtual cluster descriptor;
querying an identifier map using an identifier of the virtual cluster descriptor;
in response to the identifier mapping indicating that the requested data is stored at a location remote from the file server, accessing a cold layer translation table that stores a mapping between an identifier of each virtual cluster descriptor in a plurality of virtual cluster descriptors and a storage location of data associated with the respective virtual cluster descriptor;
query the cold layer translation table using the identifier of the virtual cluster descriptor associated with the requested data to identify a storage location of the requested data; and
loading the requested data from the identified storage location to the file server.
2. The method of claim 1, further comprising:
in response to the identifier mapping indicating that the requested data is stored locally at the file server, retrieving the requested data from the file server and providing the requested data to the user device.
3. The method of claim 1, further comprising:
further in response to the identifier mapping indicating that the requested data is stored at the location remote from the file server, sending a notification to the user device, the notification causing the user device to resend the request for data after a specified time interval.
4. The method of claim 3, wherein the notification causes the user equipment to resend the request for data a preset number of times.
5. The method of claim 3, wherein the notification causes the user equipment to increase an amount of time between each subsequent request for data.
6. The method of claim 1, further comprising:
identifying a data set stored at the file server to be offloaded from the file server to a new location remote from the file server, the identified data set associated with a second virtual cluster descriptor; and
updating the cold layer translation table to map an identifier of the second virtual cluster descriptor to the new location remote from the file server.
7. The method of claim 1, wherein the identifier map stores a mapping between an identifier of a virtual cluster descriptor and a physical storage location at the file server if data corresponding to the virtual cluster descriptor is stored at the file server, and wherein the identifier map stores a mapping between the identifier of the virtual cluster descriptor and a blank location if the data corresponding to the virtual cluster descriptor is stored remotely from the file server.
8. A method, comprising:
receiving, at a file server, a request for data stored at a cold storage location remote from the file server;
accessing a cold layer translation table that stores a mapping between an identifier of each of a plurality of virtual cluster descriptors and a storage location of data associated with the respective virtual cluster descriptor;
query the cold layer translation table using an identifier of a virtual cluster descriptor associated with the requested data to identify a storage location of the requested data; and
loading the requested data from the identified storage location to the file server.
9. The method of claim 8, further comprising:
storing an identifier map at the file server, the identifier map storing a mapping between an identifier of a virtual cluster descriptor and a physical storage location at the file server if data corresponding to the virtual cluster descriptor is stored at the file server, and the identifier map storing a mapping between the identifier of the virtual cluster descriptor and a blank location if the data corresponding to the virtual cluster descriptor is stored remotely from the file server.
10. The method of claim 9, further comprising:
querying the identifier map using the identifier of the virtual cluster descriptor associated with the requested data; and
querying the cold layer translation table in response to the identifier mapping indicating that the requested data is stored at a location remote from the file server.
11. The method of claim 8, further comprising:
in response to the request for the data, sending a notification to the user equipment, the notification causing the user equipment to resend the request for data after a specified time interval.
12. The method of claim 11, wherein the notification causes the user equipment to resend the request for data a preset number of times.
13. The method of claim 11, wherein the notification causes the user equipment to increase an amount of time between each subsequent request for data.
14. The method of claim 8, further comprising:
identifying a data set stored at the file server to be offloaded from the file server to a new location remote from the file server, the identified data set associated with a second virtual cluster descriptor; and
updating the cold layer translation table to map an identifier of the second virtual cluster descriptor to the new location remote from the file server.
15. A system, comprising:
a cold layer translator storing a translation table that maps an identifier of each of a plurality of virtual cluster descriptors to a physical storage location of data corresponding to each virtual cluster descriptor; and
a file server communicatively coupled to the cold layer translator, the file server configured to:
querying the cold layer translator using an identifier of a virtual cluster descriptor associated with requested data to identify a storage location of the requested data; and
loading the requested data from the identified storage location to the file server.
16. The system of claim 15, further comprising:
a cold layer offloader communicatively coupled to the file server and configured to:
identifying a data set stored at the file server to be offloaded from the file server to a new location remote from the file server, the identified data set associated with a second virtual cluster descriptor; and
updating the cold layer translation table to map an identifier of the second virtual cluster descriptor to the new location remote from the file server.
17. The system of claim 15, wherein the requested data is specified in a data request transmitted by a user device to the file server, and wherein the file server is further configured to:
in response to the data request, sending a notification to the user equipment, the notification causing the user equipment to resend the data request after a specified time interval.
18. The system of claim 17, wherein the notification causes the user device to resend the request for data a preset number of times.
19. The system of claim 17, wherein the notification causes the user device to increase an amount of time between each subsequent request for data.
20. The system of claim 15, wherein the requested data is specified in a data request transmitted by a user device to the file server, and wherein the file server is further configured to:
storing an identifier map that stores a mapping between an identifier of a virtual cluster descriptor and a physical storage location at the file server if data corresponding to the virtual cluster descriptor is stored at the file server, and that stores a mapping between the identifier of the virtual cluster descriptor and a blank location if the data corresponding to the virtual cluster descriptor is stored remotely from the file server; and
in response to the identifier mapping indicating that the requested data is stored locally at the file server, retrieving the requested data from the file server and providing the requested data to the user device.
CN201880065539.1A 2017-08-16 2018-08-17 Hierarchical storage in a distributed file system Pending CN111417939A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762546272P 2017-08-16 2017-08-16
US15/999,199 US11386044B2 (en) 2017-08-16 2018-08-17 Tiered storage in a distributed file system
PCT/US2018/000337 WO2019036045A1 (en) 2017-08-16 2018-08-17 Tiered storage in a distributed file system

Publications (1)

Publication Number Publication Date
CN111417939A true CN111417939A (en) 2020-07-14

Family

ID=65361903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880065539.1A Pending CN111417939A (en) 2017-08-16 2018-08-17 Hierarchical storage in a distributed file system

Country Status (4)

Country Link
US (1) US11386044B2 (en)
CN (1) CN111417939A (en)
DE (1) DE112018004178B4 (en)
WO (1) WO2019036045A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113973112A (en) * 2020-07-23 2022-01-25 戴尔产品有限公司 Method for optimizing access to data nodes of a data cluster using a data access gateway and a bid based on metadata mapping
CN113973112B (en) * 2020-07-23 2024-04-26 戴尔产品有限公司 Method and system for optimizing access to data nodes of a data cluster using a data access gateway and metadata mapping based bidding

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10552389B2 (en) * 2017-04-28 2020-02-04 Oath Inc. Object and sequence number management
US10635632B2 (en) 2017-08-29 2020-04-28 Cohesity, Inc. Snapshot archive management
US11874805B2 (en) 2017-09-07 2024-01-16 Cohesity, Inc. Remotely mounted file system with stubs
US11321192B2 (en) 2017-09-07 2022-05-03 Cohesity, Inc. Restoration of specified content from an archive
US10719484B2 (en) 2017-09-07 2020-07-21 Cohesity, Inc. Remotely mounted file system with stubs
US10721304B2 (en) 2017-09-14 2020-07-21 International Business Machines Corporation Storage system using cloud storage as a rank
US10372363B2 (en) 2017-09-14 2019-08-06 International Business Machines Corporation Thin provisioning using cloud based ranks
US10581969B2 (en) 2017-09-14 2020-03-03 International Business Machines Corporation Storage system using cloud based ranks as replica storage
CN110519776B (en) * 2019-08-07 2021-09-17 东南大学 Balanced clustering and joint resource allocation method in fog computing system
US10942852B1 (en) 2019-09-12 2021-03-09 Advanced New Technologies Co., Ltd. Log-structured storage systems
WO2019228571A2 (en) 2019-09-12 2019-12-05 Alibaba Group Holding Limited Log-structured storage systems
SG11202002588RA (en) 2019-09-12 2020-04-29 Alibaba Group Holding Ltd Log-structured storage systems
SG11202002363QA (en) 2019-09-12 2020-04-29 Alibaba Group Holding Ltd Log-structured storage systems
EP3673376B1 (en) 2019-09-12 2022-11-30 Advanced New Technologies Co., Ltd. Log-structured storage systems
WO2019228568A2 (en) * 2019-09-12 2019-12-05 Alibaba Group Holding Limited Log-structured storage systems
WO2019228575A2 (en) 2019-09-12 2019-12-05 Alibaba Group Holding Limited Log-structured storage systems
SG11202002732TA (en) 2019-09-12 2020-04-29 Alibaba Group Holding Ltd Log-structured storage systems
SG11202002027TA (en) 2019-09-12 2020-04-29 Alibaba Group Holding Ltd Log-structured storage systems
US11487701B2 (en) * 2020-09-24 2022-11-01 Cohesity, Inc. Incremental access requests for portions of files from a cloud archival storage tier
US11669318B2 (en) * 2021-05-28 2023-06-06 Oracle International Corporation Rapid software provisioning and patching
US11907241B2 (en) * 2022-06-17 2024-02-20 Hewlett Packard Enterprise Development Lp Data recommender using lineage to propagate value indicators

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050160229A1 (en) * 2004-01-16 2005-07-21 International Business Machines Corporation Method and apparatus for preloading translation buffers
JP5032191B2 (en) * 2007-04-20 2012-09-26 株式会社日立製作所 Cluster system configuration method and cluster system in server virtualization environment
US7840839B2 (en) 2007-11-06 2010-11-23 Vmware, Inc. Storage handling for fault tolerance in virtual machines
US8229945B2 (en) * 2008-03-20 2012-07-24 Schooner Information Technology, Inc. Scalable database management software on a cluster of nodes using a shared-distributed flash memory
US9600558B2 (en) 2013-06-25 2017-03-21 Google Inc. Grouping of objects in a distributed storage system based on journals and placement policies
US9489239B2 (en) 2014-08-08 2016-11-08 PernixData, Inc. Systems and methods to manage tiered cache data storage

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113973112A (en) * 2020-07-23 2022-01-25 戴尔产品有限公司 Method for optimizing access to data nodes of a data cluster using a data access gateway and a bid based on metadata mapping
CN113973112B (en) * 2020-07-23 2024-04-26 戴尔产品有限公司 Method and system for optimizing access to data nodes of a data cluster using a data access gateway and metadata mapping based bidding

Also Published As

Publication number Publication date
WO2019036045A8 (en) 2020-10-08
WO2019036045A1 (en) 2019-02-21
US20190095458A1 (en) 2019-03-28
DE112018004178B4 (en) 2024-03-07
DE112018004178T5 (en) 2020-05-14
US11386044B2 (en) 2022-07-12

Similar Documents

Publication Publication Date Title
US11386044B2 (en) Tiered storage in a distributed file system
US11169972B2 (en) Handling data extent size asymmetry during logical replication in a storage system
US11797510B2 (en) Key-value store and file system integration
US20190370225A1 (en) Tiered storage in a distributed file system
US9798486B1 (en) Method and system for file system based replication of a deduplicated storage system
US11687265B2 (en) Transferring snapshot copy to object store with deduplication preservation and additional compression
US8166260B2 (en) Method and system for managing inactive snapshot blocks
US11144503B2 (en) Snapshot storage and management within an object store
US11625306B2 (en) Data connector component for implementing data requests
US20230336183A1 (en) File system format for persistent memory
US11144498B2 (en) Defragmentation for objects within object store
US11630807B2 (en) Garbage collection for objects within object store
US9396205B1 (en) Detection and handling of namespace changes in a data replication system
US11914884B2 (en) Immutable snapshot copies stored in write once read many (WORM) storage
US11061868B1 (en) Persistent cache layer to tier data to cloud storage
WO2023009769A1 (en) Flexible tiering of snapshots to archival storage in remote object stores
US20220107916A1 (en) Supporting a lookup structure for a file system implementing hierarchical reference counting
US9626332B1 (en) Restore aware cache in edge device
CN117076413B (en) Object multi-version storage system supporting multi-protocol intercommunication
US20240126466A1 (en) Transferring snapshot copy to object store with deduplication preservation and additional compression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination