US20150248443A1 - Hierarchical host-based storage - Google Patents

Hierarchical host-based storage Download PDF

Info

Publication number
US20150248443A1
US20150248443A1 US14/635,261 US201514635261A US2015248443A1 US 20150248443 A1 US20150248443 A1 US 20150248443A1 US 201514635261 A US201514635261 A US 201514635261A US 2015248443 A1 US2015248443 A1 US 2015248443A1
Authority
US
United States
Prior art keywords
network node
memory
network
memory record
file system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/635,261
Inventor
Amit Golander
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NetApp Inc
Original Assignee
Plexistor Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Plexistor Ltd filed Critical Plexistor Ltd
Priority to US14/635,261 priority Critical patent/US20150248443A1/en
Assigned to Plexistor Ltd. reassignment Plexistor Ltd. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GOLANDER, AMIT
Publication of US20150248443A1 publication Critical patent/US20150248443A1/en
Assigned to NETAPP, INC. reassignment NETAPP, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Plexistor Ltd.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30312
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/185Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F17/30477
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the present invention in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.
  • Direct-attached storage is a model in which data is local on a server and benefits from low latency access.
  • DAS Direct-attached storage
  • the DAS model is: inefficient, because there is no resource sharing between servers; inconvenient since data cannot be shared between processes running on different application servers; and not resilient because data is lost upon a single server failure.
  • Shared-storage systems store all or most metadata and data on a server, which is typically an over-the-network server and not the same server that runs the application/s that generates and consumes the stored data.
  • This architecture can be seen both in traditional shared storage systems, such as NetApp FAS and/or EMC Isilon, where all of the data is accessed via the network; and/or in host-based storage, such as Redhat Gluster and/or EMC Scale-io, in which application servers also run storage functions, but the data is uniformly distributed across the cluster of servers (so 1/n of the data is accessed locally by each server and the remaining (n ⁇ 1)/n of the data is accessed via the network).
  • shared storage with (typically read) caches.
  • the application server includes local storage media (such as a Flash card) that holds data that was recently accessed by the application server. This is typically beneficial for recurring read requests.
  • Caching can be used in front of a traditional shared storage (for example in Linux block layer cache (BCache)), or in front of a host-based storage (for example in VMware vSAN). These caching solutions tend to be block-based solutions—i.e. DAS file system layer on top of a shared block layer.
  • HDFS Hadoop distributed file system
  • pNFS parallel network file system
  • a method of accessing a memory record in distributed network storage comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for accessing one of the plurality of memory records, the request is received from an application executed in the first network node; querying a first file system segment stored in the first network node for the memory record; when the memory record is missing from the first memory records subset, querying for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the
  • the providing comprises establishing a direct communication channel between the first network node and the second network node via the network according to the address to provide the access.
  • the querying for the address includes: sending a request to a catalog service via the network; and receiving a reply message from the catalog service, the reply message including the address.
  • the querying for the address includes sending a request to each of the plurality of network nodes to receive the address.
  • the querying for the address includes querying for a last known location of the memory record cached in the first file system segment.
  • the second network node temporarily blocks write access to the memory record for the first network node when the memory record is currently accessed by any other of the plurality of network nodes.
  • the second network node temporarily blocks access to the memory record for the first network node when the memory record is currently written by any other of the plurality of network nodes.
  • a copy of the memory record is also stored in a third of the plurality of network nodes.
  • the method further comprises: when the second network node is unavailable, querying for an address of the third network node; and establishing a direct communication channel between the first network node and the third network node via the network according to the address to provide access to the memory record.
  • a copy of the memory record is also stored in the first network node and may be accessed instead of accessing the memory record at the second network node via the network.
  • the method further comprises, before the querying: querying for an address of a directory containing the memory record; and querying for an address of the memory record in the directory.
  • the memory record includes multiple file segments.
  • the querying for the address includes providing an inode number of the memory record.
  • the querying for the address includes providing a layout number of the memory record.
  • a computer readable medium comprising computer executable instructions adapted to perform the method.
  • a system of managing a distributed network storage comprising: a file system segment stored in a first of a plurality of network nodes, the file system segment is one of a plurality of file system segments of a file system mapping a plurality of memory records; a program store storing a storage managing code; and a processor, coupled to the program store, for implementing the storage managing code, the storage managing code comprising: code to receive an access request to a memory record of the plurality of memory records from an application executed in the first network node; code to query the file system segment for the memory record in the first memory records subset; code to query for an address of a second network node of the plurality of network nodes when the memory record is missing from the first memory records subset, wherein the memory record is stored in a second memory records subset of the second network node; and code to provide the first network node with an access to the memory record at the second network node via a network according to the address.
  • a distributed network storage system comprising: a plurality of network nodes connected via a network, each including a storage managing module; a plurality of file system segments of a file system, each stored in one of the plurality of network nodes; a plurality of memory records managed by the plurality of file system segments, wherein each of the plurality of memory records is owned by one of the plurality of network nodes and stored in at least one of the plurality of network nodes; and wherein when an application executed in a first of the plurality of network nodes requests an access to one of the plurality of memory records, and the memory record is missing from a memory records subset stored in the first network node, a storage managing module included in the first network node queries for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the second network node
  • a method of creating a memory record in distributed network storage comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for creating a new of the plurality of memory records, the request is received from an application executed in the first network node; creating the memory record in the first network node; and registering the memory record in a catalog service via the network.
  • the creating includes assigning a prefix unique to the first network node to an inode number of the memory record.
  • Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
  • a data processor such as a computing platform for executing a plurality of instructions.
  • the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
  • a network connection is provided as well.
  • a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • FIG. 1 is a schematic illustration of a distributed network storage system that includes memory records and managed by a shared file system, according to some embodiments of the present invention
  • FIG. 2A is a schematic illustration of an exemplary file system segment stored by a network node, according to some embodiments of the present invention.
  • FIG. 2B is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes, according to some embodiments of the present invention
  • FIG. 2C is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B , stored by a network node, according to some embodiments of the present invention
  • FIG. 3 is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention
  • FIG. 4 is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention.
  • FIG. 5 is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.
  • Storage media typically thought of as non-volatile memory such as magnetic hard-disk drive (HDD) or Flash-based solid-state drive (SSD), offers affordable capacity, but at 1,000 to 100,000 times longer latency compared to volatile memory such as dynamic random-access memory (DRAM).
  • DRAM dynamic random-access memory
  • SCM storage class memory
  • SCM storage class memory
  • a hierarchical shared file system and methods of managing the file system by distributing segments of the file system to reduce network latency and augmenting local file management into a distributed storage solution are provided. These embodiments are a hybrid between DAS and shared storage. In this system, metadata and data are predicted to be local, and the rest of the shared file system hierarchy is only searched upon a misprediction.
  • the system includes multiple memory records that are stored in multiple network nodes.
  • Each network node stores a segment of the file system that maps a subset of the memory records stored in that network node.
  • Each memory record such as a record represented by an inode in Linux or an entry in the master file table in Windows' new technology file system (NTFS), is a directory or a file in the file system or a file segment such as a range of data blocks.
  • Each memory record is owned (e.g. access managed and/or access controlled) by a single network node in the system, at a given time.
  • the owning network node is the only entity in the system that is allowed to commit changes to its memory records.
  • a memory record, requested by an application that is executed in one of the network nodes is first speculated to be owned and therefore stored in a local memory of that network node.
  • the prediction is correct, only local information is traversed which results in ultra-low latency access.
  • the speculation fails and the memory record is missing, it is searched in other network nodes in the system.
  • this last network node is contacted first with a request of the memory record.
  • the search is done by connecting to a catalog service (typically over the network), which replies with the address of the network node that currently owns the desired memory record.
  • a catalog service typically over the network
  • the logically centralized catalog service is sharded, so that each subset of the catalog, such as a range of inode numbers or range of hash values of inode numbers, is served by a different network node, for load balancing purposes.
  • the information of memory record ownership is distributed between the network nodes in the system, and the search is done by storing and maintaining hints in the parent directory of each file or by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity.
  • a request is sent to the owning network node and access to the memory record is granted.
  • the ownership of the memory record is changed, for example due to trends in data consumption and/or failures, and the memory record is transferred to the requesting network node.
  • copies of some or all of the memory records are also stored as a secondary and/or backup in one or more of the other network nodes, that may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 is a schematic illustration of a distributed network storage system, such as a cluster of application servers, that includes memory records and managed by a shared file system wherein segments of the file system are stored in a common network node with the records they map, according to some embodiments of the present invention.
  • a distributed network storage system such as a cluster of application servers, that includes memory records and managed by a shared file system wherein segments of the file system are stored in a common network node with the records they map, according to some embodiments of the present invention.
  • the system includes multiple network nodes 110 such as application servers, each storing, in a memory 120 , a subset of memory records from all the memory records stored in the system.
  • Each one of network nodes 110 includes a file system segment of the file system, mapping the subset of memory records.
  • a memory record represents an addressable data unit, such as a file, a directory a layout or a file segment such as a data block.
  • a memory record may be of any offset and size. The data block does not have to reside on memory storage media.
  • Each memory record is owned by one network node. Owning a memory record means that the owning network node is storing its memory records and is the only entity in the system that is allowed to commit changes to its memory records. Committing changes may includes modifying a file by changing the data (e.g. write system calls), cutting it short (e.g. truncate), adding information (e.g. append) and many other portable operating system interface (POSIX) variants. Similar commit operations are required in a directory for rename, touch, remove and other POSIX commands.
  • POSIX portable operating system interface
  • Each of network nodes 110 may be, for example, a physical computer such as a mainframe computer, a workstation, a conventional personal computer (PC), a server-class computer or multiple connected computers, and/or a virtual server.
  • a physical computer such as a mainframe computer, a workstation, a conventional personal computer (PC), a server-class computer or multiple connected computers, and/or a virtual server.
  • Network nodes 110 are connected via one or more network(s) 101 .
  • Network 101 may include, for example, LAN, high performance computing network such as Infiniband and/or any other network.
  • network 110 is comprised of a hierarchy of different networks, such as multiple vLANs or a hybrid of LANs connected by WAN.
  • Memory 120 of each network node 110 may include, for example non-volatile memory (NVM), also known as persistent memory (PM), and/or solid-state drive (SSD), and/or magnetic hard-disk drive (HDD), and optionally optical disks or tape drives.
  • NVM non-volatile memory
  • PM persistent memory
  • SSD solid-state drive
  • HDD magnetic hard-disk drive
  • These technologies can be internal or external devices or systems, including memory bricks accessible by the network node 110 .
  • Memory 120 may also be made by DRAM that is backed up with supercapacitor and Flash, or other hybrid technologies. In some use cases, such as ephemeral computing (e.g. most cloud services), memory 120 may even be comprised of volatile memory.
  • Memory 120 may also include a partition in a disk drive, or even a file, a logical volume presented by a logical volume manager built on top of one or more disk drives, a persistent media, such as non-volatile memory (NVDIMM or NVRAM), a persistent array of storage blocks, and/or the like. Memory 120 may also be divided to a plurality of volumes, also referred to as partitions or block volumes. When 120 includes non-volatile memory, which is considerably faster than other storage media types, the reduced network latency achieved by the method is more significant.
  • the file system segment stored by each network node is typically implemented using tree structures.
  • One tree supports lookups, adds and removes of inodes, while another is used to do the same for data blocks per file or directory.
  • Some file systems may be implemented using other structures such as hash tables.
  • FIG. 2A is a schematic illustration of an exemplary file system segment stored by a network node 111 , according to some embodiments of the present invention.
  • Marked entities are owned by network node 111 , while the information of ownership of white entities is not owned by network node 111 , which may or may not have a hint as to who may own them.
  • the first, second and seventh inodes are owned by network node 111 .
  • the third to sixth as well as the eighth, tenth and beyond inodes are not, and may even not exist in the cluster.
  • FIG. 2A implies that the entire information represented by an inode is fully owned by a single node. While a convenient implementation, it is also possible to partition a file into layouts, of fixed or flexible sizes and let different nodes own different layouts.
  • the information of memory record ownership may be implemented using a different architecture at the second level of the hierarchy, such as a catalog service and partially independent local file systems in the first level of the hierarchy; or using a distributed architecture such as shared (even if hierarchical) file systems.
  • the file system segment stored in a network node includes cached hints of the last known network node to own the memory record.
  • each network node holds and mainly uses a subset of the global metadata, but there is for example a centralized catalog service 102 that provides the owning network node per memory record number.
  • Centralized catalog service 102 may be any kind of network node, as described above.
  • Centralized catalog service 102 may be implemented, for example, by using off-the-shelf key-value store services such as Redis or by implementing a network layer on top of a hash structure, in which the key is the inode number (or some hash applied to it) and the value is the network node identification (ID) (e.g. internet protocol (IP) address).
  • ID network node identification
  • IP internet protocol
  • the catalog service is sharded or distributed, but logically acts as a centralized one.
  • a subset S out of the N network nodes are also used for holding a subset of the ownership information and in order to know which shard [0,1, . . . (S ⁇ 1)] serves a particular inode number, the modulo of that number divided by S is calculated.
  • FIG. 2B is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes 110 , according to some embodiments of the present invention.
  • FIG. 2C is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B , stored by a network node 111 , according to some embodiments of the present invention.
  • Lightly marked entities are owned by network node 111 .
  • the ownership of white entities is unknown to a network node 111 , while other darker entities are considered as speculated cached hints for performance optimization purposes.
  • Hints are often located at elements higher than or sibling in the hierarchy to the owned elements, and not at the owned elements themselves. Also, hints may be around data which is mirrored in the local network node but owned by another network node.
  • copies of some or all of the memory records are stored in one or more of the network nodes, for example in network node 111 , in addition to the owning network node.
  • a secondary and/or backup network node may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data. This could also be leveraged for load balancing for read-only access requests, when no writing is needed.
  • a particular file, such as a golden image or template may be mirrored multiple times or even to all network nodes, in order to reduce network traffic and/or increase local deduplication ratio.
  • ownership of entire clones and snapshots, or even sets of snapshots can be re-evaluated as a whole, and out-weight the per-file or per layout ownership process.
  • a remote and cheaper site e.g. cloud storage
  • all versions probably in an efficient deduplicated format, reside on the relevant network node.
  • NFS network file system
  • AWS S3 cloud storage
  • HTTP Hypertext Transfer Protocol
  • a shared file system architecture could replace inodes with similar pointers.
  • FIG. 3 is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention.
  • the memory records are stored in network nodes 110 (such as network nodes 111 and 112 ), to create the file system.
  • the file system may include root, catalog service, tables of nodes, etc. Data and metadata is also created.
  • a request for accessing one of the memory records is received by a storage managing module 131 of network node 111 from an application 141 executed in network node 111 , for example to read or write B bytes at offset O in a file.
  • the application may be any software component that accesses, for example for read and/or writes operations, either directly or via libraries, middleware and/or an overlay file system manager.
  • the file system segment stored in network node 111 is queried for the requested memory record. This may be done by storage managing module 131 in any known way of memory access.
  • the memory record When the memory record is found locally, as shown in 304 , it is accessed and the data is provided to the application. In this case, no network latency is experienced, as no access to network 101 is required.
  • the address may be, for example, an IP address of network node 112 , an identification number of network node 112 such as ‘NodeID’ which can be used to calculate or look up the IP, media access control/(MAC), HTTP or any other network address.
  • NodeID an identification number of network node 112 such as ‘NodeID’ which can be used to calculate or look up the IP
  • MAC media access control/(MAC)
  • HTTP any other network address.
  • network node 111 connects to this last network node with a request of the memory record, before searching other network nodes.
  • This last known network node may be network node 112 still owning and storing the memory record, or may provide the address of network node 112 where the memory record is stored, or may fail and respond that the hint is no longer correct.
  • the search is done by connecting to a catalog service, such as a centralized catalog service 102 , typically over network 101 , which replies with the address of network node 112 .
  • a catalog service such as a centralized catalog service 102
  • the search is done by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity. For example, when the cluster of network nodes is spread over multiple data centers, missing memory records are first searched in the same data center because of the superior local network resources and the higher probability of data sharing. In an opposite example, certain data is expected o be found in another geography at certain hours, for example when two sites at different time zones and in which, the output data of the first team is used by the second team as their input data.
  • the memory record representing the parent directory may be is queried before querying for the requested memory record. This may be the direct parent directory and also other directories up in the hierarchy.
  • the memory record representing the parent directory may be locally owned by network node 111 , may be owned by network node 112 or may be owned by a different network node. This process may be repeated for any directory structure. When the memory record representing the parent directory is found, it is read and may be locally cached and saved as a future hint.
  • a direct communication channel is established between network node 111 and network node 112 for example via network 101 , according to the address. Access to the memory record is then provided to network node 111 .
  • network node 112 makes a decision to either change the ownership of the layout, file or directory, and start a migration process of the memory record and potentially its surrounding information to network node 111 , or to perform remote input/output (TO) access protocol with network node 111 .
  • the IO protocol may include any over-the-network file-access interfaces/libraries and object store application programming interface (API).
  • network node 112 may block access, temporarily or permanently to the memory record.
  • a temporary block for example, may occur when the memory record is currently locked or accessed by an application 142 or another one of network nodes 110 so network node 112 temporarily blocks read and/or write access for network node 111 (e.g. via a RETRY response).
  • Another type of blocked response may occur if the requested data is corrupted with no means to reconstruct it.
  • network node 111 queries, by storage managing module 131 , for an address of another network node having a copy of the memory record is queried for.
  • storage managing module 131 may query via centralized catalog service 102
  • a copy of the memory record may be located in network node 111 , saving the need to traverse network 101 .
  • network node 112 informs of any changes made to the memory record, or at least invalidates it, so that outdated copies may be removed or updated.
  • network node 112 may send a notification to centralized catalog service 102 regarding the changes.
  • network node 112 records a small number of network nodes to be informed and uses broadcast when that number crosses a threshold.
  • FIG. 4 is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention.
  • the exemplary scenario is demonstrated using the POSIX semantics and Linux implementation (e.g. virtual file system (VFS) and its dentry cache).
  • the exemplary scenario shows different types of local ownership, remote ownership and caching, which are underlined.
  • a network node ‘Node_c’ 421 includes an application (App), a VFS, a front end (FE) and a local file system (FS).
  • the front end is optional and/or implementation dependant, as some flexibility exists in the way the front end is implemented, for example, unlike this example, the front end may just be an escape option in the local file system.
  • the FE and FS may both be represented as storage managing module 130 in FIG. 1 . Other ways to partition storage managing module 130 exist, such as shown below as optional.
  • the application requests to open file c that is located in directory b that is in directory a ( 401 ) that is under the root.
  • the open function call may include relevant flag argument(s) that can later be passed to NodeID.
  • the VFS looks for directory a in root, which turns out to be locally cached, and then looks for directory b in directory a, which is not locally cached ( 402 ).
  • the lookup request is then transferred to the FE ( 403 ) and then to the local FS.
  • the local FS predicts that directory b is locally owned and stored, looks for directory b, finds that it is indeed locally owned and returns it to the FE ( 404 ).
  • the FE then returns directory b to the VFS ( 405 ), which caches it ( 406 ), looks up file c in directory b, finds the inode number (inodeNum) of file c, but the inode itself is not found in the VFS inode cache ( 406 ).
  • the open inodeNum c request is then transferred to the FE ( 406 ) and to the local FS ( 407 ).
  • the local FS predicts that file c is locally owned and stored, tries to open file c but find that it is a misprediction and returns because inodeNum c is not locally owned ( 408 ).
  • the FE then connects over the network to the catalog service, ‘Node_cs’ 422 , requesting for the value that matches the inodeNum key ( 409 ).
  • the catalog service is a key-value store, so it searches and returns the value that matched the inodeNum key.
  • the value is NodeID, i.e. the identity of the network node that owns file c ( 410 ).
  • the FE receives the NodeID, checks for its validity, calculates the node address and establishes a direct connection (P2P handshake) to the owning network node based on the NodeID ( 411 ).
  • P2P handshake a direct connection to the owning network node based on the NodeID ( 411 ).
  • the open request is complete and returns to the VFS ( 412 ), which caches it in its inode cache and returns the file descriptor to the application ( 413 ).
  • a memory record when a memory record has to be created, for example as requested by application 141 , it is created locally in memory 121 of network node 111 .
  • Storage managing module 131 assigns a new inode number to the new memory record.
  • the inode number may include a prefix unique to network node 111 .
  • Storage managing module 131 then updates the directory containing the new memory record, and when the directory is owned by another network node 112 , contacts network node 112 to update.
  • network node 111 registers the new memory record in centralized catalog service 102 by connecting to centralized catalog service 102 and providing the new inode number. Similarly, deletion of a memory record is done by the owning network node and the centralized catalog service 102 is updated.
  • FIG. 5 is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.
  • the system looks for the directories and file as described above ( 402 - 406 ). However, when file c is not mentioned in directory b and the application requested the open system call using an O_CREAT argument (back in step 501 ), then the FE is requested to create a new file, a request that continues to the local FS ( 507 ).
  • the local FS creates file c locally and assigns a new inodeNum to it ( 508 ), typically using a local prefix node number.
  • the local FS updates directory b and returns to the FE ( 508 ).
  • the FE then sends the new inodeNum with the NodeID of Node_c to update the catalog Node_cs ( 509 ), which registers it by adding the Key-Value pair ⁇ inodeNum c, Node_c>(510) and returning acknowledgement to the FE ( 511 ), which returns the inode to the VFS, which caches it in the inode cache ( 412 ), and returns the file descriptor to the application ( 413 )
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

Abstract

A method of accessing a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each stores a file system segment of a file system mapping the memory records, each file system segment maps a subset of the memory records; receiving, by a storage managing module of a first network node, a request for accessing one of the memory records from an application executed in the first network node; querying a file system segment stored in the first network node for the memory record; when the memory record is missing, querying for an address of a second network node, wherein the memory record is stored in the second network node; and providing said first network node with an access to said memory record at said second network node via a network according to said address.

Description

    RELATED APPLICATION
  • This application claims the benefit of priority under 35 USC 119(e) of U.S. Provisional Patent Application No. 61/946,847 filed on Mar. 2, 2014, the contents of which are incorporated herein by reference in their entirety.
  • FIELD AND BACKGROUND OF THE INVENTION
  • The present invention, in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.
  • Direct-attached storage (DAS) is a model in which data is local on a server and benefits from low latency access. However, when multiple servers are connected to a network, the DAS model is: inefficient, because there is no resource sharing between servers; inconvenient since data cannot be shared between processes running on different application servers; and not resilient because data is lost upon a single server failure.
  • To overcome the weaknesses of DAS, shared storage model was invented. Shared-storage systems store all or most metadata and data on a server, which is typically an over-the-network server and not the same server that runs the application/s that generates and consumes the stored data. This architecture can be seen both in traditional shared storage systems, such as NetApp FAS and/or EMC Isilon, where all of the data is accessed via the network; and/or in host-based storage, such as Redhat Gluster and/or EMC Scale-io, in which application servers also run storage functions, but the data is uniformly distributed across the cluster of servers (so 1/n of the data is accessed locally by each server and the remaining (n−1)/n of the data is accessed via the network).
  • Another well known variant of shared storage is shared storage with (typically read) caches. In this design the application server includes local storage media (such as a Flash card) that holds data that was recently accessed by the application server. This is typically beneficial for recurring read requests. Caching can be used in front of a traditional shared storage (for example in Linux block layer cache (BCache)), or in front of a host-based storage (for example in VMware vSAN). These caching solutions tend to be block-based solutions—i.e. DAS file system layer on top of a shared block layer.
  • Finally, some storage protocols such as Hadoop distributed file system (HDFS) and parallel network file system (pNFS), allow for metadata to be served from a centralized shared node, while data is served from multiple nodes. The data (not metadata) is typically uniformly distributed among the nodes for load balancing purposes.
  • SUMMARY OF THE INVENTION
  • According to an aspect of some embodiments of the present invention there is provided a method of accessing a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for accessing one of the plurality of memory records, the request is received from an application executed in the first network node; querying a first file system segment stored in the first network node for the memory record; when the memory record is missing from the first memory records subset, querying for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the second network node via a network according to the address.
  • Optionally, the providing comprises establishing a direct communication channel between the first network node and the second network node via the network according to the address to provide the access.
  • Optionally, the querying for the address includes: sending a request to a catalog service via the network; and receiving a reply message from the catalog service, the reply message including the address.
  • Optionally, the querying for the address includes sending a request to each of the plurality of network nodes to receive the address.
  • Optionally, the querying for the address includes querying for a last known location of the memory record cached in the first file system segment.
  • Optionally, the second network node temporarily blocks write access to the memory record for the first network node when the memory record is currently accessed by any other of the plurality of network nodes.
  • More optionally, the second network node temporarily blocks access to the memory record for the first network node when the memory record is currently written by any other of the plurality of network nodes.
  • Optionally, a copy of the memory record is also stored in a third of the plurality of network nodes.
  • Optionally, the method further comprises: when the second network node is unavailable, querying for an address of the third network node; and establishing a direct communication channel between the first network node and the third network node via the network according to the address to provide access to the memory record.
  • Optionally, a copy of the memory record is also stored in the first network node and may be accessed instead of accessing the memory record at the second network node via the network.
  • Optionally, the method further comprises, before the querying: querying for an address of a directory containing the memory record; and querying for an address of the memory record in the directory.
  • Optionally, the memory record includes multiple file segments.
  • Optionally, the querying for the address includes providing an inode number of the memory record.
  • Optionally, the querying for the address includes providing a layout number of the memory record.
  • According to some embodiments of the invention there is provided a computer readable medium comprising computer executable instructions adapted to perform the method.
  • According to an aspect of some embodiments of the present invention there is provided a system of managing a distributed network storage, comprising: a file system segment stored in a first of a plurality of network nodes, the file system segment is one of a plurality of file system segments of a file system mapping a plurality of memory records; a program store storing a storage managing code; and a processor, coupled to the program store, for implementing the storage managing code, the storage managing code comprising: code to receive an access request to a memory record of the plurality of memory records from an application executed in the first network node; code to query the file system segment for the memory record in the first memory records subset; code to query for an address of a second network node of the plurality of network nodes when the memory record is missing from the first memory records subset, wherein the memory record is stored in a second memory records subset of the second network node; and code to provide the first network node with an access to the memory record at the second network node via a network according to the address.
  • According to an aspect of some embodiments of the present invention there is provided a distributed network storage system, comprising: a plurality of network nodes connected via a network, each including a storage managing module; a plurality of file system segments of a file system, each stored in one of the plurality of network nodes; a plurality of memory records managed by the plurality of file system segments, wherein each of the plurality of memory records is owned by one of the plurality of network nodes and stored in at least one of the plurality of network nodes; and wherein when an application executed in a first of the plurality of network nodes requests an access to one of the plurality of memory records, and the memory record is missing from a memory records subset stored in the first network node, a storage managing module included in the first network node queries for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the second network node via a network according to the address.
  • According to an aspect of some embodiments of the present invention there is provided a method of creating a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for creating a new of the plurality of memory records, the request is received from an application executed in the first network node; creating the memory record in the first network node; and registering the memory record in a catalog service via the network.
  • Optionally, the creating includes assigning a prefix unique to the first network node to an inode number of the memory record.
  • Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
  • Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
  • For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
  • Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
  • In the drawings:
  • FIG. 1 is a schematic illustration of a distributed network storage system that includes memory records and managed by a shared file system, according to some embodiments of the present invention;
  • FIG. 2A is a schematic illustration of an exemplary file system segment stored by a network node, according to some embodiments of the present invention;
  • FIG. 2B is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes, according to some embodiments of the present invention;
  • FIG. 2C is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B, stored by a network node, according to some embodiments of the present invention;
  • FIG. 3 is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention;
  • FIG. 4 is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention; and
  • FIG. 5 is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.
  • DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • The present invention, in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.
  • Storage media, typically thought of as non-volatile memory such as magnetic hard-disk drive (HDD) or Flash-based solid-state drive (SSD), offers affordable capacity, but at 1,000 to 100,000 times longer latency compared to volatile memory such as dynamic random-access memory (DRAM). Newly developed storage media, such as storage class memory (SCM) which is a form of persistent memory, promises DRAM-like ultra-low latency. When ultra-low latency storage is used, network latency is no longer a relatively insignificant delay like in traditional shared storage architectures. New shared storage architectures are required that minimizes network access and therefore overall network latency.
  • According to some embodiments of the present invention, there is provided a hierarchical shared file system and methods of managing the file system by distributing segments of the file system to reduce network latency and augmenting local file management into a distributed storage solution. These embodiments are a hybrid between DAS and shared storage. In this system, metadata and data are predicted to be local, and the rest of the shared file system hierarchy is only searched upon a misprediction.
  • The system includes multiple memory records that are stored in multiple network nodes. Each network node stores a segment of the file system that maps a subset of the memory records stored in that network node. Each memory record, such as a record represented by an inode in Linux or an entry in the master file table in Windows' new technology file system (NTFS), is a directory or a file in the file system or a file segment such as a range of data blocks. Each memory record is owned (e.g. access managed and/or access controlled) by a single network node in the system, at a given time. The owning network node is the only entity in the system that is allowed to commit changes to its memory records.
  • When the method of accessing a memory record is applied, according to some embodiments of the present invention, a memory record, requested by an application that is executed in one of the network nodes, is first speculated to be owned and therefore stored in a local memory of that network node. When the prediction is correct, only local information is traversed which results in ultra-low latency access. However, when the speculation fails and the memory record is missing, it is searched in other network nodes in the system.
  • When the file system segment stored in that network node includes cached hints of the last known network node to own the memory record, this last network node is contacted first with a request of the memory record.
  • Optionally, the search is done by connecting to a catalog service (typically over the network), which replies with the address of the network node that currently owns the desired memory record.
  • Optionally, the logically centralized catalog service is sharded, so that each subset of the catalog, such as a range of inode numbers or range of hash values of inode numbers, is served by a different network node, for load balancing purposes.
  • Optionally, the information of memory record ownership is distributed between the network nodes in the system, and the search is done by storing and maintaining hints in the parent directory of each file or by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity.
  • Finally, a request is sent to the owning network node and access to the memory record is granted. Optionally, the ownership of the memory record is changed, for example due to trends in data consumption and/or failures, and the memory record is transferred to the requesting network node.
  • Optionally, copies of some or all of the memory records are also stored as a secondary and/or backup in one or more of the other network nodes, that may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data.
  • Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
  • The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
  • Referring now to the drawings, FIG. 1 is a schematic illustration of a distributed network storage system, such as a cluster of application servers, that includes memory records and managed by a shared file system wherein segments of the file system are stored in a common network node with the records they map, according to some embodiments of the present invention.
  • The system includes multiple network nodes 110 such as application servers, each storing, in a memory 120, a subset of memory records from all the memory records stored in the system. Each one of network nodes 110 includes a file system segment of the file system, mapping the subset of memory records.
  • A memory record represents an addressable data unit, such as a file, a directory a layout or a file segment such as a data block. A memory record may be of any offset and size. The data block does not have to reside on memory storage media.
  • Each memory record is owned by one network node. Owning a memory record means that the owning network node is storing its memory records and is the only entity in the system that is allowed to commit changes to its memory records. Committing changes may includes modifying a file by changing the data (e.g. write system calls), cutting it short (e.g. truncate), adding information (e.g. append) and many other portable operating system interface (POSIX) variants. Similar commit operations are required in a directory for rename, touch, remove and other POSIX commands.
  • Each of network nodes 110 (such as network nodes 111 and 112) may be, for example, a physical computer such as a mainframe computer, a workstation, a conventional personal computer (PC), a server-class computer or multiple connected computers, and/or a virtual server.
  • Network nodes 110 are connected via one or more network(s) 101. Network 101 may include, for example, LAN, high performance computing network such as Infiniband and/or any other network. Optionally, network 110 is comprised of a hierarchy of different networks, such as multiple vLANs or a hybrid of LANs connected by WAN.
  • Memory 120 of each network node 110 may include, for example non-volatile memory (NVM), also known as persistent memory (PM), and/or solid-state drive (SSD), and/or magnetic hard-disk drive (HDD), and optionally optical disks or tape drives. These technologies can be internal or external devices or systems, including memory bricks accessible by the network node 110. Memory 120 may also be made by DRAM that is backed up with supercapacitor and Flash, or other hybrid technologies. In some use cases, such as ephemeral computing (e.g. most cloud services), memory 120 may even be comprised of volatile memory. Memory 120 may also include a partition in a disk drive, or even a file, a logical volume presented by a logical volume manager built on top of one or more disk drives, a persistent media, such as non-volatile memory (NVDIMM or NVRAM), a persistent array of storage blocks, and/or the like. Memory 120 may also be divided to a plurality of volumes, also referred to as partitions or block volumes. When 120 includes non-volatile memory, which is considerably faster than other storage media types, the reduced network latency achieved by the method is more significant.
  • The file system segment stored by each network node is typically implemented using tree structures. One tree supports lookups, adds and removes of inodes, while another is used to do the same for data blocks per file or directory. Some file systems may be implemented using other structures such as hash tables.
  • Reference is now made to FIG. 2A, which is a schematic illustration of an exemplary file system segment stored by a network node 111, according to some embodiments of the present invention. Marked entities are owned by network node 111, while the information of ownership of white entities is not owned by network node 111, which may or may not have a hint as to who may own them. In FIG. 2A, the first, second and seventh inodes (representing files or directories) are owned by network node 111. The third to sixth as well as the eighth, tenth and beyond inodes are not, and may even not exist in the cluster.
  • Note that while the level of indirect and the level of File/Directory are drawn as one entity, they are typically comprised of many smaller entities, just like the inodes level. FIG. 2A implies that the entire information represented by an inode is fully owned by a single node. While a convenient implementation, it is also possible to partition a file into layouts, of fixed or flexible sizes and let different nodes own different layouts.
  • The information of memory record ownership may be implemented using a different architecture at the second level of the hierarchy, such as a catalog service and partially independent local file systems in the first level of the hierarchy; or using a distributed architecture such as shared (even if hierarchical) file systems. Optionally, the file system segment stored in a network node includes cached hints of the last known network node to own the memory record.
  • In the catalog architecture, each network node holds and mainly uses a subset of the global metadata, but there is for example a centralized catalog service 102 that provides the owning network node per memory record number. Centralized catalog service 102 may be any kind of network node, as described above. Centralized catalog service 102 may be implemented, for example, by using off-the-shelf key-value store services such as Redis or by implementing a network layer on top of a hash structure, in which the key is the inode number (or some hash applied to it) and the value is the network node identification (ID) (e.g. internet protocol (IP) address). Optionally, the catalog service is sharded or distributed, but logically acts as a centralized one. This too can be implemented using off-the-shelf software such as Cassandra, or by sharding alone. In one embodiment a subset S out of the N network nodes are also used for holding a subset of the ownership information and in order to know which shard [0,1, . . . (S−1)] serves a particular inode number, the modulo of that number divided by S is calculated.
  • In the distributed architecture there is no catalog service. Instead all network nodes use the same tree root the file system segment, but hold different subsets of that tree.
  • Reference is now made to FIG. 2B, which is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes 110, according to some embodiments of the present invention. Reference is also made to FIG. 2C, which is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B, stored by a network node 111, according to some embodiments of the present invention. Lightly marked entities are owned by network node 111. The ownership of white entities is unknown to a network node 111, while other darker entities are considered as speculated cached hints for performance optimization purposes. Hints are often located at elements higher than or sibling in the hierarchy to the owned elements, and not at the owned elements themselves. Also, hints may be around data which is mirrored in the local network node but owned by another network node.
  • Optionally, for both hierarchical architectures, copies of some or all of the memory records are stored in one or more of the network nodes, for example in network node 111, in addition to the owning network node. In this case, a secondary and/or backup network node may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data. This could also be leveraged for load balancing for read-only access requests, when no writing is needed. A particular file, such as a golden image or template may be mirrored multiple times or even to all network nodes, in order to reduce network traffic and/or increase local deduplication ratio.
  • Optionally, ownership of entire clones and snapshots, or even sets of snapshots (e.g. all snapshots older than snapshot i) can be re-evaluated as a whole, and out-weight the per-file or per layout ownership process. For example, in order to back up and reduce cost, migration of all data older then or belonging to a daily snapshot to a remote and cheaper site (e.g. cloud storage) is possible. For example, when a file is snapshotted every night, it is possible to have all versions, probably in an efficient deduplicated format, reside on the relevant network node. however, at some point in time, for instance when the memory 120 crosses its lowest tier watermark, to decide to move cold data and all snapshots that are more than a week old to a third party storage system, such as a network file system (NFS) server or cloud storage (e.g. AWS S3). In such scenarios the catalog service can point to an Hypertext Transfer Protocol (HTTP) address or an NFS server and path. A shared file system architecture could replace inodes with similar pointers.
  • Reference is now made to FIG. 3, which is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention.
  • First, as shown in 301, the memory records are stored in network nodes 110 (such as network nodes 111 and 112), to create the file system. The file system may include root, catalog service, tables of nodes, etc. Data and metadata is also created.
  • Then, as shown in 302, a request for accessing one of the memory records is received by a storage managing module 131 of network node 111 from an application 141 executed in network node 111, for example to read or write B bytes at offset O in a file.
  • The application may be any software component that accesses, for example for read and/or writes operations, either directly or via libraries, middleware and/or an overlay file system manager.
  • Then, as shown in 303, the file system segment stored in network node 111 is queried for the requested memory record. This may be done by storage managing module 131 in any known way of memory access.
  • When the memory record is found locally, as shown in 304, it is accessed and the data is provided to the application. In this case, no network latency is experienced, as no access to network 101 is required.
  • However, as shown in 305, when the memory record is missing from the records subset stored in network node 111, an address of a network node, such as network node 112, owning the memory record is queried for by storage managing module 131.
  • In this case, there is an associated added latency for false speculating that the relevant memory record is local. Nevertheless, the latency added by the local search may not be significant when NVM media is used, making it negligible compared to accessing over-the-network servers.
  • The address may be, for example, an IP address of network node 112, an identification number of network node 112 such as ‘NodeID’ which can be used to calculate or look up the IP, media access control/(MAC), HTTP or any other network address.
  • Optionally, when the file system segment stored in that network node includes cached hints of the last known network node to own the memory record, network node 111 connects to this last network node with a request of the memory record, before searching other network nodes. This last known network node may be network node 112 still owning and storing the memory record, or may provide the address of network node 112 where the memory record is stored, or may fail and respond that the hint is no longer correct.
  • Optionally, in catalog architecture, the search is done by connecting to a catalog service, such as a centralized catalog service 102, typically over network 101, which replies with the address of network node 112.
  • Optionally, in a distributed architecture, the search is done by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity. For example, when the cluster of network nodes is spread over multiple data centers, missing memory records are first searched in the same data center because of the superior local network resources and the higher probability of data sharing. In an opposite example, certain data is expected o be found in another geography at certain hours, for example when two sites at different time zones and in which, the output data of the first team is used by the second team as their input data.
  • Optionally, when a memory record representing a file or a directory is requested, the memory record representing the parent directory may be is queried before querying for the requested memory record. This may be the direct parent directory and also other directories up in the hierarchy. The memory record representing the parent directory may be locally owned by network node 111, may be owned by network node 112 or may be owned by a different network node. This process may be repeated for any directory structure. When the memory record representing the parent directory is found, it is read and may be locally cached and saved as a future hint.
  • Finally, as shown in 306, a direct communication channel is established between network node 111 and network node 112 for example via network 101, according to the address. Access to the memory record is then provided to network node 111.
  • Then, optionally, the information is not read and locally cached, but instead network node 112 makes a decision to either change the ownership of the layout, file or directory, and start a migration process of the memory record and potentially its surrounding information to network node 111, or to perform remote input/output (TO) access protocol with network node 111. The IO protocol may include any over-the-network file-access interfaces/libraries and object store application programming interface (API).
  • Optionally, network node 112 may block access, temporarily or permanently to the memory record. A temporary block for example, may occur when the memory record is currently locked or accessed by an application 142 or another one of network nodes 110 so network node 112 temporarily blocks read and/or write access for network node 111 (e.g. via a RETRY response). Another type of blocked response may occur if the requested data is corrupted with no means to reconstruct it.
  • Optionally, when network node 112 is unavailable, network node 111 queries, by storage managing module 131, for an address of another network node having a copy of the memory record is queried for. In the catalog architecture, storage managing module 131 may query via centralized catalog service 102 Also, a copy of the memory record may be located in network node 111, saving the need to traverse network 101.
  • Optionally, when the system includes copies of the memory record in other network node(s), and providing that these are not only treated as hints that will be validated later on, network node 112 informs of any changes made to the memory record, or at least invalidates it, so that outdated copies may be removed or updated. In catalog architecture, network node 112 may send a notification to centralized catalog service 102 regarding the changes. In a distributed architecture network node 112 records a small number of network nodes to be informed and uses broadcast when that number crosses a threshold.
  • Reference is now made to FIG. 4, which is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention. The exemplary scenario is demonstrated using the POSIX semantics and Linux implementation (e.g. virtual file system (VFS) and its dentry cache). The exemplary scenario shows different types of local ownership, remote ownership and caching, which are underlined.
  • In this example, a network node ‘Node_c’ 421 includes an application (App), a VFS, a front end (FE) and a local file system (FS). The front end is optional and/or implementation dependant, as some flexibility exists in the way the front end is implemented, for example, unlike this example, the front end may just be an escape option in the local file system. The FE and FS may both be represented as storage managing module 130 in FIG. 1. Other ways to partition storage managing module 130 exist, such as shown below as optional.
  • The application requests to open file c that is located in directory b that is in directory a (401) that is under the root. The open function call may include relevant flag argument(s) that can later be passed to NodeID.
  • The VFS looks for directory a in root, which turns out to be locally cached, and then looks for directory b in directory a, which is not locally cached (402). The lookup request is then transferred to the FE (403) and then to the local FS. The local FS predicts that directory b is locally owned and stored, looks for directory b, finds that it is indeed locally owned and returns it to the FE (404). The FE then returns directory b to the VFS (405), which caches it (406), looks up file c in directory b, finds the inode number (inodeNum) of file c, but the inode itself is not found in the VFS inode cache (406). The open inodeNum c request is then transferred to the FE (406) and to the local FS (407). The local FS predicts that file c is locally owned and stored, tries to open file c but find that it is a misprediction and returns because inodeNum c is not locally owned (408). The FE then connects over the network to the catalog service, ‘Node_cs’ 422, requesting for the value that matches the inodeNum key (409). The catalog service is a key-value store, so it searches and returns the value that matched the inodeNum key. The value is NodeID, i.e. the identity of the network node that owns file c (410). The FE receives the NodeID, checks for its validity, calculates the node address and establishes a direct connection (P2P handshake) to the owning network node based on the NodeID (411). When the ownership is resolved between the network nodes, the open request is complete and returns to the VFS (412), which caches it in its inode cache and returns the file descriptor to the application (413).
  • Optionally, when a memory record has to be created, for example as requested by application 141, it is created locally in memory 121 of network node 111. Storage managing module 131 assigns a new inode number to the new memory record. The inode number may include a prefix unique to network node 111. Storage managing module 131 then updates the directory containing the new memory record, and when the directory is owned by another network node 112, contacts network node 112 to update. In catalog architecture, network node 111 registers the new memory record in centralized catalog service 102 by connecting to centralized catalog service 102 and providing the new inode number. Similarly, deletion of a memory record is done by the owning network node and the centralized catalog service 102 is updated.
  • Reference is now made to FIG. 5, which is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.
  • When the application requests to open a new file c in directory b that is in directory a (501) that is under the root, the system looks for the directories and file as described above (402-406). However, when file c is not mentioned in directory b and the application requested the open system call using an O_CREAT argument (back in step 501), then the FE is requested to create a new file, a request that continues to the local FS (507). The local FS creates file c locally and assigns a new inodeNum to it (508), typically using a local prefix node number. The local FS updates directory b and returns to the FE (508). The FE then sends the new inodeNum with the NodeID of Node_c to update the catalog Node_cs (509), which registers it by adding the Key-Value pair <inodeNum c, Node_c>(510) and returning acknowledgement to the FE (511), which returns the inode to the VFS, which caches it in the inode cache (412), and returns the file descriptor to the application (413)
  • The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
  • It is expected that during the life of a patent maturing from this application many relevant shared file systems will be developed and the scope of the term shared file system is intended to include all such new technologies a priori.
  • The terms “comprises”, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of” and “consisting essentially of”.
  • As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
  • The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
  • Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
  • It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
  • Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
  • All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims (19)

What is claimed is:
1. A method of accessing a memory record in distributed network storage, comprising:
storing a plurality of memory records in a plurality of network nodes, each one of said plurality of network nodes storing a plurality of file system segments of a file system mapping said plurality of memory records, each one of said plurality of file system segments maps a subset of said plurality of memory records;
receiving, by a storage managing module of a first network node of said plurality of network nodes, a request for accessing one of said plurality of memory records, said request is received from an application executed in said first network node;
querying a first file system segment stored in said first network node for said memory record;
when said memory record is missing from said first memory records subset, querying for an address of a second network node of said plurality of network nodes, wherein said memory record is stored in a second memory records subset of said second network node; and
providing said first network node with an access to said memory record at said second network node via a network according to said address.
2. The method of claim 1, wherein said providing comprises establishing a direct communication channel between said first network node and said second network node via said network according to said address to provide said access.
3. The method of claim 1, wherein said querying for said address includes:
sending a request to a catalog service via said network; and
receiving a reply message from said catalog service, said reply message including said address.
4. The method of claim 1, wherein said querying for said address includes sending a request to each of said plurality of network nodes to receive said address.
5. The method of claim 1, wherein said querying for said address includes querying for a last known location of said memory record cached in said a first file system segment.
6. The method of claim 1, wherein said second network node temporarily blocks write access to said memory record for said first network node when said memory record is currently accessed by any other of said plurality of network nodes.
7. The method of claim 6, wherein said second network node temporarily blocks access to said memory record for said first network node when said memory record is currently written by any other of said plurality of network nodes.
8. The method of claim 1, wherein a copy of said memory record is also stored in a third of said plurality of network nodes.
9. The method of claim 8, further comprising:
when said second network node is unavailable, querying for an address of said third network node; and
establishing a direct communication channel between said first network node and said third network node via said network according to said address to provide access to said memory record.
10. The method of claim 1, wherein a copy of said memory record is also stored in said first network node and may be accessed instead of accessing the memory record at said second network node via said network.
11. The method of claim 1, further comprising, before said querying:
querying for an address of a directory containing said memory record; and
querying for an address of said memory record in said directory.
12. The method of claim 1, wherein said memory record includes multiple file segments.
13. The method of claim 1, wherein said querying for said address includes providing an inode number of said memory record.
14. The method of claim 1, wherein said querying for said address includes providing a layout number of said memory record.
15. A computer readable medium comprising computer executable instructions adapted to perform the method of claim 1.
16. A system of managing a distributed network storage, comprising:
a file system segment stored in a first of a plurality of network nodes, said file system segment is one of a plurality of file system segments of a file system mapping a plurality of memory records;
a program store storing a storage managing code; and
a processor, coupled to said program store, for implementing said storage managing code, the storage managing code comprising:
code to receive an access request to a memory record of said plurality of memory records from an application executed in said first network node;
code to query said file system segment for said memory record in said first memory records subset;
code to query for an address of a second network node of said plurality of network nodes when said memory record is missing form said first memory records subset, wherein said memory record is stored in a second memory records subset of said second network node; and
code to provide said first network node with an access to said memory record at said second network node via a network according to said address.
17. A distributed network storage system, comprising:
a plurality of network nodes connected via a network, each including a storage managing module;
a plurality of file system segments of a file system, each stored in one of said plurality of network nodes;
a plurality of memory records managed by said plurality of file system segments, wherein each of said plurality of memory records is owned by one of said plurality of network nodes and stored in at least one of said plurality of network nodes; and
wherein when an application executed in a first of said plurality of network nodes requests an access to one of said plurality of memory records, and said memory record is missing from a memory records subset stored in said first network node, a storage managing module included in said first network node queries for an address of a second network node of said plurality of network nodes, wherein said memory record is stored in a second memory records subset of said second network node; and providing said first network node with an access to said memory record at said second network node via a network according to said address.
18. A method of creating a memory record in distributed network storage, comprising:
storing a plurality of memory records in a plurality of network nodes, each one of said plurality of network nodes storing a plurality of file system segments of a file system mapping said plurality of memory records, each one of said plurality of file system segments maps a subset of said plurality of memory records;
receiving, by a storage managing module of a first network node of said plurality of network nodes, a request for creating a new of said plurality of memory records, said request is received from an application executed in said first network node;
creating said memory record in said first network node; and
registering said memory record in a catalog service via said network.
19. The method of claim 18, wherein said creating includes assigning a prefix unique to said first network node to an inode number of said memory record.
US14/635,261 2014-03-02 2015-03-02 Hierarchical host-based storage Abandoned US20150248443A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/635,261 US20150248443A1 (en) 2014-03-02 2015-03-02 Hierarchical host-based storage

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461946847P 2014-03-02 2014-03-02
US14/635,261 US20150248443A1 (en) 2014-03-02 2015-03-02 Hierarchical host-based storage

Publications (1)

Publication Number Publication Date
US20150248443A1 true US20150248443A1 (en) 2015-09-03

Family

ID=54006866

Family Applications (4)

Application Number Title Priority Date Filing Date
US14/635,236 Active 2035-08-01 US10031933B2 (en) 2014-03-02 2015-03-02 Peer to peer ownership negotiation
US14/635,261 Abandoned US20150248443A1 (en) 2014-03-02 2015-03-02 Hierarchical host-based storage
US16/040,358 Active US10430397B2 (en) 2014-03-02 2018-07-19 Peer to peer ownership negotiation
US16/585,528 Active US10853339B2 (en) 2014-03-02 2019-09-27 Peer to peer ownership negotiation

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/635,236 Active 2035-08-01 US10031933B2 (en) 2014-03-02 2015-03-02 Peer to peer ownership negotiation

Family Applications After (2)

Application Number Title Priority Date Filing Date
US16/040,358 Active US10430397B2 (en) 2014-03-02 2018-07-19 Peer to peer ownership negotiation
US16/585,528 Active US10853339B2 (en) 2014-03-02 2019-09-27 Peer to peer ownership negotiation

Country Status (1)

Country Link
US (4) US10031933B2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170123692A1 (en) * 2015-11-01 2017-05-04 International Business Machines Corporation Pdse extended generation grouping for tiered storage systems
US9996426B1 (en) * 2015-06-30 2018-06-12 EMC IP Holding Company LLC Sparse segment trees for high metadata churn workloads
US10031933B2 (en) 2014-03-02 2018-07-24 Netapp, Inc. Peer to peer ownership negotiation
CN108418874A (en) * 2018-02-12 2018-08-17 平安科技(深圳)有限公司 Guiding method, device, computer equipment and storage medium are returned across wide area network data
US10055420B1 (en) 2015-06-30 2018-08-21 EMC IP Holding Company LLC Method to optimize random IOS of a storage device for multiple versions of backups using incremental metadata
US20190141131A1 (en) * 2015-04-09 2019-05-09 Pure Storage, Inc. Point to point based backend communication layer for storage processing
CN110311953A (en) * 2019-05-24 2019-10-08 杭州网络传媒有限公司 A kind of media article uploads and storage system and method
CN110737547A (en) * 2019-10-22 2020-01-31 第四范式(北京)技术有限公司 Method and apparatus for restoring memory database using non-volatile memory (NVM)
CN111061681A (en) * 2019-11-15 2020-04-24 浪潮电子信息产业股份有限公司 Method and device for partitioning directory based on case insensitivity and storage medium
US11381642B2 (en) * 2017-11-06 2022-07-05 Nippon Telegraph And Telephone Corporation Distributed storage system suitable for sensor data
US20230239547A1 (en) * 2016-12-09 2023-07-27 The Nielsen Company (Us), Llc Scalable architectures for reference signature matching and updating

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10574745B2 (en) * 2015-03-31 2020-02-25 Western Digital Technologies, Inc. Syncing with a local paired device to obtain data from a remote server using point-to-point communication
US10706970B1 (en) 2015-04-06 2020-07-07 EMC IP Holding Company LLC Distributed data analytics
US10277668B1 (en) * 2015-04-06 2019-04-30 EMC IP Holding Company LLC Beacon-based distributed data processing platform
US10691553B2 (en) * 2015-12-16 2020-06-23 Netapp, Inc. Persistent memory based distributed-journal file system
EP3495981B1 (en) * 2016-11-16 2021-08-25 Huawei Technologies Co., Ltd. Directory deletion method and device, and storage server
CN110018998B (en) * 2019-04-12 2023-05-12 深信服科技股份有限公司 File management method and system, electronic equipment and storage medium
CN112527186B (en) * 2019-09-18 2023-09-08 华为技术有限公司 Storage system, storage node and data storage method
US11513970B2 (en) * 2019-11-01 2022-11-29 International Business Machines Corporation Split virtual memory address loading mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020046232A1 (en) * 2000-09-15 2002-04-18 Adams Colin John Organizing content on a distributed file-sharing network
US20040064523A1 (en) * 2002-10-01 2004-04-01 Zheng Zhang Placing an object at a node within a logical space in a peer-to-peer system
US20040111486A1 (en) * 2002-12-06 2004-06-10 Karl Schuh Distributed cache between servers of a network
US20070094269A1 (en) * 2005-10-21 2007-04-26 Mikesell Paul A Systems and methods for distributed system scanning
US20100274772A1 (en) * 2009-04-23 2010-10-28 Allen Samuels Compressed data objects referenced via address references and compression references

Family Cites Families (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5050213A (en) * 1986-10-14 1991-09-17 Electronic Publishing Resources, Inc. Database usage metering and protection system and method
US5235690A (en) * 1990-08-31 1993-08-10 International Business Machines Corporation Method for operating a cached peripheral data storage subsystem including a step of subsetting the data transfer into subsets of data records
US5560006A (en) * 1991-05-15 1996-09-24 Automated Technology Associates, Inc. Entity-relation database
US5982747A (en) * 1995-12-28 1999-11-09 Dynarc Inc. Method for managing failures on dynamic synchronous transfer mode dual ring topologies
US5864854A (en) 1996-01-05 1999-01-26 Lsi Logic Corporation System and method for maintaining a shared cache look-up table
US6101420A (en) * 1997-10-24 2000-08-08 Compaq Computer Corporation Method and apparatus for disambiguating change-to-dirty commands in a switch based multi-processing system with coarse directories
US6697846B1 (en) * 1998-03-20 2004-02-24 Dataplow, Inc. Shared file system
US6360331B2 (en) * 1998-04-17 2002-03-19 Microsoft Corporation Method and system for transparently failing over application configuration information in a server cluster
US6384825B2 (en) * 1998-06-25 2002-05-07 Tektronix, Inc. Method of controlling a sparse vector rasterizer
US6321238B1 (en) * 1998-12-28 2001-11-20 Oracle Corporation Hybrid shared nothing/shared disk database system
US6341340B1 (en) * 1998-12-28 2002-01-22 Oracle Corporation Transitioning ownership of data items between ownership groups
US6453354B1 (en) * 1999-03-03 2002-09-17 Emc Corporation File server system using connection-oriented protocol and sharing data sets among data movers
US6374332B1 (en) * 1999-09-30 2002-04-16 Unisys Corporation Cache control system for performing multiple outstanding ownership requests
US6691178B1 (en) * 2000-02-22 2004-02-10 Stmicroelectronics, Inc. Fencepost descriptor caching mechanism and method therefor
US6516393B1 (en) * 2000-09-29 2003-02-04 International Business Machines Corporation Dynamic serialization of memory access in a multi-processor system
US7165096B2 (en) * 2000-12-22 2007-01-16 Data Plow, Inc. Storage area network file system
US7620955B1 (en) * 2001-06-08 2009-11-17 Vmware, Inc. High-performance virtual machine networking
US7155722B1 (en) * 2001-07-10 2006-12-26 Cisco Technology, Inc. System and method for process load balancing in a multi-processor environment
US7058948B2 (en) * 2001-08-10 2006-06-06 Hewlett-Packard Development Company, L.P. Synchronization objects for multi-computer systems
US7120631B1 (en) * 2001-12-21 2006-10-10 Emc Corporation File server system providing direct data sharing between clients with a server acting as an arbiter and coordinator
US7243142B2 (en) * 2002-02-01 2007-07-10 Sun Microsystems, Inc Distributed computer system enhancing a protocol service to a highly available service
CA2377649C (en) * 2002-03-20 2009-02-03 Ibm Canada Limited-Ibm Canada Limitee Dynamic cluster database architecture
US6993634B2 (en) * 2002-04-29 2006-01-31 Intel Corporation Active tracking and retrieval of shared memory resource information
US20040068607A1 (en) * 2002-10-07 2004-04-08 Narad Charles E. Locking memory locations
US8041735B1 (en) * 2002-11-01 2011-10-18 Bluearc Uk Limited Distributed file system and method
US7024580B2 (en) * 2002-11-15 2006-04-04 Microsoft Corporation Markov model of availability for clustered systems
US6981106B1 (en) * 2002-11-26 2005-12-27 Unisys Corporation System and method for accelerating ownership within a directory-based memory system
US7340522B1 (en) * 2003-07-31 2008-03-04 Hewlett-Packard Development Company, L.P. Method and system for pinning a resource having an affinity to a user for resource allocation
US7139772B2 (en) * 2003-08-01 2006-11-21 Oracle International Corporation Ownership reassignment in a shared-nothing database system
US7962696B2 (en) * 2004-01-15 2011-06-14 Hewlett-Packard Development Company, L.P. System and method for updating owner predictors
US20140149783A1 (en) * 2004-06-01 2014-05-29 Ivan I. Georgiev Methods and apparatus facilitating access to storage among multiple computers
GB0420057D0 (en) * 2004-09-09 2004-10-13 Level 5 Networks Ltd Dynamic resource allocation
CA2622404A1 (en) * 2004-09-15 2006-03-23 Adesso Systems, Inc. System and method for managing data in a distributed computer system
US8549180B2 (en) * 2004-10-22 2013-10-01 Microsoft Corporation Optimizing access to federation infrastructure-based resources
US9037698B1 (en) * 2006-03-14 2015-05-19 Amazon Technologies, Inc. Method and system for collecting and analyzing time-series data
US7743018B2 (en) * 2006-04-10 2010-06-22 International Business Machines Corporation Transient storage in distributed collaborative computing environments
US8972345B1 (en) * 2006-09-27 2015-03-03 Hewlett-Packard Development Company, L.P. Modifying data structures in distributed file systems
WO2008055271A2 (en) * 2006-11-04 2008-05-08 Virident Systems, Inc. Seamless application access to hybrid main memory
US7613947B1 (en) * 2006-11-30 2009-11-03 Netapp, Inc. System and method for storage takeover
US7711683B1 (en) * 2006-11-30 2010-05-04 Netapp, Inc. Method and system for maintaining disk location via homeness
WO2008070814A2 (en) * 2006-12-06 2008-06-12 Fusion Multisystems, Inc. (Dba Fusion-Io) Apparatus, system, and method for a scalable, composite, reconfigurable backplane
JP4369471B2 (en) * 2006-12-27 2009-11-18 富士通株式会社 Mirroring program, mirroring method, information storage device
US9292620B1 (en) 2007-09-14 2016-03-22 Hewlett Packard Enterprise Development Lp Retrieving data from multiple locations in storage systems
US8392370B1 (en) * 2008-03-28 2013-03-05 Emc Corporation Managing data on data storage systems
US7873619B1 (en) * 2008-03-31 2011-01-18 Emc Corporation Managing metadata
US7869383B2 (en) * 2008-07-24 2011-01-11 Symform, Inc. Shared community storage network
US8244951B2 (en) * 2008-09-25 2012-08-14 Intel Corporation Method and apparatus to facilitate system to system protocol exchange in back to back non-transparent bridges
WO2010108186A1 (en) * 2009-03-20 2010-09-23 Georgia Tech Research Corporation Methods and apparatuses for using a mobile device to provide remote assistance
US8370571B2 (en) * 2009-04-08 2013-02-05 Hewlett-Packard Development Company, L.P. Transfer control of a storage volume between storage controllers in a cluster
US8244988B2 (en) * 2009-04-30 2012-08-14 International Business Machines Corporation Predictive ownership control of shared memory computing system data
US20100332285A1 (en) * 2009-06-24 2010-12-30 International Business Machines Corporation Intellectual Property Component Business Model for Client Services
US8452835B2 (en) * 2009-12-23 2013-05-28 Citrix Systems, Inc. Systems and methods for object rate limiting in multi-core system
US8438341B2 (en) * 2010-06-16 2013-05-07 International Business Machines Corporation Common memory programming
CN103620576B (en) * 2010-11-01 2016-11-09 七网络公司 It is applicable to the caching of mobile applications behavior and network condition
US20120203733A1 (en) * 2011-02-09 2012-08-09 Zhang Amy H Method and system for personal cloud engine
CN106407766A (en) * 2011-03-07 2017-02-15 安全第公司 Secure file sharing method and system
EP2700019B1 (en) * 2011-04-19 2019-03-27 Seven Networks, LLC Social caching for device resource sharing and management
US8713577B2 (en) * 2011-06-03 2014-04-29 Hitachi, Ltd. Storage apparatus and storage apparatus management method performing data I/O processing using a plurality of microprocessors
US9116812B2 (en) * 2012-01-27 2015-08-25 Intelligent Intellectual Property Holdings 2 Llc Systems and methods for a de-duplication cache
US20140007189A1 (en) * 2012-06-28 2014-01-02 International Business Machines Corporation Secure access to shared storage resources
US10339056B2 (en) * 2012-07-03 2019-07-02 Sandisk Technologies Llc Systems, methods and apparatus for cache transfers
US20140115579A1 (en) * 2012-10-19 2014-04-24 Jonathan Kong Datacenter storage system
US9088450B2 (en) * 2012-10-31 2015-07-21 Elwha Llc Methods and systems for data services
WO2014096970A2 (en) * 2012-12-20 2014-06-26 Marvell World Trade Ltd. Memory sharing in a network device
US9424301B2 (en) * 2013-11-22 2016-08-23 Netapp, Inc. System and method for negotiated takeover of storage objects
US10031933B2 (en) 2014-03-02 2018-07-24 Netapp, Inc. Peer to peer ownership negotiation
US20160212198A1 (en) * 2015-01-16 2016-07-21 Netapp, Inc. System of host caches managed in a unified manner

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020046232A1 (en) * 2000-09-15 2002-04-18 Adams Colin John Organizing content on a distributed file-sharing network
US20040064523A1 (en) * 2002-10-01 2004-04-01 Zheng Zhang Placing an object at a node within a logical space in a peer-to-peer system
US20040111486A1 (en) * 2002-12-06 2004-06-10 Karl Schuh Distributed cache between servers of a network
US20070094269A1 (en) * 2005-10-21 2007-04-26 Mikesell Paul A Systems and methods for distributed system scanning
US20100274772A1 (en) * 2009-04-23 2010-10-28 Allen Samuels Compressed data objects referenced via address references and compression references

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430397B2 (en) 2014-03-02 2019-10-01 Netapp, Inc. Peer to peer ownership negotiation
US10853339B2 (en) 2014-03-02 2020-12-01 Netapp Inc. Peer to peer ownership negotiation
US10031933B2 (en) 2014-03-02 2018-07-24 Netapp, Inc. Peer to peer ownership negotiation
US10693964B2 (en) * 2015-04-09 2020-06-23 Pure Storage, Inc. Storage unit communication within a storage system
US20190141131A1 (en) * 2015-04-09 2019-05-09 Pure Storage, Inc. Point to point based backend communication layer for storage processing
US10055420B1 (en) 2015-06-30 2018-08-21 EMC IP Holding Company LLC Method to optimize random IOS of a storage device for multiple versions of backups using incremental metadata
US9996426B1 (en) * 2015-06-30 2018-06-12 EMC IP Holding Company LLC Sparse segment trees for high metadata churn workloads
US20180157432A1 (en) * 2015-11-01 2018-06-07 International Business Machines Corporation Pdse extended generation grouping for tiered storage systems
US9927989B2 (en) * 2015-11-01 2018-03-27 International Business Machines Corporation PDSE extended generation grouping for tiered storage systems
US20170123692A1 (en) * 2015-11-01 2017-05-04 International Business Machines Corporation Pdse extended generation grouping for tiered storage systems
US11237735B2 (en) * 2015-11-01 2022-02-01 International Business Machines Corporation PDSE extended generation grouping for tiered storage systems
US20230239547A1 (en) * 2016-12-09 2023-07-27 The Nielsen Company (Us), Llc Scalable architectures for reference signature matching and updating
US11381642B2 (en) * 2017-11-06 2022-07-05 Nippon Telegraph And Telephone Corporation Distributed storage system suitable for sensor data
CN108418874A (en) * 2018-02-12 2018-08-17 平安科技(深圳)有限公司 Guiding method, device, computer equipment and storage medium are returned across wide area network data
CN110311953A (en) * 2019-05-24 2019-10-08 杭州网络传媒有限公司 A kind of media article uploads and storage system and method
CN110737547A (en) * 2019-10-22 2020-01-31 第四范式(北京)技术有限公司 Method and apparatus for restoring memory database using non-volatile memory (NVM)
CN111061681A (en) * 2019-11-15 2020-04-24 浪潮电子信息产业股份有限公司 Method and device for partitioning directory based on case insensitivity and storage medium

Also Published As

Publication number Publication date
US20180322152A1 (en) 2018-11-08
US10853339B2 (en) 2020-12-01
US10031933B2 (en) 2018-07-24
US10430397B2 (en) 2019-10-01
US20150249618A1 (en) 2015-09-03
US20200026694A1 (en) 2020-01-23

Similar Documents

Publication Publication Date Title
US20150248443A1 (en) Hierarchical host-based storage
US10013317B1 (en) Restoring a volume in a storage system
US10789217B2 (en) Hierarchical namespace with strong consistency and horizontal scalability
US8600949B2 (en) Deduplication in an extent-based architecture
US8694469B2 (en) Cloud synthetic backups
US7376796B2 (en) Lightweight coherency control protocol for clustered storage system
US10210191B2 (en) Accelerated access to objects in an object store implemented utilizing a file storage system
US10102211B2 (en) Systems and methods for multi-threaded shadow migration
US11297031B2 (en) Hierarchical namespace service with distributed name resolution caching and synchronization
US11106625B2 (en) Enabling a Hadoop file system with POSIX compliance
US11055265B2 (en) Scale out chunk store to multiple nodes to allow concurrent deduplication
US11210006B2 (en) Distributed scalable storage
US20190258604A1 (en) System and method for implementing a quota system in a distributed file system
US20230359374A1 (en) Method and System for Dynamic Storage Scaling
US20140181036A1 (en) Log consolidation
Liu et al. Cfs: A distributed file system for large scale container platforms
US8918378B1 (en) Cloning using an extent-based architecture
US10423583B1 (en) Efficient caching and configuration for retrieving data from a storage system
US10223545B1 (en) System and method for creating security slices with storage system resources and related operations relevant in software defined/as-a-service models, on a purpose built backup appliance (PBBA)/protection storage appliance natively
US20200242086A1 (en) Distribution of global namespace to achieve performance and capacity linear scaling in cluster filesystems
US11782882B2 (en) Methods for automated artifact storage management and devices thereof
US11455114B1 (en) Consolidation and migration of cloud data
US10713121B1 (en) Dynamic migration of a cloud based distributed file system metadata server
US20220197860A1 (en) Hybrid snapshot of a global namespace
US20220114139A1 (en) Fractional consistent global snapshots of a distributed namespace

Legal Events

Date Code Title Description
AS Assignment

Owner name: PLEXISTOR LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GOLANDER, AMIT;REEL/FRAME:035136/0039

Effective date: 20150301

AS Assignment

Owner name: NETAPP, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PLEXISTOR LTD.;REEL/FRAME:043375/0358

Effective date: 20170823

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION