US20170068475A1 - Intra-rack and inter-rack erasure code distribution - Google Patents

Intra-rack and inter-rack erasure code distribution Download PDF

Info

Publication number
US20170068475A1
US20170068475A1 US15/353,772 US201615353772A US2017068475A1 US 20170068475 A1 US20170068475 A1 US 20170068475A1 US 201615353772 A US201615353772 A US 201615353772A US 2017068475 A1 US2017068475 A1 US 2017068475A1
Authority
US
United States
Prior art keywords
storage
objects
server
rack
given
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/353,772
Inventor
Danny Harnik
Michael Factor
Dmitry Sotnikov
Paula Ta-Shma
Lukas Kull
Thomas Morf
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US15/353,772 priority Critical patent/US20170068475A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FACTOR, MICHAEL, TA-SHMA, PAULA, HARNIK, DANNY, SOTNIKOV, DMITRY
Publication of US20170068475A1 publication Critical patent/US20170068475A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1088Reconstruction on already foreseen single or plurality of spare disks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1092Rebuilding, e.g. when physically replacing a failing disk
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/154Error and erasure correction, e.g. by using the error and erasure locator or Forney polynomial

Definitions

  • the present invention relates generally to erasure codes and specifically to local recovery for distributed erasure codes.
  • Erasure coding is a technique used to greatly reduce storage space required to safely store a dataset. For example, compared to three-way data replication that has an overhead of 200% and can survive two failures, a 10:4 Reed-Solomon erasure correction code (which divides the data into ten blocks and adds four parity blocks) has an overhead of 40% and can survive four failures. To maximize survivability, each of the replicas or different blocks of the erasure coded data are placed in different failure domains, where a failure domain at scale would be different racks or even different aisles within a data center. Typically, the distribution of replicas or blocks is implemented in a declustered configuration, in order that that the data on a given storage device can be protected by a large number of other storage devices.
  • the amount of data that must be read to recover from a storage device failure is the amount of data that was on the failed device.
  • the amount of data that must cross the aggregation network switches between the racks is proportional to the data on the failed drive.
  • the amount of data that must be read and transferred over the aggregation switches is k times the amount of data on the failed device.
  • a method including detecting multiple sets of storage objects stored in a data facility including multiple server racks, each of the server racks including a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, identifying, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, identifying one or more server computers in the given server rack not storing any of the identified storage objects, and creating and managing, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • an storage facility including multiple server racks, each of the server racks including a plurality of server computers, and a processor configured to detect multiple sets of storage objects, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, to identify one or more server computers in the given server rack not storing any of the identified storage objects, and to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • a computer program product including a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code including computer readable program code configured to detect multiple sets of storage objects stored in a data facility including multiple server racks, each of the server racks including a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, computer readable program code configured to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, computer readable program code configured to identify one or more server computers in the given server rack not storing any of the identified storage objects, and computer readable program code configured to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • FIGS. 1A-1C are block diagrams of a data facility configured to perform local recovery using distributed erasure correction codes, in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram detailing a first given storage object stored in the data facility and configured as a data object, in accordance with a first embodiment of the present invention
  • FIG. 3 is a block diagram detailing a second given storage object stored in the data facility and configured as a protection object, in accordance with a second embodiment of the present invention
  • FIG. 4 is a block diagram showing sets of the storage objects stored in the storage facility, in accordance an embodiment of the present invention.
  • FIG. 5 is a flow diagram that schematically illustrates a method of creating and managing intra-rack erasure correction codes in the data facility, in accordance an embodiment of the present invention
  • FIG. 6 is a block diagram showing a first distribution of the storage objects among server computers in the data facility prior to creating intra-rack protection objects, in accordance with an embodiment of the present invention.
  • FIG. 7 is a block diagram showing a second distribution of the storage objects among the server computers in the data facility subsequent to creating intra-rack protection objects, in accordance with an embodiment of the present invention.
  • Data centers such as cloud data centers typically comprise multiple server racks, wherein each rack comprises multiple server computers, and wherein each of the server computers comprises one or more storage devices.
  • replication or erasure coding is used to protect individual storage objects on the storage devices. These individual storage objects can be either replicated or divided into storage units (i.e., physical blocks of data), and protected by a set of linear equations performed on data in the storage units.
  • Each of the server racks typically comprises a plurality of server computers connected to a top-of-rack network switch.
  • the multiple server racks can be connected via an aggregation switch to communicate with server computers in other server racks.
  • intra rack bandwidth is higher and more plentiful than inter-rack bandwidth.
  • Embodiments of the present invention provide methods and systems for combining inter-rack (i.e., cross failure zone) erasure coding of individual storage objects using orthogonal inter-object, intra-rack erasure coding.
  • inter-rack i.e., cross failure zone
  • intra-rack erasure coding enables recovering from a single storage failure based upon data stored in the same server rack as the failed storage device. This may reduce the inter-rack bandwidth to a bandwidth that is less than or equal to the bandwidth for replication (i.e., there may be a need to transfer the data via the aggregation switch if the data is to be recovered in a different server rack).
  • Systems can implement embodiments describe herein with either a small increase in space overhead or a small decrease in resiliency. For example, implementing 10 : 3 inter-rack resiliency plus one failure intra-rack resiliency can enable as storage facility to survive four storage device failures but only enable the facility to survive three server rack failures.
  • FIGS. 1A-1C are block diagrams of a storage facility 20 configured to perform local recovery using distributed erasure correction codes, in accordance with an embodiment of the present invention.
  • Facility 20 comprises a local data center 22 and a cloud data center 24 that communicate via Internet 26 .
  • Local data center 22 comprises one or more host computers (e.g., database servers and/or e-mail servers) and a management system 30 than communicate via a local area network (LAN) 31 .
  • LAN 31 couples local data center 22 to Internet 26 .
  • Management system comprises a management processor 32 and a management memory 34 .
  • Processor 32 executes an erasure correction code (ECC) management application 36 from memory 34 .
  • ECC management application 36 manages the distributed erasure correction codes, as described hereinbelow.
  • Cloud data center 24 comprises multiple server racks 38 .
  • server racks 38 and their respective components can be differentiated by appending a letter to the identifying numeral, so that the server racks comprise server racks 38 A- 38 D.
  • Server racks 38 A and 38 B communicate via an aggregation switch 40
  • server racks 38 C and 38 D communicate via an aggregation switch 42 .
  • Aggregation switches 40 and 42 communicate via a data center switch 44 that also coupled cloud data center 24 to Internet 26 .
  • Each server rack 38 comprises a top-of-rack switch (TOR) 46 , and multiple server computers 48 .
  • server computers 48 communicate with each other via top-of-rack switch 46 , which is also coupled to a given aggregation switch (i.e., switch 40 or 42 , depending on the given server rack).
  • aggregation switch i.e., switch 40 or 42 , depending on the given server rack.
  • intra-rack bandwidth i.e., bandwidth between two server computers in the same server rack 38 that communicate via switch 46
  • inter-rack bandwidth i.e., bandwidth between two server computers in different server racks 38 that communicate via switch 40 ).
  • Each server computer 48 comprises a server processor 50 , a server memory 52 , and one or more storage devices 54 such as hard disks and solid-state disks that store storage objects 56 , which are described in detail hereinbelow.
  • storage device 54 A stores storage object 56 A that is also referred to herein as storage object A 1
  • storage device 54 B stores storage object 56 B that is also referred to herein as storage object R 1
  • storage device 54 C stores storage object 56 C that is also referred to herein as storage object C 1
  • storage device 54 D stores storage object 56 D that is also referred to herein as storage object B 1
  • storage device 54 G stores storage object 56 G that is also referred to herein as storage object B 2
  • storage device 54 H stores storage object 56 H that is also referred to herein as storage object C 2
  • storage device 54 I stores storage object 56 I that is also referred to herein as storage object A 2
  • storage device 54 J stores storage object 56 H that is also referred to herein as storage object R 2
  • cloud data center 24 implements an inter-rack 2:2 erasure code so that each of the objects comprises two data objects and two protection objects. While the example in FIG. 1 shows an inter-rack 2:2 erasure code for sake of simplicity, any k:r code is considered to be within the spirit and scope of the present invention.
  • storage objects 56 storing user data may also be referred to herein as data objects 56
  • storage objects storing protection data such as erasure correction codes may also be referred to herein as protection objects 56 .
  • protection objects 56 that protect storage objects that are stored in different server racks 38 may also be referred to herein as inter-rack protection objects 56
  • protection objects 56 that protect storage objects that are stored in different server computers 48 in a given server rack may also be referred to herein as intra-rack protection objects 56 .
  • an intra-rack protection object RN i.e., a given protection object 56
  • processor 32 can construct from a linear function of a collection of storage objects 56 within a given server rack 38 so that all of the storage objects are on different server computers 48 (i.e., different storage devices 54 ).
  • processor 32 can read storage objects A 1 , R 1 , and B 1 , apply the inverse of the linear function used to construct protection object R 1 , thereby rebuilding storage object C 1 . While rebuilding storage object C 1 , the only inter-rack communication required is if the rebuilt storage object C 1 should be placed on another rack.
  • the storage objects in a given server rack 38 that are combined using a linear function to create a given inter-rack protection object 56 typically need to reside on distinct server computers 48 .
  • the protection objects comprise any linear codes (e.g. Reed Solomon codes)
  • the codes typically need to have the same size. In systems where this is not the case, this can be handled in a different ways, for example (a) padding the smaller storage objects 56 to bring them to the same size as the others, since zero padding does not change the storage object's parity and can be implemented with negligible increase of physical footprint, and/or (b) combining smaller storage objects to make a larger storage object 56 , (i.e., since a bin packing algorithm can be used to do this efficiently in terms of space usage or other factors).
  • linear codes e.g. Reed Solomon codes
  • the width of a given intra-rack protection object 56 may be less than or equal to the number of server computers in a given server rack 38 .
  • FIG. 2 is a block diagram detailing storage object 56 A, in accordance with a first embodiment of the present invention
  • FIG. 3 is a block diagram detailing storage object 56 N, in accordance with a second embodiment of the present invention.
  • Storage objects 56 comprise multiple storage units 60 that comprise physical blocks of data on storage devices 54 . Since storage units 60 in storage object 56 A comprise user data, storage object 56 A may also be referred to as data object 56 A. Likewise, since storage units 60 storage object 56 N comprise erasure correction codes, storage object 56 N may also be referred to as protection object 56 N.
  • processor 32 manages a given protection object 56 for multiple storage objects 56 by calculating an erasure correction codes for corresponding contents in the multiple storage objects.
  • the contents comprise user data
  • the contents comprise erasure correction codes.
  • protection objects 56 using erasure correction codes
  • other error correction mechanism are considered to be within the spirit and scope of the present invention.
  • a given protection object 56 i.e., either intra-rack or inter-rack
  • processor 32 creates and manages intra-rack protection objects 56 .
  • local data center 22 can configure management system 30 to intercept write request from host computer 28 , and update the intra-rack protection objects 56 as necessary.
  • processor 32 can monitor data objects 56 , and update intra-rack protection objects 56 as necessary.
  • the functionality of management system 30 can be performed by a given server processor 50 , or by a virtual machine instance executing in a given memory 52 . Additionally, while embodiments herein describe creating a single intra-rack protection object 56 for a given server rack 38 , creating and managing multiple intra-rack protection objects for a given server rack 38 is considered to be within the spirit and scope of the present invention.
  • Processors 32 and 50 typically comprise general-purpose computer, which are programmed in software to carry out the functions described herein.
  • the software may be downloaded to system 22 and server computers 48 in electronic form, over a network, for example, or it may be provided on non-transitory tangible media, such as optical, magnetic or electronic memory media.
  • processors 32 and 50 may be carried out by dedicated or programmable digital hardware components, or using a combination of hardware and software elements.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • FIG. 4 is a block diagram showing sets 70 of the storage objects stored in the storage facility, in accordance an embodiment of the present invention.
  • sets can be differentiated by appending a letter to the identifying numeral, so that the sets comprise server racks 70 A- 70 C.
  • Set 70 A represents object A and comprises storage objects A 1 , A 2 , A 3 and A 4 .
  • Set 70 B represents object B and comprises storage objects B 1 , B 2 , B 3 and B 4 .
  • Set 70 C represents object C and comprises storage objects C 1 , C 2 , C 3 and C 4 .
  • storage objects A 1 , A 2 , B 1 , B 2 , C 1 , and C 2 comprise data objects 56
  • storage objects A 3 , A 4 , B 3 , B 4 , C 3 , and C 4 comprise (inter-rack) protection objects 56 .
  • FIG. 5 is a flow diagram that schematically illustrates a method of creating and managing intra-rack protection objects 56 , in accordance an embodiment of the present invention.
  • processor 32 detects multiple sets 70 .
  • each set 70 comprises one or more data objects 56 and one or more protection objects 56 , each of the data and the protection objects stored on separate server racks 38 .
  • set 70 A comprises data object 56 A that is stored in server rack 38 A, data object 56 B that is stored in server rack 38 B, data object 56 C that is stored in server rack 38 C, and data object 56 D that is stored in server rack 38 D.
  • Protection objects 56 C and 56 D protect 70 A.
  • contents of the given storage object can be recovered using data stored the other storage objects in set 70 A.
  • processor 32 identifies, in a given server rack 38 , a specified number of storage objects that are stored in separate server computers in the given server rack. For example, if the specified number is three and the given server rack is server rack 38 A, then the identified storage objects comprise storage objects A 1 , B 1 and C 1 .
  • processor 32 identifies, in the given server rack, one or more server computers 48 not storing any of the identified storage objects. For example, in the configuration shown in FIG. 1 , processor can identify either server computer 48 B or server computer 48 E.
  • processor 32 creates and manages, on each of the one or more identified server computers, an additional protection object 56 that is configured to protect the identified storage objects, and the method ends.
  • processor 32 calculates erasure correction codes based on contents of the identified storage objects, and to manage the one or more additional protection objects, the management processor updates the erasure correction codes upon detecting any changes to the identified storage objects.
  • the additional protection object protects the identified storage objects in the given rack.
  • protection object R 1 protects storage objects A 1 , B 1 , C 1 and D 1
  • storage objects A 1 , B 1 , C 1 , D 1 and E 1 can be referred to as a rack set. Therefore, if a given storage object 56 in the rack set cannot be read, contents of the given storage object can be recovered using data stored in the other storage objects 56 in the rack set.
  • FIG. 6 is a block diagram showing a first distribution of storage objects 56 among server computers 48 prior to performing the steps described in FIG. 5 , in accordance with an embodiment of the present invention.
  • the storage objects comprise sets 70 A, 70 B and 70 C.
  • FIG. 7 is a block diagram showing a second distribution of storage objects 56 among server computers 48 in the data facility subsequent to performing the steps described in FIG. 5 , in accordance with an embodiment of the present invention.
  • FIG. 7 shows protection objects R 1 , R 2 , R 3 and R 4 stored in each server rack 38 .
  • processor 32 can maintain a list of storage objects within each rack 38 that are candidates for intra-rack erasure coding, including their sizes and respective server computers 48 and/or storage devices 54 .
  • Facility 20 can maintain a “rack candidate list” using centralized or distributed logic.
  • processor 32 can create the additional intra-rack parity objects can be created in a “lazy” manner, because while their presence reduces the amount of inter-rack communication needed for recovery, recovery is possible without them. Therefore, processor 32 does not need to create the additional intra-rack protection objects created immediately, (or at all), and can be created in any of the following scenarios:
  • Another enhancement for “lazy” intra-rack parity creation comprises supporting both lazy and immediate protection object creation.
  • Processor 32 can use the lazy mechanism will be used as a default, but in case of data compromise in the cloud data center, the management processor can upgrade some of the protection object creation operations to be performed immediately.
  • processor 32 can regenerate the missing data form a second server rack 38 (e.g., rack 38 B).
  • processor 32 can quickly upgrade the internal parity creation of any relevant storage objects 56 in the second server rack, thereby reducing a probability of data loss due to a potential failure in the second server rack. Note that this expedited protection block creation should not interfere with the recovery of data to the first rack, since both processes attempt to read the same affected data from a given storage device in the second rack, and thus can “piggyback” their respective reads.
  • intra-rack protection is based upon data that is independent from any inter-rack parity protection. Therefore, the same logic in ECC management application 36 can be used for both intra-rack and inter-rack protection without loss of redundancy.
  • overwriting or migrating given storage object 56 is equivalent to deleting the given storage object and creating a new storage object. If a given storage object 56 is deleted, may result in its respective erasure correction codes in a given inter-rack protection object 56 also being deleted. However, since each of these inter-rack blocks now also belongs to given intra-rack protection object 56 . To handle this situation, processor 32 can perform one of the following alternative operations:
  • policies can be defined for prioritizing the inclusion of storage objects 56 into a given intra-rack protection object. These policies may include:
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods, computing systems and computer program products implement embodiments of the present invention that include detecting multiple sets of storage objects stored in a data facility including multiple server racks, each of the server racks including a plurality of server computers, each of the storage objects in each set being stored in a separate one of the server racks and including one or more data objects and one or more protection objects. A specified number of the storage objects are identified in a given server rack, each of the identified storage objects being stored in a separate one of the server computers, and one or more server computers in the given server rack not storing any of the identified storage objects are identified. Finally, in the identified one or more server computers, an additional protection object is created and managed for the identified storage objects.

Description

    FIELD OF THE INVENTION
  • The present invention relates generally to erasure codes and specifically to local recovery for distributed erasure codes.
  • BACKGROUND
  • Erasure coding is a technique used to greatly reduce storage space required to safely store a dataset. For example, compared to three-way data replication that has an overhead of 200% and can survive two failures, a 10:4 Reed-Solomon erasure correction code (which divides the data into ten blocks and adds four parity blocks) has an overhead of 40% and can survive four failures. To maximize survivability, each of the replicas or different blocks of the erasure coded data are placed in different failure domains, where a failure domain at scale would be different racks or even different aisles within a data center. Typically, the distribution of replicas or blocks is implemented in a declustered configuration, in order that that the data on a given storage device can be protected by a large number of other storage devices.
  • To recover from a failure with simple replication, data from a surviving replica is read. In other words, the amount of data that must be read to recover from a storage device failure (the most common non-transient failure) is the amount of data that was on the failed device. At scale, where a failure domain is a rack, the amount of data that must cross the aggregation network switches between the racks is proportional to the data on the failed drive. By contrast, with k:r erasure coding, the amount of data that must be read and transferred over the aggregation switches is k times the amount of data on the failed device.
  • The description above is presented as a general overview of related art in this field and should not be construed as an admission that any of the information it contains constitutes prior art against the present patent application.
  • SUMMARY
  • There is provided, in accordance with an embodiment of the present invention a method, including detecting multiple sets of storage objects stored in a data facility including multiple server racks, each of the server racks including a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, identifying, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, identifying one or more server computers in the given server rack not storing any of the identified storage objects, and creating and managing, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • There is also provided, in accordance with an embodiment of the present invention an storage facility, including multiple server racks, each of the server racks including a plurality of server computers, and a processor configured to detect multiple sets of storage objects, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, to identify one or more server computers in the given server rack not storing any of the identified storage objects, and to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • There is further provided, in accordance with an embodiment of the present invention a computer program product, the computer program product including a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code including computer readable program code configured to detect multiple sets of storage objects stored in a data facility including multiple server racks, each of the server racks including a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and including one or more data objects and one or more protection objects for the given set, computer readable program code configured to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers, computer readable program code configured to identify one or more server computers in the given server rack not storing any of the identified storage objects, and computer readable program code configured to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The disclosure is herein described, by way of example only, with reference to the accompanying drawings, wherein:
  • FIGS. 1A-1C are block diagrams of a data facility configured to perform local recovery using distributed erasure correction codes, in accordance with an embodiment of the present invention;
  • FIG. 2 is a block diagram detailing a first given storage object stored in the data facility and configured as a data object, in accordance with a first embodiment of the present invention;
  • FIG. 3 is a block diagram detailing a second given storage object stored in the data facility and configured as a protection object, in accordance with a second embodiment of the present invention;
  • FIG. 4 is a block diagram showing sets of the storage objects stored in the storage facility, in accordance an embodiment of the present invention;
  • FIG. 5 is a flow diagram that schematically illustrates a method of creating and managing intra-rack erasure correction codes in the data facility, in accordance an embodiment of the present invention;
  • FIG. 6 is a block diagram showing a first distribution of the storage objects among server computers in the data facility prior to creating intra-rack protection objects, in accordance with an embodiment of the present invention; and
  • FIG. 7 is a block diagram showing a second distribution of the storage objects among the server computers in the data facility subsequent to creating intra-rack protection objects, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS Overview
  • Data centers such as cloud data centers typically comprise multiple server racks, wherein each rack comprises multiple server computers, and wherein each of the server computers comprises one or more storage devices. Typically, replication or erasure coding is used to protect individual storage objects on the storage devices. These individual storage objects can be either replicated or divided into storage units (i.e., physical blocks of data), and protected by a set of linear equations performed on data in the storage units.
  • Each of the server racks typically comprises a plurality of server computers connected to a top-of-rack network switch. The multiple server racks can be connected via an aggregation switch to communicate with server computers in other server racks. Typically, intra rack bandwidth is higher and more plentiful than inter-rack bandwidth.
  • Embodiments of the present invention provide methods and systems for combining inter-rack (i.e., cross failure zone) erasure coding of individual storage objects using orthogonal inter-object, intra-rack erasure coding. Using orthogonal inter-object, intra-rack erasure coding enables recovering from a single storage failure based upon data stored in the same server rack as the failed storage device. This may reduce the inter-rack bandwidth to a bandwidth that is less than or equal to the bandwidth for replication (i.e., there may be a need to transfer the data via the aggregation switch if the data is to be recovered in a different server rack).
  • Systems can implement embodiments describe herein with either a small increase in space overhead or a small decrease in resiliency. For example, implementing 10:3 inter-rack resiliency plus one failure intra-rack resiliency can enable as storage facility to survive four storage device failures but only enable the facility to survive three server rack failures.
  • System Description
  • FIGS. 1A-1C, referred to collectively as FIG. 1, are block diagrams of a storage facility 20 configured to perform local recovery using distributed erasure correction codes, in accordance with an embodiment of the present invention. Facility 20 comprises a local data center 22 and a cloud data center 24 that communicate via Internet 26.
  • Local data center 22 comprises one or more host computers (e.g., database servers and/or e-mail servers) and a management system 30 than communicate via a local area network (LAN) 31. LAN 31 couples local data center 22 to Internet 26.
  • Management system comprises a management processor 32 and a management memory 34. Processor 32 executes an erasure correction code (ECC) management application 36 from memory 34. In operation, ECC management application 36 manages the distributed erasure correction codes, as described hereinbelow.
  • Cloud data center 24 comprises multiple server racks 38. In the example shown in FIG. 1, server racks 38 and their respective components can be differentiated by appending a letter to the identifying numeral, so that the server racks comprise server racks 38A-38D. Server racks 38A and 38B communicate via an aggregation switch 40, and server racks 38C and 38D communicate via an aggregation switch 42. Aggregation switches 40 and 42 communicate via a data center switch 44 that also coupled cloud data center 24 to Internet 26.
  • Each server rack 38 comprises a top-of-rack switch (TOR) 46, and multiple server computers 48. In a given server rack 38, server computers 48 communicate with each other via top-of-rack switch 46, which is also coupled to a given aggregation switch (i.e., switch 40 or 42, depending on the given server rack). Typically intra-rack bandwidth (i.e., bandwidth between two server computers in the same server rack 38 that communicate via switch 46) is higher and more plentiful than inter-rack bandwidth (i.e., bandwidth between two server computers in different server racks 38 that communicate via switch 40).
  • Each server computer 48 comprises a server processor 50, a server memory 52, and one or more storage devices 54 such as hard disks and solid-state disks that store storage objects 56, which are described in detail hereinbelow. In the configuration shown in FIG. 1, storage device 54A stores storage object 56A that is also referred to herein as storage object A1, storage device 54B stores storage object 56B that is also referred to herein as storage object R1, storage device 54C stores storage object 56C that is also referred to herein as storage object C1, storage device 54D stores storage object 56D that is also referred to herein as storage object B1, storage device 54G stores storage object 56G that is also referred to herein as storage object B2, storage device 54H stores storage object 56H that is also referred to herein as storage object C2, storage device 54I stores storage object 56I that is also referred to herein as storage object A2, storage device 54J stores storage object 56H that is also referred to herein as storage object R2, storage device 54K stores storage object 56K that is also referred to herein as storage object B3, storage device 54M stores storage object 56M that is also referred to herein as storage object A3, storage device 54N stores storage object 56N that is also referred to herein as storage object R3, storage device 54O stores storage object 56O that is also referred to herein as storage object C3, storage device 54P stores storage object 56P that is also referred to herein as storage object R4, storage device 54Q stores storage object 56Q that is also referred to herein as storage object C4, storage device 54S stores storage object 56S that is also referred to herein as storage object A4, and storage device 54T stores storage object 56T that is also referred to herein as storage object B4.
  • In the example shown in FIG. 1, there are three objects A (comprising storage objects A1-A4), B (comprising storage objects B1-B4) and C (comprising storage objects C1-C4), and cloud data center 24 implements an inter-rack 2:2 erasure code so that each of the objects comprises two data objects and two protection objects. While the example in FIG. 1 shows an inter-rack 2:2 erasure code for sake of simplicity, any k:r code is considered to be within the spirit and scope of the present invention.
  • As described in the description referencing FIGS. 2 and 3 hereinbelow, storage objects 56 storing user data may also be referred to herein as data objects 56, and storage objects storing protection data such as erasure correction codes may also be referred to herein as protection objects 56.
  • Additionally, protection objects 56 that protect storage objects that are stored in different server racks 38 may also be referred to herein as inter-rack protection objects 56, and protection objects 56 that protect storage objects that are stored in different server computers 48 in a given server rack may also be referred to herein as intra-rack protection objects 56.
  • Therefore, in the example:
      • Storage objects A1 and A2 are the two data objects 56 of object A, and A3 and storage objects A4 are the two protection objects 56 (e.g., parity blocks) of object A.
      • Storage objects B1 and B2 are the two data objects 56 of object B, and storage objects B3 and B4 are the two protection objects 56 of object B.
      • Storage objects C1 and C2 are the two data objects 56 of object C, and storage objects C3 and C4 are the two protection objects 56 of object C.
  • To enable recovery of a single failed storage object without resulting in extensive inter-rack communication, embodiments of the present invention add an intra-rack protection object RN (i.e., a given protection object 56), which processor 32 can construct from a linear function of a collection of storage objects 56 within a given server rack 38 so that all of the storage objects are on different server computers 48 (i.e., different storage devices 54). To recover from a single failure of a given server computer 48 or a given storage device 54 (e.g., storage device 54C containing storage object C1), processor 32 can read storage objects A1, R1, and B1, apply the inverse of the linear function used to construct protection object R1, thereby rebuilding storage object C1. While rebuilding storage object C1, the only inter-rack communication required is if the rebuilt storage object C1 should be placed on another rack.
  • In order to protect against a failure of a given server computer 48, the storage objects in a given server rack 38 that are combined using a linear function to create a given inter-rack protection object 56 typically need to reside on distinct server computers 48.
  • Additionally, if the protection objects comprise any linear codes (e.g. Reed Solomon codes), the codes typically need to have the same size. In systems where this is not the case, this can be handled in a different ways, for example (a) padding the smaller storage objects 56 to bring them to the same size as the others, since zero padding does not change the storage object's parity and can be implemented with negligible increase of physical footprint, and/or (b) combining smaller storage objects to make a larger storage object 56, (i.e., since a bin packing algorithm can be used to do this efficiently in terms of space usage or other factors).
  • Furthermore, the width of a given intra-rack protection object 56 may be less than or equal to the number of server computers in a given server rack 38.
  • FIG. 2 is a block diagram detailing storage object 56A, in accordance with a first embodiment of the present invention, and FIG. 3 is a block diagram detailing storage object 56N, in accordance with a second embodiment of the present invention. Storage objects 56 comprise multiple storage units 60 that comprise physical blocks of data on storage devices 54. Since storage units 60 in storage object 56A comprise user data, storage object 56A may also be referred to as data object 56A. Likewise, since storage units 60 storage object 56N comprise erasure correction codes, storage object 56N may also be referred to as protection object 56N.
  • In operation, processor 32 manages a given protection object 56 for multiple storage objects 56 by calculating an erasure correction codes for corresponding contents in the multiple storage objects. For a given data object 56, the contents comprise user data, and for a given protection object 56, the contents comprise erasure correction codes.
  • While embodiments herein describe protection objects 56 using erasure correction codes, other error correction mechanism are considered to be within the spirit and scope of the present invention. For example a given protection object 56 (i.e., either intra-rack or inter-rack) may comprise a replication of a given data object.
  • In embodiments described herein, processor 32 creates and manages intra-rack protection objects 56. In some embodiments, local data center 22 can configure management system 30 to intercept write request from host computer 28, and update the intra-rack protection objects 56 as necessary. In alternative embodiments, processor 32 can monitor data objects 56, and update intra-rack protection objects 56 as necessary. In further embodiments, the functionality of management system 30 can be performed by a given server processor 50, or by a virtual machine instance executing in a given memory 52. Additionally, while embodiments herein describe creating a single intra-rack protection object 56 for a given server rack 38, creating and managing multiple intra-rack protection objects for a given server rack 38 is considered to be within the spirit and scope of the present invention.
  • Processors 32 and 50 typically comprise general-purpose computer, which are programmed in software to carry out the functions described herein. The software may be downloaded to system 22 and server computers 48 in electronic form, over a network, for example, or it may be provided on non-transitory tangible media, such as optical, magnetic or electronic memory media. Alternatively, some or all of the functions of processors 32 and 50 may be carried out by dedicated or programmable digital hardware components, or using a combination of hardware and software elements.
  • The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • Intra-Rack Erasure Correction Codes
  • FIG. 4 is a block diagram showing sets 70 of the storage objects stored in the storage facility, in accordance an embodiment of the present invention. In the example shown in FIG. 4, sets can be differentiated by appending a letter to the identifying numeral, so that the sets comprise server racks 70A-70C.
  • Set 70A represents object A and comprises storage objects A1, A2, A3 and A4. Set 70B represents object B and comprises storage objects B1, B2, B3 and B4. Set 70C represents object C and comprises storage objects C1, C2, C3 and C4. As described supra, storage objects A1, A2, B1, B2, C1, and C2 comprise data objects 56, and storage objects A3, A4, B3, B4, C3, and C4 comprise (inter-rack) protection objects 56.
  • FIG. 5 is a flow diagram that schematically illustrates a method of creating and managing intra-rack protection objects 56, in accordance an embodiment of the present invention. In a detection step 80, processor 32 detects multiple sets 70. As described supra, each set 70 comprises one or more data objects 56 and one or more protection objects 56, each of the data and the protection objects stored on separate server racks 38. For example, set 70A comprises data object 56A that is stored in server rack 38A, data object 56B that is stored in server rack 38B, data object 56C that is stored in server rack 38C, and data object 56D that is stored in server rack 38D. Protection objects 56C and 56D protect 70A. In other words, if a given storage object 56 in set 70A cannot be read, then contents of the given storage object can be recovered using data stored the other storage objects in set 70A.
  • In a first identification step 82, processor 32 identifies, in a given server rack 38, a specified number of storage objects that are stored in separate server computers in the given server rack. For example, if the specified number is three and the given server rack is server rack 38A, then the identified storage objects comprise storage objects A1, B1 and C1.
  • In a second identification step 84, processor 32 identifies, in the given server rack, one or more server computers 48 not storing any of the identified storage objects. For example, in the configuration shown in FIG. 1, processor can identify either server computer 48B or server computer 48E. Finally, in a creation step 86, using embodiments described herein, processor 32 creates and manages, on each of the one or more identified server computers, an additional protection object 56 that is configured to protect the identified storage objects, and the method ends. To create the one or more additional protection objects, processor 32 calculates erasure correction codes based on contents of the identified storage objects, and to manage the one or more additional protection objects, the management processor updates the erasure correction codes upon detecting any changes to the identified storage objects.
  • The additional protection object protects the identified storage objects in the given rack. For example, in rack 38A, protection object R1 protects storage objects A1, B1, C1 and D1, and storage objects A1, B1, C1, D1 and E1 can be referred to as a rack set. Therefore, if a given storage object 56 in the rack set cannot be read, contents of the given storage object can be recovered using data stored in the other storage objects 56 in the rack set.
  • FIG. 6 is a block diagram showing a first distribution of storage objects 56 among server computers 48 prior to performing the steps described in FIG. 5, in accordance with an embodiment of the present invention. In FIG. 6, the storage objects comprise sets 70A, 70B and 70C.
  • FIG. 7 is a block diagram showing a second distribution of storage objects 56 among server computers 48 in the data facility subsequent to performing the steps described in FIG. 5, in accordance with an embodiment of the present invention. In addition to sets 70A-70C, FIG. 7 shows protection objects R1, R2, R3 and R4 stored in each server rack 38.
  • In operation, processor 32 can maintain a list of storage objects within each rack 38 that are candidates for intra-rack erasure coding, including their sizes and respective server computers 48 and/or storage devices 54. Facility 20 can maintain a “rack candidate list” using centralized or distributed logic.
  • In some embodiments, processor 32 can create the additional intra-rack parity objects can be created in a “lazy” manner, because while their presence reduces the amount of inter-rack communication needed for recovery, recovery is possible without them. Therefore, processor 32 does not need to create the additional intra-rack protection objects created immediately, (or at all), and can be created in any of the following scenarios:
      • Processor 32 detects a suitable set of similarly sized storage units 60 available on a distinct server computer 48.
      • Processor 32 detects sufficient system resources to read the storage objects, compute the erasure correction codes and store the computed codes.
      • Processor 32 determines if workload parameters suggest that creating the intra-rack protection objects is beneficial to the facility.
  • Another enhancement for “lazy” intra-rack parity creation comprises supporting both lazy and immediate protection object creation. Processor 32 can use the lazy mechanism will be used as a default, but in case of data compromise in the cloud data center, the management processor can upgrade some of the protection object creation operations to be performed immediately.
  • For example, if a failure occurred in a first server rack (e.g., rack 38A) before processor 32 created intra-rack protection object R1, then processor 32 can regenerate the missing data form a second server rack 38 (e.g., rack 38B). At this point, processor 32 can quickly upgrade the internal parity creation of any relevant storage objects 56 in the second server rack, thereby reducing a probability of data loss due to a potential failure in the second server rack. Note that this expedited protection block creation should not interfere with the recovery of data to the first rack, since both processes attempt to read the same affected data from a given storage device in the second rack, and thus can “piggyback” their respective reads.
  • In operation, intra-rack protection is based upon data that is independent from any inter-rack parity protection. Therefore, the same logic in ECC management application 36 can be used for both intra-rack and inter-rack protection without loss of redundancy.
  • In some embodiments, overwriting or migrating given storage object 56 is equivalent to deleting the given storage object and creating a new storage object. If a given storage object 56 is deleted, may result in its respective erasure correction codes in a given inter-rack protection object 56 also being deleted. However, since each of these inter-rack blocks now also belongs to given intra-rack protection object 56. To handle this situation, processor 32 can perform one of the following alternative operations:
      • “A”. Abandon the given intra-rack protection object, delete the protection object's erasure correction codes, and return the other erasure correction codes that were in the given intra-rack protection object to the server rack's candidate list.
      • “B”. Maintain the given intra-rack protection object, mark the given protection object's storage units 60 as deleted but retain the given protection object's erasure correction codes to support the other storage objects 56 in the codes.
      • “C”. Apply operation “A” described hereinabove if a certain percentage of the erasure correction codes in the given protection object have been deleted. Otherwise apply operation “B” described hereinabove.
      • “D”. Place the intra-rack given protection object on a list for rebuilding. Therefore, when new additional storage objects 56 (i.e., storage units 60) are available on a given storage device 54 that previously stored the deleted storage object, use the new storage objects as a replacement for the intra-rack protection object.
      • “E”. Replace the deleted storage objects with a known fixed byte sequence and delete the storage object. This reduces the efficiency of the intra-rack encoding but does not “waste” space for storing a deleted storage object 56.
  • These alternative operations typically trade off space usage for CPU and I/O resources. Additionally, in order to improve reliability, multiple policies can be defined for prioritizing the inclusion of storage objects 56 into a given intra-rack protection object. These policies may include:
      • Assigning higher importance to frequently accessed data objects 56.
      • Assigning higher importance to de-duplicated storage objects 56, since losing a given de-duplicated storage objects 56 may result in losing all the storage objects that refer to it.
      • User-specific service level agreements (SLAs) can be used for prioritization.
  • The flowchart(s) and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

Claims (20)

1. A method, comprising:
detecting, by a management processor, multiple sets of storage objects stored in a data facility comprising multiple server racks, each of the server racks comprising a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and comprising one or more data objects and one or more protection objects for the given set;
identifying, by the management processor, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers;
identifying by the management processor, one or more server computers in the given server rack not storing any of the identified storage objects; and
creating and managing, by the management processor, in the identified one or more server computers, an additional protection object for the identified storage objects,
wherein the multiple server racks are coupled via one or more first network switches having one or more respective first bandwidths, and wherein each of the server computers in a given server rack are coupled via a second network switch having a second bandwidth, and wherein the second bandwidth is greater than each of the one or more first bandwidths, and wherein the second bandwidth is more plentiful than the first bandwidth.
2. The method according to claim 1, wherein the additional protection object comprises a replicated storage object.
3. (canceled)
4. The method according to claim 1, wherein the corresponding contents of a given identified data object comprises user data.
5. The method according to claim 1, wherein the erasure correction codes comprise first erasure correction codes, and wherein the corresponding contents of a given identified protection object comprises second erasure correction codes.
6. (canceled)
7. The method according to claim 1, wherein each of the one or more first communication switches are selected from a list consisting of a data center switch and an aggregation switch, and wherein the second communication switch comprises a top-of-rack switch.
8. A storage facility, comprising:
multiple server racks, each of the server racks comprising a plurality of server computers; and
a processor configured:
to detect multiple sets of storage objects, each of the storage objects in each given set being stored in separate server racks and comprising one or more data objects and one or more protection objects for the given set,
to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored separate server computers,
to identify one or more server computers in the given server rack not storing any of the identified storage objects, and
to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects,
wherein the multiple server racks are coupled via one or more first network switches having one or more respective first bandwidths, and wherein each of the server computers in a given server rack are coupled via a second network switch having a second bandwidth, and wherein the second bandwidth is greater than each of the one or more first bandwidths, and wherein the second bandwidth is more plentiful than the first bandwidth.
9. The storage facility according to claim 8, wherein the additional protection object comprises a replicated storage object.
10. (canceled)
11. The storage facility according to claim 8, wherein the corresponding contents of a given identified data object comprises user data.
12. The storage facility according to claim 8, wherein the erasure correction codes comprise first erasure correction codes, and wherein the corresponding contents of a given identified protection object comprises second erasure correction codes.
13. (canceled)
14. The storage facility according to claim 8, wherein each of the one or more first communication switches are selected from a list consisting of a data center switch and an aggregation switch, and wherein the second communication switch comprises a top-of-rack switch.
15. A computer program product, the computer program product comprising:
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising:
computer readable program code configured to detect multiple sets of storage objects stored in a data facility comprising multiple server racks, each of the server racks comprising a plurality of server computers, each of the storage objects in each given set being stored in separate server racks and comprising one or more data objects and one or more protection objects for the given set;
computer readable program code configured to identify, in a given server rack, a specified number of the storage objects, each of the identified storage objects being stored in separate server computers;
computer readable program code configured to identify one or more server computers in the given server rack not storing any of the identified storage objects; and
computer readable program code configured to create and manage, in the identified one or more server computers, an additional protection object for the identified storage objects,
wherein the multiple server racks are coupled via one or more first network switches having one or more respective first bandwidths, and wherein each of the server computers in a given server rack are coupled via a second network switch having a second bandwidth, and wherein the second bandwidth is greater than each of the one or more first bandwidths, and wherein the second bandwidth is more plentiful than the first bandwidth, and wherein each of the one or more first communication switches are selected from a list consisting of a data center switch and an aggregation switch, and wherein the second communication switch comprises a top-of-rack switch.
16. The computer program product according to claim 15, wherein the additional protection object comprises a replicated storage object.
17. (canceled)
18. The computer program product according to claim 15, wherein the corresponding contents of a given identified data object comprises user data.
19. The computer program product according to claim 15, wherein the erasure correction codes comprise first erasure correction codes, and wherein the corresponding contents of a given identified protection object comprises second erasure correction codes.
20. (canceled)
US15/353,772 2014-12-24 2016-11-17 Intra-rack and inter-rack erasure code distribution Abandoned US20170068475A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/353,772 US20170068475A1 (en) 2014-12-24 2016-11-17 Intra-rack and inter-rack erasure code distribution

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/582,227 US9547458B2 (en) 2014-12-24 2014-12-24 Intra-rack and inter-rack erasure code distribution
US15/353,772 US20170068475A1 (en) 2014-12-24 2016-11-17 Intra-rack and inter-rack erasure code distribution

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/582,227 Continuation US9547458B2 (en) 2014-12-24 2014-12-24 Intra-rack and inter-rack erasure code distribution

Publications (1)

Publication Number Publication Date
US20170068475A1 true US20170068475A1 (en) 2017-03-09

Family

ID=56164285

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/582,227 Expired - Fee Related US9547458B2 (en) 2014-12-24 2014-12-24 Intra-rack and inter-rack erasure code distribution
US15/353,772 Abandoned US20170068475A1 (en) 2014-12-24 2016-11-17 Intra-rack and inter-rack erasure code distribution

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/582,227 Expired - Fee Related US9547458B2 (en) 2014-12-24 2014-12-24 Intra-rack and inter-rack erasure code distribution

Country Status (1)

Country Link
US (2) US9547458B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347526A (en) * 2019-06-28 2019-10-18 华中科技大学 Promote the method, apparatus and system of LRC code repairing performance in distributed storage cluster

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11188665B2 (en) * 2015-02-27 2021-11-30 Pure Storage, Inc. Using internal sensors to detect adverse interference and take defensive actions
US10169126B2 (en) * 2016-10-12 2019-01-01 Samsung Electronics Co., Ltd. Memory module, memory controller and systems responsive to memory chip read fail information and related methods of operation
US10942807B2 (en) 2018-06-12 2021-03-09 Weka.IO Ltd. Storage system spanning multiple failure domains
US11061772B2 (en) 2018-12-14 2021-07-13 Samsung Electronics Co., Ltd. FPGA acceleration system for MSR codes
US11269745B2 (en) * 2019-10-29 2022-03-08 International Business Machines Corporation Two-node high availability storage system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553336B1 (en) * 1999-06-25 2003-04-22 Telemonitor, Inc. Smart remote monitoring system and method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990573B2 (en) * 2003-02-05 2006-01-24 Dell Products L.P. System and method for sharing storage to boot multiple servers
AU2003304502A1 (en) * 2003-10-13 2005-04-27 Illuminator (Israel) Ltd. Apparatus and method for information recovery quality assessment in a computer system
US7930611B2 (en) 2007-03-09 2011-04-19 Microsoft Corporation Erasure-resilient codes having multiple protection groups
US20100138717A1 (en) 2008-12-02 2010-06-03 Microsoft Corporation Fork codes for erasure coding of data blocks
US8112661B1 (en) * 2009-02-02 2012-02-07 Netapp, Inc. Method and system for changing a protection policy for a dataset in a network storage system
US8190974B2 (en) * 2009-09-28 2012-05-29 Nvidia Corporation Error detection and correction for external DRAM
US8631269B2 (en) 2010-05-21 2014-01-14 Indian Institute Of Science Methods and system for replacing a failed node in a distributed storage network
US8719675B1 (en) 2010-06-16 2014-05-06 Google Inc. Orthogonal coding for data storage, access, and maintenance
US8484433B2 (en) * 2010-11-19 2013-07-09 Netapp, Inc. Dynamic detection and reduction of unaligned I/O operations
US9277010B2 (en) * 2012-12-21 2016-03-01 Atlantis Computing, Inc. Systems and apparatuses for aggregating nodes to form an aggregated virtual storage for a virtualized desktop environment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553336B1 (en) * 1999-06-25 2003-04-22 Telemonitor, Inc. Smart remote monitoring system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347526A (en) * 2019-06-28 2019-10-18 华中科技大学 Promote the method, apparatus and system of LRC code repairing performance in distributed storage cluster

Also Published As

Publication number Publication date
US9547458B2 (en) 2017-01-17
US20160188406A1 (en) 2016-06-30

Similar Documents

Publication Publication Date Title
US20170068475A1 (en) Intra-rack and inter-rack erasure code distribution
US11003556B2 (en) Method, device and computer program product for managing storage system
US10922177B2 (en) Method, device and computer readable storage media for rebuilding redundant array of independent disks
CN109725822B (en) Method, apparatus and computer program product for managing a storage system
US20190197023A1 (en) Placement of data fragments generated by an erasure code in distributed computational devices based on a deduplication factor
US9535807B2 (en) Recovering from uncorrected memory errors
CN110720088A (en) Accessible fast durable storage integrated into mass storage device
US20160092137A1 (en) Data Integrity In Deduplicated Block Storage Environments
US11074130B2 (en) Reducing rebuild time in a computing storage environment
US10224967B2 (en) Protecting in-memory immutable objects through hybrid hardware/software-based memory fault tolerance
US11269745B2 (en) Two-node high availability storage system
US10642691B2 (en) Resilient data storage and retrieval
US11347418B2 (en) Method, device and computer program product for data processing
US10691349B2 (en) Mitigating data loss
US11269521B2 (en) Method, device and computer program product for processing disk unavailability states
US20200150881A1 (en) Validation of storage volumes that are in a peer to peer remote copy relationship
US10705909B2 (en) File level defined de-clustered redundant array of independent storage devices solution
US10554749B2 (en) Clientless software defined grid
US20170366202A1 (en) Using storlet in erasure code object storage architecture for image processing
US10719399B2 (en) System combining efficient reliable storage and deduplication
US11429287B2 (en) Method, electronic device, and computer program product for managing storage system
US10740203B2 (en) Aggregation of updated tracks to be copied to a backup volume for physically contiguous storage on a RAID stride
US20160173602A1 (en) Clientless software defined grid

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HARNIK, DANNY;FACTOR, MICHAEL;SOTNIKOV, DMITRY;AND OTHERS;SIGNING DATES FROM 20141111 TO 20141120;REEL/FRAME:040350/0960

STCB Information on status: application discontinuation

Free format text: ABANDONMENT FOR FAILURE TO CORRECT DRAWINGS/OATH/NONPUB REQUEST