WO2021189314A1

WO2021189314A1 - Data server crash recovery in object storage system using enhanced meta structure

Info

Publication number: WO2021189314A1
Application number: PCT/CN2020/081169
Authority: WO
Inventors: Li Wang; Yiming Zhang; Mingya SHI; Chunhua Huang
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2021-09-30

Abstract

Embodiments of the disclosure provide systems and methods for recovering a data server crash. The exemplary method includes detecting a crash of a first data server and marking an affected volume group associated with the first data server read-only. The method further includes determining that the first data server is not back online within a predetermined time. The method also includes replacing the first data server with a second data server and recovering object data of the affected volume group to the second data server.

Description

DATA SERVER CRASH RECOVERY IN OBJECT STORAGE SYSTEM USING ENHANCED META STRUCTURE

TECHNICAL FIELD

The present disclosure relates to systems and methods for object storage, and more particularly to, systems and methods for recovering a data server crash in an object storage system using an enhanced meta structure.

BACKGROUND

Object storage has been widely adopted by online-data-intensive (OLDI) applications for data analysis and computation. Compared to files or blobs of which the sizes vary from several kilobytes to many gigabytes, the sizes of objects (like photos, audio/video pieces, and H5 files) in the OLDI applications are relatively small (at most a few megabytes) , so as to support large-scale, latency-sensitive computation.

Different from social media applications (e.g., Facebook ^TM) , where reads are predominant, currently many OLDI applications are write-intensive such that they write a large number of objects online of which only a fraction will be read later for computation. Data placement is critical for the performance of these write-intensive applications storing hundreds of billions of small objects in thousands of machines.

The two main placement architectures for object data are: (1) directory-based placement (e.g., Haystack) , where a directory service is maintained from which the clients can retrieve the mapping from objects to disk volumes; and (2) the calculation-based placement (e.g., CRUSH) , where the mapping from objects to volumes can be independently calculated based on the objects’ globally unique names.

Calculation-based placement cannot control the mapping from objects to volumes and thus will cause huge data migration in storage capacity expansions resulting in significant performance degradation. Therefore, the calculation-based placement is not smoothly expandable. This is unacceptable to the fast-growing latency sensitive OLDI applications.

On the other hand, although the directory-based placement is flexible in controlling the object-to-volume mapping and can easily realize migration-free expansions, it requires writing of the storage meta-data (e.g., mappings from objects both to volumes and to in-volume positions) and the actual object data before acknowledgements. The co-existence of object data and meta-data complicates the processing of requests (e.g., object PUT request) in current object stores. To ensure crash consistency of an object PUT request, current directory-based object stores have to orchestrate a sequence of distributed writes in a particular order. The distributed ordering requirement severely affects the write performance of small objects, making directory-based object stores inefficient for write-intensive applications.

Embodiments of the disclosure address the above problems by providing systems and methods for recovering server crashes in object storage system using an enhanced meta structure.

SUMMARY

Embodiments of the disclosure provide a method for recovering a data server crash. The exemplary method includes detecting a crash of a first data server and marking an affected volume group associated with the first data server read-only. The method further includes determining that the first data server is not back online within a predetermined time. The method also includes replacing the first data server with a second data server and recovering object data of the affected volume group to the second data server.

Embodiments of the disclosure also provide a method for performing a write operation of an object. The exemplary method includes issuing, by a proxy server, a request for the write operation to a meta server. The method further includes selecting, by the meta server, a first volume group for writing the object. The first volume group comprises a volume of a data server. The method also includes detecting a crash of the data server and reissuing, by the proxy server, the request for the write operation to the meta server. The method additionally includes selecting, by the meta server, a second volume group for writing the object and withdrawing from writing the object to the first volume group.

Embodiments of the disclosure further provide an object storage system, including a master cluster and a data server cluster, which includes a first data server and a second data server. The master cluster is configured to detect a crash of the first data server and mark an affected volume group associated with the first data server read-only. The master cluster is further configured to determine that the first data server is not back online within a predetermined time. The master cluster is also configured to replace the first data server with the second data server and recover object data of the affected volume group to the second data server.

Embodiments of the disclosure further provide a system for performing a write operation of an object. The exemplary system includes a proxy server; meta server, a data server cluster including a data server, and a master cluster. The proxy server is configured to issue a request for the write operation to the meta server. The meta server is configured to select a first volume group for writing the object. The first volume group includes a volume of the data server. The master cluster is configured to detect a crash of the data server. The proxy server is further configured to reissue the request for the write operation to the meta server. The meta server is further configured to select a second volume group for writing the object and withdraw from writing the object to the first volume group.

Embodiments of the disclosure also provide a non-transitory computer readable medium storing computer instructions, when executed by a master cluster of an object storage system, perform a method for recovering a data server crash. The exemplary method includes detecting a crash of a first data server and marking an affected volume group associated with the first data server read-only. The method further includes determining that the first data server is not back online within a predetermined time. The method also includes replacing the first data server with a second data server and recovering object data of the affected volume group to the second data server.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a schematic diagram of an exemplary object storage system, according to embodiments of the disclosure.

FIG. 2 illustrates a flow diagram of an exemplary write operation implementing local write atomicity, according to embodiments of the disclosure.

FIG. 3 is a flowchart of an exemplary method for performing the write operation of FIG. 2, according to embodiments of the disclosure.

FIG. 4 illustrates an exemplary framework for hybrid object placement, according to embodiments of the disclosure.

FIG. 5 illustrates an exemplary METAX structure for meta-data storage, according to embodiments of the disclosure.

FIG. 6 illustrates a schematic diagram of an exemplary object storage system configured to process an operation request using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 7 is a flowchart of an exemplary method for performing an object write operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 8 is a flowchart of an exemplary method for recovering a meta server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 9 is a flowchart of an exemplary method for performing an object write operation amid a meta server crash, according to embodiments of the disclosure.

FIG. 10 is a flowchart of an exemplary method for recovering a proxy server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 11 is a flowchart of an exemplary method for recovering a data server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 12 is a flowchart of an exemplary method for performing an object write operation amid a data server crash, according to embodiments of the disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

FIG. 1 illustrates a schematic diagram of an exemplary object storage system (referred to as “storage system 100” ) , according to embodiments of the disclosure. In some embodiments, storage system 100 may include components shown in FIG. 1, including a user device 110 used by a client to send client requests, a proxy cluster 120 for surrogating client requests, a meta server cluster 130 for providing meta service, a data server cluster 140 for providing data storage service, a master cluster 150 for managing the clusters and disseminating directory mappings, and a network 160 for facilitating communications among the various components.

Consistent with the present disclosure, storage system 100 is configured to provide high input and output performance for write-intensive OLDI applications (e.g., in-journey audio recording, face identification, online sentiment analysis, driver/passenger assessment, etc. ) while ensuring the crash consistency by enforcing the local write atomicity of the enhanced meta structure (e.g., METAX) instead of applying a distributed write ordering. Storage system 100 may be configured to perform various object operations including, for example, an operation that writes an object in the storage system (also referred to as an object write operation or an object PUT) , an operation that reads an object stored in the storage system (also referred to as an object read operation or an object GET) , and an operation that deletes an object from the storage system (also referred to as an object delete operation or an object DELETE) . In some embodiments, storage system 100 may perform some other object operations by combining the above basic operations. For example, an object overwrite operation can be performed by first deleting the object and then writing a new one with the same object name. It is contemplated that storage system 100 may also be configured to perform other auxiliary operations, such as object LIST and namespace CREATE and DELETE.

In some embodiments, user device 110 may include, but not limited to, a facial recognition camera, a video/audio recorder, a server, a database/repository, a netbook computer, a laptop computer, a desktop computer, a media center, a gaming console, a television set, a set-top box, a handheld device (e.g., smart phone, tablet, etc. ) , a smart wearable device (e.g., eyeglasses, wrist watch, etc. ) , or any other suitable device. In some embodiments, storage system 100 may perform an object operation when receiving a request from user device 110, e.g., an object PUT, GET or DELET request. In some embodiments, a request may include the object data, the application meta-data and a URL. For example, in an object PUT request, the object data may be organized into groups and then be assigned to data server cluster 140. The ID of the groups, the in-group sequence number of the ongoing object PUT request and the allocation result (e.g., a current bitmap) of the assigned volume groups may be logged by meta server cluster 130.

In some embodiments, the object data provided by user device 110 may contain data of a document, a picture, a piece of video or audio, an e-book, etc. Consistent with the present disclosure, the object data may be generated by user device 110 and may be sent to proxy cluster 120 for storage and/or replication. In some embodiments, the object data may be striped/divided into multiple objects and may be stored to data server cluster 140, according to a directory mapping generated by master cluster 150.

In some embodiments, proxy cluster 120 may be any suitable device (e.g., a server, a computer, etc. ) that can provide object input/output (I/O) service to the clients such as user device 110, and can provide object PUT/GET/DELETE request service (s) for the clients as a surrogate. For example, the objects may be replicated and stored on proxy cluster 120 and they are identified by their globally unique identifications (e.g., object names) .

In some embodiments, meta server cluster 130 may include one or more machines each installing multiple disks for providing meta service (s) . In some embodiments, each disk of meta server cluster 130 may run a meta server process. For example, machines within meta server cluster 130 may maintain a local store which supports both atomic writes and efficient queries of the meta-data/meta-log of the data file.

In some embodiments, data server cluster 140 may include one or more storage machines each installing many disks for providing raw disk service (s) . In some embodiments, each disk in data server cluster 140 may run a lightweight data server process (e.g., process for executing database retrieval and updates, managing data integrity, dispatching responses to client requests, etc. ) . For example, data server cluster 140 may purely perform disk I/O to directly read/write object data without filesystem overhead (e.g., the meta-data/meta-log of the object data) .

In some embodiments, storage system 100 may also include a master cluster 150 being responsible for managing the entire system. For example, master cluster 150 may be configured to maintain a consistent cluster map (e.g., by running Paxos) of proxy cluster 120, meta server cluster 130, data server cluster 140 as well as master cluster 150 itself. In some embodiments, master cluster 150 may also maintain and disseminate the directory mapping. Master cluster 150 may further periodically update the topology map to proxy cluster 120 and meta server cluster 130 and distribute the directory mapping to meta server cluster 130.

In some embodiments, master cluster 150 may include a general-purpose or special-purpose server, a computing device, or any suitable device that may be responsible for managing the entire system. For example, master cluster 150 may include an odd number of cluster master processes (e.g., processes that manage the clusters state) , and may run Paxos (protocols for solving consensus in a network of processors) to maintain a consistent cluster map of proxy cluster 120, meta server cluster 130, data server cluster 140 and master cluster 150 itself. Master cluster 150 may also maintain and disseminate the directory mapping. For example, master cluster 150 may periodically update the topology map to proxy cluster 120 and meta server cluster 130 and distribute the directory mapping to meta server cluster 130.

In some embodiments, storage system 100 may include network 160 to facilitate the communication among the various components of storage system 100, such as user device 110, proxy cluster 120, meta server cluster 130, data server cluster 140 and master cluster 150. For example, the network may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network may be replaced by wired data communication systems or devices.

It is contemplated that storage system 100 may include more or less components compared to those shown in FIG. 1. In some embodiments, the various components of storage system 100 may be remote from each other or in different locations and be connected through the network. In some alternative embodiments, certain components of storage system 100 may be located on the same site or inside one device. For example, master cluster 120 may be located on-site with or be part of user device 110 such as being an application or a computer program that can be executed by user device 110.

In some embodiments, storage system 110 may be configured to perform object PUT operations that eliminate the ordering requirement of conventional write operations, which usually penalizes storage performance. In some embodiments, storage system 110 places the object meta-data, ongoing meta-logs and object attributes on meta server cluster 130, while leaving proxy server cluster 120 to be a stateless surrogate and data server cluster 140 to provide only raw disk service. In some embodiments, storage system 100 may use an enhanced meta structure (referred to as METAX) that contains the traditional meta-data of objects, the meta-logs and object attributes. A local atomic write using the METAX structure therefore does not require distributed write ordering of data/meta-data and thus may be applied to ensure crash consistency based on the object immutability. This would significantly increase the I/O performance of storage system 100.

In some embodiments, when performing the local atomic write, all file data except for the object data may be placed on meta server cluster 130. This may avoid uncertainty caused by applying distributed write ordering. For example, when processing object PUT request (s) , meta server cluster 130 may maintain (i) the complete placement meta-data recording the allocated mapping from objects to volumes and in-volume positions, (ii) the object attributes including data checksum, (iii) the meta-logs of the ongoing object PUT request (s) recording IDs of proxy cluster 120 and the object keys, and (iv) the volume allocation results in the form of bitmaps recording the status of disk blocks in the allocated volumes. Data server cluster 140 may provide raw disk service that is agnostic to the existence of objects and only responsible for storing pure object data onto specific disks blocks.

FIG. 2 illustrates a flow diagram 200 of an exemplary write operation implementing a local write atomicity, according to embodiments of the disclosure. In some embodiments, master cluster 150 may coordinate proxy cluster 120, meta server cluster 130 and data server cluster 140 according to flow diagram 200. FIG. 3 is a flowchart of an exemplary method 300 for performing the write operation of FIG. 2, according to embodiments of the disclosure. In some embodiments, method 300 may be performed by storage system 100 and may include steps S302-S310 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 2 and FIG. 3 will be described together.

In step S302, a client 210 may send an object PUT request (e.g., via user device 110) to a proxy server 220 (e.g., among proxy cluster 120) . In some embodiments, the object PUT request may include the object data, the application meta-data and a URL (domain/ns/object) , where ns is the namespace of the application or the user (e.g., user device 110) and object is the object name. In step S304, proxy server 220 may send application meta-data to a meta server 230 (e.g., among meta server cluster 130) . In some embodiments, application meta-data of the object may include the object key (e.g., a key computed by hashing the object name) . In some embodiments, proxy server 220 may also send attributes of the object to meta server 230 such as a data checksum computed using the object data.

In step S306, meta server 230 may schedule and return allocation data to proxy server 220, and simultaneously perform an atomic write of the various data associated with the object and the operation request, such as the allocation data, meta-logs of the request, and the application meta-data, to the local store of meta server 230. In some embodiments, the allocation data may include placement meta-data for the object (e.g., the allocated disk volume and in-volume position for the object) . In some embodiments, meta server 230 further determines an allocation result bitmap, where each bit requests the allocation status of one disk block in the allocated disk volume at the in-volume position. In some embodiments, meta server 230 may also atomically write the allocation result bitmap to its local store along with the other information and data described above.

Consistent with the present disclosure, an atomic write requires all operations in the local write either succeed or fail together. In some embodiments, method 300 only requires the write on meta server 230 to be atomic. In some embodiments, meta server 230 may also return an acknowledgement (e.g., SUCCESS) to proxy server 220 after the atomic write is successfully completed. It is contemplated that atomic write can also be implemented on other devices, such as proxy server 220.

In step S308, proxy server 220 may send object data and allocation data for the object (e.g., volume ID and in-volume offset) to the assigned disk volume within a data server 240 (e.g., data server among data server cluster 140 that includes the assigned disk volume) . In step S310, data server 240 may write the received object data in the disk volume according to the allocation data such as the volume ID and the in-volume offset, and return an acknowledgement (e.g., SUCCESS) to proxy server 220 after the write is completed. In some embodiments, after receiving the acknowledgements from both meta server 230 (in step S306) and data server 240 (in step S310) , proxy server 220 may send an acknowledgement (e.g., SUCCESS) to client 210 (e.g., via user device 110) indicating the object PUT request has been successfully fulfilled.

In some embodiments, as shown in FIG. 1, meta server cluster 130 may include multiple meta servers to provide scalable meta service. In some embodiments, a meta server (e.g., meta server 230) may be identified among meta server cluster 130 as responsible for a given object. In some embodiments, the responsible meta server may be identified using a predetermined lookup table or directly calculated via a decentralized CRUSH algorithm without extra lookup overhead.

In some embodiments, storage system 100 may implement a hybrid placement architecture where the directory (e.g., master cluster 150) may maintain the mapping from objects to volumes and disseminate the mapping to the meta servers (e.g., meta server cluster 130) , and the meta servers (e.g., meta server cluster 130) may maintain the mapping from objects to in-volume positions and provide meta service (s) . For example, FIG. 4 illustrates an exemplary framework 400 for hybrid object placement, according to embodiments of the disclosure. In some embodiments, an intermediate mapping layer of volume group lists (VGL) each having multiple volume groups (VG) may be included in the hybrid placement architecture. For example, as shown in FIG. 4, framework 400 may include VGLs 420-430 where VGL 420 may include

VGs

421 and 422. In some embodiments, a disk may be organized into multiple volumes and a VG (similar to a logical volume in Haystack) may have f volumes on f disks (e.g., for f-way replication) that have exactly the same bitmap and data. An f-way replication stores f replicas for each object.

As illustrated in FIG. 4, objects to be stored in the storage system are first organized into object groups (OGs) , e.g., including object groups 411-413. In some embodiments, the OG for a particular object may be determined by computing the modulo of the hashing of the object name, e.g., ogid = HASH (name) mod OG_NUM, where OG_NUM is the number of OGs (e.g., OC_NUM=2 ⁿ) . In some embodiments, an OG (e.g., OG 411) may be mapped to meta server 230 by a decentralized CRUSH algorithm. The OG may be additionally mapped to a VGL (e.g., VGL 420) by the central directory, e.g., master cluster 150 (known as a “directory mapping” ) . The central directory may disseminate the directory mapping (i.e., OG-to-VGL mapping) to meta server 230. For example, after organizing an object to an OG (e.g., OG 411) , the mapped meta server (e.g., meta server 230) may select a VG (e.g., VG 421) in the mapped VGL (e.g., VGL 420) and may map the object to the f positions in the f volumes of the selected VG (known as a “volume mapping” ) . The f positions may be exactly the same in the respective f volumes.

The hybrid placement architecture disclosed herein avoids unnecessary migration of object data by keeping the existing mapping (e.g., from objects to VGLs, VGs and in-volume positions) unaffected whenever possible. For example, when cluster expansion happens (e.g., one or more of disks are added to storage system 100) , the directory service may construct new VGs on the newly added disks and later map the newly constructed VGs to the OGs on demand. In some embodiments, if one of the meta server in meta server cluster 130 (e.g., meta server 230) fails, OGs assigned to that meta server would be “crushed” to other meta servers, but the directory/volume mappings can keep unchanged after recovery.

When processing object request (s) , to ensure crash consistency without the distributed write ordering, both the placement/application meta-data and the meta-log (e.g., proxy server ID and request ID) of the object request may be written atomically. Also, because objects are organized into groups before being assigned to data servers (e.g., data server cluster 140) , the meta server (e.g., meta server 230) needs to log the object group ID (ogid) and the in-group sequence number of the ongoing request. Moreover, the meta server may also record the allocation result (current bitmap) of the assigned volume group with ID= vgid.

In order for the meta-data/meta-log be efficiently stored (e.g., in a PUT request) and queried (e.g., in a GET request) , they may be maintained in a particular form of data structure. For example, each meta server of meta server cluster 130 may run RocksDB as the local store to maintain the meta-data/meta-log in the form of a METAX structure that includes several key values (KVs) . FIG. 5 illustrates an exemplary METAX structure 500 for meta-data storage, according to embodiments of the disclosure. As illustrated in FIG. 5, each record of the meta-data/meta-log of PUT request (s) in METAX structure 500 may include four KVs 510-540. For example, for every object PUT request, the located primary meta server (e.g., the meta server in meta server cluster located by the CRUSH algorithm) may atomically write KVs 510-540 to the local store (e.g., RocksDB) which could collaboratively ensure crash consistency.

In some embodiments, KV 510 may store the placement/application meta-data, in a form of <OBMETA_objectname, {app-meta-data, vgid, extents, cheksum} >, where OBMETA stands for “object metadata” (e.g., may be queried by key prefix matching in RocksDB) , and objectnamne stands for the globally unique identification (e.g., object name) of the object. In some embodiments, the value of KV 510 may include the app-meta-data (e.g., the attributes) , the ID of the assigned volume group (vgid) , the allocated in-volume positions (extents) , and the object data checksum (from the proxy server) .

In some embodiments, KV 520 may be used for logging the ongoing object PUT request (s) of the proxy server (e.g., proxy cluster 120) , in a form of <REQLOG_pxid-_reqid, {objectname, oplogkey} >, where REQLOG stands for “request log” , pxid stands for the request ID received from the proxy server and reqid stands for the request ID received from the proxy server. In some embodiments, the value of KV 520 includes the globally unique name of the object objectname, and the key of the current operation’s log OPLOG_opid_opseq.

In some embodiments, KV 530 may be used for logging the ongoing object PUT request (s) of the object groups, in a form of < OPLOG_opid_opseq, {objectname, reqlogkey} >, where OPLOG stands for the “operation log” , ogid stands for the ID of the object group and opseq stands for a version number monotonically increased for every PUT of object in the object group. In some embodiments, the value of KV 530 includes the name of the object (objectname) and the key of the current PUT request (s) ’ log REQLOG_pxid_reqid.

In some embodiments, KV 540 may record the volume allocation result, in a form of <VOLBITMAP_vgid, bitmap>, where VOLBITMAP stands for the “volume bitmap” , and vgid stands for the ID of the assigned volume group. The value bitmap stands for the allocated bitmap, where each bit requests the allocation status of one disk block in the volume which is coherently updated according to the allocated extents in KV 510. In some embodiments, the bitmap is auxiliary and may be cached in memory for fast volume space allocation.

In some embodiments, object request (s) (e.g., PUT, GET and DELETE) may be processed by a high-performance object store (e.g., CHEETA) based on the METAX structure. Both the object data and the meta-data may adopt a f-way replication, and parallel disk writes may be applied. As a result, CHEETA combined with the METAX structure may avoid the ordering requirement in multiple disk writes when processing the request (s) . This would significantly outperform the conventional object stores (e.g., Haystack and Ceph) in PUT/GET/DETETE latency and throughput.

FIG. 6 illustrates a schematic diagram of an exemplary object storage system (referred to as “storage system 600” ) configured to process an operation request using METAX structure 500 of FIG. 5, according to embodiments of the disclosure. In some embodiments, system 600 may include similar components to those in storage system 100, such as a user device 610, a proxy server 620, a meta server cluster 630, a data server cluster 640, e.g., including data servers 641-643, and a master cluster 650. Detailed descriptions of the similar components will be omitted. In some embodiments, each meta server may include a plurality of disks 634 and each data server may include a plurality of disks 644, for storing the meta-data and the object data respectively. In some embodiments, when object data and the meta-data adopt a three-way replication (f =3) , as illustrated in FIG. 6, meta server cluster 630 may include a primary meta server 631 and two backup

meta servers

632 and 633 for the replication. However, it is contemplated that other number of replication can be adopted and storage system 600 can be configured according to the number f selected.

FIG. 7 is a flowchart of an exemplary method for performing an object write operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 700 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 700 may include steps S702-S718 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 7. FIG. 6 and FIG. 7 would be described together.

In step S702, the client may send an object PUT request through user device 610 to proxy server 620. In some embodiments, the object PUT request may include the object data, the application meta-data and a URL (domain/ns/object) , where ns is the namespace of the application or user and object is the object name.

In step S704, proxy server 620 may concatenate ns and object to get a globally unique object name (objectname) . For example, proxy server 620 may adopt CRUSH algorithm to calculate the ID of the responsible meta serve in meta server cluster 630 for the object. Specifically, the object may first be classified to an object group and then be crushed to the three meta servers (e.g., the primary meta server 631 and two backup meta servers 632 and 633) of meta server cluster 630. The checksum of the object data may be computed and a UUID (e.g., the reid) may be generated for the object PUT request. In step S706, proxy server 620 may send application meta-data, and attributes of the object such as checksum, object data size and reid, to primary meta server 631. In some embodiments, proxy server 620 may additionally send log information such as request and operation logs to primary meta server 631.

In step S708, primary meta server 631 may generate the METAX structure and send the KVs of the structure to backup meta servers 632-633. In some embodiments, primary meta server 631 may perform the following sub-steps S708 (i) - (iii) after receiving the object PUT request. In sub-step S708 (i) , primary meta server 631 may generate allocation data for the object. In some embodiments, primary meta server 631 may select a volume group (with ID=vgid) for the object, and allocate free space on the selected volumes of the volume group. In some embodiments, the allocation data may be recorded in a list of extents, each of which records the offset and length of a contiguous free space. In sub-step S708 (ii) , primary meta server 631 may generate a METAX structure that include multiple KVs (e.g., KVs 510-540 in FIG. 5) . In sub-step S708 (iii) , primary meta server 631 may perform the following operations in parallel: (a) returning the allocation data, e.g., the allocated replication group ID (vgid) and space allocation result (extents) , to proxy server 620; (b) sending the KVs to backup meta servers 632-633 of meta server cluster 630 for the object group (selected by CRUSH) ; and (c) writing the KVs to the local store (e.g., local ROCKsDB) using operations such as WriteBatch () .

In step S710, backup meta servers 632-633 may atomically write the received KVs to their local store (e.g., the local ROCKsDB) and return an acknowledgement (e.g., a SUCCESS) to primary meta server 631. In step S712, upon receiving the acknowledgement from backup meta servers 632-633, primary meta server 631 may return another acknowledgement (e.g., another SUCCESS) to proxy server 620.

In step S714, proxy server 620 may send the object data and the allocation data to the corresponding data servers (e.g., data servers 641-643) in data server cluster 640. In some embodiments, proxy server 620 may receive the allocation data from primary meta server 631 in step S708 and store it in its in-memory cluster map. Proxy server 620 may look up the volume group (with ID=vgid) and allocation result (extents) in the in-memory cluster map and send to the corresponding data servers. In step S716, the data servers may write the received object data to the positions of disks 644 as specified by the allocation result (extents) and return an acknowledgement to proxy server 620.

In step S718, proxy server 620 may acknowledge to the client via user device 610 after receiving acknowledgement from both primary meta server 631 and data servers 641-643. In some embodiments, the meta-log of the object PUT request (e.g., request log in KV 520 and operation log in KV 530 of FIG. 5) may then be cleaned. In some alternative embodiments, for efficiency purposes, proxy server 620 may periodically (instead of immediately) notify the primary meta server about the committed object PUT request and the corresponding meta-log.

Consistent with the present disclosure, method 700 may be implemented without distributed ordering requirement which is the key factor limiting the object storage performance in the process. The parallel writes on primary/backup meta servers and data servers disclosed herein keep the crash consistency. Therefore, storage system 600 and method 700 provide a better storage performance comparing to conventional object storages.

In some embodiments, proxy server 620, meta server cluster 630, and data server cluster 640 periodically send ALIVE messages to master cluster 650 to confirm that the servers are up and running. Master server 650 runs Paxos to maintain a consistent map of the entire system (storage system 600) . In some embodiments, when ALIVE message is missing from a particular server, master cluster 650 may coordinate crash recovery to get the server back online.

The local atomic write implemented by method 300 (and similarly by method 700) ensures crash consistency among these servers because it avoids the ordering requirement for separate writes of meta-data on meta server 230 and data server 240 respectively. For example, if meta server 230 crashes during processing a PUT and then recovers, proxy server 220 can re-issue the request to the recovered meta server 230, which will either return the previously-assigned meta-data (meaning the previous atomic write succeeds) , or handle the request as a new one no matter whether data server 240 has written the data (because the bitmap is not updated) . As another example, since proxy server 220 is stateless, its crash can be handled by letting meta server 130 check relevant request logs. As yet another example, since data server 240 provides only raw disk service, its crash can be handled by verifying the checksum after it recovers and, if the checksum is not correct, letting proxy server 220 re-issue the request.

FIG. 8 is a flowchart of an exemplary method 800 for recovering a meta server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 800 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 800 may include steps S802-S820 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 8.

In step S802, master cluster 650 may detect a crash of meta server M (e.g., one of

meta servers

631, 632, or 633) . In some embodiments, master cluster 650 may determine that meta server M has crashed if it does not receive the periodic ALIVE message from meta server M. The crash of meta server M may affect object groups that are mapped to it (e.g., OG 411 is mapped to meta server 230) . For example,

meta servers

631, 632, or 633 may be assigned to the object groups, and therefore the crash of any of them may affect these object groups.

In step S804, master cluster 650 determines whether meta server M is a primary meta server. If it is a primary meta server (e.g., primary meta server 631) (S804: YES) , a new primary meta server may be automatically selected from the backup meta servers (e.g., backup meta servers 632 and 633) . In some embodiments, the new primary meta server may be selected using CRUSH algorithm. The selected new primary meta server replaces meta server M and becomes the primary meta server.

If meta server M is not a primary meta server (e.g., backup meta server 632 or 633) (S804: NO) , method 800 skips step S806 and proceeds directly to step S808. In step S808, master cluster 650 may determine whether meta server is back online within a predetermined time T1. For example, master cluster 650 measures the time length between the last received ALIVE message and the next ALIVE message and compare that time length with T1. During T1, it is allowable for each of the affected object groups on meta server M to temporarily have only f-1 meta servers (in a f-1 replication) .

If meta server M is back within time T1 (S808: YES) , method 800 proceeds to recover meta server M by performing steps S810-S814. In some embodiments, meta server M updates its local store RocksDB to the latest state. Otherwise, if meta server M is not back within time T1 (S808: NO) , method 800 proceeds to replace meta server M by performing steps S816-S820. For high availability, PUT requests are still permitted for the affected object groups even before meta server M recovers.

In step S810, meta server M retrieves operation logs for each of the affected object groups. For example, for each of the affected object groups (with ID = ogid) , meta server M uses OPLOG_ogid as the key prefix to retrieve the group’s operation logs, e.g., stored in KV 530 of METAX structure 500. In step S812, meta server M compares the operation logs retrieved from its own local store with the operation logs stored on other meta servers and add missing operation logs to its local store. In some embodiments, meta server M negotiates with the other meta servers assigned to the object group and compares the opseq in the key stored in the respective METAX structures on those meta servers. The missing operation logs would be the difference between the local operation logs and the operation logs on the other meta servers. As a result, in some embodiments, meta server M recovers KV 530 of its METAX structure 500 to the most updated value.

In step S814, meta server M recovers object meta-data and request logs based on the operation logs. For example, meta server M may recover the missing object meta-data with key OBMETA_objectname and the missing request log with key REQLOG_pxid_reqid by retrieving the same from the other meta servers. In some embodiments, meta server M recovers KV 510 and KV 520 of its METAX structure 500 to the most updated values. After steps S810-S814 are performed for each affected object group, the meta data structures (e.g., METAX structure 500) on meta server M will become up-to-date, and meta server M is fully recovered.

If meta server M is not back within time T1, in step S816, master cluster 650 may label meta server M “out” for each affected object group and disseminate the updated cluster map. In step S818, master cluster 650 replaces meta server M with a new meta server M*. In some embodiments, meta server M*may be identified among meta server cluster 630 using a CRUSH algorithm. In step S820, master cluster 650 reassigns volume groups of each affected object group to meta server M*and copies the meta data structures for the object groups over to meta server M*. The meta data structures may be copied from the other meta servers within the same group of meta servers as meta server M that is mapped to the object groups. In some embodiments, the key values of each meta data structure may be atomically written in a local store of meta server M*. After steps S816-S820, meta server M*is up-to-date and can functionally replace meta server M, as if meta server M is fully recovered.

In some embodiments, the meta server may crash when it is serving as the primary meta server (e.g., meta server M is primary meta server 631) for an ongoing object write operation. FIG. 9 is a flowchart of an exemplary method 900 for performing an object write operation amid a meta server crash, according to embodiments of the disclosure. In some embodiments, method 900 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 900 may include steps S902-S916 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 9.

In step S902, proxy server 620 may send an object PUT request to primary meta server 631, e.g., as in step S706. In step S904, master cluster 650 may detect a crash of primary meta server 631, e.g., as in step S802. In step S906, master cluster 650 may perform steps 804-806 to designate one of backup

meta servers

632 and 633 to replace primary meta server 631 as the primary, and perform steps 808-820 to recover crashed primary meta server 631. In step S908, after the recovery completes, proxy server 620 reissues the object PUT request to the new primary meta server. In some embodiments, the PUT request may be marked as “reissued. ”

In step S910, the new primary meta server may look up the METAX structure associated with the PUT request. For example, the METAX structure may be searched using key OBMETA_objectname. If the METAX structure is found (S912: YES) , in step S914, the new primary meta server resumes processing of the PUT request, e.g., by returning the vgid and extents to proxy server and sending the retrieved meta-data and meta-log to the relevant backup meta servers (e.g., backup meta servers 632-633) . If the METAX structure is not found (S912: NO) , in step S916, the new primary meta server processes the PUT request from scratch as if it is a new request, e.g., by generating the METAX structure using information sent by proxy server 620.

FIG. 10 is a flowchart of an exemplary method 1000 for recovering a proxy server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 1000 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 1000 may include steps S1002-S1018 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 10.

In step S1002, master cluster 650 may detect a crash of proxy server 620. In some embodiments, master cluster 650 may determine that proxy server 620 has crashed if it does not receive the periodic ALIVE message from proxy server 620. In some embodiments, if proxy server 620 crashes, every meta server in meta server cluster 630 may check whether it has ongoing write requests from proxy server 620. For example, proxy server 620 may have a PUT request out to meta server M among meta server cluster 630. The crash of stateless proxy server 620 may affect the PUT requests.

In step S1004, meta server M may look up a request log associated with the write request from proxy server 620. For example, meta server M may look up the request log in its RocksDB (e.g., stored in KV 520 of METAX structure 500) with key prefix REQLOG_pxid. The retrieved request log represents possibly unfinished PUT requests (with name = objectname in the value. ) In step S1006, meta server M may further look up the object meta-data based on the request log. For example, meta server M may look up the object meta-data in its RocksDB (e.g., stored in KV 510 of METAX structure 500) with key prefix OBMETA_objectname.

For a write request that has a request log stored on meta server M, the object meta-data typically also exists on meta server M because these values are atomically written during a write operation. For example, KV 510 (which stores the object meta-data) and KV 520 (which stores the request log) of METAX structure 500 are atomically written to the RocksDB. If the object meta-data exists (S1008: YES) , meta server M obtains object data according to allocation data in the object meta-data in step S1010. In some embodiments, the object meta-data may include the value of KV 510, i.e., value = {app-meta-data, vgid, extents, cheksum} . Accordingly, meta server M can read the ID of the assigned volume group (vgid) , the allocated in-volume positions (extents) , and the object data checksum (from the proxy server) . Meta server M may send the allocation data (e.g., vgid and extents) to the corresponding data server to obtain the object data stored in the disk volume and the in-volume position of the data server.

In step 1012, meta server M validates the checksum of the object data against the checksum in the object meta-data by comparing them. The checksum in the object meta-data is provided by proxy server 620 as part of the write request. If the checksum matches with checksum (S1012: YES) , the write request is finished, and in step S1014 meta server M may look up the operation log in its local store and atomically delete the request log and the operation log. For example, meta server M may look up the operation log in RocksDB (e.g., stored in KV 530 of METAX structure 500) with key oplogkey. If the checksum does not match with checksum (S1012: NO) , the write request is unfinished, and in step S1016 meta server M may look up the operation log in its local store and atomically delete the object meta-data, the request log and the operation log. In some embodiments, in step S1016, meta server M may additionally look up the bitmap and clear the corresponding bits of the bitmap. For example, meta server M may look up the bitmap stored in KV 540 of METAX structure 500 using the vgid in the object meta-data.

If the object meta-data does not exist (S1008: NO) , e.g., due to a delete request after the write operation is committed and before the logs are cleaned, meta server M simply deletes the request log in step S1018.

In some embodiments, master cluster 650 may further detect that proxy server 620 is recovered from the crash (e.g., master cluster 650 receives ALIVE messages from proxy server 620 again) . Proxy server 620 may reissue the write request to meta server M so that the write request can be processed normally.

FIG. 11 is a flowchart of an exemplary method 1100 for recovering a data server crash in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 1100 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 1000 may include steps S1102-S1118 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 11.

In step S1102, master cluster 650 may detect a crash of data server D (e.g., one of

data servers

641, 642, or 643) . In some embodiments, master cluster 650 may determine that data server D has crashed if it does not receive the periodic ALIVE message from data server D. The crash of data server D may affect volume groups on data server M that the responsible meta server assigns to the object groups (e.g., VG 421 is assigned to OG 411 by meta server 230) .

In step S1104, master cluster 650 may mark data server D read-only so that no object data can be written to data server D before it recovers. In step S1106, master cluster 650 also notifies the responsible meta servers of the affected volume groups about data server D’s new status (i.e., read-only) . For example, primary meta server 631 may be a “responsible meta server” if it maps certain object groups it handles to one or more affected volume groups. After the notification, the responsible meta servers will temporarily skip the affected volume groups while handling new PUT requests.

In step S1108, master cluster 650 may determine whether data server D is back online within a predetermined time T2. For example, master cluster 650 measures the time length between the last received ALIVE message and the next ALIVE message and compares that time length with T2. If data server D is not back within time T2 (S1108: NO) , method 1100 proceeds to perform steps S1110-S1114 to replace data server D. For high availability, PUT requests are still permitted for the affected object groups even before data server D recovers (as will be described using FIG. 12) .

In step S1110, master cluster 650 replaces data server D with a new data server D*. In some embodiments, meta server D*may be identified among data server cluster 640. In step S1112, master cluster 650 notifies the meta servers responsible for the affected volume groups of the replacement. In step S1114, object data stored in the affected volume groups may be recovered from one of the remaining data servers in data server cluster 640 (e.g., a data server different from data server D and data server D*) to data server D*. For example, a VG may have f volumes on f disks (e.g., for f-way replication) on different data servers. Therefore, the object data can be recovered by first identifying another data server that has the replication data stored thereon and then copying over the replication data from that data server to data server D*. Data server D*is now recovered. In some embodiments, master cluster 650 may mark data server D*as writable. In other words, data server D has been functionally replaced by data server D*and volume groups on data server D*are available for assignment.

If data server D is back within time T2 (S1108: YES) , method 1100 proceeds directly to step S1116 to mark data server D as writable. In other words, data server D is fully recovered and volume groups on data server D are available for assignment again. In step S1118, data service of data server D or data server D*may be resumed.

In some embodiments, the data server may crash during an ongoing object write operation, e.g., when object data is being written to the data server. FIG. 12 is a flowchart of an exemplary method 1200 for performing an object write operation amid a meta server crash, according to embodiments of the disclosure. In some embodiments, method 1200 may be performed by storage system 600 directed by master cluster 650 coordinating the various components within storage system 600. Method 900 may include steps S1202-S1214 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 12.

In step S1202, proxy server 620 may send object data to data server D for writing, e.g., as in step S714. For example, proxy server 620 may issue an object PUT request to primary meta server 631, which performs volume mapping to map the object group to a volume group of a volume group listing. Proxy server 620 may send the object data to be written in f volumes of the volume group (in a f-way replication) . One or more volumes of the volume groups may be on data server D. In step S1204, master cluster 650 may detect a crash of data server D, e.g., as in step S1102. In step S1106, master cluster 650 may perform steps 1104-1116 to recover crashed data server D. In parallel to the recovery in step S1206 and without waiting for the completion of the recovery, proxy server 620 reissues the object PUT request to primary meta server 631 in step S1208. In some embodiments, the PUT request may be marked as “reissued. ”

In step S1210, primary meta server 631 may select a new volume group from the healthy volume groups of the VGL that are available for assignment. For example, as shown in FIG. 4, VGL 420 may include

VGs

421 and 422. If previously selected VG 421 is on crashed data server D, primary meta server 630 may select a new VG (e.g., VG 422) in the mapped VGL and may volume map the object to the selected VG.

In step S1212, primary meta server 631 may withdraw assignment to the affected volume group (e.g., VG 421) . In some embodiments, primary meta server 631 may do so by adding one more operation to the current atomic KV writes to the local RocksDB. For example, the operation may be to update the bitmap of the affected VG by clearing the bits of the bitmap. In step S1214, normal processing of PUT request may be resumed, for example, according to method 700.

In some embodiments, all servers may become offline temporarily and reboot, e.g., due to a power outage. In that case, the meta servers of meta server cluster 630 may negotiate with each other for the operation logs to synchronize the meta-data. For example, the meta servers compare notes of the operation logs retrieved from their local stores (e.g., RocksDBs) and construct a most complete version of the operation log. The meta servers may then roll back any unfinished operations according to the request logs. Afterwards, storage system 600 may resume the service for processing object operations.

Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer- readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method for recovering a data server crash, comprising:

detecting a crash of a first data server;

marking an affected volume group associated with the first data server read-only;

determining that the first data server is not back online within a predetermined time;

replacing the first data server with a second data server; and

recovering object data of the affected volume group to the second data server.
The method of claim 1, further comprising:

notifying a meta server responsible for the affected volume group that the affected volume group is read-only; and

temporarily skipping the affected volume group, by the meta server, in processing a write operation of an object.
The method of claim 2, further comprising:

notifying the meta server that the first data server is replaced by the second data server.
The method of claim 1, further comprising:

marking the affected volume group writable after the object data of the affected volume group is recovered.
The method of claim 1, wherein recovering the object data of the affected volume groups further comprises:

identifying a third data server that stores the object data of the affected volume groups; and

copying the object data from the third data server to the second data server.
A method for performing a write operation of an object, comprising:

issuing, by a proxy server, a request for the write operation to a meta server;

selecting, by the meta server, a first volume group for writing the object, wherein the first volume group comprises a volume of a data server;

detecting a crash of the data server;

reissuing, by the proxy server, the request for the write operation to the meta server;

selecting, by the meta server, a second volume group for writing the object; and

withdrawing from writing the object to the first volume group.
The method of claim 6, wherein the second volume group does not have any volume on the first data server.
The method of claim 6, further comprising:

marking the first volume group of the data server read-only; and

notifying the meta server that the first volume group is read-only.
The method of claim 6, wherein withdrawing from writing the object to the first volume group further comprise:

updating a bitmap of the first volume group.
An object storage system, comprising a master cluster and a data server cluster comprising a first data server and a second data server, wherein the master cluster is configured to:

detect a crash of the first data server;

mark an affected volume group associated with the first data server read-only;

determine that the first data server is not back online within a predetermined time;

replace the first data server with the second data server; and

recover object data of the affected volume group to the second data server.
The object storage system of claim 10, further comprising a meta server responsible for the affected volume group,

wherein the master cluster is further configured to notify that the affected volume group is read-only,

wherein the meta server is configured to temporarily skip the affected volume group in processing a write operation of an object.
The object storage system of claim 11, wherein the master cluster is further configured to:

notify the meta server that the first data server is replaced by the second data server.
The object storage system of claim 10, wherein the master cluster is further configured to:

mark the affected volume group writable after the object data of the volume group is recovered.
The object storage system of claim 10, wherein to recover the object data of the affected volume groups,

wherein the master cluster is further configured to identify a third data server, from the data server cluster, that stores the object data of the volume groups,

wherein the second data server is configured to copy the object data from the third data server.
A system for performing a write operation of an object, comprising a proxy server; meta server; a data server cluster comprising a data server; and a master cluster, wherein:

the proxy server is configured to issue a request for the write operation to the meta server;

the meta server is configured to select a first volume group for writing the object, wherein the first volume group comprises a volume of the data server;

the master cluster is configured to detect a crash of the data server;

the proxy server is further configured to reissue the request for the write operation to the meta server;

the meta server is further configured to:

select a second volume group for writing the object; and

withdraw from writing the object to the first volume group.
The system of claim 15, wherein the second volume group does not have any volume on the first data server.
The system of claim 15, wherein the master cluster is further configured to:

mark the first volume group of the data server read-only; and

notify the meta server that the first volume group is read-only.
The system of claim 15, wherein to withdraw from writing the object to the first volume group, the meta server is further configured to:

update a bitmap of the first volume group.
A non-transitory computer readable medium storing computer instructions, when executed by a master cluster of an object storage system, perform a method for recovering a data server crash, the method comprising:

detecting a crash of a first data server;

marking an affected volume group associated with the first data server read-only;

determining that the first data server is not back online within a predetermined time;

replacing the first data server with a second data server; and

recovering object data of the affected volume group to the second data server.
The non-transitory computer readable medium of claim 19, wherein the method further comprises:

notifying a meta server responsible for the affected volume group that the affected volume group is read-only;

notifying the meta server that the first data server is replaced by the second data server; and

marking the affected volume group writable after the object data of the affected volume group is recovered.