WO2021189308A1

WO2021189308A1 - Delete operation in object storage system using enhanced meta structure

Info

Publication number: WO2021189308A1
Application number: PCT/CN2020/081160
Authority: WO
Inventors: Li Wang; Yiming Zhang; Mingya SHI; Chunhua Huang
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2021-09-30

Abstract

Embodiments of the disclosure provide systems and methods for performing a delete operation of an object. The exemplary method includes receiving, by a proxy server, a request for the delete operation, the request including an identification of the object. The method further includes sending, by the proxy server, a meta-data query to a primary meta server, the meta-data query including the identification of the object. The method also includes atomically performing, by the primary meta-server, update operations to meta-data associated with the identification of the object. The meta-data is stored in a local store of the primary meta server and includes allocation data of the object. The method additionally includes sending, by the primary meta server, the update operations to at least one backup meta server, performing, by the backup meta server, the update operations to corresponding backup meta-data stored in a local store of the backup meta server.

Description

DELETE OPERATION IN OBJECT STORAGE SYSTEM USING ENHANCED META STRUCTURE

TECHNICAL FIELD

The present disclosure relates to systems and methods for object storage, and more particularly to, systems and methods for performing object delete operations in an object storage system using an enhanced meta structure.

BACKGROUND

Object storage has been widely adopted by online-data-intensive (OLDI) applications for data analysis and computation. Compared to files or blobs of which the sizes vary from several kilobytes to many gigabytes, the sizes of objects (like photos, audio/video pieces, and H5 files) in the OLDI applications are relatively small (at most a few megabytes) , so as to support large-scale, latency-sensitive computation.

Different from social media applications (e.g., Facebook ^TM) , where reads are predominant, currently many OLDI applications are write-intensive such that they write a large number of objects online of which only a fraction will be read later for computation. Data placement is critical for the performance of these write-intensive applications storing hundreds of billions of small objects in thousands of machines.

The two main placement architectures for object data are: (1) directory-based placement (e.g., Haystack) , where a directory service is maintained from which the clients can retrieve the mapping from objects to disk volumes; and (2) the calculation-based placement (e.g., CRUSH) , where the mapping from objects to volumes can be independently calculated based on the objects’ globally unique names.

Calculation-based placement cannot control the mapping from objects to volumes and thus will cause huge data migration in storage capacity expansions resulting in significant performance degradation. Therefore, the calculation-based placement is not smoothly expandable. This is unacceptable to the fast-growing latency sensitive OLDI applications.

On the other hand, although the directory-based placement is flexible in controlling the object-to-volume mapping and can easily realize migration-free expansions, it requires writing of the storage meta-data (e.g., mappings from objects both to volumes and to in-volume positions) and the actual object data before acknowledgements. The co-existence of object data and meta-data complicates the processing of requests (e.g., object PUT request) in current object stores. To ensure crash consistency of an object PUT request, current directory-based object stores have to orchestrate a sequence of distributed writes in a particular order. The distributed ordering requirement severely affects the write performance of small objects, making directory-based object stores inefficient for write-intensive applications.

Embodiments of the disclosure address the above problems by providing systems and methods for operations in object storage system using an enhanced meta structure.

SUMMARY

Embodiments of the disclosure provide a method for performing a delete operation of an object. The exemplary method includes receiving, by a proxy server, a request for the delete operation, the request including an identification of the object. The method further includes sending, by the proxy server, a meta-data query to a primary meta server, the meta-data query including the identification of the object. The method also includes atomically performing, by the primary meta-server, update operations to meta-data associated with the identification of the object. The meta-data is stored in a local store of the primary meta server and includes allocation data of the object. The method additionally includes sending, by the primary meta server, the update operations to at least one backup meta server, performing, by the backup meta server, the update operations to corresponding backup meta-data stored in a local store of the backup meta server.

Embodiments of the disclosure also provide a system for performing a delete operation of an object. The exemplary system includes a proxy server and a meta server cluster. The proxy server is configured to receive a request for the delete operation, the request including an identification of the object, and send a meta-data query to a primary meta server in the meta server cluster, the meta-data query including the identification of the object. The primary meta server is configured to atomically perform update operations to meta-data associated with the identification of the object. The meta-data is stored in a local store of the primary meta server and includes allocation data of the object. The primary meta server is configured to send the update operations to at least one backup meta server. The backup meta server is configured to perform the update operations to corresponding backup meta-data stored in a local store of the backup meta server.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a schematic diagram of an exemplary object storage system, according to embodiments of the disclosure.

FIG. 2 illustrates a flow diagram of an exemplary write operation implementing local write atomicity, according to embodiments of the disclosure.

FIG. 3 is a flowchart of an exemplary method for performing the write operation of FIG. 2, according to embodiments of the disclosure.

FIG. 4 illustrates an exemplary framework for hybrid object placement, according to embodiments of the disclosure.

FIG. 5 illustrates an exemplary METAX structure for meta-data storage, according to embodiments of the disclosure.

FIG. 6 illustrates a schematic diagram of an exemplary object storage system configured to process an operation request using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 7 is a flowchart of an exemplary method for performing an object write operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 8 is a flowchart of an exemplary method for performing an object read operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

FIG. 9 is a flowchart of an exemplary method for performing an object delete operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

FIG. 1 illustrates a schematic diagram of an exemplary object storage system (referred to as “storage system 100” ) , according to embodiments of the disclosure. In some embodiments, storage system 100 may include components shown in FIG. 1, including a user device 110 used by a client to send client requests, a proxy cluster 120 for surrogating client requests, a meta server cluster 130 for providing meta service, a data server cluster 140 for providing data storage service, a master cluster 150 for managing the clusters and disseminating directory mappings, and a network 160 for facilitating communications among the various components.

Consistent with the present disclosure, storage system 100 is configured to provide high input and output performance for write-intensive OLDI applications (e.g., in-journey audio recording, face identification, online sentiment analysis, driver/passenger assessment, etc. ) while ensuring the crash consistency by enforcing the local write atomicity of the enhanced meta structure (e.g., METAX) instead of applying a distributed write ordering. Storage system 100 may be configured to perform various object operations including, for example, an operation that writes an object in the storage system (also referred to as an object write operation or an object PUT) , an operation that reads an object stored in the storage system (also referred to as an object read operation or an object GET) , and an operation that deletes an object from the storage system (also referred to as an object delete operation or an object DELETE) . In some embodiments, storage system 100 may perform some other object operations by combining the above basic operations. For example, an object overwrite operation can be performed by first deleting the object and then writing a new one with the same object name. It is contemplated that storage system 100 may also be configured to perform other auxiliary operations, such as object LIST and namespace CREATE and DELETE.

In some embodiments, user device 110 may include, but not limited to, a facial recognition camera, a video/audio recorder, a server, a database/repository, a netbook computer, a laptop computer, a desktop computer, a media center, a gaming console, a television set, a set-top box, a handheld device (e.g., smart phone, tablet, etc. ) , a smart wearable device (e.g., eyeglasses, wrist watch, etc. ) , or any other suitable device. In some embodiments, storage system 100 may perform an object operation when receiving a request from user device 110, e.g., an object PUT, GET or DELET request. In some embodiments, a request may include the object data, the application meta-data and a URL. For example, in an object PUT request, the object data may be organized into groups and then be assigned to data server cluster 140. The ID of the groups, the in-group sequence number of the ongoing object PUT request and the allocation result (e.g., a current bitmap) of the assigned volume groups may be logged by meta server cluster 130.

In some embodiments, the object data provided by user device 110 may contain data of a document, a picture, a piece of video or audio, an e-book, etc. Consistent with the present disclosure, the object data may be generated by user device 110 and may be sent to proxy cluster 120 for storage and/or replication. In some embodiments, the object data may be striped/divided into multiple objects and may be stored to data server cluster 140, according to a directory mapping generated by master cluster 150.

In some embodiments, proxy cluster 120 may be any suitable device (e.g., a server, a computer, etc. ) that can provide object input/output (I/O) service to the clients such as user device 110, and can provide object PUT/GET/DELETE request service (s) for the clients as a surrogate. For example, the objects may be replicated and stored on proxy cluster 120 and they are identified by their globally unique identifications (e.g., object names) .

In some embodiments, meta server cluster 130 may include one or more machines each installing multiple disks for providing meta service (s) . In some embodiments, each disk of meta server cluster 130 may run a meta server process. For example, machines within meta server cluster 130 may maintain a local store which supports both atomic writes and efficient queries of the meta-data/meta-log of the data file.

In some embodiments, data server cluster 140 may include one or more storage machines each installing many disks for providing raw disk service (s) . In some embodiments, each disk in data server cluster 140 may run a lightweight data server process (e.g., process for executing database retrieval and updates, managing data integrity, dispatching responses to client requests, etc. ) . For example, data server cluster 140 may purely perform disk I/O to directly read/write object data without filesystem overhead (e.g., the meta-data/meta-log of the object data) .

In some embodiments, storage system 100 may also include a master cluster 150 being responsible for managing the entire system. For example, master cluster 150 may be configured to maintain a consistent cluster map (e.g., by running Paxos) of proxy cluster 120, meta server cluster 130, data server cluster 140 as well as master cluster 150 itself. In some embodiments, master cluster 150 may also maintain and disseminate the directory mapping. Master cluster 150 may further periodically update the topology map to proxy cluster 120 and meta server cluster 130 and distribute the directory mapping to meta server cluster 130.

In some embodiments, master cluster 150 may include a general-purpose or special-purpose server, a computing device, or any suitable device that may be responsible for managing the entire system. For example, master cluster 150 may include an odd number of cluster master processes (e.g., processes that manage the clusters state) , and may run Paxos (protocols for solving consensus in a network of processors) to maintain a consistent cluster map of proxy cluster 120, meta server cluster 130, data server cluster 140 and master cluster 150 itself. Master cluster 150 may also maintain and disseminate the directory mapping. For example, master cluster 150 may periodically update the topology map to proxy cluster 120 and meta server cluster 130 and distribute the directory mapping to meta server cluster 130.

In some embodiments, storage system 100 may include network 160 to facilitate the communication among the various components of storage system 100, such as user device 110, proxy cluster 120, meta server cluster 130, data server cluster 140 and master cluster 150. For example, the network may be a local area network (LAN) , a wireless network, a cloud computing environment (e.g., software as a service, platform as a service, infrastructure as a service) , a client-server, a wide area network (WAN) , etc. In some embodiments, network may be replaced by wired data communication systems or devices.

It is contemplated that storage system 100 may include more or less components compared to those shown in FIG. 1. In some embodiments, the various components of storage system 100 may be remote from each other or in different locations and be connected through the network. In some alternative embodiments, certain components of storage system 100 may be located on the same site or inside one device. For example, master cluster 120 may be located on-site with or be part of user device 110 such as being an application or a computer program that can be executed by user device 110.

In some embodiments, storage system 110 may be configured to perform object PUT operations that eliminate the ordering requirement of conventional write operations, which usually penalizes storage performance. In some embodiments, storage system 110 places the object meta-data, ongoing meta-logs and object attributes on meta server cluster 130, while leaving proxy server cluster 120 to be a stateless surrogate and data server cluster 140 to provide only raw disk service. In some embodiments, storage system 100 may use an enhanced meta structure (referred to as METAX) that contains the traditional meta-data of objects, the meta-logs and object attributes. A local atomic write using the METAX structure therefore does not require distributed write ordering of data/meta-data and thus may be applied to ensure crash consistency based on the object immutability. This would significantly increase the I/O performance of storage system 100.

In some embodiments, when performing the local atomic write, all file data except for the object data may be placed on meta server cluster 130. This may avoid uncertainty caused by applying distributed write ordering. For example, when processing object PUT request (s) , meta server cluster 130 may maintain (i) the complete placement meta-data recording the allocated mapping from objects to volumes and in-volume positions, (ii) the object attributes including data checksum, (iii) the meta-logs of the ongoing object PUT request (s) recording IDs of proxy cluster 120 and the object keys, and (iv) the volume allocation results in the form of bitmaps recording the status of disk blocks in the allocated volumes. Data server cluster 140 may provide raw disk service that is agnostic to the existence of objects and only responsible for storing pure object data onto specific disks blocks.

FIG. 2 illustrates a flow diagram 200 of an exemplary write operation implementing a local write atomicity, according to embodiments of the disclosure. In some embodiments, master cluster 150 may coordinate proxy cluster 120, meta server cluster 130 and data server cluster 140 according to flow diagram 200. FIG. 3 is a flowchart of an exemplary method 300 for performing the write operation of FIG. 2, according to embodiments of the disclosure. In some embodiments, method 300 may be performed by storage system 100 and may include steps S302-S310 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3. FIG. 2 and FIG. 3 will be described together.

In step S302, a client 210 may send an object PUT request (e.g., via user device 110) to a proxy server 220 (e.g., among proxy cluster 120) . In some embodiments, the object PUT request may include the object data, the application meta-data and a URL (domain/ns/object) , where ns is the namespace of the application or the user (e.g., user device 110) and object is the object name. In step S304, proxy server 220 may send application meta-data to a meta server 230 (e.g., among meta server cluster 130) . In some embodiments, application meta-data of the object may include the object key (e.g., a key computed by hashing the object name) . In some embodiments, proxy server 220 may also send attributes of the object to meta server 230 such as a data checksum computed using the object data.

In step S306, meta server 230 may schedule and return allocation data to proxy server 220, and simultaneously perform an atomic write of the various data associated with the object and the operation request, such as the allocation data, meta-logs of the request, and the application meta-data, to the local store of meta server 230. In some embodiments, the allocation data may include placement meta-data for the object (e.g., the allocated disk volume and in-volume position for the object) . In some embodiments, meta server 230 further determines an allocation result bitmap, where each bit requests the allocation status of one disk block in the allocated disk volume at the in-volume position. In some embodiments, meta server 230 may also atomically write the allocation result bitmap to its local store along with the other information and data described above.

Consistent with the present disclosure, an atomic write requires all operations in the local write either succeed or fail together. In some embodiments, method 300 only requires the write on meta server 230 to be atomic. In some embodiments, meta server 230 may also return an acknowledgement (e.g., SUCCESS) to proxy server 220 after the atomic write is successfully completed. It is contemplated that atomic write can also be implemented on other devices, such as proxy server 220.

In step S308, proxy server 220 may send object data and allocation data for the object (e.g., volume ID and in-volume offset) to the assigned disk volume within a data server 240 (e.g., data server among data server cluster 140 that includes the assigned disk volume) . In step S310, data server 240 may write the received object data in the disk volume according to the allocation data such as the volume ID and the in-volume offset, and return an acknowledgement (e.g., SUCCESS) to proxy server 220 after the write is completed. In some embodiments, after receiving the acknowledgements from both meta server 230 (in step S306) and data server 240 (in step S310) , proxy server 220 may send an acknowledgement (e.g., SUCCESS) to client 210 (e.g., via user device 110) indicating the object PUT request has been successfully fulfilled.

The local atomic write implemented by method 300 avoids the ordering requirement for two separate writes respectively on meta server 230 and data server cluster 240. This local atomicity may ensure crash consistency. For example, since proxy server 220 is stateless, its crash can be handled by letting meta server 230 check relevant request logs. As another example, since data server 240 provides only raw disk service, its crash can be handled by verifying the checksum after it recovers and, if the checksum is not correct, letting proxy server 220 re-issue the request. As yet another example, if meta server 230 crashes during processing a PUT and then recovers, proxy server 220 can re-issue the request to the recovered meta server 230, which will either return the previously-assigned meta-data (meaning the previous atomic write succeeds) , or handle the request as a new one no matter whether data server 240 has written the data (because the bitmap is not updated) .

In some embodiments, as shown in FIG. 1, meta server cluster 130 may include multiple meta servers to provide scalable meta service. In some embodiments, a meta server (e.g., meta server 230) may be identified among meta server cluster 130 as responsible for a given object. In some embodiments, the responsible meta server may be identified using a predetermined lookup table or directly calculated via a decentralized CRUSH algorithm without extra lookup overhead.

In some embodiments, storage system 100 may implement a hybrid placement architecture where the directory (e.g., master cluster 150) may maintain the mapping from objects to volumes and disseminate the mapping to the meta servers (e.g., meta server cluster 130) , and the meta servers (e.g., meta server cluster 130) may maintain the mapping from objects to in-volume positions and provide meta service (s) . For example, FIG. 4 illustrates an exemplary framework 400 for hybrid object placement, according to embodiments of the disclosure. In some embodiments, an intermediate mapping layer of volume group lists (VGL) each having multiple volume groups (VG) may be included in the hybrid placement architecture. For example, as shown in FIG. 4, framework 400 may include VGLs 420-430 where VGL 420 may include

VGs

421 and 422. In some embodiments, a disk may be organized into multiple volumes and a VG (similar to a logical volume in Haystack) may have f volumes on f disks (e.g., for f-way replication) that have exactly the same bitmap and data. An f-way replication stores f replicas for each object.

As illustrated in FIG. 4, objects to be stored in the storage system are first organized into object groups (OGs) , e.g., including object groups 411-413. In some embodiments, the OG for a particular object may be determined by computing the modulo of the hashing of the object name, e.g., ogid = HASH (name) mod OG_NUM, where OG_NUM is the number of OGs (e.g., OC_NUM=2 ⁿ) . In some embodiments, an OG (e.g., OG 411) may be mapped to meta server 230 by a decentralized CRUSH algorithm. The OG may be additionally mapped to a VGL (e.g., VGL 420) by the central directory, e.g., master cluster 150 (known as a “directory mapping” ) . The central directory may disseminate the directory mapping (i.e., OG-to-VGL mapping) to meta server 230. For example, after organizing an object to an OG (e.g., OG 411) , the mapped meta server (e.g., meta server 230) may select a VG (e.g., VG 421) in the mapped VGL (e.g., VGL 420) and may map the object to the f positions in the f volumes of the selected VG (known as a “volume mapping” ) . The f positions may be exactly the same in the respective f volumes.

The hybrid placement architecture disclosed herein avoids unnecessary migration of object data by keeping the existing mapping (e.g., from objects to VGLs, VGs and in-volume positions) unaffected whenever possible. For example, when cluster expansion happens (e.g., one or more of disks are added to storage system 100) , the directory service may construct new VGs on the newly added disks and later map the newly constructed VGs to the OGs on demand. In some embodiments, if one of the meta server in meta server cluster 130 (e.g., meta server 230) fails, OGs assigned to that meta server would be “crushed” to other meta servers, but the directory/volume mappings can keep unchanged after recovery.

When processing object request (s) , to ensure crash consistency without the distributed write ordering, both the placement/application meta-data and the meta-log (e.g., proxy server ID and request ID) of the object request may be written atomically. Also, because objects are organized into groups before being assigned to data servers (e.g., data server cluster 140) , the meta server (e.g., meta server 230) needs to log the object group ID (ogid) and the in-group sequence number of the ongoing request. Moreover, the meta server may also record the allocation result (current bitmap) of the assigned volume group with ID= vgid.

In order for the meta-data/meta-log be efficiently stored (e.g., in a PUT request) and queried (e.g., in a GET request) , they may be maintained in a particular form of data structure. For example, each meta server of meta server cluster 130 may run RocksDB as the local store to maintain the meta-data/meta-log in the form of a METAX structure that includes several key values (KVs) . FIG. 5 illustrates an exemplary METAX structure 500 for meta-data storage, according to embodiments of the disclosure. As illustrated in FIG. 5, each record of the meta-data/meta-log of PUT request (s) in METAX structure 500 may include four KVs 510-540. For example, for every object PUT request, the located primary meta server (e.g., the meta server in meta server cluster located by the CRUSH algorithm) may atomically write KVs 510-540 to the local store (e.g., RocksDB) which could collaboratively ensure crash consistency.

In some embodiments, KV 510 may store the placement/application meta-data, in a form of <OBMETA_objectname, {app-meta-data, vgid, extnts, cheksum} >, where OBMETA stands for “object metadata” (e.g., may be queried by key prefix matching in RocksDB) , and objectnamne stands for the globally unique identification (e.g., object name) of the object. In some embodiments, the value of KV 510 may include the app-meta-data (e.g., the attributes) , the ID of the assigned volume group (vgid) , the allocated in-volume positions (extents) , and the object data checksum (from the proxy server) .

In some embodiments, KV 520 may be used for logging the ongoing object PUT request (s) of the proxy server (e.g., proxy cluster 120) , in a form of <REQLOG_pxid- _reqid, {objectname, oplogkey} >, where REQLOG stands for “request log” , pxid stands for the request ID received from the proxy server and reqid stands for the request ID received from the proxy server. In some embodiments, the value of KV 520 includes the globally unique name of the object objectname, and the key of the current operation’s log OPLOG_opid_opseq.

In some embodiments, KV 530 may be used for logging the ongoing object PUT request (s) of the object groups, in a form of < OPLOG_opid_opseq, {objectname, reqlogkey} >, where OPLOG stands for the “operation log” , ogid stands for the ID of the object group and opseq stands for a version number monotonically increased for every PUT of object in the object group. In some embodiments, the value of KV 530 includes the name of the object (objectname) and the key of the current PUT request (s) ’ log REQLOG_pxid_reqid.

In some embodiments, KV 540 may record the volume allocation result, in a form of <VOLBITMAP_vgid, bitmap>, where VOLBITMAP stands for the “volume bitmap” , and vgid stands for the ID of the assigned volume group. The value bitmap stands for the allocated bitmap, where each bit requests the allocation status of one disk block in the volume which is coherently updated according to the allocated extents in KV 510. In some embodiments, the bitmap is auxiliary and may be cached in memory for fast volume space allocation.

In some embodiments, object request (s) (e.g., PUT, GET and DELETE) may be processed by a high-performance object store (e.g., CHEETA) based on the METAX structure. Both the object data and the meta-data may adopt a f-way replication, and parallel disk writes may be applied. As a result, CHEETA combined with the METAX structure may avoid the ordering requirement in multiple disk writes when processing the request (s) . This would significantly outperform the conventional object stores (e.g., Haystack and Ceph) in PUT/GET/DETETE latency and throughput.

FIG. 6 illustrates a schematic diagram of an exemplary object storage system (referred to as “storage system 600” ) configured to process an operation request using METAX structure 500 of FIG. 5, according to embodiments of the disclosure. In some embodiments, system 600 may include similar components to those in storage system 100, such as a user device 610, a proxy server 620, a meta server cluster 630, and a data server cluster 640, e.g., including data servers 641-643. Detailed descriptions of the similar components will be omitted. In some embodiments, each meta server may include a plurality of disks 634 and each data server may include a plurality of disks 644, for storing the meta-data and the object data respectively. In some embodiments, when object data and the meta-data adopt a three-way replication (f =3) , as illustrated in FIG. 6, meta server cluster 630 may include a primary meta server 631 and two backup

meta servers

632 and 633 for the replication. However, it is contemplated that other number of replication can be adopted and storage system 600 can be configured according to the number f selected.

FIG. 7 is a flowchart of an exemplary method for performing an object write operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 700 may be performed by storage system 600 directed by a master cluster (not shown) coordinating the various components within storage system 600. Method 700 may include steps S702-S718 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 7. FIG. 6 and FIG. 7 would be described together.

In step S702, the client may send an object PUT request through user device 610 to proxy server 620. In some embodiments, the object PUT request may include the object data, the application meta-data and a URL (domain/ns/object) , where ns is the namespace of the application or user and object is the object name.

In step S704, proxy server 620 may concatenate ns and object to get a globally unique object name (objectname) . For example, proxy server 620 may adopt CRUSH algorithm to calculate the ID of the responsible meta serve in meta server cluster 630 for the object. Specifically, the object may first be classified to an object group and then be crushed to the three meta servers (e.g., the primary meta server 631 and two backup meta servers 632 and 633) of meta server cluster 630. The checksum of the object data may be computed and a UUID (e.g., the reid) may be generated for the object PUT request. In step S706, proxy server 620 may send application meta-data, and attributes of the object such as checksum, object data size and reid, to primary meta server 631. In some embodiments, proxy server 620 may additionally send log information such as request and operation logs to primary meta server 631.

In step S708, primary meta server 631 may generate the METAX structure and send the KVs of the structure to backup meta servers 632-633. In some embodiments, primary meta server 631 may perform the following sub-steps S708 (i) - (iii) after receiving the object PUT request. In sub-step S708 (i) , primary meta server 631 may generate allocation data for the object. In some embodiments, primary meta server 631 may select a volume group (with ID=vgid) for the object, and allocate free space on the selected volumes of the volume group. In some embodiments, the allocation data may be recorded in a list of extents, each of which records the offset and length of a contiguous free space. In sub-step S708 (ii) , primary meta server 631 may generate a METAX structure that include multiple KVs (e.g., KVs 510-540 in FIG. 5) . In sub-step S708 (iii) , primary meta server 631 may perform the following operations in parallel: (a) returning the allocation data, e.g., the allocated replication group ID (vgid) and space allocation result (extents) , to proxy server 620; (b) sending the KVs to backup meta servers 632-633 of meta server cluster 630 for the object group (selected by CRUSH) ; and (c) writing the KVs to the local store (e.g., local ROCKsDB) using operations such as WriteBatch () .

In step S710, backup meta servers 632-633 may atomically write the received KVs to their local store (e.g., the local ROCKsDB) and return an acknowledgement (e.g., a SUCCESS) to primary meta server 631. In step S712, upon receiving the acknowledgement from backup meta servers 632-633, primary meta server 631 may return another acknowledgement (e.g., another SUCCESS) to proxy server 620.

In step S714, proxy server 620 may send the object data and the allocation data to the corresponding data servers (e.g., data servers 641-643) in data server cluster 640. In some embodiments, proxy server 620 may receive the allocation data from primary meta server 631 in step S708 and store it in its in-memory cluster map. Proxy server 620 may look up the volume group (with ID=vgid) and allocation result (extents) in the in-memory cluster map and send to the corresponding data servers. In step S716, the data servers may write the received object data to the positions of disks 644 as specified by the allocation result (extents) and return an acknowledgement to proxy server 620.

In step S718, proxy server 620 may acknowledge to the client via user device 610 after receiving acknowledgement from both primary meta server 631 and data servers 641-643. In some embodiments, the meta-log of the object PUT request (e.g., request log in KV 520 and operation log in KV 530 of FIG. 5) may then be cleaned. In some alternative embodiments, for efficiency purposes, proxy server 620 may periodically (instead of immediately) notify the primary meta server about the committed object PUT request and the corresponding meta-log.

Consistent with the present disclosure, method 700 may be implemented without distributed ordering requirement which is the key factor limiting the object storage performance in the process. The parallel writes on primary/backup meta servers and data servers disclosed herein keep the crash consistency. Therefore, storage system 600 and method 700 provide a better storage performance comparing to conventional object storages.

In some embodiments, storage system 600 may also perform other object operations such as an object read operation (e.g., object GET) . For example, the object may be previously stored in storage system 600, e.g., through an object write operation by performing method 700. FIG. 8 is a flowchart of an exemplary method 800 for performing an object read operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 800 may be performed by storage system 600 directed by a master cluster (not shown in FIG. 6) coordinating the various components within storage system 600. Method 800 may include steps S802-S814 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 8.

In step S802, the client may send an object GET request through user device 610 to proxy server 620. The object GET request may include information such as the object name (e.g., objectname = ns_object) . In step S804, proxy server 620 may identify primary meta server 631 based on the object name. In some embodiments, proxy server 620 may perform a CRUSH algorithm on the object name to obtain the identity of primary meta server 631. Proxy server 620 then sends a meta-data query request to primary meta server 631 for reading the object. For example, the meta-data query request may also include the object name.

In step S806, primary meta server 631 may look up the meta-data using the object name objectname and return the meta-data of the object to proxy server 620. As described above, several key values of the METAX structure stored in the local store (e.g., RocksDB) include the object name of the object. Therefore, METAX structure of the object can be uniquely looked up in the local store using the a key generated based on the object name, e.g., key = OBMETA_objectname. For example, both KV 520 and KV 530 of METAX structure 500 store the object name. Therefore, METAX structure 500 can be looked up using the objectname stored in those key values. Other meta-data values stored in the METAX structure can be read and returned to proxy server 620. In some embodiments, the meta-data returned to proxy server 620 may include the application meta-data, the allocation data (e.g., the ID of the assigned volume group vgid, the allocated in-volume positions extents) , and attributes of the object (e.g., a data checksum of the object data) . For example, the value stored in KV 510 of METAX structure 500, i.e., value= {metadata, vgid, extents, checksum} may be returned to proxy server 620.

In step S808, proxy server 620 may look up the allocation data (such as vgid and extents) in the in-memory cluster map and may send a read request to the corresponding data server in data server cluster 640. For example, using ID = vgid returned from primary meta server 631, proxy server 620 may look up the corresponding data server (e.g., data server 641) that hosts the volume group. Proxy server 620 then sends a read request to the corresponding data server and the read request includes the in-volume positions (extents) . The read request may, among other things, include the allocation data.

In step S810, the corresponding data server (e.g., data server 641) may read the requested object data according to the allocation data specified by the read request and return the object data to proxy server 620. For example, data server 641 may read the data stored in the volume group with ID = vgid at the position indicated by the extents.

In step S812, proxy server 620 may check the checksum of object data received from the corresponding data server. For example, proxy server 620 may compare the checksum received from primary meta server 631 with the checksum calculated based on the object data returned by data server 641. In step S814, if the checksum matches, proxy server 620 returns the object data and the application meta-data to the client through user device 610.

In some embodiments, storage system 600 may also optimize the method 800 by performing steps S806 and S810 in parallel when the client previously requested writing the same object in storage system 600 (i.e., the client sent the object PUT request) . In that case, the client would have received the meta-data identifying the allocation data the object. For example, proxy server 620 may return the meta-data (vgid and extents) in its acknowledgement to user device 610 in step S718 of method 700. User device 610 may cache the storage meta-data and send the GET request with vgid=<vgid>&offset=<offset>&length=<length>. After receiving the GET request, proxy server 620 may simultaneously read the meta-data from primary meta server 631 and the object data from the corresponding data server (e.g., data server 641) . The meta-data read from primary meta server 631 is used for validation purpose. For example, the checksum returned by primary meta server 631 may be used by proxy server 620 to validate the object data received from the data server to prevent malicious clients from tempering the value of the request (e.g., the vgid and extends of the request) .

In some embodiments, storage system 600 may also perform other object operations such as an object DELETE request for deleting an object. For example, the object may be previously stored in storage system 600, e.g., through an object write operation by performing method 700. FIG. 9 is a flowchart of an exemplary method 900 for performing an object delete operation in an object storage system using the METAX structure of FIG. 5, according to embodiments of the disclosure. In some embodiments, method 900 may be performed by storage system 600 directed by a master cluster (not shown in FIG. 6) coordinating the various components within storage system 600. Method 900 may include steps S902-S914 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 9.

In step S902, the client may send an object DELETE request through user device 610 to proxy server 620. The object GET request may include information such as the object name (e.g., objectname = ns_object) . In step S904, proxy server 620 may identify primary meta server 631 based on the object name. In some embodiments, proxy server 620 may perform a CRUSH algorithm on the object name to obtain the identity of primary meta server 631 as well as backup

meta servers

632 and 633. Proxy server 620 then sends a meta-data query to primary meta server 631 for deleting the object. For example, the meta-data query request may also include the object name.

In step S906, primary meta server 631 may look up the METAX structure of the object using the object name objectname. As described above, several key values of the METAX structure stored in the local store (e.g., RocksDB) include the object name of the object. Therefore, METAX structure of the object can be uniquely looked up in the local store using the a key generated based on the object name, e.g., key = OBMETA_objectname. For example, both KV 520 and KV 530 of METAX structure 500 store the object name. Therefore, METAX structure 500 can be looked up using the objectname stored in those key values. Other meta-data values, such as the allocation data, stored in the METAX structure can be read by primary meta server 631. In some embodiments, the allocation data read by primary meta server 631 may include the ID of the assigned volume group vgid and the allocated in-volume positions extents. For example, vgid and extents may be stored in KV 510 of METAX structure 500. In some embodiments, primary meta server 631 may further read the bitmap of the volume group (with ID =vgid) in the METAX structure located in step S906. For example, KV 540 of METAX structure 500 records the volume bitmap VOLBITMAP for vgid.

In step S910, primary meta server 631 may atomically commit two operations to the METAX structure identified in step S908: (1) delete the object meta-data, and (2) update the bitmap. For example, primary meta server 631 may use WriteBatch () to atomically delete<OBMETA_objectname, {metadata, vgid, extents, checksum} > stored in KV 510, and update <VOLBITMAP_vgid, bitmap> stored in KV 540. In the bitmap, each bit requests the allocation status of one disk block in the volume group vgid which is coherently updated according to the allocated extents in KV 510. In some embodiments, primary meta server 631 may update the bitmap to clear the bits specified by extents.

In step S912, primary meta server 631 may send the two operations committed in step S910 to backup

meta servers

632 and 633 to update the copies of METAX structures stored thereon. In some embodiments, backup meta servers may be identified using the object name during step S904. The METAX structure of the object is backed up on backup

meta servers

632 and 633 during the write operation of the object. Therefore, the backup copies need to be updated as well. In some embodiments, backup

meta servers

632 and 633 may perform the two operations atomically to update the meta-data stored in its local store.

In step S912, after the two operations are committed to the METAX structures stored on local stores of backup

meta servers

632 and 633, backup

meta servers

632 and 633 may send an acknowledgement (e.g., SUCCESS) to primary meta server 631.

In step S914, after successfully committing the two operations to its local store and receiving the acknowledgement from backup

meta servers

632 and 633, primary meta server 631 may return an acknowledge (e.g., SUCCESS) to proxy server 620. In step S916, proxy server 620 subsequently may acknowledge the client (e.g., through user device 610) that the DELETE operation has been successfully performed.

By performing method 900, because data server cluster 640 provides mostly a raw disk service instead of also providing filesystem operations, storage system 600 may process the object DELETE request in a lightweight and compaction-free fashion. Storage system 600 may allow the allocator (e.g., BitMapAllocator) to allocate reclaimed spaces for similar sized newly PUT objects in OLDI applications. Accordingly, the space of deleted objects may immediately become reusable for the allocator by updating bitmap without compaction.

In some embodiments, after receiving periodic notifications of complete requests (e.g., SUCCESS) from proxy server 620, meta servers relevant to the respective requests may clean request logs and operation logs. For example, for a particular object operation request (e.g., a PUT request, a GET request, or a DELETE request) , the relevant meta servers may include primary meta server 631 and backup

meta servers

632 and 633. Each meta server within that group may look up the KVs (e.g., KV 520 and KV 530 of METAX structure 500) that store the request logs and operation log by matching pxid and reqid, and matching ogid and opseq respectively. These logs may be cleaned as the request is complete.

Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method for performing a delete operation of an object, comprising:

receiving, by a proxy server, a request for the delete operation, the request including an identification of the object;

sending, by the proxy server, a meta-data query to a primary meta server, the meta-data query including the identification of the object;

atomically performing, by the primary meta-server, update operations to meta-data associated with the identification of the object, the meta-data stored in a local store of the primary meta server and including allocation data of the object;

sending, by the primary meta server, the update operations to at least one backup meta server; and

performing, by the backup meta server, the update operations to corresponding backup meta-data stored in a local store of the backup meta server.
The method of claim 1, further comprising:

identifying, by the proxy server, the meta server and the at least one backup meta server from a meta server cluster based on the identification of the object.
The method of claim 2, wherein the meta server and the at least one backup meta server are identified by applying a CRUSH method to the identification of the object.
The method of claim 1, wherein performing the update operations to the meta-data associated with the identification of the object further comprises:

identifying a meta-data structure associated with the identification of the object from the local store of the primary meta server, the meta-data structure comprising a plurality of key values; and

performing the update operations to at least one key value of the meta-data structure.
The method of claim 4, wherein the at least one key value of the meta-data structure stores the identification of the object.
The method of claim 1, wherein the allocation data includes placement meta-data recording disk volumes on a data server and positions within the disk volumes allocated to the object,

wherein the update operations include a delete operation configured to delete the placement meta-data from the local store of the primary meta server.
The method of claim 6, wherein the allocation data further includes a bitmap including bits recording status of disk blocks at the allocated positions of the allocated disk volumes,

wherein the update operations further include a clear operation configured to clear the bits of the bitmap,

wherein the delete operation and the clear operation are performed atomically.
The method of claim 1, wherein the identification of the object is an object name.
The method of claim 1, further comprising:

returning a first acknowledgement, by the at least one backup meta server, to the primary meta server after successfully performing the update operations.
The method of claim 9, further comprising:

returning a second acknowledgement, by the primary meta server, to the proxy server after successfully performing the update operations and receiving the first acknowledgement.
A system for performing a delete operation of an object, comprising a proxy server; and a meta server cluster,

wherein the proxy server is configured to:

receive a request for the delete operation, the request including an identification of the object; and

send a meta-data query to a primary meta server in the meta server cluster, the meta-data query including the identification of the object,

wherein the primary meta server is configured to:

atomically perform update operations to meta-data associated with the identification of the object, the meta-data stored in a local store of the primary meta server and including allocation data of the object; and

send the update operations to at least one backup meta server,

wherein the backup meta server is configured to perform the update operations to corresponding backup meta-data stored in a local store of the backup meta server.
The system of claim 11, wherein the proxy server is further configured to:

identify the meta server and the at least one backup meta server from a meta server cluster based on the identification of the object.
The system of claim 12, wherein the meta server and the at least one backup meta server are identified by applying a CRUSH method to the identification of the object.
The system of claim 11, wherein the primary meta server is further configured to:

identify a meta-data structure associated with the identification of the object from the local store of the primary meta server, the meta-data structure comprising a plurality of key values; and

perform the update operations to at least one key value of the meta-data structure.
The system of claim 14, wherein the at least one key value of the meta-data structure stores the identification of the object.
The system of claim 11, wherein the allocation data includes placement meta-data recording disk volumes on a data server and positions within the disk volumes allocated to the object,

wherein the update operations include a delete operation configured to delete the placement meta-data from the local store of the primary meta server.
The system of claim 16, wherein the allocation data further includes a bitmap including bits recording status of disk blocks at the allocated positions of the allocated disk volumes,

wherein the update operations further include a clear operation configured to clear the bits of the bitmap,

wherein the delete operation and the clear operation are performed atomically.
The system of claim 11, wherein the identification of the object is an object name.
The system of claim 11, wherein the at least one backup meta server is further configured to:

return a first acknowledgement to the primary meta server after successfully performing the update operations.
The system of claim 19, wherein the primary meta server is further configured to:

return a second acknowledgement to the proxy server after successfully performing the update operations and receiving the first acknowledgement.