TW201933121A

TW201933121A - Data storage system and method for writing object of key-value pair

Info

Publication number: TW201933121A
Application number: TW107140309A
Authority: TW
Inventors: 奇亮奭
Original assignee: 南韓商三星電子股份有限公司
Priority date: 2018-01-19
Filing date: 2018-11-14
Publication date: 2019-08-16
Also published as: JP2019128960A; JP7140688B2; KR20190088873A; TWI750425B; CN110058804A; KR102460568B1

Abstract

A data storage system includes: a plurality of data storage devices for storing a plurality of objects of a key-value pair; and a virtual storage layer that applies different data reliability schemes including a data replication scheme and an erasure coding scheme based on a size of an object of the plurality of objects. The plurality of objects includes a first object having a first size and a second object having a second size that is larger than the first size. The virtual storage layer classifies the first object as a small object, applies the data replication scheme, and stores the small object across one or more of the plurality of data storage devices. The virtual storage layer classifies the second object as a huge object, splits the huge object into one or more chunks of a same size, applies the erasure encoding scheme, and distributedly stores the one or more chunks across the plurality of data storage devices. A method for writing an object of a key-value pair is also provided.

Description

System and method for storing large key-value objects

本發明大體上涉及資料儲存系統，更具體來說，涉及用於在資料儲存系統中儲存極大鍵值物件的方法。The present invention generally relates to a data storage system, and more particularly, to a method for storing a maximum key value object in a data storage system.

資料可靠性是資料儲存系統的一個關鍵要求。使用傳統塊裝置的資料可靠性已得到充分研究，並通過例如獨立磁碟容錯陣列（Redundant Array of Independent Disk，RAID）和擦除編碼的各種資料複製技術實施。RAID經由資料儲存驅動器的集合傳播（或複製）資料，以防止特定驅動器的永久性資料丟失。RAID主要分為兩類：資料的完整鏡像保存於第二驅動器上，或將同位塊添加到資料中以便能夠在故障中恢復有故障的塊。擦除編碼使用複雜演算法添加一串同位塊的群聚，所述複雜演算法提供可包容高級別故障的強大資料保護和恢復。舉例來說，擦除編碼虛擬化物理驅動器以創建虛擬驅動器，所述虛擬驅動器可分散在超過物理驅動器上以實現快速恢復。使用RAID的資料複製對於複製大物件來說可能過於昂貴，而擦除編碼可能浪費小物件的儲存空間。Data reliability is a key requirement for data storage systems. Data reliability using conventional block devices has been fully studied and implemented through various data replication technologies such as Redundant Array of Independent Disk (RAID) and erasure coding. RAID propagates (or copies) data through a collection of data storage drives to prevent permanent data loss for a particular drive. RAID is mainly divided into two categories: a complete image of the data is stored on the second drive, or parity blocks are added to the data so that the failed block can be recovered in the event of a failure. Erasure coding uses a complex algorithm to add a cluster of co-located blocks that provides powerful data protection and recovery that can tolerate high-level failures. For example, erasure coding virtualized physical drives to create virtual drives that can be spread over more physical drives for fast recovery. Data replication using RAID may be too expensive for copying large objects, and erasure coding may waste storage space for small objects.

鍵值固態驅動器（key-value solid-state drive，KV SSD）是一種新型儲存裝置，與例如硬碟驅動器（hard disk drive，HDD）和固態驅動器（solid-state drive，SSD）的常規塊裝置相比，所述新型儲存裝置具有不同的介面和語義。KV SSD可直接儲存鍵值對的資料值。儲存在KV SSD中的資料值可根據應用程式以及資料的特性而變大或變小。需要一種有效的資料可靠性模型，用於有效地儲存具有不同大小的物件而不出現性能瓶頸和空間限制。Key-value solid-state drive (KV SSD) is a new type of storage device that is comparable to conventional block devices such as hard disk drive (HDD) and solid-state drive (SSD) In contrast, the new storage devices have different interfaces and semantics. KV SSD can directly store data values of key-value pairs. The value of the data stored in the KV SSD can be larger or smaller depending on the characteristics of the application and the data. There is a need for an effective data reliability model for efficiently storing objects of different sizes without performance bottlenecks and space constraints.

根據一個實施例，資料儲存系統包含：多個資料儲存裝置，用於儲存鍵值對的多個物件；以及虛擬儲存層，基於多個物件中的一個物件的大小而應用不同的資料可靠性方案，所述不同的資料可靠性方案包含資料複製方案以及擦除編碼方案。多個物件包含具有第一大小的第一物件以及具有第二大小的第二物件，第二大小大於第一大小。虛擬儲存層將第一物件分類為小物件、應用資料複製方案以及將小物件儲存在多個資料儲存裝置中的一個或多個上。虛擬儲存層將第二物件分類為超大物件、將超大物件拆分成一個或多個相同大小的分塊、應用擦除編碼方案以及將一個或多個分塊分散式地儲存在多個資料儲存裝置上。According to one embodiment, a data storage system includes: a plurality of data storage devices for storing a plurality of objects of a key-value pair; and a virtual storage layer that applies different data reliability schemes based on the size of an object among the plurality of objects The different data reliability schemes include data replication schemes and erasure coding schemes. The plurality of objects include a first object having a first size and a second object having a second size, and the second size is larger than the first size. The virtual storage layer classifies the first object as a small object, applies a data copying scheme, and stores the small object on one or more of a plurality of data storage devices. The virtual storage layer classifies the second object as an oversized object, splits the oversized object into one or more chunks of the same size, applies an erasure coding scheme, and stores one or more chunks distributed across multiple data stores Device.

根據另一實施例，用於寫入鍵值對的物件的方法包含：接收鍵值對的多個物件，其中多個物件包含具有第一大小的第一物件以及具有第二大小的第二物件，第二大小大於第一大小；將第一物件分類為小物件；對小物件應用資料複製方案；將小物件儲存在多個資料儲存裝置中的一個或多個上；將第二物件分類為超大物件；將超大物件拆分成一個或多個相同大小的分塊；對超大物件應用擦除編碼方案；以及將一個或多個分塊分散式地儲存在多個資料儲存裝置上。According to another embodiment, a method for writing an object of a key-value pair includes: receiving a plurality of objects of the key-value pair, wherein the plurality of objects include a first object having a first size and a second object having a second size , The second size is larger than the first size; classifying the first object as a small object; applying a data replication scheme to the small object; storing the small object on one or more of a plurality of data storage devices; classifying the second object as Oversized objects; splitting oversized objects into one or more blocks of the same size; applying an erasure coding scheme to oversized objects; and decentrally storing one or more blocks on multiple data storage devices.

現將參考附圖更具體地描述並且在權利要求書中指出以上和其它優選特徵，包含實施方案的各種新穎細節和事件的組合。應理解，僅以說明的方式展示本文所描述的特定系統和方法並且不作為限制。如本領域的技術人員將理解，在不脫離本發明的範圍的情況下，可在各種和眾多實施例中採用本文所描述的原理和特徵。The above and other preferred features, including combinations of various novel details and events of the embodiments, will now be described more specifically with reference to the accompanying drawings and pointed out in the claims. It should be understood that the specific systems and methods described herein are shown by way of illustration only and not as a limitation. As those skilled in the art would understand, the principles and features described herein may be employed in various and numerous embodiments without departing from the scope of the invention.

本文所揭露的特徵和教示內容中的每一個可單獨地利用或與其它特徵和教示內容結合利用，以提供一種用於有效地儲存不同大小的物件的資料儲存系統以及一種在資料儲存系統中儲存物件的方法。參考隨附圖式進一步詳細地描述單獨地和以組合形式利用這些額外特徵和教示內容中的多個的代表性實例。這一詳細描述僅打算向本領域的技術人員傳授用於實踐本教示內容的優選方面的其它細節並且不打算限制權利要求書的範圍。因此，以上在詳細描述中揭露的特徵的組合可能在最廣泛意義上不必實踐教示內容，而是僅為了具體描述本教示內容的代表性實例而傳授。Each of the features and teachings disclosed herein can be used individually or in combination with other features and teachings to provide a data storage system for efficiently storing objects of different sizes and a data storage system Object methods. Representative examples of utilizing many of these additional features and teachings individually and in combination are described in further detail with reference to the accompanying drawings. This detailed description is intended only to teach those skilled in the art other details for practicing the preferred aspects of the teachings and is not intended to limit the scope of the claims. Therefore, the combination of features disclosed above in the detailed description may not necessarily practice the teaching content in the broadest sense, but is taught only to specifically describe the representative examples of the teaching content.

在以下描述中，僅出於解釋的目的，闡述特定的命名法以提供對本發明的透徹理解。但是，本領域的技術人員將顯而易見的是這些特定細節對於實踐本發明的實施例並非必需的。In the following description, specific nomenclature is set forth for the purpose of explanation only to provide a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that these specific details are not necessary for practicing the embodiments of the present invention.

關於電腦記憶體內的資料位元的操作的演算法和符號表示呈現本文中詳細描述的一些部分。資料處理領域的技術人員使用這些演算法描述和表示以將其工作的主旨有效地傳達給本領域的其它技術人員。在本文中，且一般將演算法構想為產生所需結果的步驟的自洽序列。步驟為需要物理量的物理操控的那些步驟。通常（但未必），這些量呈能夠被儲存、轉移、組合、比較以及以其它方式操控的電訊號或磁訊號的形式。主要出於常用的原因，已證實將這些訊號指代為位元、值、元件、符號、字元、術語、編號等有時為便利的。Algorithms and symbolic representations of the operation of data bits in computer memory present some sections described in detail herein. Those skilled in the data processing arts use these algorithmic descriptions and representations to effectively convey the substance of their work to others skilled in the art. In this paper, and algorithms are generally conceived as a self-consistent sequence of steps that produce the desired result. The steps are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Mainly for common reasons, it has proven convenient to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, etc.

然而，應牢記，所有這些和類似術語與適當物理量相關聯，且僅為應用於這些量的方便的標籤。除非確切地陳述為如從以下論述顯而易見地是其他情況，否則應瞭解，在整個描述中，使用如“處理（processing）”、“計算（computing）”、“運算（calculating）”、“確定（determining）”、“顯示（displaying）”或類似術語的術語的論述是指將表示為電腦系統的暫存器和記憶體內的物理（電子）量的資料操控和變換為類似地表示為電腦系統記憶體或暫存器或其它此類資訊儲存、傳輸或顯示裝置內的物理量的其它資料的電腦系統或類似電子計算裝置的動作和進程。It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless stated explicitly as otherwise as apparent from the following discussion, it should be understood that throughout the description, use such as "processing", "computing", "calculating", "determining ( The discussion of "determining", "displaying" or similar terms refers to manipulating and transforming data representing physical (electronic) quantities in the register and memory of a computer system into similar representations of computer system memory The actions and processes of a computer system or similar electronic computing device such as a computer or similar electronic computing device that stores, transmits, or displays other data of physical quantities in a device or register, or other such information.

此外，可以並非特定和明確列舉的方式合併代表性實例和隨附權利要求的各種特徵，以提供本教示內容的額外適用實施例。還應明確地注意，出於原始揭露內容的目的，以及出於限制所要求保護的主題的目的，實體群組的所有值範圍或指示揭露每一可能的中間值或中間實體。還應明確地注意，圖式中所示的組件的尺寸和形狀經設計以有助於理解如何實踐本教示內容，但並不打算限制實例中所示的尺寸和形狀。Furthermore, the various features of the representative examples and the appended claims may be combined in ways that are not specifically and explicitly enumerated to provide additional applicable embodiments of the present teachings. It should also be explicitly noted that for the purpose of the original disclosure, and for the purpose of limiting the claimed subject matter, all value ranges or indications of the group of entities disclose every possible intermediate value or intermediate entity. It should also be explicitly noted that the size and shape of the components shown in the drawings are designed to help understand how to practice this teaching, but are not intended to limit the size and shape shown in the examples.

本發明描述一種資料儲存系統以及一種用於在資料儲存系統中儲存大鍵值物件的方法。本發明資料儲存系統能夠以高可靠性在一個或多個資料儲存裝置儲存資料。具體來說，本發明資料儲存系統能夠基於物件的大小而以不同方式儲存物件，以在實現高可靠性的同時降低成本並減少儲存空間。與大物件相關聯的資料可拆分成小段並且儲存在一個或多個資料儲存裝置中。在本文中，在儲存於資料儲存裝置中的資料是與鍵值對相關聯的資料陣列時，資料儲存裝置也可稱為鍵值固態驅動器（KV SSD）。The invention describes a data storage system and a method for storing large key-value objects in the data storage system. The data storage system of the present invention can store data in one or more data storage devices with high reliability. Specifically, the data storage system of the present invention can store objects in different ways based on the size of the objects, so as to achieve high reliability while reducing costs and storage space. The data associated with large objects can be split into smaller pieces and stored in one or more data storage devices. In this article, when the data stored in the data storage device is a data array associated with a key-value pair, the data storage device may also be referred to as a key-value solid state drive (KV SSD).

能夠將物件（例如鍵值對的值）拆分成多個相同大小的段或分塊。可在執行時間動態地基於每一物件來確定分塊的大小。根據分塊的大小，物件可具有不同數目和不同大小的分塊。Ability to split an object (such as the value of a key-value pair) into multiple segments or chunks of the same size. The size of the chunks can be determined dynamically at execution time based on each item. Depending on the size of the tiles, the object may have tiles of different numbers and sizes.

組被定義為資料儲存裝置的集合，以實現目標資料可靠性。組可包含在主機殼（例如機殼或機架）內或在多個主機殼的一個或多個資料儲存裝置，且可以分級方式或不分級方式結構化。A group is defined as a collection of data storage devices to achieve target data reliability. A group may be contained within a main case (eg, a chassis or rack) or one or more data storage devices in multiple main cases, and may be structured in a hierarchical or non-hierarchical manner.

本發明資料儲存系統包含虛擬儲存層，其管理對一個或多個資料儲存裝置的分組並將組作為整體呈現給作為單一虛擬儲存單元的用戶應用。虛擬儲存層可用於管理控制一個或多個資料儲存裝置的多個驅動器。可基於可靠性目標來配置虛擬儲存層所管理的資料儲存裝置的數目。對於擦除編碼，資料儲存裝置的總數可以是資料裝置（D）與同位裝置（P）的總和，以包容P個故障。對於複製，可包容P個故障的資料儲存裝置的總數可以是P+1。資料儲存裝置的儲存容量在擦除編碼或複製中可大致相似。可由組內的所有資料儲存裝置的資料儲存空間總和確定虛擬儲存的儲存容量。The data storage system of the present invention includes a virtual storage layer, which manages grouping of one or more data storage devices and presents the group as a whole to a user application as a single virtual storage unit. The virtual storage layer can be used to manage multiple drives that control one or more data storage devices. The number of data storage devices managed by the virtual storage layer can be configured based on reliability goals. For erasure coding, the total number of data storage devices may be the sum of the data device (D) and the co-located device (P) to tolerate P faults. For replication, the total number of P fault-tolerant data storage devices may be P + 1. The storage capacity of a data storage device may be approximately similar in erasure coding or copying. The storage capacity of the virtual storage can be determined by the total data storage space of all data storage devices in the group.

根據一個實施例，虛擬儲存層以無狀態方式管理一組一個或多個資料儲存裝置。也就是說，虛擬儲存層並不（且不需要）維持物件與資料儲存裝置之間的任何鍵資訊或映射資訊。然而，虛擬儲存層可在執行時間動態地快取記憶體和維持一個或多個資料儲存裝置的基本中繼資料，例如物件的數目、可用儲存容量和/或類似物。According to one embodiment, the virtual storage layer manages a group of one or more data storage devices in a stateless manner. That is, the virtual storage layer does not (and does not need to) maintain any key information or mapping information between the object and the data storage device. However, the virtual storage layer can dynamically cache memory and maintain basic metadata of one or more data storage devices, such as the number of objects, available storage capacity, and / or the like, at runtime.

例如KV SSD的資料儲存裝置對物件和其可處理的物件上的操作具有特定於實現的約束。資料儲存系統的虛擬儲存層能夠知道每一資料儲存裝置可支援的最小值大小和最大值大小，並且能夠確定最小值大小和最大值大小。Data storage devices such as KV SSDs have implementation-specific constraints on objects and operations on objects that they can process. The virtual storage layer of the data storage system can know the minimum and maximum sizes that each data storage device can support, and can determine the minimum and maximum sizes.

舉例來說，是第i個KV SSD的最小值大小。虛擬儲存的最小值大小（VMIN_VS）可由組中的單獨KV SSD的所有最小值大小的最大值來定義。
式(1)for example, Is the minimum size of the i-th KV SSD. The minimum size of the virtual storage (VMIN_VS) can be defined by the maximum of all the minimum sizes of the individual KV SSDs in the group.
Formula 1)

類似地，是第i個KV SSD的最大值大小。虛擬儲存的最大值大小（VMAX_VS）可由組中的單獨KV SSD的所有最大值大小的最小值來定義。
式(2)Similarly, Is the maximum size of the i-th KV SSD. The maximum size of the virtual storage (VMAX_VS) can be defined by the minimum of all the maximum sizes of the individual KV SSDs in the group.
Equation (2)

在一個實例中，例如裡德-所羅門（Reed-Solomon，RS）編碼的最大距離分離（maximum distance separable，MDS）演算法可用於擦除編碼。In one example, a Reed-Solomon (RS) -encoded maximum distance separable (MDS) algorithm can be used for erasure coding.

根據一個實施例，本發明的系統和方法提供一種混合資料可靠性機制，所述混合資料可靠性機制基於物件的大小利用複製和擦除編碼兩者。資料複製對於小物件來說較快，並且可具有輕量。然而，與擦除編碼相比，資料複製可能針對大物件消耗更多儲存空間。與資料複製相比，擦除編碼可能針對大物件消耗較少儲存空間，且擦除編碼可通過利用多個資料儲存裝置來重構這些大物件。然而，擦除編碼通常涉及繁雜計算，且由於其需要來自多個資料儲存裝置的多個分塊而可能花費更長時間來重構小物件。這是不分級方法，其在對資料進行擦除編碼之後複製資料。實際上，本發明的系統和方法呈現基於物件的大小而在資料複製與擦除編碼之間進行的靈活決策。換句話說，本發明的系統和方法可在執行時間基於物件的大小以及儲存物件的總空間開銷來決定資料可靠性決策，即，是否應用資料複製或擦除編碼。According to one embodiment, the system and method of the present invention provides a hybrid data reliability mechanism that utilizes both copy and erasure coding based on the size of an object. Data replication is faster for small objects and can be lightweight. However, data copying may consume more storage space for large objects than erasure coding. Compared with data copy, erasure coding may consume less storage space for large objects, and erasure coding can reconstruct these large objects by using multiple data storage devices. However, erasure coding usually involves cumbersome calculations and may take longer to reconstruct small objects because it requires multiple chunks from multiple data storage devices. This is a non-hierarchical method that copies the material after it has been erasure-coded. In fact, the system and method of the present invention presents flexible decisions between data copying and erasure coding based on the size of the object. In other words, the system and method of the present invention can determine data reliability decisions at execution time based on the size of the object and the total space overhead of storing the object, ie, whether to apply data copy or erasure coding.

根據一個實施例，本發明的系統和方法在更新物件時不具有讀取-修改-寫入開銷。用於塊裝置的常規擦除編碼或資料複製在進行局部更新時具有高懲罰值。如果更新了小資料段，那麼必須讀取並更新用於擦除編碼的所有塊，且在更新之後，同位塊必須經重新運算並寫入回到資料儲存裝置中。換句話說，物件的更新需要一連串讀取-修改-寫入進程，這是因為塊可以與多個物件共用。相反地，本發明的系統和方法可基於物件和其特性（例如大小）來提供可靠性。如果更新了物件，那麼只需要重寫更新的物件，而不會導致讀取-修改-寫入開銷。According to one embodiment, the system and method of the present invention does not have read-modify-write overhead when updating an object. Conventional erasure coding or data copying for block devices has a high penalty value for local updates. If the small data segment is updated, all blocks used for erasure coding must be read and updated, and after the update, the parity blocks must be re-calculated and written back to the data storage device. In other words, updating an object requires a series of read-modify-write processes because blocks can be shared with multiple objects. Conversely, the systems and methods of the present invention can provide reliability based on an object and its characteristics (eg, size). If the object is updated, then only the updated object needs to be rewritten without incurring read-modify-write overhead.

根據一個實施例，本發明的系統和方法支援統一框架中的廣泛範圍的物件大小。如果物件極大（或超大），超出資料儲存裝置的儲存容量限制，那麼資料儲存裝置可能無法將物件作為整體儲存。在這種情況下，需要將物件拆分成小段，在本文中稱為分塊。本發明論述如何能夠基於資料儲存裝置的性能來將物件拆分成分塊以及利用拆分的分塊重建物件。舉例來說，本發明的系統和方法基於資料儲存裝置的群組特徵支援提供多個拆分和重建機制，這將在下文詳細論述。According to one embodiment, the system and method of the present invention supports a wide range of object sizes in a unified framework. If the object is very large (or oversized) beyond the storage capacity limit of the data storage device, the data storage device may not be able to store the object as a whole. In this case, the object needs to be split into small pieces, referred to herein as chunks. The present invention discusses how an object can be divided into blocks based on the performance of the data storage device and the object can be reconstructed using the divided blocks. For example, the system and method of the present invention support multiple split and reconstruction mechanisms based on the group feature of the data storage device, which will be discussed in detail below.

本發明的系統和方法提供物件的可靠性，並未基於固定塊。可將資料複製與擦除編碼混合，以通過基於物件的大小使物件分叉來實現單一磁碟組的物件的目標可靠性。這一方法與具有分級結構（即用於擦除編碼物件的複製）的常規資料可靠性技術不同。本發明的系統和方法將空間效率作為主要關注點，且性能是次要度量，以確定適合於特定物件的可靠性機制。物件的儲存為無狀態的。複製或擦除編碼不需要儲存額外資訊。無論物件大小如何，更新都不需要讀取-修改-寫入開銷。The system and method of the present invention provides the reliability of the object and is not based on fixed blocks. Data duplication and erasure coding can be mixed to achieve the target reliability of objects of a single disk set by forking objects based on their size. This method differs from conventional data reliability techniques that have a hierarchical structure (that is, a copy used to erase coded objects). The system and method of the present invention take space efficiency as a primary concern, and performance is a secondary metric to determine a reliability mechanism suitable for a particular object. Object storage is stateless. Copying or erasing codes does not require storing additional information. Regardless of the size of the object, updates do not require read-modify-write overhead.

在本文中，物件是指靜態資料，所述靜態資料在輸入和輸出（input and output，I/O）操作期間具有固定值。物件可與鍵值對的鍵相關聯。在這種情況下，物件與鍵值對的值相對應。物件可以拆分成多個相同大小的分塊，且可在執行時間動態地基於每一物件來確定分塊的大小。在將每一物件保存到一個或多個資料儲存裝置時，每一物件可具有不同數目和大小的分塊。In this article, an object refers to static data, which has a fixed value during input and output (input and output (I / O) operations). Objects can be associated with keys of key-value pairs. In this case, the object corresponds to the value of the key-value pair. Objects can be split into multiple blocks of the same size, and the size of the blocks can be dynamically determined based on each object at execution time. When each object is saved to one or more data storage devices, each object may have a different number and size of tiles.

在本文中，在無資料可靠性的情況下全部一次性地儲存在一個或多個資料儲存裝置（例如KV SSD）上的物件的最小資料分塊集合稱為片段。舉例來說，片段與拆分一個或多個資料儲存裝置的物件的分塊的數目相對應。實現資料可靠性（例如複製或擦除編碼）的物件的最小資料分塊集合稱為帶。由於同位分塊，帶中的分塊的數目可能大於片段中的分塊的數目。在組中儲存到一個或多個資料儲存裝置之中的一個資料儲存裝置的物件的分塊的集合可稱為拆分。In this article, the smallest set of data chunks for all objects stored on one or more data storage devices (such as KV SSDs) at once without data reliability is called fragments. For example, a fragment corresponds to the number of chunks of an object that split one or more data storage devices. The smallest set of data chunks for an object that achieves data reliability (such as copying or erasure coding) is called a band. Due to co-located blocks, the number of blocks in a band may be greater than the number of blocks in a segment. A collection of pieces of an object stored in a group to one of the data storage devices may be referred to as a split.

片段可包含用於資料複製（即原始物件的複製副本）的僅（目標物件的）一個分塊。對於擦除編碼，片段可含有包括目標物件的D個分塊。另一方面，帶可包含一個原始分塊和用於資料複製的P個複製分塊，或用於擦除編碼的D個資料分塊和P個同位分塊。片段或帶的數目與拆分中的分塊的總數相對應。拆分大小（即儲存在一個資料儲存裝置中的分塊的數目）和帶大小（即儲存在一個或多個資料儲存裝置中的分塊的數目）可根據原始物件和分塊的大小而變化。舉例來說，拆分大小可以是用於資料複製的物件大小，然而拆分大小是分塊大小乘以每一帶用於擦除編碼的分塊數。A snippet can contain only one block (of the target object) for data reproduction (ie, a duplicate copy of the original object). For erasure coding, a segment may contain D tiles including the target object. On the other hand, a band may contain one original block and P copy blocks for data copying, or D data blocks and P parity blocks for erasure coding. The number of segments or bands corresponds to the total number of tiles in the split. The split size (that is, the number of tiles stored in one data storage device) and the tape size (that is, the number of tiles stored in one or more data storage devices) can vary based on the original object and the size of the tiles . For example, the split size may be the size of the object used for data copying, but the split size is the block size multiplied by the number of blocks per band used for erasure coding.

圖 1 繪示根據一個實施例的儲存在實例資料儲存系統中的物件的示意圖。物件110可拆分成多個分塊。舉例來說，物件110拆分成3S個分塊，分塊（Chunk）1到分塊3S，其中S為自然數。應注意，分塊3S的總數僅為實例，且物件110的分塊的總數可以是任何數目，並且物件110未必拆分成3的倍數。可將物件110分類為極大或超大物件，如將在下文更詳細地論述。那些拆分的分塊可以在一個或多個資料儲存裝置上分佈於虛擬儲存中。在本實例中，虛擬儲存包涵D個資料儲存裝置（磁碟1到磁碟D）和P個同位裝置（磁碟D+1到磁碟D+P）。在本實例中，S與D相同。 FIG. 1 is a schematic diagram of objects stored in an example data storage system according to an embodiment. The object 110 may be split into a plurality of pieces. For example, the object 110 is divided into 3S blocks, and Chunk 1 to 3S, where S is a natural number. It should be noted that the total number of blocks 3S is only an example, and the total number of blocks of the object 110 may be any number, and the object 110 may not necessarily be split into multiples of three. Objects 110 may be classified as very large or oversized objects, as will be discussed in more detail below. Those split blocks can be distributed in virtual storage on one or more data storage devices. In this example, the virtual storage includes D data storage devices (disk 1 to disk D) and P parity devices (disk D + 1 to disk D + P). In this example, S is the same as D.

根據一個實施例，資料儲存系統的虛擬儲存層可以拆分優先方案（拆分優先分佈）或以帶優先方案（帶優先分佈）來分佈分塊。在拆分優先方案中，在虛擬儲存120中，虛擬儲存層將分塊1、分塊2以及分塊3儲存在磁碟1中，將分塊4、分塊5以及分塊6儲存在磁碟2中，直到分塊3D-2、分塊3D-1以及分塊3D儲存在磁碟D中為止。帶150包含資料分塊1、資料分塊4到資料分塊3D-2，以及同位分塊1到同位分塊P。在帶優先方案中，在虛擬儲存121中，虛擬儲存層分別將分塊1到分塊D儲存在磁碟1到磁碟D中，分別將分塊D+1到分塊2D儲存在磁碟1到磁碟D中，直到分塊2D+1到分塊3D分別儲存在磁碟1到磁碟D中為止。拆分151包含資料分塊1、資料分塊D+1以及資料分塊2D+1。同位分塊（同位1到同位P）儲存在同位磁碟（磁碟D+1到磁碟D+P）中。雖然，但是示出在本實例中出於方便起見虛擬儲存120和虛擬儲存121兩個以帶優先方案儲存同位分塊，但應理解，在不背離本發明的範圍的情況下可以拆分優先方案進行同位分塊的儲存。According to one embodiment, the virtual storage layer of the data storage system can be split-first scheme (split-first distribution) or partitioned with a band-first scheme (with priority distribution). In the split priority scheme, in virtual storage 120, the virtual storage layer stores block 1, block 2, and block 3 on disk 1, and stores block 4, block 5, and block 6 on magnetic disks. In the disc 2, the blocks 3D-2, 3D-1, and the block 3D are stored in the magnetic disk D. The band 150 includes data block 1, data block 4 to data block 3D-2, and co-located block 1 to co-located block P. In the band-priority scheme, in the virtual storage 121, the virtual storage layer stores blocks 1 to D on disk 1 to disk D, and stores blocks D + 1 to 2D on disk respectively. 1 to disk D until blocks 2D + 1 to 3D are stored in disk 1 to disk D, respectively. The split 151 includes a data block 1, a data block D + 1, and a data block 2D + 1. Parity blocks (parity 1 to parity P) are stored on parity disks (disk D + 1 to disk D + P). Although, in this example, it is shown that for convenience, the virtual storage 120 and the virtual storage 121 both store parity blocks with a priority scheme, it should be understood that the priority can be split without departing from the scope of the present invention The scheme performs the storage of parity blocks.

不管使用哪種分塊分佈方法，都可以帶優先方案執行I/O操作。在這種情況下，即使對於拆分優先方案，分塊1、分塊4到同位P的I/O操作也並行儲存。No matter which block distribution method is used, I / O operations can be performed with a priority scheme. In this case, even for the split-first scheme, the I / O operations of block 1, block 4 to co-located P are stored in parallel.

根據一個實施例，基於物件大小對物件應用擦除編碼。對於具有大小的第i個物件，每一裝置（即拆分）的分塊的數目（由下式3定義。對於複製，每一裝置的分塊的數目可以是1（即。
式(3)According to one embodiment, erasure coding is applied to an object based on its size. For having size The i-th object , The number of blocks per device (ie, split) ( It is defined by the following formula 3. For replication, the number of tiles per device can be 1 (ie .
Equation (3)

每一裝置的分塊的數目是每一拆分的分塊的最小數目，用以在使用最大分塊大小（）時將物件儲存在資料磁碟（即稱為磁碟1到磁碟D的資料儲存裝置）上拆分。如果物件大小不與資料儲存裝置的分配或對準單元對準，那麼為帶中的物件分配的額外空間可以用零填充。Number of blocks per device Is the minimum number of chunks per split, used when using the largest chunk size ( ) When the object Stored on data disks (ie, data storage devices called disks 1 through D). If the object size is not aligned with the allocation or alignment unit of the data storage device, the extra space allocated for the objects in the tape can be filled with zeros.

如果使用最大分塊大小，那麼往往會浪費填充過多的儲存空間。因此，物件的實際分塊大小由式4更加嚴格地確定。如果額外中繼資料與每一分塊的資料一起儲存，那麼是中繼資料大小。如果資料儲存裝置支援群組特徵，那麼例如群組識別碼（identifier，ID）以及分塊總數的一些中繼資料可以儲存在每一分塊中。如果資料儲存裝置並不支援群組特徵，那麼不儲存中繼資料（即=0）。對於複製，實際分塊大小可以等於其原始物件大小（即）。
式(4)If you use the maximum tile size, you tend to waste too much storage space. So the object The actual block size of is determined more strictly by Equation 4. If additional metadata is stored with each chunk of data, then Is the metadata size. If the data storage device supports group characteristics, some metadata such as a group identifier (ID) and the total number of blocks can be stored in each block. If the data storage device does not support group characteristics, no metadata is stored (i.e. = 0). For replication, the actual chunk size can be equal to its original object size (ie ).
Equation (4)

式4確定分塊大小的範圍在與之間，但接近於。由式4確定的分塊大小可使I/O操作的數目最小化同時使I/O頻寬最大化。隨後，每一資料儲存裝置所儲存的資料量（即拆分大小）由式5定義。
式(5)Equation 4 determines the range of block sizes between versus Between, but close to . The block size determined by Equation 4 minimizes the number of I / O operations and maximizes I / O bandwidth. Subsequently, the amount of data (ie, the split size) stored by each data storage device is defined by Equation 5.
Equation (5)

最後，一次性地在資料儲存裝置上寫入的總資料量（即帶大小）由式6定義。對於複製，D可等於1。
式(6)Finally, the total amount of data (ie, tape size) written on the data storage device at one time is defined by Equation 6. For replication, D can be equal to 1.
Equation (6)

如上文所描述，資料儲存裝置可限制所述資料儲存裝置可儲存的物件的大小。一些資料儲存裝置可能無法支援極大物件或極小物件。為達成具有不同大小的物件的可靠且有效儲存，本發明的資料儲存系統基於物件的大小而採用不同資料可靠性方案以供儲存。根據一個實施例，本發明的資料儲存系統的虛擬儲存層可基於物件的大小而將物件歸類為四種類型，即超大、大、中等以及小。如果使用多個帶來儲存物件，那麼將物件分類為超大。如果幾乎完全使用一個帶來儲存物件，那麼將物件分類為大。如果僅使用帶的一小部分來儲存物件，那麼將物件分類為小。最後，如果可將物件歸類為小或大，那麼將物件分類為中等。因此，不同大小的分塊不僅可在同一虛擬儲存中共存，並且還能在形成虛擬儲存的單獨資料儲存裝置中共存。As described above, the data storage device can limit the size of objects that the data storage device can store. Some data storage devices may not support very large or very small objects. In order to achieve reliable and efficient storage of objects with different sizes, the data storage system of the present invention uses different data reliability schemes for storage based on the size of the objects. According to one embodiment, the virtual storage layer of the data storage system of the present invention can classify the objects into four types based on the size of the objects, namely, oversized, large, medium, and small. If you use multiple ribbons to store objects, classify the objects as oversized. If you use an object almost exclusively to store objects, classify the objects as large. If you use only a small part of the band to store objects, classify the objects as small. Finally, if the objects can be classified as small or large, classify the objects as medium. Therefore, different size blocks can coexist not only in the same virtual storage, but also in separate data storage devices forming a virtual storage.

如果用於物件的複製的空間開銷小於用於物件的擦除編碼的空間開銷，那麼將物件分類為小。在這種情況下，複製為優選的，這是因為與複雜擦除編碼方案相比，複製針對讀取提供更佳性能且能夠更好地處理更新。從應用中繼資料往往較小的觀測結果來看，這也是合理的。在一個實施例中，具有的小物件滿足下式7：
式(7)If the space overhead for copying an object is less than the space overhead for erasure coding of the object, classify the object as small. In this case, replication is preferred because it provides better performance for reads and better ability to handle updates than complex erasure coding schemes. From the observations that the application of relay data is often small, this is also reasonable. In one embodiment, having Small items Meet the following formula 7:
Equation (7)

如果用於物件的擦除編碼的空間開銷小於用於物件的資料複製的空間開銷，那麼將物件分類為大。在這種情況下，擦除編碼為優選的，這是因為其佔據較小空間。具體地說，大物件滿足下式8：
式(8)If the space overhead for erasure coding of an object is less than the space overhead for data copying of an object, then classify the object as large. In this case, erasure coding is preferred because it takes up less space. Specifically, the large object satisfies the following formula 8:
Equation (8)

可與圖 1 類似地構造大物件，但其可在一拆分內僅具有一個帶或一個分塊。Large objects can be constructed similarly to FIG. 1 , but they can have only one band or one block in a split.

如果物件在一拆分內具有超過一個分塊，那麼將物件分類為超大。在這種情況下，擦除編碼為優選的。具體地說，超大物件滿足下式9：
式(9)If an object has more than one block within a split, the object is classified as oversized. In this case, erasure coding is preferred. Specifically, the oversized object satisfies the following formula 9:
Equation (9)

可與圖 1 類似地構造超大物件，且其在一拆分內可具有多個帶或超過一個分塊。An oversized object can be constructed similarly to Figure 1 and it can have multiple bands or more than one block in a split.

可能存在可分類為小或大的物件大小的範圍。將滿足以下不等式的物件分類為中等：
式(10)There may be a range of object sizes that can be classified as small or large. Classify objects that meet the following inequality as medium:
Equation (10)

在這種情況下，可使用資料複製或擦除編碼。如果性能更為重要且物件頻繁更新，那麼資料複製可以是更好地選擇。在這種情況下，可將中等物件分類為小。如果空間效率更為重要，那麼可使用擦除編碼。在這種情況下，可將中等物件分類為大。In this case, you can use data copy or erasure coding. If performance is more important and objects are updated frequently, data replication can be a better choice. In this case, the medium items can be classified as small. If space efficiency is more important, then erasure coding can be used. In this case, medium items can be classified as large.

虛擬儲存層可能需要將超大物件拆分成小資料分塊以儲存物件，且隨後用拆分的資料分塊重構物件以檢索超大物件。出於這一目的，從用戶（例如在主機上運行的用戶應用程式）鍵生成的內部鍵可用於使虛擬儲存層無狀態。根據一個實施例，虛擬儲存層保留用於內部使用的裝置支援的鍵空間的幾個位元組以供用於分佈分塊，且將鍵空間的其餘部分暴露於用戶。在這種情況下，用戶指定的物件鍵呈現物件的一個或多個拆分的分塊的一組內部鍵。The virtual storage layer may need to split oversized objects into small data chunks to store the objects, and then reconstruct the objects with the split data chunks to retrieve the oversized objects. For this purpose, internal keys generated from user (such as user application running on the host) keys can be used to make the virtual storage layer stateless. According to one embodiment, the virtual storage layer reserves a few bytes of key space for internally used device support for distributed partitioning and exposes the rest of the key space to the user. In this case, the user-specified object key renders one or more split and grouped internal keys of the object.

圖 2 繪示根據一個實施例的包含用戶鍵的實例內部鍵。內部鍵200包含用戶鍵201的第一部分以及帶識別碼（ID）202的第二部分。內部鍵200可用于為相對應物件識別整個分塊組或分塊的一部分。在這種情況下，物件與鍵值對的值相對應，所述鍵值對包含內部鍵200作為鍵值對的鍵。在本實例中，虛擬儲存層和/或資料儲存裝置支援的最大鍵長度為L，且保留以用於群組規格的位元組數為G。虛擬儲存層播發用戶可使用的最大鍵長度為L-G。 FIG. 2 illustrates an example internal key including a user key according to one embodiment. The internal key 200 includes a first part of the user key 201 and a second part with an identification code (ID) 202. The internal key 200 may be used to identify an entire block group or a part of a block for a corresponding object. In this case, the object corresponds to the value of a key-value pair, which contains the internal key 200 as the key of the key-value pair. In this example, the maximum key length supported by the virtual storage layer and / or the data storage device is L, and the number of bytes reserved for the group specification is G. The maximum key length that users of the virtual storage layer can use is LG.

對於小物件或大物件，可預設地用0填充帶ID 202的G個位元組。對於超大物件，虛擬儲存層可運算物件的帶數目。可使用內部鍵201來識別單獨帶。可以根據拆分優先方案或帶優先方案來將帶寫入到指配用於逐個地儲存物件的一個或多個資料儲存裝置。For small or large objects, G bytes with ID 202 can be filled with 0 by default. For very large objects, the virtual storage layer can count the number of bands of the object. The internal key 201 can be used to identify individual bands. A tape may be written to one or more data storage devices assigned to store items one by one according to a split-first scheme or a tape-first scheme.

根據一個實施例，資料儲存裝置可支援群組特徵。虛擬儲存層可基於用戶鍵201通過指定組來識別儲存在資料儲存裝置中的拆分。在這種情況下，可能不需要額外中繼資料（SZ_meta=0）。虛擬儲存層可通過向所有資料儲存裝置廣播用戶鍵201和填充有“不在乎（don't care）”位元（任意資料的多位元，例如0xFF）的帶ID 202來檢索物件的分塊。如果帶ID是“不在乎”，那麼帶ID欄位被忽略。可假定資料儲存裝置有效地實現群組特徵。舉例來說，首碼樹結構（trie structure）可輕易地用用戶鍵201的給定首碼識別物件，然而只有當中繼資料欄位固定時散列表才能夠使用用戶鍵在散列桶中找出物件。虛擬儲存層可基於每一裝置的帶ID 202以昇冪對返回的物件分塊進行排序，重構帶且隨後重構物件，並且返回具有用戶鍵201的物件。According to one embodiment, the data storage device may support group features. The virtual storage layer may identify a split stored in the data storage device by specifying a group based on the user key 201. In this case, no additional metadata may be needed (SZ_meta = 0). The virtual storage layer can retrieve the chunks of the object by broadcasting the user key 201 and the ID 202 filled with "don't care" bits (multi-bits of arbitrary data, such as 0xFF) to all data storage devices. If the ID is "don't care", then the ID field is ignored. It can be assumed that the data storage device effectively implements the group feature. For example, the trie structure can easily identify an object with a given first code of the user key 201, but the hash table can only be found in the hash bucket using the user key when the metadata field is fixed object. The virtual storage layer may sort the returned object chunks in ascending order based on the belt ID 202 of each device, reconstruct the belt and then reconstruct the object, and return the object with the user key 201.

圖 3 繪示根據一個實施例的使用群組特徵進行物件檢索的實例。磁碟中的每一個，即磁碟1到磁碟D以及磁碟D+1磁碟到P，指配以儲存物件的資料儲存裝置支援群組特徵。在這種情況下，帶ID 302設定為“不在乎”，指示忽略帶ID 302。虛擬儲存層使用用戶鍵301收集屬於帶350的分塊（即資料分塊1、資料分塊2、……、同位分塊1到同位分塊P），且利用帶350重構包含分塊1到分塊D的第一片段。隨後，虛擬儲存層收集其餘的分塊且依次重構其餘的片段。一旦構建所有片段，虛擬儲存層就利用片段重構包含資料分塊1到資料分塊3D的物件310。在擦除編碼的情況下，虛擬儲存層利用同位分塊1到同位分塊P進一步重構同位塊。本實例繪示用於分佈分塊的帶優先方案；然而應理解，在不背離本發明的範圍的情況下拆分優先方案可應用於分塊分佈以及物件檢索。 FIG. 3 illustrates an example of object retrieval using group features according to one embodiment. Each of the disks, that is, disk 1 to disk D and disk D + 1 to disk P, the data storage device assigned to store objects supports the group feature. In this case, the band ID 302 is set to "don't care", indicating that the band ID 302 is ignored. The virtual storage layer uses the user key 301 to collect the blocks belonging to the band 350 (ie, data block 1, data block 2, ..., co-located block 1 to co-located block P), and uses the band 350 reconstruction to include the block 1 Go to the first fragment of block D. Subsequently, the virtual storage layer collects the remaining chunks and reconstructs the remaining fragments in turn. Once all the fragments are constructed, the virtual storage layer uses the fragments to reconstruct an object 310 containing data blocks 1 to 3D. In the case of erasure coding, the virtual storage layer further uses the co-located block 1 to the co-located block P to further reconstruct the co-located block. This example illustrates a band-priority scheme for distributing blocks; however, it should be understood that the split-priority scheme can be applied to block distribution and object retrieval without departing from the scope of the present invention.

圖 4 繪示根據一個實施例的無群組特徵的物件檢索的實例。在這種情況下，虛擬儲存層將額外中繼資料附加到大物件或超大物件（即SZ_meta≠0），這是因為指配以儲存物件的資料儲存裝置並不支援群組特徵。可由具有1位元組長度的帶ID 402來識別每一分塊。在本實例中，存在三個帶，帶0、帶1以及帶2，以使得帶的數目可適合於1位元組長度。虛擬儲存層可使用帶ID 402逐個地構建片段。首先，虛擬儲存層向所有資料儲存裝置廣播具有等於0的帶ID（BID=0）的用戶鍵401。虛擬儲存層從資料儲存裝置接收帶0的分塊，且利用來自屬於帶0的接收到的分塊的分塊檢索帶資訊。基於接收到的帶資訊，虛擬儲存層知道帶的數目以檢索物件。如果物件是大的，那麼可能存在僅一個帶，因此虛擬儲存層可利用帶中的分塊重構整個物件。如果對於超大物件存在超過一個帶，那麼虛擬儲存層需要逐個地檢索更多的帶（例如帶1和帶2）。在這種情況下，虛擬儲存層通過調整帶ID（例如BID=1或BID=2）來廣播檢索請求，直到其檢索所有分塊為止。一旦虛擬儲存層構建所有片段，虛擬儲存層就可重構物件410。應注意，無論裝置是否支援群組特徵，小物件都可能不具有中繼資料。通過檢查分塊大小，虛擬儲存層可使用不等式(7)以及不等式(10)來確定物件410是否為小。 FIG. 4 illustrates an example of object retrieval without group features according to an embodiment. In this case, the virtual storage layer attaches additional metadata to large or oversized objects (that is, SZ_meta ≠ 0) because the data storage device assigned to the storage object does not support the group feature. Each tile can be identified by a band ID 402 having a length of 1 byte. In this example, there are three bands, band 0, band 1, and band 2, so that the number of bands can be adapted to a 1-byte length. The virtual storage layer can use the ID 402 to build fragments one by one. First, the virtual storage layer broadcasts a user key 401 with a band ID (BID = 0) equal to 0 to all data storage devices. The virtual storage layer receives the block with band 0 from the data storage device, and retrieves the band information using the block from the received block belonging to band 0. Based on the received tape information, the virtual storage layer knows the number of tapes to retrieve objects. If the object is large, then there may be only one band, so the virtual storage layer can use the blocks in the band to reconstruct the entire object. If there are more than one zone for an oversized object, the virtual storage layer needs to retrieve more zones one by one (such as zone 1 and zone 2). In this case, the virtual storage layer broadcasts the retrieval request by adjusting the band ID (for example, BID = 1 or BID = 2) until it retrieves all the blocks. Once the virtual storage layer constructs all the fragments, the virtual storage layer can reconstruct the object 410. It should be noted that gadgets may not have metadata regardless of whether the device supports group features. By checking the block size, the virtual storage layer can use inequality (7) and inequality (10) to determine whether the object 410 is small.

對於將超大物件寫入到一個或多個資料儲存裝置，可將超大物件拆分成×D個相同大小的分塊，即個片段。考慮到對準要求，可以用零填充最末資料分塊（例如圖 5 中的物件510a的資料4a以及物件510b的資料4b），且可從每一片段的D個資料分塊生成P個同位分塊。For writing oversized objects to one or more data storage devices, the oversized objects can be split into × D blocks of the same size, ie Clips. Considering the alignment requirements, the last data block can be filled with zeros (for example, data 4a of object 510a and data 4b of object 510b in Figure 5 ), and P parity can be generated from D data blocks of each segment Block.

圖 5 繪示根據一個實施例的無專用同位裝置的擦除編碼的實例。圖 6 繪示根據一個實施例的具有一個或多個專用同位裝置的擦除編碼的實例。在每個每一帶中包含D個資料分塊以及P個同位分塊的總(D+P)個分塊分佈在一個或多個資料儲存裝置上，從而寫入所有個帶。同位分塊可分佈在D+P個裝置（例如圖 5 中的SSD 4到SSD 6）上，或可儲存在P個專用裝置（例如圖 6 中的SSD 5和SSD 6）上。在資料儲存裝置上無帶ID的情況下，可使用用戶鍵的散列值（下文中表示為“Hash(user key)”）來選擇主要資料儲存裝置。可在圖 5 的實例中選擇全部(D+P)個裝置或(D+P)個裝置的子集，且可在圖 6 的實例中選擇D個裝置。如果不存在專用同位裝置，那麼可由Hash(user key)%(D+P)確定起始裝置，或如果存在專用同位裝置，那麼由Hash(user key)%D確定。後續分塊可依序地寫入到下一個裝置，例如(Hash(user key) + 1)%(D+P)、(Hash(user key) + 2)%(D+P)、……、(Hash(user key) + D + P - 1)%(D+P)或(Hash(user key) + 1)%D、(Hash(user key) + 2)%D、……、(Hash(user key) + D - 1)%D。這一操作於每一帶進行，且虛擬儲存層針對所有個帶重複這一過程以寫入物件的分塊。可能需要為每一物件計算一次用戶鍵的散列值。 FIG. 5 illustrates an example of erasure coding without a dedicated parity device according to one embodiment. FIG. 6 illustrates an example of erasure coding with one or more dedicated parity devices according to one embodiment. A total of (D + P) blocks containing D data blocks and P co-located blocks in each band are distributed on one or more data storage devices to write all Band. Co-located block may be distributed on D + P devices (e.g. FIG. 5 SSD 4 to SSD 6), or P may be stored on a dedicated apparatus (e.g., in FIG. 6 SSD 5 and SSD 6). When there is no ID on the data storage device, the hash value of the user key (hereinafter referred to as "Hash (user key)") can be used to select the main data storage device. All may be selected (D + P) th device (D + P) th subset of devices in the example of FIG. 5, a device D and may be selected in the example of FIG. 6. If there is no dedicated peer device, the starting device can be determined by Hash (user key)% (D + P), or if there is a dedicated peer device, it can be determined by Hash (user key)% D. Subsequent blocks can be sequentially written to the next device, such as (Hash (user key) + 1)% (D + P), (Hash (user key) + 2)% (D + P), ..., (Hash (user key) + D + P-1)% (D + P) or (Hash (user key) + 1)% D, (Hash (user key) + 2)% D, ..., (Hash ( user key) + D-1)% D. This operation is performed on each zone, and the virtual storage layer This process is repeated for each band to write pieces of the object. A hash of the user key may need to be calculated once for each object.

如果資料儲存裝置並不支援群組特徵，那麼分塊具有帶ID以及帶總數的額外中繼資料，如圖 4 中所示。可由式(3)確定帶的數目。帶中的分塊可具有一對（帶ID，）作為中繼資料。If the group does not support data storage device wherein, the block having the additional information with the relay ID and the total number of bands, as shown in FIG. 4. The number of bands can be determined by equation (3). A block in a band can have a pair (with ID, ) As metadata.

參看圖 5 ，虛擬儲存層將物件510a的資料分塊（資料1a到資料4a）以及同位分塊（同位1a和同位2a）儲存在資料儲存裝置SSD1到資料儲存裝置SSD6上。虛擬儲存層將另一物件510b的資料分塊（資料1b到資料4b）以及同位分塊（同位1b和同位2b）儲存在資料儲存裝置SSD1到資料儲存裝置SSD6上。可由用戶鍵的散列值確定起始裝置（例如物件510a的SSD1以及物件510b的SSD6），如上文所論述。因為不存在專用同位裝置，所以虛擬儲存層可將資料分塊和同位分塊分佈在資料儲存裝置SSD1到資料儲存裝置SSD6上而無需區分資料分塊與同位分塊。在本實例中，SSD4和SSD6包含資料分塊和同位分塊兩個。Referring to FIG. 5 , the virtual storage layer stores data blocks (data 1a to 4a) and parity blocks (parity 1a and parity 2a) of the object 510a on the data storage device SSD1 to the data storage device SSD6. The virtual storage layer stores data blocks (data 1b to data 4b) and parity blocks (parity 1b and parity 2b) of another object 510b on the data storage device SSD1 to the data storage device SSD6. The starting device (eg, SSD1 of object 510a and SSD6 of object 510b) can be determined by the hash value of the user key, as discussed above. Because there is no dedicated parity device, the virtual storage layer can distribute data blocks and parity blocks on the data storage device SSD1 to the data storage device SSD6 without distinguishing data blocks from parity blocks. In this example, SSD4 and SSD6 include data blocks and co-located blocks.

參看圖 6 ，虛擬儲存層將物件610a的資料分塊（資料1a到資料4a）以及同位分塊（同位1a和同位2a）儲存在資料儲存裝置SSD1到資料儲存裝置SSD6上。類似地，虛擬儲存層將物件610b的資料分塊（資料1b到資料4b）以及同位分塊（同位1b和同位2b）儲存在資料儲存裝置SSD1到資料儲存裝置SSD6上。這是因為SSD5和SSD6指配為同位裝置，SSD5和SSD6僅包含同位分塊。Referring to FIG. 6 , the virtual storage layer stores data blocks (data 1a to 4a) and parity blocks (parity 1a and parity 2a) of the object 610a on the data storage device SSD1 to the data storage device SSD6. Similarly, the virtual storage layer stores data blocks (data 1b to 4b) and parity blocks (parity 1b and parity 2b) of the object 610b on the data storage device SSD1 to the data storage device SSD6. This is because SSD5 and SSD6 are assigned as co-located devices, and SSD 5 and SSD 6 only contain co-located blocks.

對於將大物件寫入到一個或多個資料儲存裝置，可將大物件拆分成× D個相同大小的分塊。可與超大物件類似地處理大物件，不同之處在於可能僅存在物件的一個帶，即。For writing large objects to one or more data storage devices, the large objects can be split into × D blocks of the same size. Large objects can be handled similarly to oversized objects, except that there may be only one band of objects, namely .

對於儲存小物件，可以為物件創建(P+1)個複製物件。考慮到用填充對準，複製物件可以分佈在(P+1)個裝置上。可使用用戶鍵的散列值在(D+P)個裝置上選擇主要裝置。可基於例如儲存組織、性能等的各種因數來確定性地選擇P個複製物件。舉例來說，複製物件可簡單地儲存在(Hash(key)+1)%(D+P)、(Hash(key)+2)%(D+P)、……、(Hash(key)+P)%(D+P)上，或在不同節點、機架上，而不管是否沒有專用同位裝置。如果存在專用同位裝置或節點，那麼複製物件可儲存在(Hash(key)+1)%D、(Hash(key)+2)%D、……、(Hash(key)+P)%D上。無論裝置性能如何，小物件都可能不具有中繼資料。For storing small objects, you can create (P + 1) duplicate objects for the object. Taking into account the alignment with padding, duplicate objects can be distributed across (P + 1) devices. The primary device can be selected on (D + P) devices using the hash value of the user key. The P duplicates can be deterministically selected based on various factors such as storage organization, performance, and the like. For example, the copied object can be simply stored in (Hash (key) +1)% (D + P), (Hash (key) +2)% (D + P), ..., (Hash (key) + P)% (D + P), or on different nodes and racks, regardless of whether there is no dedicated co-location device. If there is a dedicated peer device or node, the copied object can be stored on (Hash (key) +1)% D, (Hash (key) +2)% D, ..., (Hash (key) + P)% D . Regardless of device performance, gadgets may not have metadata.

圖 7 繪示根據一個實施例的在不具有同位裝置的一個或多個資料儲存裝置上的小物件的實例複製方案。虛擬儲存層可將小物件710a（物件1）儲存在資料儲存裝置SSD1、資料儲存裝置SSD2以及資料儲存裝置SSD3上。虛擬儲存層可將小物件710b（物件2）分塊儲存在資料儲存裝置SSD3、資料儲存裝置SSD4以及資料儲存裝置SSD5上。應注意，小物件710a和小物件710b並不拆分成更小的資料分塊。可通過相對應的用戶鍵的散列值來確定用於儲存物件710a和物件710b的起始裝置。在本實例中，物件710a的起始裝置是SSD1，然而物件710b的起始裝置是SSD3。在本實例中，物件710a和物件710b中的每一個的複製物件的總數(P+1)為3（即P=2）。 FIG. 7 illustrates an example replication scheme for small objects on one or more data storage devices without a co-located device, according to one embodiment. The virtual storage layer can store small objects 710a (object 1) on the data storage device SSD1, the data storage device SSD2, and the data storage device SSD3. The virtual storage layer can store small objects 710b (object 2) in blocks on the data storage device SSD3, the data storage device SSD4, and the data storage device SSD5. It should be noted that the small objects 710a and 710b are not split into smaller data chunks. The starting device for storing the object 710a and the object 710b may be determined by the hash value of the corresponding user key. In this example, the starting device of the object 710a is SSD1, while the starting device of the object 710b is SSD3. In this example, the total number (P + 1) of duplicated objects of each of the object 710a and the object 710b is 3 (ie, P = 2).

圖 8 繪示根據一個實施例的用於寫入物件的實例流程圖。資料儲存系統的虛擬儲存層接收寫入請求以寫入物件（801）。從用戶（或在主機上運行的用戶應用程式）接收到的寫入請求可包含用戶鍵。虛擬儲存層使用例如不等式(8)和不等式(9)來確定物件為超大或為大（802）。對於大物件或超大物件，虛擬儲存層例如使用式(3)和式(4)來確定每一拆分和每一帶的分塊大小和分塊數（811），以及將資料寫入到在資料儲存裝置上的一個或多個帶（812），以及完成寫入進程（815）。 FIG. 8 illustrates an example flowchart for writing an object according to one embodiment. The virtual storage layer of the data storage system receives a write request to write an object (801). A write request received from a user (or a user application running on a host) can include a user key. The virtual storage layer uses, for example, inequality (8) and inequality (9) to determine whether an object is oversized or large (802). For large objects or very large objects, the virtual storage layer uses equations (3) and (4) to determine the block size and number of blocks (811) for each split and each band, and writes data to the data One or more tapes on the storage device (812), and complete the write process (815).

如果虛擬儲存層確定物件既不大也不超大，那麼虛擬儲存層例如使用式(7)進一步確定物件是否為小（803）。對於小物件，虛擬儲存層確定一個或多個資料儲存裝置以基於分佈策略儲存包含原始資料以及複製資料的資料（813），在一個或多個裝置上將資料寫入帶中（814），以及完成寫入進程（815）。舉例來說，虛擬儲存層可採用帶優先策略（將資料分佈在多個資料儲存裝置上）。虛擬儲存層可使用用戶鍵的散列值來確定起始裝置。If the virtual storage layer determines that the object is neither large nor oversized, the virtual storage layer further determines whether the object is small (803), for example, using formula (7). For small objects, the virtual storage layer determines one or more data storage devices to store the data containing the original data and the copied data based on the distribution strategy (813), writes the data to the tape on one or more devices (814), and Complete the write process (815). For example, the virtual storage layer can adopt a priority policy (distribute data across multiple data storage devices). The virtual storage layer may use the hash value of the user key to determine the starting device.

如果虛擬儲存層確定物件既不為超大、大，也不為小，那麼虛擬儲存層將物件作為中等來處理（804），確定一個或多個資料儲存裝置以基於分佈策略儲存包含原始資料和複製資料的資料（813），在一個或多個裝置上將資料寫入帶中（814），以及完成寫入進程（815）。If the virtual storage layer determines that the object is neither oversized, large, or small, then the virtual storage layer treats the object as medium (804), and determines one or more data storage devices to store the original data and copy based on the distribution strategy. The data of the data (813), writing the data to the tape on one or more devices (814), and completing the writing process (815).

將資料寫入到在資料儲存裝置上的一個或多個帶的進程812可包含若干子進程。首先，虛擬儲存層確定是否有待寫入的任何片段（820）。如果沒有待寫入的片段，那麼虛擬儲存層保存錯誤日誌、終止操作並且通知在發送請求以寫入物件的主機上運行的用戶應用程式和/或主機的作業系統。虛擬儲存層為物件創建片段且為片段中的分塊創建內部鍵（821）。虛擬儲存層檢查這是否是待寫入的最末片段（822）。對於最末片段，如果所述片段無法完全填滿帶，那麼虛擬儲存層可向資料分塊中添加填充物（823），且使用擦除編碼演算法運算所述帶的P個同位分塊（824）。對於除最末片段以外的片段，虛擬儲存層可跳過填充進程823。虛擬儲存層基於分佈策略進一步確定用於資料分塊和同位分塊的裝置（825）。舉例來說，分佈策略可以是拆分優先策略或帶優先策略。虛擬儲存層將在具有內部鍵的帶中的資料分塊和同位分塊寫入到一個或多個資料儲存裝置（826），所述內部鍵生成於821中。The process 812 of writing data to one or more bands on a data storage device may include several sub-processes. First, the virtual storage layer determines whether there are any fragments to be written (820). If there are no fragments to be written, the virtual storage layer saves the error log, terminates the operation, and notifies the user application and / or the host's operating system running on the host that sent the request to write the object. The virtual storage layer creates fragments for objects and internal keys for the tiles in the fragments (821). The virtual storage layer checks if this is the last fragment to be written (822). For the last segment, if the segment cannot completely fill the band, the virtual storage layer may add padding to the data segment (823) and use an erasure coding algorithm to calculate the P parity segments of the segment ( 824). For segments other than the last segment, the virtual storage layer may skip the fill process 823. The virtual storage layer further determines a device for data partitioning and parity partitioning based on the distribution strategy (825). For example, the distribution strategy may be a split-first strategy or a band-first strategy. The virtual storage layer writes data chunks and parity chunks in a band with internal keys to one or more data storage devices (826), the internal keys being generated in 821.

虛擬儲存層可能並不知道待讀取的物件是為小、大，還是超大，這是因為所述虛擬儲存層並不保持例如用戶鍵和值大小的物件中繼資料。因此，虛擬儲存層使用物件的用戶鍵向虛擬儲存中的所有資料儲存裝置（例如(D+P)個資料儲存裝置）廣播讀取請求，且基於物件大小的分類確定重構物件的恰當方式。取決於由資料儲存裝置支援的群組特徵，虛擬儲存層可以不同方式進行讀取操作和寫入操作。The virtual storage layer may not know whether the object to be read is small, large, or oversized, because the virtual storage layer does not maintain object metadata such as user keys and value sizes. Therefore, the virtual storage layer uses the user keys of the object to broadcast read requests to all data storage devices (such as (D + P) data storage devices) in the virtual storage, and determines the appropriate way to reconstruct the object based on the classification of the object size. Depending on the group characteristics supported by the data storage device, the virtual storage layer can perform read operations and write operations in different ways.

可基於資料儲存裝置對群組特徵的支援來以不同方式讀取超大物件。如果虛擬儲存中的資料儲存裝置支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=“不在乎”的用戶鍵。作為回應，如果對廣播的回應中沒有錯誤，那麼(D+P)個資料儲存裝置中的每一個返回其拆分中的個分塊。虛擬儲存層隨後將接收到的分塊歸類成每一個帶ID的總個帶，並按帶ID的昇冪對帶進行排序。只要虛擬儲存層接收每個每一帶的相同大小的任何D個分塊，虛擬儲存層就能夠重建所述帶。如果帶的接收到的分塊的總數小於資料儲存裝置D的數目或分塊的大小不相同，錯誤就會出現。錯誤可由在所有資料儲存裝置返回NOT_EXIST錯誤或不可恢復的錯誤的情況下讀取非存在物件導致。虛擬儲存層逐個地重構帶，且隨後利用帶重構物件。Very large objects can be read in different ways based on the data storage device's support for group characteristics. If the data storage device in the virtual storage supports the group feature, the virtual storage layer broadcasts a user key with BID = "don't care" to all data storage devices. In response, if there is no error in the response to the broadcast, then each of the (D + P) data storage devices returns its Blocks. The virtual storage layer then classifies the received chunks into each Bands, and sort the bands in ascending order of ID. As long as the virtual storage layer receives any D tiles of the same size for each band, the virtual storage layer can reconstruct the band. If the total number of received blocks of the tape is less than the number of data storage devices D or the sizes of the blocks are not the same, an error occurs. Errors can be caused by reading non-existent objects with a NOT_EXIST error or an unrecoverable error returned by all data storage devices. The virtual storage layer reconstructs the strips one by one, and then uses the strips to reconstruct the objects.

如果虛擬儲存中的資料儲存裝置並不支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=0的用戶鍵以讀取超大物件。作為響應，(D+P)個資料儲存裝置中的每一個返回第一帶（Band 0）中的一個分塊。虛擬儲存層可通過檢查與接收到的分塊中的任一個一起儲存的中繼資料來識別物件的帶的數目。虛擬儲存層隨後按帶ID的昇冪逐個地重構所有帶，且使用帶重構物件。在一些實施例中，虛擬儲存層可通過非同步地請求所有其餘的帶來重構與群組支援情況類似的物件。If the data storage device in the virtual storage does not support the group feature, then the virtual storage layer broadcasts a user key with BID = 0 to all data storage devices to read oversized objects. In response, each of the (D + P) data storage devices returns a block in the first band (Band 0). The virtual storage layer can identify the number of bands of an object by examining metadata stored with any of the received tiles. The virtual storage layer then reconstructs all the bands one by one with increasing power of the band ID, and uses the band reconstruction objects. In some embodiments, the virtual storage layer may request all remaining objects that are similar to the group support situation by asynchronously requesting.

大物件的讀取進程可與超大物件的讀取進程類似，不同之處在於其僅具有一個帶。如果資料儲存裝置支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=“不在乎”的用戶鍵。作為響應，(D+P)個資料儲存裝置中的每一個返回其拆分中的一個分塊。只要虛擬儲存層接收相同大小的D個分塊，虛擬儲存層就可重建物件。如果帶的接收到的分塊的總數小於資料分塊D的數目或分塊的大小不相同，錯誤就會出現。錯誤可由在所有資料儲存裝置返回NOT_EXIST錯誤或不可恢復的錯誤的情況下讀取非存在物件導致。The reading process for large objects can be similar to the reading process for very large objects, except that it has only one band. If the data storage device supports the group feature, the virtual storage layer broadcasts a user key with BID = "don't care" to all data storage devices. In response, each of the (D + P) data storage devices returns a block in its split. As long as the virtual storage layer receives D blocks of the same size, the virtual storage layer can reconstruct objects. If the total number of received blocks of the band is less than the number of data blocks D or the size of the blocks is not the same, an error will occur. Errors can be caused by reading non-existent objects with a NOT_EXIST error or an unrecoverable error returned by all data storage devices.

如果虛擬儲存中的資料儲存裝置並不支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=0的用戶鍵以讀取大物件。作為響應，(D+P)個資料儲存裝置中的每一個返回帶（BID=0）中的一個分塊。虛擬儲存層通過檢查與任何接收到的分塊一起儲存的中繼資料識別出對於大物件僅存在一個帶，並且通過重建所述帶來重構物件。If the data storage device in the virtual storage does not support the group feature, the virtual storage layer broadcasts a user key with BID = 0 to all data storage devices to read large objects. In response, each of the (D + P) data storage devices returns a block in the band (BID = 0). The virtual storage layer recognizes that there is only one band for a large object by examining the metadata stored with any of the received chunks, and reconstructs the object by reconstructing the band.

小物件的讀取進程可與大物件的讀取進程類似，不同之處在于其借助於複製。如果資料儲存裝置支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=“不在乎”的用戶鍵。作為回應，(D+P)個資料儲存裝置中的每一個返回回應。由於物件為小，因此一些資料儲存裝置可返回NOT_EXIST錯誤，然而其它資料儲存裝置返回分塊。虛擬儲存層接收可使用從一個或多個資料儲存裝置接收到的分塊重建物件的任何有效分塊。如果所有資料儲存裝置返回NOT_EXIST錯誤，那麼由讀取請求識別的物件可以是非存在物件，或可能出現不可恢復的錯誤。The reading process of small objects can be similar to the reading process of large objects, except that it is by means of copying. If the data storage device supports the group feature, the virtual storage layer broadcasts a user key with BID = "don't care" to all data storage devices. In response, each of the (D + P) data storage devices returns a response. Because the objects are small, some data storage devices can return NOT_EXIST errors, while other data storage devices return blocks. The virtual storage layer receives any valid chunk of the object that can be reconstructed using the chunks received from one or more data storage devices. If all data storage devices return a NOT_EXIST error, the object identified by the read request may be a non-existent object, or an unrecoverable error may occur.

如果虛擬儲存中的資料儲存裝置並不支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有BID=0的用戶鍵以讀取小物件。作為回應，(D+P)個資料儲存裝置中的每一個返回回應。虛擬儲存層可使用從任何資料儲存裝置接收到的任何有效分塊識別出物件為小。應注意，小物件並不保持額外中繼資料。If the data storage device in the virtual storage does not support the group feature, the virtual storage layer broadcasts a user key with BID = 0 to all small data storage devices to read small objects. In response, each of the (D + P) data storage devices returns a response. The virtual storage layer can identify objects as small using any valid block received from any data storage device. It should be noted that small objects do not hold extra metadata.

圖 9 繪示根據一個實施例的用於讀取物件的實例流程圖。虛擬儲存層接收物件的讀取請求（901）。從用戶（或在主機上運行的用戶應用程式）接收到的讀取請求可包含用戶鍵。虛擬儲存層用儲存物件的一個或多個分塊的一個或多個資料儲存裝置確定是否支援群組特徵（902）。如果支援群組特徵，那麼虛擬儲存層向所有資料儲存裝置廣播具有帶有BID=“不在乎”的內部鍵的讀取請求（903），否則，為所有資料儲存裝置設定BID=0（904）。虛擬儲存層從資料儲存裝置中的一個接收分塊（905）。如果接收到的分塊中沒有錯誤（906），那麼虛擬儲存層確定是否支援群組特徵（907）並且檢索分塊中的中繼資料（908），否則，虛擬儲存層繼續確定物件的大小。 FIG. 9 illustrates an example flowchart for reading an object according to one embodiment. The virtual storage layer receives a read request of the object (901). A read request received from a user (or a user application running on a host) may include a user key. The virtual storage layer uses one or more data storage devices that store one or more chunks of the object to determine whether the group feature is supported (902). If the group feature is supported, the virtual storage layer broadcasts a read request with an internal key with BID = "don't care" to all data storage devices (903), otherwise, sets BID = 0 (904) for all data storage devices. The virtual storage layer receives blocks from one of the data storage devices (905). If there are no errors in the received block (906), the virtual storage layer determines whether the group feature is supported (907) and retrieves metadata in the block (908), otherwise, the virtual storage layer continues to determine the size of the object.

如果虛擬儲存層確定物件為超大或大（909），那麼虛擬儲存層進一步檢查是否重構整個物件（910），繼續使用從一個或多個資料儲存裝置接收到的分塊逐個地重構片段直到重構整個物件為止（912），以及完成讀取進程（913）。對於小物件，虛擬儲存層使用從一個或多個資料儲存裝置接收到的分塊重構物件（911）。If the virtual storage layer determines that the object is oversized or large (909), the virtual storage layer further checks whether the entire object is reconstructed (910) and continues to reconstruct fragments one by one using the blocks received from one or more data storage devices until Until the entire object is reconstructed (912), and the reading process is completed (913). For small objects, the virtual storage layer uses chunked reconstructed objects received from one or more data storage devices (911).

重構小物件的進程911可包含若干子進程。虛擬儲存層首先檢查接收到的分塊是否為小（921）。在虛擬儲存層預期小物件的小分塊但接收了大分塊時，虛擬儲存層生成錯誤（924）。如果虛擬儲存層確定接收到的分塊為有效的（922），那麼虛擬儲存層用接收到的分塊重構小物件（923），否則，虛擬儲存層從另一資料儲存裝置接收分塊（925）。The process 911 of reconstructing the gadget may include several child processes. The virtual storage layer first checks whether the received chunk is small (921). When the virtual storage layer expects small chunks of small objects but receives large chunks, the virtual storage layer generates an error (924). If the virtual storage layer determines that the received block is valid (922), the virtual storage layer reconstructs the small object with the received block (923); otherwise, the virtual storage layer receives the block from another data storage device ( 925).

重構片段的進程912可包含若干子進程。虛擬儲存層檢查是否接收到用以重構片段的所有D個分塊（931）。如果是，那麼虛擬儲存層利用D個分塊使用擦除編碼演算法重構片段（935）。如果不是，那麼虛擬儲存層進一步檢查是否接收到當前帶中的所有分塊（932）。如果接收到帶中的所有分塊，那麼虛擬儲存層生成錯誤（936），這是因為接收到的分塊中的至少一個可能並不有效。如果有更多待接收的分塊，那麼虛擬儲存層繼續從另一資料儲存裝置接收所述分塊（933）且重複所述進程直到接收到用於重構片段的所有D個分塊為止。如果接收到的分塊中的任一個既不大也不超大（例如分塊是用於小物件的），那麼虛擬儲存層生成錯誤（936）。The process 912 of reconstructing a segment may include several child processes. The virtual storage layer checks whether all D tiles used to reconstruct the segment are received (931). If so, then the virtual storage layer reconstructs the segment using the D block using an erasure coding algorithm (935). If not, the virtual storage layer further checks whether all the tiles in the current band have been received (932). If all the tiles in the band are received, the virtual storage layer generates an error (936), because at least one of the received tiles may not be valid. If there are more chunks to be received, the virtual storage layer continues to receive the chunks from another data storage device (933) and repeats the process until all D chunks for the reconstructed segment are received. If any of the received chunks are neither large nor oversized (eg, chunks are for small objects), the virtual storage layer generates an error (936).

根據一個實施例，資料儲存系統包含：多個資料儲存裝置，用於儲存鍵值對的多個物件；以及虛擬儲存層，基於多個物件中的一物件的大小而應用不同的資料可靠性方案，所述不同的資料可靠性方案包含資料複製方案以及擦除編碼方案。多個物件包含具有第一大小的第一物件以及具有第二大小的第二物件，第二大小大於第一大小。虛擬儲存層將第一物件分類為小物件、應用資料複製方案以及將小物件儲存在多個資料儲存裝置中的一個或多個上。虛擬儲存層將第二物件分類為超大物件、將超大物件拆分成一個或多個相同大小的分塊、應用擦除編碼方案以及將一個或多個分塊分散式地儲存在多個資料儲存裝置上。According to one embodiment, a data storage system includes: a plurality of data storage devices for storing a plurality of objects of a key-value pair; and a virtual storage layer that applies different data reliability schemes based on the size of an object among the plurality of objects The different data reliability schemes include data replication schemes and erasure coding schemes. The plurality of objects include a first object having a first size and a second object having a second size, and the second size is larger than the first size. The virtual storage layer classifies the first object as a small object, applies a data copying scheme, and stores the small object on one or more of a plurality of data storage devices. The virtual storage layer classifies the second object as an oversized object, splits the oversized object into one or more blocks of the same size, applies an erasure coding scheme, and stores one or more blocks in multiple data stores in a distributed manner. Device.

分佈方案可以是拆分優先分佈，其中虛擬儲存層將超大物件的一個或多個分塊拆分到多個資料儲存裝置中的每一個且將一個或多個分塊中的每一個中的又一分塊儲存在多個資料儲存裝置上。The distribution scheme may be a split-first distribution, in which a virtual storage layer splits one or more blocks of an oversized object into each of a plurality of data storage devices and divides each of the one or more blocks into A block is stored on multiple data storage devices.

分佈方案可以是帶優先分佈方案，其中虛擬儲存層將超大物件的一個分塊儲存到多個資料儲存裝置中的每一個直到超大物件的一個或多個分塊完全地儲存在一個或多個資料儲存裝置中為止。The distribution scheme may be a band-first distribution scheme, in which a virtual storage layer stores one block of an oversized object to each of a plurality of data storage devices until one or more blocks of the oversized object are completely stored in one or more data To the storage device.

虛擬儲存層可進一步將具有第三大小的第三物件分類為大物件、將大物件拆分成一個或多個相同大小的分塊、應用擦除編碼方案以及將一個或多個分塊分散式地儲存在多個資料儲存裝置上，其中第三大小大於第一大小且小於第二大小，且其中大物件在一拆分內具有僅一個帶或僅一個分塊。The virtual storage layer can further classify a third object with a third size into a large object, split the large object into one or more blocks of the same size, apply an erasure coding scheme, and decentralize one or more blocks. The ground is stored on a plurality of data storage devices, wherein the third size is larger than the first size and smaller than the second size, and wherein the large object has only one band or only one block within a split.

虛擬儲存層可進一步將具有第四大小的第四物件分類為中等物件，其中第四大小大於第一大小且小於第三大小，且其中虛擬儲存層應用資料複製方案以及擦除編碼方案中的一個。The virtual storage layer may further classify a fourth object having a fourth size into a medium object, where the fourth size is larger than the first size and smaller than the third size, and wherein the virtual storage layer applies one of a data copying scheme and an erasure coding scheme .

可用用戶鍵識別物件，且虛擬儲存層可創建包含用戶鍵以及超大物件的帶識別碼的內部鍵，其中帶識別碼用於識別多個帶之中的一個帶，且多個帶中的每一個帶中包含分佈在多個資料儲存裝置上的一個或多個分塊中的一個分塊。User keys can be used to identify objects, and the virtual storage layer can create internal keys with user codes and oversized objects with identification codes, where the belt identification code is used to identify one of the multiple tapes, and each of the multiple tapes The band contains one of one or more blocks distributed across multiple data storage devices.

虛擬儲存層可使用用於寫入或讀取一個或多個分塊中的第一分塊的用戶鍵的散列值來識別多個資料儲存裝置之中的起始資料儲存裝置。The virtual storage layer may use a hash value of a user key for writing or reading a first block of one or more blocks to identify a starting data storage device among the plurality of data storage devices.

多個資料儲存裝置可包含一個或多個專用同位資料儲存裝置，所述專用同位資料儲存裝置儲存與超大物件相關聯的同位分塊。The plurality of data storage devices may include one or more dedicated parity data storage devices that store parity blocks associated with oversized objects.

多個資料儲存裝置可支援群組特徵，且虛擬儲存層可向多個資料儲存裝置廣播，其中帶識別碼在讀取超大物件時設定為任意資料的多位元。Multiple data storage devices can support group characteristics, and the virtual storage layer can broadcast to multiple data storage devices, where the identification code is set to multiple bits of arbitrary data when reading oversized objects.

多個資料儲存裝置可不支援群組特徵，且虛擬儲存層可向多個資料儲存裝置廣播，其中帶識別碼在讀取超大物件時設定為唯一帶識別碼。Multiple data storage devices may not support the group feature, and the virtual storage layer may broadcast to multiple data storage devices, where the identification code is set as the unique identification code when reading oversized objects.

多個資料儲存裝置中的每一個可以是鍵值固態驅動器（KV SSD）。Each of the plurality of data storage devices may be a key-value solid state drive (KV SSD).

方法可進一步包含：接收物件的寫入請求；確定物件是超大物件；確定超大物件的分塊大小以及分塊數；以及基於分塊大小以及分塊數而將超大物件的一個或多個分塊寫入到多個資料儲存裝置。The method may further include: receiving a write request for the object; determining that the object is an oversized object; determining the block size and number of blocks of the oversized object; and dividing one or more of the oversized object based on the block size and the number of blocks Write to multiple data storage devices.

方法可進一步包含：在一個或多個分塊之中為多個資料儲存裝置中的每一個創建包含一個分塊的片段，且使用用戶鍵來創建內部鍵，所述用戶鍵附加有用於片段中所包含的一個或多個分塊中的每一個的帶識別碼；使用擦除編碼方案來創建帶，所述帶包含一個或多個同位分塊以及與片段相對應的一個或多個分塊；確定多個資料儲存裝置以基於分佈方案來儲存帶；以及寫入具有內部鍵的帶中的一個或多個分塊。The method may further include creating a segment containing a segment for each of the plurality of data storage devices among the one or more segments, and using a user key to create an internal key that is additionally used in the segment A band identification code for each of the contained one or more blocks; a band is created using an erasure coding scheme, the band containing one or more co-located blocks and one or more blocks corresponding to the segment ; Determining multiple data storage devices to store tapes based on a distribution scheme; and writing one or more chunks in a tape with internal keys.

方法可進一步包含：接收物件的寫入請求；確定物件是小物件；確定多個資料儲存裝置的子集以基於分佈方案來儲存小物件的一個或多個分塊；以及將小物件的一個或多個分塊寫入到多個資料儲存裝置的子集。The method may further include: receiving a write request for the object; determining that the object is a small object; determining a subset of the plurality of data storage devices to store one or more blocks of the small object based on the distribution scheme; and Multiple blocks are written to a subset of multiple data storage devices.

方法可進一步包含：接收包含用戶鍵的物件的讀取請求；確定多個資料儲存裝置是否支援群組特徵；以及向多個資料儲存裝置廣播具有內部鍵的讀取請求。The method may further include: receiving a read request for an object containing a user key; determining whether a plurality of data storage devices support a group feature; and broadcasting a read request with an internal key to the plurality of data storage devices.

如果支援群組特徵，那麼內部鍵的帶識別碼可設定為任意資料的多位元。If the group feature is supported, the ID with the internal key can be set to multiple bits of any data.

如果不支援群組特徵，那麼內部鍵的帶識別碼可設定為唯一帶識別碼。If the group feature is not supported, the ID with internal key can be set as the unique ID.

方法可進一步包含：從多個資料儲存裝置中的每一個接收至少一個分塊；以及使用擦除編碼方案利用從多個資料儲存裝置接收到的至少一個分塊重構片段。The method may further include: receiving at least one chunk from each of the plurality of data storage devices; and using an erasure coding scheme to reconstruct the segment using the at least one chunk received from the plurality of data storage devices.

上文已描述以上實例實施例以說明實施能夠有效地儲存具有不同大小的物件的資料儲存系統以及在資料儲存系統中儲存那些物件的方法的各種實施例。本領域普通技術人員將進行對所揭露實例實施例的各種修改和背離。在隨附權利要求書中闡述意欲在本發明的範圍內的主題。The above example embodiments have been described above to illustrate various embodiments of implementing a data storage system capable of efficiently storing objects having different sizes and a method of storing those objects in the data storage system. Those of ordinary skill in the art will make various modifications and departures from the disclosed example embodiments. The subject matter which is intended to be within the scope of the invention is set forth in the appended claims.

110、310、410、510a、510b、610a、610b、710a、710b‧‧‧物件110, 310, 410, 510a, 510b, 610a, 610b, 710a, 710b‧‧‧ objects

120、121、321、421‧‧‧虛擬儲存 120, 121, 321, 421‧‧‧ Virtual storage

150、350‧‧‧帶 150, 350‧‧‧ belt

151‧‧‧拆分 151‧‧‧ split

200‧‧‧內部鍵 200‧‧‧ Internal key

201、301、401‧‧‧用戶鍵 201, 301, 401‧‧‧ user keys

202、302、402‧‧‧帶識別碼 202, 302, 402‧‧‧ with identification code

801、802、803、804、811、812、813、814、815、820、821、822、823、824、825、826、901、902、903、904、905、906、907、908、909、910、911、912、913、921、922、923、924、925、931、932、933、934、935、936‧‧‧步驟 801, 802, 803, 804, 811, 812, 813, 814, 815, 820, 821, 822, 823, 824, 825, 826, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 921, 922, 923, 924, 925, 931, 932, 933, 934, 935, 936‧‧‧ steps

G‧‧‧位元組數 G‧‧‧Bytes

L、L-G‧‧‧最大鍵長度 L, L-G‧‧‧ maximum key length

SSD1～SSD6‧‧‧資料儲存裝置 SSD1 ～ SSD6‧‧‧ Data Storage Device

作為本說明書的部分包含在內的附圖示出本發明優選實施例，並且連同上文給出的總體描述和下文給出的優選實施例的詳細描述一起用以闡明和傳授本文所描述的原理。The accompanying drawings, which are included as part of this specification, illustrate preferred embodiments of the present invention and, together with the general description given above and the detailed description of the preferred embodiments given below, are used to clarify and impart the principles described herein .

圖 1 繪示根據一個實施例的儲存在實例資料儲存系統中的物件的示意圖。 FIG. 1 is a schematic diagram of objects stored in an example data storage system according to an embodiment.

圖 2 繪示根據一個實施例的包含用戶鍵的實例內部鍵。 FIG. 2 illustrates an example internal key including a user key according to one embodiment.

圖 3 繪示根據一個實施例的使用群組特徵進行物件檢索的實例。 FIG. 3 illustrates an example of object retrieval using group features according to one embodiment.

圖 4 繪示根據一個實施例的無群組特徵的物件檢索的實例。 FIG. 4 illustrates an example of object retrieval without group features according to an embodiment.

圖 5 繪示根據一個實施例的無專用同位裝置的擦除編碼的實例。 FIG. 5 illustrates an example of erasure coding without a dedicated parity device according to one embodiment.

圖 6 繪示根據一個實施例的具有一個或多個專用同位裝置的擦除編碼的實例。 FIG. 6 illustrates an example of erasure coding with one or more dedicated parity devices according to one embodiment.

圖 7 繪示根據一個實施例的在不具有同位裝置的一個或多個資料儲存裝置上的小物件的實例複製方案。 FIG. 7 illustrates an example replication scheme for small objects on one or more data storage devices without a co-located device, according to one embodiment.

圖 8 繪示根據一個實施例的用於寫入物件的實例流程圖。 FIG. 8 illustrates an example flowchart for writing an object according to one embodiment.

圖 9 繪示根據一個實施例的用於讀取物件的實例流程圖。 FIG. 9 illustrates an example flowchart for reading an object according to one embodiment.

圖式未必按比例繪製，並且在整個圖式中，出於說明的目的，類似結構或功能的元件通常由相同的附圖標記表示。圖式僅旨在便於本文所描述的各種實施例的描述。圖式並不描述本文所揭露的教示內容的每一個方面並且不限制權利要求書的範圍。The drawings are not necessarily drawn to scale, and throughout the drawings, for the purpose of illustration, elements of similar structure or function are generally denoted by the same reference numerals. The drawings are only intended to facilitate the description of the various embodiments described herein. The drawings do not describe every aspect of the teachings disclosed herein and do not limit the scope of the claims.

Claims

A data storage system including: Multiple data storage devices for storing multiple objects of key-value pairs; and The virtual storage layer applies different data reliability schemes based on the size of one of the plurality of objects. The different data reliability schemes include a data replication scheme and an erasure coding scheme. The plurality of objects include a first object having a first size and a second object having a second size, the second size is larger than the first size, The virtual storage layer classifies the first object as a small object, applies the data copying scheme, and stores the small object on one or more of the plurality of data storage devices, and The virtual storage layer classifies the second object into an oversized object, splits the oversized object into one or more blocks of the same size, applies the erasure coding scheme, and divides the one or more Blocks are stored on the plurality of data storage devices in a distributed manner.

The data storage system according to item 1 of the scope of patent application, wherein the virtual storage layer stores the one or more blocks on the plurality of data storage devices in a split-first manner, and in a split-first manner Wherein the virtual storage layer stores some of the one or more chunks of the oversized object to one of the plurality of data storage devices and stores other ones of the one or more chunks Stored to another one of said plurality of data storage devices.

The data storage system according to item 1 of the scope of patent application, wherein the virtual storage layer stores the one or more blocks on the plurality of data storage devices in a band priority distribution scheme, and the band storage priority In the distribution scheme, the virtual storage layer stores a block of the oversized object to each of the plurality of data storage devices until the one or more blocks of the oversized object are completely stored in the Multiple data storage devices.

The data storage system according to item 1 of the scope of patent application, wherein the virtual storage layer further classifies a third object having a third size into a large object, and splits the large object into one or more of the same size. Chunking, applying the erasure coding scheme, and storing the one or more chunks on the plurality of data storage devices in a distributed manner, wherein the third size is larger than the first size and smaller than the The second size, and wherein the large object has only one band or only one block within one split.

The data storage system according to item 4 of the scope of patent application, wherein the virtual storage layer further classifies a fourth object having a fourth size into a medium object, wherein the fourth size is larger than the first size and smaller than The third size is described, and wherein the virtual storage layer applies one of the data copying scheme and the erasure coding scheme.

The data storage system according to item 1 of the scope of patent application, wherein a user key is used to identify the object, and the virtual storage layer creates an internal key with an identification code including the user key and the oversized object, where The band identification code is used to identify one of a plurality of bands, and each of the plurality of bands includes the one or more blocks distributed on the plurality of data storage devices. One block.

The data storage system according to item 6 of the patent application scope, wherein the virtual storage layer uses a hash of the user key for writing or reading a first one of the one or more blocks Value to identify a starting data storage device among the plurality of data storage devices.

The data storage system according to item 1 of the scope of patent application, wherein the plurality of data storage devices include one or more dedicated parity data storage devices, and the dedicated parity data storage devices store parity associated with the oversized object Block.

The data storage system according to item 6 of the patent application scope, wherein the plurality of data storage devices support a group feature, and when reading the oversized object, the virtual storage layer provides the data storage device to the plurality of data storage devices. The band identification code set as a multi-bit of arbitrary data is broadcast.

The data storage system according to item 6 of the scope of patent application, wherein the plurality of data storage devices do not support a group feature, and when reading the oversized object, the virtual storage layer stores data to the plurality of data. The device broadcasts the tape identification code set as the unique tape identification code.

The data storage system according to item 1 of the patent application scope, wherein each of the plurality of data storage devices is a key-value solid state drive.

A method for writing an object of a key-value pair, the method includes: Receiving a plurality of objects of key-value pairs, wherein the plurality of objects include a first object having a first size and a second object having a second size, and the second size is larger than the first size; Classifying the first object as a small object; Applying a data copying scheme to said small objects; Storing the small object on one or more of a plurality of data storage devices; Classifying the second object as an oversized object; Splitting the oversized object into one or more pieces of the same size; Applying an erasure coding scheme to the oversized object; and The one or more blocks are stored on the plurality of data storage devices in a distributed manner.

The method described in item 12 of the patent application scope further includes: Receiving a write request for the object; Determining that the object is the oversized object; Determining the block size and number of blocks of the oversized object; and Writing the one or more blocks of the oversized object to the plurality of data storage devices based on the block size and the number of blocks.

The method described in item 13 of the patent application scope further includes: Creating a segment containing a segment for each of the plurality of data storage devices among the one or more segments, and using a user key to create an internal key, the user key is additionally used for the segment A band identification code for each of the one or more sub-blocks included in; Creating a band using the erasure coding scheme, the band including one or more co-located blocks and the one or more blocks corresponding to the segment; Determining the plurality of data storage devices to store the tape based on a distribution scheme; and The one or more blocks in the band having the internal key are written.

The method described in item 12 of the patent application scope further includes: Receiving a write request for the object; Determining that the object is the small object; Determining a subset of the plurality of data storage devices to store one or more chunks of the gadget based on a distribution scheme; and Write the one or more chunks of the gadget to the subset of the plurality of data storage devices.

The method described in claim 14 of the patent application scope further includes: Receiving a read request for the object containing the user key; Determining whether the plurality of data storage devices support a group feature; and The read request having the internal key is broadcast to the plurality of data storage devices.

The method according to item 16 of the scope of patent application, wherein if the group feature is supported, the identification code of the internal key is set to a multi-bit of arbitrary data.

The method according to item 16 of the scope of patent application, wherein if the group feature is not supported, the band identification code of the internal key is set as a unique band identification code.

The method described in claim 16 of the scope of patent application further includes: Receiving at least one chunk from each of the plurality of data storage devices; and Using the erasure coding scheme to reconstruct a segment using the at least one chunk received from the plurality of data storage devices.

The method of claim 12, wherein each of the plurality of data storage devices is a key-value solid state drive.