WO2022121274A1 - 一种存储系统中元数据管理方法、装置及存储系统 - Google Patents

一种存储系统中元数据管理方法、装置及存储系统 Download PDF

Info

Publication number
WO2022121274A1
WO2022121274A1 PCT/CN2021/100904 CN2021100904W WO2022121274A1 WO 2022121274 A1 WO2022121274 A1 WO 2022121274A1 CN 2021100904 W CN2021100904 W CN 2021100904W WO 2022121274 A1 WO2022121274 A1 WO 2022121274A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
metadata file
storage
storage layer
file
Prior art date
Application number
PCT/CN2021/100904
Other languages
English (en)
French (fr)
Inventor
高蒙
潘浩
王晨
任仁
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022121274A1 publication Critical patent/WO2022121274A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/185Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Definitions

  • the present application relates to the field of storage, and in particular, to a metadata management method, device and storage system in a storage system.
  • the log structured merge tree is one of the common storage structures in log-based database systems.
  • the LSM tree is a multi-layer structure.
  • the top layer C 0 is located in memory, and the following layers are C 1 .
  • ⁇ C k is located on the hard disk.
  • the newly written metadata will first be stored in the file in the top layer C 0. After the amount of metadata in the top layer C 0 reaches a certain level, the file becomes unavailable.
  • Modify the write file the file in the top layer C 0 will be merged with the unmodifiable write file in the next layer C 1 to obtain a new unmodifiable write file, and then sequentially write to the hard disk C 2 Layers, and so on, allow old metadata to be continuously deleted and new metadata to be continuously written to disk. Since the merging of each layer is performed asynchronously, the writing of new metadata will not affect the writing speed of the LSM tree.
  • the present application provides a metadata management method, device, and storage system in a storage system, which can solve the problems of long time consuming and large resource consumption of a storage system merging task based on a log structure tree.
  • a method for managing metadata in a storage system is provided, the method is executed by one or more processors in the storage system, and the method includes the following steps: storing non-modifiable and writable data in a first storage layer of the storage system
  • the first metadata file stores a modifiable and writable second metadata file in the second storage layer; wherein, the first storage layer manages the first metadata file based on a log structure tree; the second storage layer manages the first metadata file based on a non-log structure tree.
  • a binary data file is provided, the method is executed by one or more processors in the storage system, and the method includes the following steps: storing non-modifiable and writable data in a first storage layer of the storage system
  • the first metadata file stores a modifiable and writable second metadata file in the second storage layer; wherein, the first storage layer manages the first metadata file based on a log structure tree; the second storage layer manages the first metadata file based on
  • the non-log structure tree may be a B tree, a B+ tree, a dictionary sequence tree, a skip table, etc., which is not specifically limited in this application.
  • the method described in the first aspect is implemented, the first metadata file that cannot be modified and written in the first storage layer is managed based on the log structure tree, and the second metadata file that can be modified and written in the second storage layer is managed based on the non-log structure tree, so that
  • the storage system can directly write the metadata of the sequential write request into the second storage layer, eliminating the system overhead caused by merging storage layers.
  • the second storage layer is modified in batches, and the transaction overhead is exchanged for the merge overhead, so as to solve the problem that the storage system merge task takes a long time and consumes large resources in a large-capacity scenario.
  • the first storage layer may include multiple metadata files.
  • the first storage layer also stores a third metadata file that cannot be modified and written, and the first storage layer manages the third metadata file based on the log structure tree.
  • metadata file the method may further include the following steps: combining the first metadata file and the third metadata file to obtain a fourth metadata file, and writing the metadata stored in the fourth metadata file to the second metadata file .
  • the first storage layer may manage multiple metadata files in a hierarchical management manner.
  • the first storage layer may include a first storage sublayer and a second storage sublayer.
  • the second storage sublayer is used to store the third metadata file, and the multiple storage sublayers are combined in a top-to-bottom manner, and then the data in the bottom storage sublayer is written into the third storage sublayer.
  • a binary data file That is to say, the first metadata file of the first storage sub-layer is first merged with the third metadata file of the second storage sub-layer to obtain a fourth metadata file, and then the metadata stored in the fourth metadata file is merged. Write to the second metadata file.
  • further hierarchical management of the first storage layer may not be performed. That is, the multiple metadata files (ordered string table) in the first storage layer are merged according to the capacity of the first storage layer, and then written to the second storage layer.
  • the first storage layer contains a plurality of metadata files, and the metadata files can be merged, and the merged metadata files are rewritten into the second storage layer, which can reduce the need for the first storage layer in the second storage layer.
  • the number of modification times of the binary data files thereby reducing the occurrence of the write amplification problem of the non-log-structured tree in the second storage layer.
  • the problem of write amplification refers to that when modifying a non-log-structured tree, each modification needs to traverse the entire non-log-structured tree to determine the leaf nodes, intermediate nodes, and root nodes that need to be modified. Smaller, and the number of modifications is more frequent, there will be a waste of resources.
  • the multiple metadata files in the first storage layer are merged and then written to the second storage layer, which can reduce the occurrence of the write amplification problem and avoid resource waste.
  • the fourth metadata file and the second metadata file may be sorted to determine the second metadata file.
  • the modification node of the data file is then modified, and the metadata stored in the fourth metadata file is written to the second metadata file.
  • the above sorting may be a merge sort, for example, sorting different versions of the same metadata, and then combining the fourth metadata file and the second metadata file according to the sorting result to obtain a combined file, and then determining according to the combined file Modify nodes and modify content.
  • the above-mentioned implementation method by modifying part of the node in the second metadata file, writes the metadata stored in the fourth metadata file to the second metadata file, while the storage system based on the LSM tree needs to store the fourth metadata file.
  • the metadata file and the second metadata file are merged to obtain a new file, and the new file is completely written into the second storage layer, so as to realize the purpose of writing the metadata stored in the fourth metadata file to the second metadata file , compared with the two, the above implementation method can eliminate the system overhead caused by the merging of storage layers, and improve the efficiency of metadata persistent storage.
  • the modification node when the modification node is modified, the modification node may be modified in batches in the manner of transaction writing. For example, it is determined that multiple modification nodes need to be modified according to the merged file, then multiple modification operations can be added to a transaction, and this modification operation can be performed in units of transactions, which can ensure the consistency of multiple modification operations, thereby completing the modification of multiple modification operations. Modification of the second metadata file.
  • the above implementation method stores the metadata persistently by modifying the second metadata file, and exchanges a small amount of overhead required for transaction writing for the huge overhead caused by the storage layer merging, which can solve the problem that the storage system merge task takes a long time. , The problem of high resource consumption.
  • the storage system may include one or more storage devices for providing the first storage layer and the second storage layer, and the storage device may be a hard disk drive (HDD) , solid-state drive (SSD), hybrid hard disk (solid state hybrid drive, SSHD), storage class memory (storage class memory, SCM), etc., or any combination of the above storage media , which is not specifically limited in this application.
  • HDD hard disk drive
  • SSD solid-state drive
  • SSHD hybrid hard disk
  • SCM storage class memory
  • the storage device when the second metadata file is modified according to the fourth metadata file, if the storage device is an SCM, since the SCM is a storage medium with read and write performance close to the memory, it can be directly modified and written in place. -in-place) method, modify the second metadata file to realize persistent storage of metadata.
  • the storage device If the storage device is an SSD, it needs to perform garbage collection (GC) due to the inherent characteristics of SSD. In order to reduce the write amplification problem caused by GC, SSD can use redirect-on-write (redirect-on-write) ,ROW) method will persist the metadata.
  • the storage device is an HDD, because the random write performance of the HDD is poor, the HDD can convert multiple random write lowercase requests into sequential uppercase requests in ROW mode to achieve persistent metadata storage.
  • the method may further include the following steps: receiving a sequential write request, carrying metadata in the sequential write request, and writing the metadata carried in the sequential write request into the second metadata file.
  • the storage system when receiving a sequential write request, can also first determine whether there is a historical version of the metadata carried by the sequential write request in the memory and the first storage layer. Write into the second metadata file, otherwise, write the metadata into the memory, so as to avoid that the metadata version in the memory is not the latest metadata version when the metadata is queried, and improve the reliability of metadata reading.
  • the metadata in the sequential write request is directly written into the second metadata file, thereby eliminating the system overhead caused by merging of storage layers and improving the efficiency of metadata persistence.
  • the method may further include the following steps: receiving a random write request, the random write request carrying metadata, and writing the metadata carried in the random write request into the memory.
  • the metadata carried in the random write request can be written into the memory table in the memory.
  • the memory table is switched to an unmodifiable memory table, and the metadata can be changed with the unmodifiable memory table.
  • the modified memory table is written to the first storage layer.
  • the metadata can be written to the second storage layer along with the merging of the metadata files in the first storage layer.
  • the second metadata file in the second storage layer can be modified, such as modifying some leaf nodes, root nodes and intermediate nodes corresponding to the metadata, and adding the modification operation to a transaction, which is written by the transaction.
  • the modification nodes in the second metadata file are modified in batches, so as to achieve the purpose of persistently storing the metadata.
  • the second storage layer is modified in batches according to the data in the first storage layer, and the transaction overhead is exchanged for the combined overhead, so as to solve the problem based on the In a large-capacity scenario, the storage system of the log structure tree occupies storage space and consumes more system resources.
  • the storage system may sequentially query the memory, the first storage layer, and the second storage layer until the metadata is read. If the first storage layer includes multiple storage sub-layers, when querying the first storage layer, the storage system queries the first storage layer in the order of merging. For example, the first storage layer includes the L0 layer and the L1 layer, and the L0 layer When the metadata file of the L1 layer reaches the upper limit, it will be merged with the metadata file in the L1 layer. When reading the metadata, you can query L0 first and then query L1, and so on, which will not be described here.
  • the memory, the first storage layer and the second storage layer are sequentially queried until the metadata is read, and the last second storage layer has only one non-log structure tree.
  • metadata query compared with the LSM tree composed of multiple storage layers, the query efficiency of the above storage system can be improved, and the user experience can be improved.
  • a storage system including a storage device including one or more processors and in communication with the one or more processors, the storage device for providing a first storage tier and a second storage tier , the above-mentioned one or more processors are used to: store the first metadata file that cannot be modified and written in the first storage layer of the storage system, store the second metadata file that can be modified and written in the second storage layer, and store in the first storage layer
  • the first metadata file is managed based on a log-structured tree
  • the second metadata file is managed at the second storage layer based on a non-log-structured tree.
  • the storage device may be any one of HDD, SSD, SCM, etc., and may also be a combination of the above-mentioned various storage media, which is not specifically limited in this application.
  • the first metadata file that cannot be modified and written in the first storage layer can be managed based on the log structure tree
  • the second metadata file that can be modified and written in the second storage layer can be managed based on the non-log structure tree.
  • the second storage layer is modified in batches, and the transaction overhead is exchanged for the merge overhead, so as to solve the problem that the storage system merge task takes a long time and consumes large resources in a large-capacity scenario.
  • the first storage layer also stores a third metadata file that cannot be modified and written, the first storage layer manages the third metadata file based on the log structure tree, and the one or more processors are further used for: merging the first metadata
  • the fourth metadata file is obtained from the file and the third metadata file, and the metadata stored in the fourth metadata file is written to the second metadata file.
  • the first storage layer may manage multiple metadata files in a hierarchical management manner.
  • the first storage layer may include a first storage sublayer and a second storage sublayer.
  • the second storage sublayer is used to store the third metadata file, and the multiple storage sublayers are combined in a top-to-bottom manner, and then the data in the bottom storage sublayer is written into the third storage sublayer.
  • a binary data file That is to say, the first metadata file of the first storage sub-layer is first merged with the third metadata file of the second storage sub-layer to obtain a fourth metadata file, and then the metadata stored in the fourth metadata file is merged. Write to the second metadata file.
  • further hierarchical management of the first storage layer may not be performed. That is, the multiple metadata files (ordered string table) in the first storage layer are merged according to the capacity of the first storage layer, and then written to the second storage layer.
  • the fourth metadata file and the second metadata file can be sorted to determine the first metadata file.
  • the modification node of the binary data file, and then the modification node is modified to write the metadata stored in the fourth metadata file to the second metadata file.
  • the above sorting may be a merge sort, for example, sorting different versions of the same metadata, and then combining the fourth metadata file and the second metadata file according to the sorting result to obtain a combined file, and then determining according to the combined file Modify nodes and modify content.
  • the modification nodes may be modified in batches in the manner of transaction writing. For example, it is determined that multiple modification nodes need to be modified according to the merged file, then multiple modification operations can be added to a transaction, and this modification operation can be performed in units of transactions, which can ensure the consistency of multiple modification operations, thereby completing the modification of multiple modification operations. Modification of the second metadata file.
  • the writing method can be directly modified in situ. , modify the second metadata file to realize persistent storage of metadata.
  • the storage device is an SSD
  • GC is required due to the inherent characteristics of the SSD.
  • the SSD can persist metadata in ROW mode.
  • the storage device is an HDD, because the random write performance of the HDD is poor, the HDD can convert multiple random write lowercase requests into sequential uppercase requests in ROW mode to achieve persistent metadata storage.
  • the above one or more processors are configured to receive a sequential write request, where the sequential write request carries metadata, and write the metadata carried in the sequential write request into the second metadata file.
  • the storage system when receiving a sequential write request, can also first determine whether there is a historical version of the metadata carried by the sequential write request in the memory and the first storage layer. Write into the second metadata file, otherwise, write the metadata into the memory, so as to avoid that the metadata version in the memory is not the latest metadata version when the metadata is queried, and improve the reliability of metadata reading.
  • the above-mentioned one or more processors are configured to receive a random write request, the random write request carries metadata, and write the metadata carried in the random write request into the memory.
  • the metadata carried in the random write request can be written into the memory table in the memory.
  • the memory table is switched to an unmodifiable memory table, and the metadata can be changed with the unmodifiable memory table.
  • the modified memory table is written to the first storage layer.
  • the metadata can be written to the second storage layer along with the merging of the metadata files in the first storage layer.
  • the second metadata file in the second storage layer can be modified, such as modifying some leaf nodes, root nodes and intermediate nodes corresponding to the metadata, and adding the modification operation to a transaction, which is written by the transaction.
  • the modification nodes in the second metadata file are modified in batches, so as to achieve the purpose of persistently storing the metadata.
  • the above-mentioned one or more processors are configured to receive a query request for metadata, and then according to the query request, the memory, the first storage layer and the second storage layer can be sequentially queried until the metadata is read.
  • the first storage layer includes multiple storage sub-layers
  • the storage system queries the first storage layer in the order of merging.
  • the first storage layer includes the L0 layer and the L1 layer, and the L0 layer
  • the metadata file of the L1 layer reaches the upper limit, it will be merged with the metadata file in the L1 layer.
  • a metadata management apparatus in a storage system may include: a first storage unit configured to store an unmodifiable and writable first metadata file in a first storage layer of the storage system , the second storage unit is used to store the second metadata file that can be modified and written in the second storage layer of the storage system.
  • the first storage layer manages the first metadata file based on the log structure tree, and the second storage layer is based on the non-log structure.
  • the tree manages the second metadata file.
  • Implementing the metadata management device described in the third aspect can manage the unmodifiable and writable first metadata files in the first storage layer based on the log-structured tree, and manage the modifiable and writable second metadata files in the second storage layer based on the non-log-structured tree.
  • data files so that the storage system can directly write the metadata of sequential write requests into the second storage layer in the scenario of metadata writing, eliminating the system overhead caused by the merger of storage layers.
  • the second storage layer is modified in batches according to the data in the first storage layer, and the transaction overhead is exchanged for the merge overhead, so as to solve the problem that the storage system merge task takes a long time and consumes a lot of resources in a large-capacity scenario. question.
  • the first storage layer also stores a third metadata file that cannot be modified and written; the first storage layer manages the third metadata file based on the log structure tree; the metadata management device further includes: a merging unit for merging the first metadata file.
  • the metadata file and the third metadata file obtain a fourth metadata file; the first writing unit is configured to write the metadata stored in the fourth metadata file into the second metadata file.
  • the metadata management device further includes: a receiving unit, configured to receive a sequential write request; the sequential write request carries metadata; a second writing unit is used to write the metadata carried in the sequential write request into the second in the metadata file.
  • the second storage layer manages the second metadata file based on the B-tree.
  • the second storage layer manages the second metadata file based on the B+ tree.
  • the first writing unit is specifically configured to: sort the fourth metadata file and the second metadata file, and determine the modification node of the second metadata file; The metadata stored in is written to the second metadata file.
  • the first writing unit is specifically configured to: modify the modification nodes in batches in the manner of transaction writing.
  • a storage array including a storage controller and at least one memory, wherein the at least one memory is used to provide a first storage layer and a second storage layer, and is also used to store program codes, and the storage controller can execute The program code implements the method as described in the first aspect.
  • a computer program product comprising a computer program that, when read and executed by a computing device, implements the method described in the first aspect.
  • a computer-readable storage medium comprising instructions that, when executed on a computing device, cause the computing device to implement the method as described in the first aspect.
  • FIG. 1 is a schematic diagram of the architecture of a storage system based on a log structure tree
  • FIG. 2 is a schematic diagram of the hardware structure of a storage system provided by the present application.
  • FIG. 3 is a schematic structural diagram of a structure tree area of a storage system provided by the present application.
  • FIG. 4 is a schematic flowchart of steps of a metadata management method in a storage system provided by the present application.
  • FIG. 5 is a schematic flowchart of a write request processing method provided by the present application.
  • FIG. 6 is a schematic structural diagram of a metadata management device provided by the present application.
  • FIG. 7 is a schematic diagram of a hardware structure of a storage array provided by the present application.
  • a log database is a kind of electronic filing cabinet, which is a collection of large amounts of data that is organized, shared, and managed in a unified manner. Users can perform operations such as adding, querying, updating, and deleting files in the log-type database. After the database receives the user's operation request, the modified log is generated and stored in the log disk, and then the file is stored in the memory. The memory page is modified. When the data in the memory page reaches a certain threshold, such as when the memory page is full, the data in the memory page is flushed to the hard disk for persistent processing. Therefore, the log database can avoid generating a large number of random write requests that need to be written to the hard disk every time the data is modified, thereby reducing the number of hard disk write operations.
  • the method of combining data and writing to the hard disk converts multiple random write requests into sequential write requests to improve storage efficiency.
  • the files in the memory can be recovered according to the logs in the log disk, thereby improving the performance of the database. reliability.
  • the read performance of a simple log database is poor.
  • the log-structured tree for managing the log-type database came into being.
  • the LSM tree as a log-structured tree, can convert a random write operation with low efficiency into a sequential write operation with high efficiency. While improving write performance, the impact on read performance is minimized, and it is widely used in log databases.
  • FIG. 1 is a schematic structural diagram of a storage system based on a log structure tree.
  • the storage system 110 may include a memory 110 and a hard disk 120 .
  • the memory 110 includes a memory table (memtable) 111 and an immutable memtable (immutable memtable) 112, and the hard disk 120 includes an ordered string table (sorted string table, SStable) 123, a log storage area 121 and a data storage area 122.
  • the memory table 111 in the memory 110 can be used to store the metadata of the data.
  • the metadata refers to the data describing the data.
  • the metadata may include description information such as the storage address, length, and data type of the data.
  • the memory table 111 The metadata in can be modified, and the memory table 111 stores the metadata in order in the form of key-value pairs.
  • the key-value pair (key, value) the key-value pair is a string
  • K represents the key (key)
  • the key can be the name or identifier of the data
  • V represents the value (value) corresponding to the key
  • the value can be based on
  • the definition of business requirements, such as the location information of the data, etc., each key corresponds to at least one value.
  • K data x
  • V metadata of data x.
  • the unmodifiable memory table 112 in the memory 110 is used to store unmodifiable metadata, wherein after the metadata in the memory table 111 reaches a threshold, the memory table 111 can be switched to the unmodifiable memory table 112, and then a new memory table 111 is created
  • the non-modifiable memory table 112 for storing the most recently updated metadata will be flushed into the hard disk 120 in batches for persistence.
  • the log storage area 121 in the hard disk 120 is used for storing logs. It should be understood that, since the memory 110 is not reliable storage, data in the memory 110 will be lost if the power is turned off, so the reliability can be ensured through write-ahead logging (WAL).
  • WAL is a standard way to implement transaction logging. In the scenario of storing metadata based on LSM, the core idea is that the modification of metadata must and can only occur after the modification of the log. In other words, when modifying the metadata, the modification operation is first recorded in the log file, and then the modification data. Using WAL for data storage can avoid flushing the metadata into the hard disk every time the metadata is modified, reducing the number of hard disk writes. When it is not flushed into the hard disk, if the memory fails, the metadata can be recovered according to the WAL.
  • the data storage area 121 in the hard disk 120 is used for storing data, that is, an area where data is persistently stored.
  • the ordered string table 123 in the hard disk 120 manages the metadata based on a log structure tree.
  • the log structure tree can be an LSM tree.
  • the organizational structure of the LSM includes multiple storage layers (for example, the C0 layer, the C1 layer, the C2 layer and C3 layer), wherein, the storage space of the upper storage layer is smaller, and the storage space of the lower storage layer is larger. And when the non-modifiable memory table 112 is flushed into the hard disk, the top layer of the ordered string table 123 is written first.
  • the ordered string table in the top layer C0 is merged. It will be stored in the C1 layer. After the ordered string table in the C1 layer is merged, an ordered string table is obtained and written to the C1 layer, and so on, so that the old data can be continuously deleted, and the new data can be continuously written to the hard disk. .
  • memory tables, non-modifiable memory tables, and ordered string tables are also referred to as metadata files.
  • the storage system 100 shown in FIG. 1 receives a write request for data X
  • the system 100 can first write the data X of the current write request into the data storage area 122, and then write the data X in the data storage area 122 into the data storage area 122.
  • the metadata information Y such as the address is written into the log
  • the log is stored in the log storage area 121
  • the metadata Y of the data X is written into the memory table 111 .
  • the storage system 100 can additionally write the modified data X this time into the data storage area 122, and additionally write the metadata corresponding to the modified data X into the log.
  • the metadata in the memory table 111 is modified.
  • the storage system 100 can merge the received data first and then write additionally, which can reduce the number of disk writes during data modification and improve storage efficiency.
  • the memory table 111 When the amount of data stored in the memory table 111 reaches the threshold, the memory table 111 will be switched to the non-modifiable memory table 112, and a new memory table 111 will be created, and the non-modifiable memory table 112 will be written to the C0 of the hard disk 120 in batches
  • the storage space of the C0 layer is 300MB
  • the storage space of the C1 layer is 3GB
  • the storage space of the C2 layer is 30GB
  • the storage space of the C3 layer is 100GB.
  • the unmodifiable memory tables in the C0 layer are merged to obtain an ordered string table, and the ordered string table is written into the C1 layer.
  • the ordered string tables in the C1 layer are merged to obtain a new ordered string table, and the new ordered string table is written into the C2 layer.
  • the merging of each layer is performed asynchronously, the writing of new data will not be affected, thereby improving the writing speed of the LSM tree.
  • a storage layer may include multiple log-structured trees, and the execution of merging tasks between two adjacent layers consumes time. The longer the time is, the more storage space is occupied and the more system resources are consumed.
  • the present application provides a storage system 200, the storage system includes a first storage layer and a second storage layer.
  • the first storage layer manages the metadata files that can be written without modification based on the log structure tree
  • the second storage layer manages the metadata files that can be modified and written based on the non-log structure tree.
  • the storage system can directly write the metadata of the sequential write request into the second storage layer, eliminating the system overhead caused by the merging of storage layers. , and then modify the second storage layer in batches according to the data in the first storage layer, and exchange the transaction overhead for the merge overhead, thereby solving the problem that the storage system merge task takes a long time and consumes large resources in a large-capacity scenario.
  • the storage system 200 provided in the present application may be deployed in a storage system, and the storage system includes one or more processors.
  • the processor can be an X86 or an ARM processor and so on.
  • the storage system may be a centralized storage array or a distributed storage system, which is not specifically limited in this application.
  • the storage system 200 may include a memory 210, a hard disk 220, and one or more processors 230.
  • FIG. 2 takes one processor 230 as an example for description. Quantity is limited.
  • the memory 210, the hard disk 220 and one or more processors 230 can be connected to each other through a bus, such as a peripheral component interconnect express (PCIe) bus or an extended industry standard architecture (EISA) A bus, etc., can also implement communication through other means such as wired transmission, such as Ethernet (Ethernet), which is not specifically limited in this application.
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • FIG. 2 is only an exemplary division manner, and each module unit may be merged or split into more or less module units, which is not specifically limited in this application, and the system shown in FIG. 2 and The positional relationship between modules also does not constitute any limitation.
  • the memory 210 can be read-only memory (ROM), random access memory (RAM), dynamic random-access memory (DRAM), double-rate synchronous dynamic random access memory (double data rate) Any one of SDRAM, DDR), storage class memory (storage class memory, SCM), etc., the memory 210 may also be a combination of the above-mentioned various storage media, which is not specifically limited in this application. Among them, the memory 210 stores the unmodifiable memory table 212 and the memory table 211 . It should be understood that for the description of the memory 210, the unmodifiable memory table 212, and the memory table 211, reference may be made to the memory 110, the unmodifiable memory table 112, and the memory table 111 in the embodiment of FIG. 1, and details are not repeated here.
  • the hard disk 220 can be any one of a mechanical hard disk (hard disk drive, HDD), a solid-state drive (SSD), a hybrid hard disk (solid state hybrid drive, SSHD), an SCM, etc., and can also be any of the above
  • the combination of storage media is not specifically limited in this application.
  • the hard disk 220 includes a log storage area 221 , a data storage area 222 and a structure tree area 223 .
  • the log storage area 221 , the data storage area 222 and the structure tree area 223 may be in the same hard disk or in different hard disks.
  • FIG. 2 is an example This application does not limit the way of dividing the gender.
  • log storage area 221 and the data storage area 222 may refer to the log storage area 121 and the data storage area 122 in the embodiment of FIG. illustrate.
  • the structure tree area 223 in the hard disk 220 includes a first storage layer 2231 and a second storage layer 2232. It should be understood that when the storage system 200 is deployed in a storage array or a distributed storage system, the first storage layer 2231 and the second storage layer 2232 may be in the same hard disk, or may be in different hard disks, which is not limited in this application.
  • the first storage layer 2231 can store the first metadata file that cannot be modified and written, and the first storage layer manages the first metadata file based on the log structure tree.
  • the log structure tree may be an LSM tree, and the first metadata file may be obtained after data in the unmodifiable memory table 212 is written into the first storage layer 2231 .
  • the first metadata file may also be an ordered string table obtained by merging.
  • the second storage layer 2232 stores a second metadata file that can be modified and written, and the second storage layer manages the second metadata file based on a non-log structured tree.
  • the non-log structure tree may be a B tree (B tree), a B+ tree (B+tree), a dictionary sequence tree (trie tree), a skip list (skip list), etc., which is not specifically limited in this application.
  • the second metadata file may be used to store the metadata in the ordered string table merged by the first storage layer 2231 .
  • the storage system 200 may obtain an ordered string table by merging the unmodifiable and writable metadata files (ordered string table) in the first storage layer 2231, and determine The second storage layer 2232 needs to modify the modification node, and then modify it, and write the metadata in the ordered string table into the second metadata file in the second storage layer 2232, thereby avoiding large-capacity scenarios.
  • the huge overhead caused by the consolidation of the second storage layer 2232 reduces the resource consumption of the storage system 200 .
  • the one or more processors 230 may be composed of at least one general-purpose processor, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip, and the above-mentioned hardware chip may be an application specific integrated circuit (application specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof, the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), General array logic (generic array logic, GAL) or any combination thereof is not specifically limited in this application.
  • the processor 230 may execute various types of program codes to enable the storage system 200 to implement various functions.
  • the one or more processors 230 may be configured to store the unmodifiable and writable first metadata file in the first storage layer 2231 of the storage system 100, and store the modifyable and writable second metadata file in the second storage layer 2232,
  • the first metadata file is managed based on a log-structured tree at the first storage layer 2331
  • the second metadata file is managed based on a non-log-structured tree at the second storage layer.
  • the first storage layer 2231 may contain multiple metadata files (ordered string tables).
  • the first storage layer may also store an unmodifiable third metadata file.
  • the third metadata file is managed based on the log structure tree.
  • the storage system 200 combines the first metadata file and the third metadata file to obtain a fourth metadata file, and modifies the second metadata file according to the fourth metadata file to complete persistent storage of metadata.
  • FIG. 3 is an exemplary division method of the structure tree area 223, and FIG. 3 takes the example that the first storage layer 2231 includes two storage sub-layers L0 and L1 as an example. The number of sublayers is limited.
  • the storage system 200 may combine the first metadata file with the third metadata file in the L1 layer to generate the first metadata file in the L1 layer.
  • the fourth metadata file is then deleted, and the fourth metadata file is written into the L1 layer.
  • the fourth metadata file in the L1 layer reaches the upper limit, according to the fourth metadata file and the second storage layer 2232
  • the second metadata file in the second metadata file is modified to complete the persistent storage of the metadata.
  • FIG. 3 is used for illustration, and is not specifically limited in the present application.
  • the second metadata Modifying the file can reduce the number of times of modification to the second metadata file in the second storage layer 2232 , thereby reducing the occurrence of the write amplification problem of the non-log-structured tree in the second storage layer 2332 .
  • the problem of write amplification refers to that when modifying a non-log-structured tree, each modification needs to traverse the entire non-log-structured tree to determine the leaf nodes, intermediate nodes, and root nodes that need to be modified. Smaller, and the number of modifications is more frequent, there will be a waste of resources.
  • further hierarchical management of the first storage layer 2231 may not be performed. That is, the multiple metadata files (ordered string tables) in the first storage layer 2231 are merged according to the capacity of the first storage layer 2231, and then written to the second storage layer.
  • the fourth metadata file and the second metadata file may be merged and sorted to determine the modification node to be modified, For example, to sort different versions of the same metadata, for example, the corresponding values of metadata Y in the fourth metadata file and the second metadata file can be sorted, and metadata X in the fourth metadata file can be sorted. Sorting with the corresponding value in the second metadata file, thereby determining the modified node to be modified, wherein, the modified node can be a part of the leaf node, intermediate node, root node, etc.
  • the above modification nodes are modified in batches in the manner of transaction writing, so as to complete the modification of the second metadata file. It can be understood that the storage system 200 exchanges a small amount of overhead required for transaction writing for a huge amount of overhead caused when a storage layer with a large amount of data is merged, thereby solving the problem that the storage system merge task takes a long time and consumes large resources.
  • the hard disk 220 is an SCM
  • the SCM is a storage medium whose read and write performance is close to that of the memory
  • it can be directly modified and written in place (write -in-place) method
  • garbage collection GC
  • the SSD can use redirect-on-write (redirect-on-write) ,ROW) method will persist the metadata.
  • the HDD can convert multiple random write lowercase requests into sequential uppercase requests in a ROW manner to implement persistent storage of metadata.
  • the storage system 200 may query the memory 210, the first storage layer 2231 and the second storage layer 2232 in sequence until the metadata is read. If the first storage layer 2231 includes multiple storage sublayers, when querying the first storage layer 2231, the storage system 200 queries the first storage layer 2231 in the order of merging. For example, in the example shown in FIG. When the L1 layer is merged, when reading metadata, you can query L0 first and then L1, and so on. It can be understood that, compared with the storage system 100 shown in FIG. 1 , since the storage system 200 has fewer storage layers, the query efficiency of the storage system 200 can be improved during metadata query, and the user experience can be improved.
  • the metadata carried in the sequential write request can be directly Write to the second metadata file, thereby reducing resource consumption caused by merging between storage layers and improving the efficiency of metadata persistence.
  • the storage system 200 may first determine whether there is a historical version of the metadata carried by the sequential write request in the memory 210 and the first storage layer 2231, and if not, then The metadata is directly written into the second metadata file, otherwise, the metadata is written into the memory 210, so as to avoid that the metadata version in the memory is not the latest metadata version when the metadata is queried, which improves the readability of the metadata. reliability.
  • the storage system 200 when the storage system 200 receives the random write request, it can write the metadata Y carried in the random write request into the memory table 211 of the memory 210. In this way, when the amount of data in the memory table 211 is When the threshold is reached, the memory table 211 is switched to the non-modifiable memory table 212, and the metadata Y can be written into the first storage layer 2231 along with the non-modifiable memory table 212. When the amount of data in the first storage layer 2231 reaches the upper limit , the metadata Y can be written into the second storage layer 2232 along with the merging of the metadata files in the first storage layer 2231. Specifically, the second metadata file in the second storage layer 2232 can be modified, such as modifying the metadata Some leaf nodes, root nodes, intermediate nodes, etc. corresponding to data Y, so as to achieve the purpose of persistent storage of metadata Y.
  • the storage system includes a first storage layer and a second storage layer.
  • the first storage layer manages unmodifiable and writable files based on a log-structured tree
  • the second storage layer manages modifiable and writable files based on a non-log-structured tree. document.
  • the storage system with this structure can improve the efficiency of metadata reading due to the small number of storage layers.
  • the metadata of sequential write requests can be directly written to
  • the second storage layer eliminates the system overhead caused by the merging of storage layers. After the metadata of the random write request is written into the first storage layer, the second storage layer can be modified in batches according to the data in the first storage layer. The cost of transaction overhead is exchanged for the overhead of merging, so as to solve the problem that the storage system based on the log structure tree occupies storage space and consumes more system resources in large-capacity scenarios.
  • FIG. 4 is a schematic flowchart of steps of a metadata management method in a storage system provided by the present application, wherein the management method can be applied to the storage system 200 described in FIG. 2 .
  • the method can be implemented by the storage system Executed by one or more processors in 200, the method may include the following steps:
  • S410 Store the unmodifiable and writable first metadata file in the first storage layer 2231 of the storage system 200 .
  • the first metadata file may be obtained after data in the unmodifiable memory table 212 in the memory 210 is written into the first storage layer 2231 .
  • the first storage layer 2231 manages the first metadata file based on the log structure tree.
  • the first storage layer 2231 may include a plurality of metadata files, for example, the first storage layer 2231 may also store a third metadata file that cannot be modified and written. Wherein, when the size of the first metadata file reaches the upper limit, the first metadata file can be merged with the third metadata file to obtain a fourth metadata file, and the fourth metadata file can be written into the second storage layer 2232 .
  • S420 Store the modifiable and writable second metadata file in the second storage layer 2232 of the storage system 200 .
  • the second metadata file may be obtained after the data in the first storage layer 2331 is written into the second storage layer 2232, or may be the metadata of the sequential write request after the storage system 200 receives the sequential write request.
  • the second metadata file stored by the second storage layer 2232 is directly written. It should be understood that, for the description of the second storage layer 2231 and the second metadata file, reference may be made to the foregoing embodiments in FIG. 2 to FIG. 3 , and details are not repeated here.
  • the second storage layer 2232 manages the second metadata files based on a non-log structured tree.
  • the above-mentioned non-log structure tree may include a B tree, a B+ tree, a dictionary sequence tree, a skip table, etc., which is not specifically limited in this application.
  • the fourth metadata file obtained by merging the first metadata file and the third metadata file can be written to the second metadata file.
  • Metadata file completes the persistent storage of metadata.
  • the fourth metadata file and the second metadata file can be merged and sorted, and each metadata in the fourth metadata file and the second metadata file can be sorted, such as different versions of the same metadata Sort according to the writing order, and then determine the modified nodes and modified contents that need to be modified when the fourth metadata file is written into the second metadata file, and then modify the modified nodes in batches in the way of transaction writing, so that the first The quad metadata file is written into the second metadata file, wherein the modified nodes may be some leaf nodes, intermediate nodes, and root nodes of the non-log-structured tree, and so on.
  • the fourth metadata file and the second metadata file can be merged according to the sorting result to obtain the merged file and delete the data of the old version , retain the data of the new version, and then determine the modified node and modified content according to the merged file.
  • the fourth metadata file and the second metadata file are processed
  • the leaf node where X1 is located in the second metadata file can be modified, the leaf node where X2 is located is not modified, and a new leaf node records metadata X3.
  • the above multiple modification operations can be added To a transaction, this modification operation is performed in units of transactions, thereby completing the modification of the second metadata file.
  • the storage system 200 may also receive a query request, and according to the query request, sequentially query the memory 210, the first storage layer 2231, and the second storage layer 2232, until the required element is queried. data.
  • the storage system 100 shown in FIG. 1 needs to query the memory 110, the C0 layer, the C1 layer, the C2 layer, and the C3 layer in turn when querying data.
  • the query efficiency of the storage system shown in FIG. 1 is very low, and the method provided in this application can query the memory 210, the first storage layer 2231 and the second storage layer 2232 in turn to obtain the required Metadata can improve query efficiency and reduce query latency.
  • the storage system 200 provided by the present application can manage the stored data based on the above steps S410 to S420.
  • the following describes the write operation process of the storage system 200 with reference to specific embodiments.
  • FIG. 5 is a schematic diagram of a step flow of a write operation process provided by the present application. As shown in FIG. 5 , the write operation process provided by the present application includes the following steps:
  • S510 Receive a write request, where the write request carries metadata X.
  • S520 Determine the write type of the write request, wherein the write type includes sequential write and random write. In the case that the write type is sequential write, perform steps S530 to S550, and in the case that the write type is random write , and step S560 is executed.
  • Sequential write requests refer to sequential requests for a large amount of data.
  • sequential write requests may include database execution of a large number of query requests, streaming media service requests, etc., with the request in the database.
  • sequential write requests may include requests to perform write redo/undo log operations on the database, requests for streaming media services, and the like.
  • Random write requests refer to random requests for a small amount of data, such as world wide web (Web) service requests, mailbox (mail) service requests, and so on.
  • Web world wide web
  • email mailbox
  • the metadata can be directly written to the second storage layer. 2232 performs persistence processing, which not only eliminates the system overhead caused by storage layer merging, but also reduces the write amplification problem of non-log-structured trees in the second storage layer 2232.
  • the metadata of the random write request can be written into the first storage layer 2231 first, and it is managed by the log structure tree in the first storage layer 2231. After the metadata files are merged, at this time Then, by modifying the second metadata file in the second storage layer 2232, the metadata is persistently stored, which can effectively reduce the write amplification problem of the non-log structure tree in the second storage layer 2232.
  • step S530 Query whether the memory and the first storage layer 2231 store the metadata X of the historical version. Wherein, in the case that the memory 210 and the first storage layer 2231 do not have the metadata X of the historical version, step S550 is performed, and in the case that the memory 210 and/or the first storage layer 2232 includes the metadata X of the historical version, step S550 is performed S540.
  • the storage system 200 receives the read request of the metadata X
  • the metadata X of the historical version in the memory 210 or the first storage layer 2231 will be queried first, but the second storage layer will be queried first.
  • the metadata X in 2232 is the latest version of the metadata.
  • the first storage layer 2231 can manage the metadata X based on the log structure tree. Specifically, the metadata X can be written first. into the first metadata file in the first storage layer 2231, and then when the first metadata file reaches the upper limit, the storage system merges the first metadata file and the third metadata file to obtain the fourth metadata file file, so that the metadata X is written into the fourth metadata file, and the fourth metadata file is written into the second metadata file in the second storage layer 2232, thereby realizing persistent storage of the metadata X.
  • step S560 the step flow of processing the metadata X by the storage system 200 may refer to the foregoing embodiments in FIGS. 2 to 4, and specifically refer to the management of the first metadata file by the first storage layer based on the log structure tree The related content will not be repeated here.
  • the metadata management method in the storage system manages the unmodifiable and writable first metadata files in the first storage layer 2231 based on the log structure tree, and manages the non-log structure tree in the second storage layer 2232.
  • the second metadata file that can be modified and written can improve the efficiency of metadata reading due to the small number of storage layers when processing read requests, and when processing write requests, the metadata of sequential write requests can be directly written.
  • the metadata in the metadata file in the storage layer 2232 exchanges the transaction overhead for the merge overhead, thereby solving the problem that the log-structured tree-based storage system occupies storage space and consumes more system resources in a large-capacity scenario.
  • FIG. 6 is a metadata management apparatus in a storage system provided by the present application.
  • the metadata management apparatus may be the processor 230 in the foregoing content, and the metadata management apparatus includes: a first storage unit 610 and a second storage unit 620.
  • the first storage unit 610 is configured to store the unmodifiable and writable first metadata file in the first storage layer of the storage system
  • the second storage unit 620 is configured to store the modifiable and writable second metadata file in the second storage layer of the storage system.
  • the first storage layer manages the first metadata file based on the log structure tree
  • the second storage layer manages the second metadata file based on the non-log structure tree.
  • the metadata management apparatus further includes a merging unit 630 and a first writing unit 640 .
  • the first storage layer also stores a third metadata file that cannot be modified and written; the first storage layer manages the third metadata file based on the log structure tree; the merging unit 630 is used for combining the first metadata file and the third metadata file to obtain A fourth metadata file; the first writing unit 640 is configured to write the metadata stored in the fourth metadata file to the second metadata file.
  • the metadata management apparatus further includes a receiving unit 650 and a second writing unit 660 .
  • the receiving unit 650 is configured to receive the sequential write request; the sequential write request carries metadata; the second writing unit 660 is configured to write the metadata carried in the sequential write request into the second metadata file.
  • the second storage layer manages the second metadata file based on the B-tree.
  • the second storage layer manages the second metadata file based on the B+ tree.
  • the first writing unit 640 is specifically configured to: sort the fourth metadata file and the second metadata file, and determine the modification node of the second metadata file; The metadata stored in the file is written to the second metadata file.
  • the first writing unit 640 is specifically configured to: perform batch modification on the modification nodes in the manner of transaction writing.
  • the metadata management apparatus of the embodiments of the present application may correspond to executing the methods described in the embodiments of FIG. 2 to FIG. 5 of the present application, and each module and/or function in the processor 230 is to implement each of the embodiments of FIG. 2 to FIG. 5 , respectively.
  • the corresponding flow of the method is not repeated here for brevity.
  • the metadata management device manages the unmodifiable and writable first metadata file based on the log structure tree in the first storage layer of the storage system, and manages the non-log structure tree in the second storage layer of the storage system.
  • a second metadata file that can be modified and written.
  • the metadata of sequential write requests can be directly written to the second storage layer. Eliminate the system overhead caused by the merger of storage layers.
  • the second storage layer can be modified in batches according to the data in the first storage layer, in exchange for transaction overhead.
  • the overhead of merging can solve the problem that the storage system based on the log structure tree occupies storage space and consumes more system resources in a large-capacity scenario.
  • FIG. 7 is a schematic diagram of the hardware structure of a storage array 700 provided by the present application.
  • the storage array 700 may be the storage system 200 of the foregoing content.
  • the storage array 700 includes a storage controller 710 and at least one memory 720, wherein the storage controller 710 and the at least one memory 720 are connected to each other through a bus 730 or a network.
  • the bus 730 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus or the like.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 730 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the storage controller 710 may include the processor 230 and the metadata management apparatus 600 in the embodiment of FIG. 2, and the storage controller 710 includes one or more processors, wherein the processors include a CPU, a microprocessor, a microcontroller, a main processing devices, controllers, ASICs, etc.
  • At least one memory 720 may be the hard disk 220 and the memory 210 in the embodiment of FIG. 2 , and may specifically be non-volatile memory, such as HDD, SSD, SCM, etc., and may also include a combination of the above-mentioned types of memory.
  • the storage array 700 may be composed of multiple HDDs or multiple SDDs, or the storage array 700 may be composed of multiple HDDs or multiple SCMs.
  • at least one memory 720 is combined in different ways under the control of the memory controller 710 to form a memory group, thereby providing higher storage performance than a single memory.
  • FIG. 7 is only a possible implementation manner of the embodiment of the present application.
  • the storage array 700 may further include more or less components, which is not limited here.
  • the storage system may also be a distributed storage system, or a single server that implements the above solution, etc.
  • the present invention does not limit the specific form of the storage system.
  • Embodiments of the present application further provide a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a processor, the method flows shown in FIG. 4 and FIG. 5 are implemented.
  • the embodiments of the present application also provide a computer program product, when the computer program product runs on a processor, the method flows shown in FIG. 4 and FIG. 5 are implemented.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes at least one computer instruction.
  • computer program instructions When computer program instructions are loaded or executed on a computer, the procedures or functions according to the embodiments of the present invention result in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website site, computer, server, or data center over a wire (e.g.
  • Coaxial cable, optical fiber, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (such as infrared, wireless, microwave, etc.) means to transmit to another website site, computer, server or data center.
  • a computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains at least one set of available media.
  • Useful media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD), or semiconductor media.
  • the semiconductor media may be SSDs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种存储系统中元数据管理方法、装置及存储系统,该方法由存储系统中的一个或多个处理器执行,该方法包括以下步骤:在存储系统的第一存储层存储不可修改写的第一元数据文件,在第二存储层存储可修改写的第二元数据文件,第一存储层基于日志结构树管理第一元数据文件,第二存储层基于非日志结构树管理第二元数据文件。使用该元数据管理方法,可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,解决在大容量场景下存储系统合并任务耗时久、资源消耗大的问题。

Description

一种存储系统中元数据管理方法、装置及存储系统 技术领域
本申请涉及存储领域,尤其涉及一种存储系统中元数据管理方法、装置及存储系统。
背景技术
随着科学技术的不断发展,信息爆炸时代产生的海量数据已经渗透到当今每一个行业和业务职能领域,信息爆炸使得存储系统的读写要求也越来越高。为了提高存储系统的读写能力,优化数据文件的组织方式,基于日志的数据库系统被广泛应用在各行各业。
结构化合并树(log structured merge tree,LSM tree)是基于日志(log)的数据库系统中的常用存储结构之一,LSM树是一个多层结构,最上层C 0位于内存,后面多层C 1~C k位于硬盘。在LSM树的用于管理元数据写入场景中,新写入的元数据会首先存储至最上层C 0的文件中,最上层C 0中元数据量达到一定程度后,该文件变成不可修改写文件,最上层C 0中的该文件将会与下一层C 1中的不可修改写文件进行合并(compaction),从而获得新的不可修改写文件,然后顺序写入硬盘中的C 2层,以此类推,使得旧元数据可以不断被删除,新元数据能够不断被写入至硬盘中。由于各层的合并是异步进行的,对新元数据的写不会产生影响,从而使得LSM树的写入速度得以提升。
但是,在大容量场景下,基于日志结构树的存储系统在经过多层合并之后,越底层的元数据量越大,使得越底层的合并任务执行耗时越久,系统资源消耗越大。
发明内容
本申请提供了一种存储系统中元数据管理方法、装置及存储系统,可以解决基于日志结构树的存储系统合并任务耗时久、资源消耗大的问题。
第一方面,提供了一种存储系统中元数据管理方法,该方法由存储系统中的一个或多个处理器执行,该方法包括以下步骤:在存储系统的第一存储层存储不可修改写的第一元数据文件,在第二存储层存储可修改写的第二元数据文件;其中,第一存储层基于日志结构树管理第一元数据文件;第二存储层基于非日志结构树管理第二元数据文件。
具体实现中,非日志结构树可以是B树、B+树、字典序列树、跳表等等,本申请不作具体限定。
实施第一方面描述的方法,基于日志结构树管理第一存储层中不可修改写的第一元数据文件,基于非日志结构树管理第二存储层中可修改写的第二元数据文件,使得存储系统在元数据写场景,可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,同时,随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解 决在大容量场景下存储系统合并任务耗时久、资源消耗大的问题。
在一种可能的实现方式中,第一存储层可以包含多个元数据文件,比如,第一存储层还存储不可修改写的第三元数据文件,第一存储层基于日志结构树管理第三元数据文件,该方法还可包括以下步骤:合并第一元数据文件和第三元数据文件得到第四元数据文件,将第四元数据文件中存储的元数据写入到第二元数据文件。
具体实现中,第一存储层可以通过分层管理的方式,对多个元数据文件进行管理,比如第一存储层可包括第一存储子层和第二存储子层,第一存储子层用于存储第一元数据文件,第二存储子层用于存储第三元数据文件,多个存储子层按照从上至下的方式进行合并,然后将底部的存储子层中的数据写入第二元数据文件。也就是说,第一存储子层的第一元数据文件先与第二存储子层的第三元数据文件进行合并,获得第四元数据文件,然后将第四元数据文件中存储的元数据写入到第二元数据文件。
可以理解,另一种实现,也可以不对第一存储层进行进一步分层管理。即在第一存储层中的多个元数据文件(有序字符串表),根据第一存储层的容量,对多个元数据文件进行合并,然后写入第二存储层。
上述实现方式,第一存储层中包含多个元数据文件,且元数据文件之间可进行合并,将合并后的元数据文件再写入第二存储层,可以减少对第二存储层中第二元数据文件的修改次数,从而降低第二存储层的非日志结构树写放大问题的出现。其中,写放大问题指的是在对非日志结构树进行修改时,每次修改都需要遍历整个非日志结构树确定所需修改的叶子节点、中间节点和根节点等等,若每次修改量较小,而修改次数又较为频繁,将会出现资源浪费的情况。而第一存储层中多个元数据文件之间合并后,再写入第二存储层,可以减少该写放大问题的出现,避免资源浪费。
在一种可能的实现方式中,将第四元数据文件中存储的元数据写入到第二元数据文件时,可将第四元数据文件和第二元数据文件进行排序,确定第二元数据文件的修改节点,然后对修改节点进行修改,将第四元数据文件中存储的元数据写入到第二元数据文件。
具体实现中,上述排序可以是归并排序,比如将同一元数据的不同版本进行排序,然后根据排序结果将第四元数据文件和第二元数据文件进行合并,获得合并文件,然后根据合并文件确定修改节点和修改内容。
上述实现方式,通过修改第二元数据文件中部分修改节点的方式,将第四元数据文件中存储的元数据写入到第二元数据文件,而基于LSM树的存储系统则需要将第四元数据文件和第二元数据文件进行合并获得新文件,并将新文件完整写入第二存储层,以实现将第四元数据文件中存储的元数据写入到第二元数据文件的目的,二者相比,上述实现方式可以消除存储层合并带来的系统开销,提高元数据持久化存储的效率。
在一种可能的实现方式中,对修改节点进行修改时,可以以事务写的方式,对修改节点进行批量修改。比如根据合并文件确定要对多个修改节点进行修改,那么可以将多个修改操作添加至一个事务中,以事务为单位进行本次修改操作,可以确保多个修改操作的一致性,从而完成对第二元数据文件的修改。
上述实现方式,通过修改第二元数据文件的方式对元数据进行持久化存储,并且以事务写所需的少量开销换取存储层合并时带来的巨额开销,可以解决存储系统合并 任务耗时久、资源消耗大的问题。
在一种可能的实现方式中,存储系统可包括一个或多个存储设备,该存储设备用于提供第一存储层和第二存储层,该存储设备可以是机械硬盘(hard disk drive,HDD)、固态硬盘(solid-state drive,SSD)、混合硬盘(solid state hybrid drive,SSHD)、存储级内存(storage class memory,SCM)等等中任意一种,还可以是上述各种存储介质的组合,本申请不作具体限定。
具体实现中,根据第四元数据文件对第二元数据文件进行修改时,若存储设备是SCM的情况下,由于SCM是读写性能接近内存的存储介质,可以直接以原地修改写(write-in-place)的方式,对第二元数据文件进行修改,实现元数据的持久化存储。若存储设备是SSD的情况下,由于SSD的固有特性导致其需要进行垃圾回收(garbage collection,GC),为了减少GC带来的写放大问题,SSD可以以写时重定向(redirect-on-write,ROW)方式将对元数据进行持久化处理。若存储设备是HDD的情况下,由于HDD的随机写性能较差,因此HDD可以以ROW方式,将多个随机写的小写请求转化为顺序大写请求,实现元数据的持久化存储。
在一种可能的实现方式中,该方法还可包括以下步骤:接收顺序写请求,该顺序写请求中携带元数据,将顺序写请求中携带的元数据写入该第二元数据文件中。
具体实现中,在接收到顺序写请求时,存储系统还可先确定内存和第一存储层中是否存在该顺序写请求所携带的元数据的历史版本,若不存在,再将该元数据直接写入第二元数据文件中,否则,将该元数据写入内存中,从而避免查询元数据时,内存中的元数据版本不是最新的元数据版本,提高元数据读取的可靠性。
上述实现方式,将顺序写请求中的元数据直接写入第二元数据文件中,从而消除存储层合并带来的系统开销,提高元数据持久化的效率。
在一可能的实现方式中,该方法还可包括以下步骤:接收随机写请求,该随机写请求中携带元数据,将随机写请求中携带的元数据写入内存中。
具体实现中,可将随机写请求中携带的元数据写入内存中的内存表,这样,当内存表中的数据量达到阈值时,内存表切换为不可修改内存表,元数据可随着不可修改内存表,被写入第一存储层中,当第一存储层中的数据量达到上限时,元数据可随着第一存储层中的元数据文件的合并,被写入第二存储层,具体可对第二存储层中的第二元数据文件进行修改,比如修改元数据对应的部分叶子节点、根节点和中间节点等等,并将修改操作添加入一个事务中,以事务写的方式批量修改第二元数据文件中的修改节点,从而实现将元数据进行持久化存储的目的。
上述实现方式,通过将随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解决基于日志结构树的存储系统在大容量场景下,占用存储空间且系统资源消耗越大的问题。
在一可能实现方式中,存储系统在接收到元数据的查询请求时,可依次查询内存、第一存储层以及第二存储层,直至读取到该元数据。若第一存储层包括多个存储子层,那么存储系统在查询第一存储层时,按照合并顺序对第一存储层进行查询,比如第一存储层包括L0层和L1层,且L0层中的元数据文件达到上限后,将会向L1层中的元数据文件进行合并,那么在读取元数据时,可先查询L0再查询L1,以此类推,这里 不一一举例说明。
上述实现方式,在处理元数据查询请求时,依次查询内存、第一存储层和第二存储层直至读取到该元数据,且最后的第二存储层只有一棵非日志结构树,因此在元数据查询时,相比于由多个存储层构成的LSM树,上述存储系统的查询效率可以提升,提高用户的使用体验。
第二方面,提供了一种存储系统,该存储系统包括含一个或多个处理器以及与一个或多个处理器通信的存储设备,该存储设备用于提供第一存储层和第二存储层,上述一个或多个处理器用于:在存储系统的第一存储层存储不可修改写的第一元数据文件,在第二存储层存储可修改写的第二元数据文件,在第一存储层基于日志结构树管理第一元数据文件,在第二存储层基于非日志结构树管理第二元数据文件。
具体实现中,该存储设备可以是HDD、SSD、SCM等等中任意一种,还可以是上述各种存储介质的组合,本申请不作具体限定。
实施第二方面描述的存储系统,可基于日志结构树管理第一存储层中不可修改写的第一元数据文件,基于非日志结构树管理第二存储层中可修改写的第二元数据文件,使得存储系统在元数据写场景,可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,同时,随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解决在大容量场景下存储系统合并任务耗时久、资源消耗大的问题。
可选地,第一存储层还存储不可修改写的第三元数据文件,第一存储层基于日志结构树管理第三元数据文件,一个或多个处理器还用于:合并第一元数据文件和第三元数据文件得到第四元数据文件,将第四元数据文件中存储的元数据写入到第二元数据文件。
具体实现中,第一存储层可以通过分层管理的方式,对多个元数据文件进行管理,比如第一存储层可包括第一存储子层和第二存储子层,第一存储子层用于存储第一元数据文件,第二存储子层用于存储第三元数据文件,多个存储子层按照从上至下的方式进行合并,然后将底部的存储子层中的数据写入第二元数据文件。也就是说,第一存储子层的第一元数据文件先与第二存储子层的第三元数据文件进行合并,获得第四元数据文件,然后将第四元数据文件中存储的元数据写入到第二元数据文件。
可以理解,另一种实现,也可以不对第一存储层进行进一步分层管理。即在第一存储层中的多个元数据文件(有序字符串表),根据第一存储层的容量,对多个元数据文件进行合并,然后写入第二存储层。
可选地,上述一个或多个处理器将第四元数据文件中存储的元数据写入到第二元数据文件时,可将第四元数据文件和第二元数据文件进行排序,确定第二元数据文件的修改节点,然后对修改节点进行修改,将第四元数据文件中存储的元数据写入到第二元数据文件。
具体实现中,上述排序可以是归并排序,比如将同一元数据的不同版本进行排序,然后根据排序结果将第四元数据文件和第二元数据文件进行合并,获得合并文件,然后根据合并文件确定修改节点和修改内容。
可选地,上述一个或多个处理器用于对修改节点进行修改时,可以以事务写的方 式,对修改节点进行批量修改。比如根据合并文件确定要对多个修改节点进行修改,那么可以将多个修改操作添加至一个事务中,以事务为单位进行本次修改操作,可以确保多个修改操作的一致性,从而完成对第二元数据文件的修改。
可选地,根据第四元数据文件对第二元数据文件进行修改时,若存储设备是SCM的情况下,由于SCM是读写性能接近内存的存储介质,可以直接以原地修改写的方式,对第二元数据文件进行修改,实现元数据的持久化存储。若存储设备是SSD的情况下,由于SSD的固有特性导致其需要进行GC,为了减少GC带来的写放大问题,SSD可以以ROW方式将对元数据进行持久化处理。若存储设备是HDD的情况下,由于HDD的随机写性能较差,因此HDD可以以ROW方式,将多个随机写的小写请求转化为顺序大写请求,实现元数据的持久化存储。
可选地,上述一个或多个处理器用于接收顺序写请求,该顺序写请求中携带元数据,将顺序写请求中携带的元数据写入该第二元数据文件中。
具体实现中,在接收到顺序写请求时,存储系统还可先确定内存和第一存储层中是否存在该顺序写请求所携带的元数据的历史版本,若不存在,再将该元数据直接写入第二元数据文件中,否则,将该元数据写入内存中,从而避免查询元数据时,内存中的元数据版本不是最新的元数据版本,提高元数据读取的可靠性。
可选地,上述一个或多个处理器用于接收随机写请求,该随机写请求中携带元数据,将随机写请求中携带的元数据写入内存中。
具体实现中,可将随机写请求中携带的元数据写入内存中的内存表,这样,当内存表中的数据量达到阈值时,内存表切换为不可修改内存表,元数据可随着不可修改内存表,被写入第一存储层中,当第一存储层中的数据量达到上限时,元数据可随着第一存储层中的元数据文件的合并,被写入第二存储层,具体可对第二存储层中的第二元数据文件进行修改,比如修改元数据对应的部分叶子节点、根节点和中间节点等等,并将修改操作添加入一个事务中,以事务写的方式批量修改第二元数据文件中的修改节点,从而实现将元数据进行持久化存储的目的。
可选地,上述一个或多个处理器用于接收到元数据的查询请求,然后根据该查询请求,可依次查询内存、第一存储层以及第二存储层,直至读取到该元数据。若第一存储层包括多个存储子层,那么存储系统在查询第一存储层时,按照合并顺序对第一存储层进行查询,比如第一存储层包括L0层和L1层,且L0层中的元数据文件达到上限后,将会向L1层中的元数据文件进行合并,那么在读取元数据时,可先查询L0再查询L1,以此类推,这里不一一举例说明。
第三方面,提供了一种存储系统中的元数据管理装置,该元数据管理装置可包括:第一存储单元,用于在存储系统的第一存储层存储不可修改写的第一元数据文件,第二存储单元,用于在存储系统的第二存储层存储可修改写的第二元数据文件,第一存储层基于日志结构树管理第一元数据文件,第二存储层基于非日志结构树管理第二元数据文件。
实施第三方面描述的元数据管理装置,可基于日志结构树管理第一存储层中不可修改写的第一元数据文件,基于非日志结构树管理第二存储层中可修改写的第二元数据文件,使得存储系统在元数据写场景,可将顺序写请求的元数据直接写入第二存储 层,消除存储层合并带来的系统开销,同时,随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解决在大容量场景下存储系统合并任务耗时久、资源消耗大的问题。
可选地,第一存储层还存储不可修改写的第三元数据文件;第一存储层基于日志结构树管理第三元数据文件;元数据管理装置还包括:合并单元,用于合并第一元数据文件和第三元数据文件得到第四元数据文件;第一写入单元,用于将第四元数据文件中存储的元数据写入到第二元数据文件。
可选地,元数据管理装置还包括:接收单元,用于接收顺序写请求;顺序写请求中携带元数据;第二写入单元,用于将顺序写请求中携带的元数据写入第二元数据文件中。
可选地,第二存储层基于B树管理第二元数据文件。
可选地,第二存储层基于B+树管理第二元数据文件。
可选地,第一写入单元具体用于:将第四元数据文件和第二元数据文件进行排序,确定第二元数据文件的修改节点;对修改节点进行修改,将第四元数据文件中存储的元数据写入到第二元数据文件。
可选地,第一写入单元具体用于:以事务写的方式,对修改节点进行批量修改。
第四方面,提供了一种存储阵列,包括存储控制器和至少一个存储器,其中,至少一个存储器用于提供第一存储层和第二存储层,还用于存储程序代码,存储控制器可执行该程序代码实现如第一方面描述的方法。
第五方面,提供了一种计算机程序产品,包括计算机程序,当计算机程序被计算设备读取并执行时,实现如第一方面所描述的方法。
第六方面,提供了一种计算机可读存储介质,包括指令,当指令在计算设备上运行时,使得计算设备实现如第一方面所描述的方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是一种基于日志结构树的存储系统的架构示意图;
图2是本申请提供的一种存储系统的硬件结构示意图;
图3是本申请提供的一种存储系统的结构树区域的结构示意图;
图4是本申请提供的一种存储系统中元数据管理方法的步骤流程示意图;
图5是本申请提供的一种写请求处理方法的流程示意图;
图6是本申请提供的一种元数据管理装置的结构示意图;
图7是本申请提供的一种存储阵列的硬件结构示意图。
具体实施方式
为了便于理解本发明的技术方案,应理解,本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
为了便于理解本申请实施例,首先,对本申请涉及的应用场景“日志结构树”进 行简要说明。
由于硬盘的存储特性,使得顺序写操作的效率比随机写操作的效率高很多,因此存储系统往往基于这一特性对数据库进行设计和优化,其中,日志型数据库通过追加(append)的方式将数据写入硬盘,成为优化写性能的最佳实现方式之一,被广泛应用在存储系统中。
日志型数据库是一种可视为电子化的文件柜,是一个有组织、可共享、统一管理的大量数据的集合。用户可对日志型数据库中的文件进行新增、查询、更新和删除等操作,数据库接收到用户的操作请求后,生成本次修改的日志存储于日志盘中,再在内存中对文件所在的内存页进行修改。当内存页中的数据达到一定阈值,比如内存页满了以后,内存页中的数据再刷入硬盘进行持久化处理。因此,日志型数据库可以免于在每次修改数据时,产生大量的需要写入硬盘的随机写请求,从而减少硬盘写操作的次数,并且,日志型数据库可以通过写日志以及将内存页中的数据合并后写入硬盘的方式,将多个随机写请求转化为顺序写请求,提高存储效率,同时,在内存发生故障时,可根据日志盘中的日志,恢复内存中的文件,提高数据库的可靠性。
但是,单纯的日志型数据库的读性能很差,在读数据场景中,可能要遍历全部日志记录才能找到一条数据。为优化日志型数据库,用于管理日志型数据库的日志结构树应运而生,其中,LSM树作为一种日志结构树,可以将效率较低的随机写操作转化为效率较高的顺序写操作,使得写性能提升的同时,读性能的影响降到最低,被广泛应用在日志型数据库中。
图1是一种基于日志结构树的存储系统的结构示意图,如图1所示,该存储系统110可包括内存110和硬盘120。其中,内存110包括内存表(memtable)111和不可修改内存表(immutable memtable)112,硬盘120包括有序字符串表(sorted string table,SStable)123、日志存储区域121以及数据存储区域122。
内存110中的内存表111可用于存储数据的元数据,元数据指的是描述数据的数据,该元数据可包括数据的存储地址、长度、数据类型等描述信息,具体实现中,内存表111中的元数据可以被修改,且内存表111按照键值对的形式将元数据进行有序存储。其中,键值对(key,value):键值对是一种字符串,K代表键(key),key可以是数据的名称或者标识,V代表该键对应的值(value),value可根据业务需求定义,比如可以是数据的位置信息等,每个key对应至少一个value。比如K=数据x,V=数据x的元数据。该种存储方式可通过单键查询、组合键查询以及范围键的查询,简单、快速地获得业务所需的数据。
内存110中的不可修改内存表112用于存储不可修改的元数据,其中,内存表111中的元数据达到阈值后,内存表111可切换为不可修改内存表112,然后创建新的内存表111用于存储最近更新的元数据,而不可修改内存表112将会被批量刷入硬盘120中进行持久化处理。
硬盘120中的日志存储区域121用于存储日志。应理解,由于内存110不是可靠存储,如果断电会导致内存110中的数据丢失,因此可通过预写式日志(write-ahead logging,WAL)来确保可靠性。WAL是一种实现事务日志的标准方法。在基于LSM存储元数据的场景下,其核心思想是对元数据的修改必须且只能发生在日志修改之后, 换句话说,修改元数据时,先修改操作记录在日志文件,然后再修改元数据。使用WAL进行数据存储,可以免于在每次修改元数据时,都将元数据冲刷进硬盘,减少硬盘写的次数,同时,由于WAL在硬盘中存储,当元数据先被写进内存,还未被冲刷进硬盘时,若内存发生故障,此时可根据WAL对元数据进行回恢复。
硬盘120中的数据存储区域121用于存储数据,也就是数据进行持久化存储的区域。硬盘120中的有序字符串表123基于日志结构树对元数据进行管理,该日志结构树可以是LSM树,LSM的组织结构包含多个存储层(例如图1中的C0层、C1层、C2层和C3层),其中,越上层的存储层其存储空间越小,越下层的存储层其存储空间越大。且不可修改内存表112被冲刷(flush)进硬盘时,先写入有序字符串表123的顶层,最上层C0中数据量达到一定程度后,最上层C0中的有序字符串表合并后会存储到C1层,C1层中的有序字符串表合并后得到有序字符串表写入C1层,以此类推,使得旧数据可以不断被删除,新数据能够不断被写入至硬盘中。在存储元数据场景中,内存表、不可修改内存表和有序字符串表也称为元数据文件。
因此,当图1所示的存储系统100接收数据X的写请求时,该系统100可先将本次写请求的数据X写入数据存储区域122,再将数据X在数据存储区域122中的地址等元数据信息Y写入日志中,将日志存储至日志存储区域121,最后将数据X的元数据Y写入内存表111。这样,当存储系统100再接收到数据X的写请求时,存储系统100可以将本次修改后的数据X追加写入数据存储区域122,将修改后的数据X对应的元数据追加写入日志中,对内存表111中的元数据进行修改。上述过程中,存储系统100可以对接收到的数据先进行合并后再追加写,可以减少数据修改时磁盘写入的次数,提高存储效率。
而当内存表111中存储的数据量达到阈值后,内存表111将会切换为不可修改内存表112,并创建新的内存表111,而不可修改内存表112将会批量写入硬盘120的C0层进行持久化处理,假设如图1所示的,C0层的存储空间大小为300MB,C1层的存储空间大小为3GB,C2层的存储空间大小为30GB,C3层的存储空间大小为100GB,那么当C0层中的数据量达到300MB时,C0层中的不可修改内存表合并得到有序字符串表,将有序字符串表写入C1层。同理,当C1层中的数据量达到3GB时,C1层中的有序字符串表合并得到新的有序字符串表,将新的有序字符串表写入C2层。同时,由于各层的合并是异步进行的,对新数据的写不会产生影响,从而使得LSM树的写入速度得以提升。
但是,基于日志结构树的存储系统在大容量场景下,越底层的存储层中的数据量越大,一层存储层可能包括多个日志结构树,相邻两层之间的合并任务执行耗时较久,占用存储空间且系统资源消耗越大。
为了解决上述存储系统的资源消耗大、合并任务耗时久等问题,本申请提供了一种存储系统200,该存储系统包括第一存储层和第二存储层。在管理元数据场景,第一存储层基于日志结构树管理可不修改写的元数据文件,第二存储层基于非日志结构树管理可修改写的元数据文件。在元数据写场景,该存储系统可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,同时,随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开 销的代价换取合并的开销,从而解决在大容量场景下存储系统合并任务耗时久、资源消耗大的问题。
其中,本申请提供的存储系统200可以部署于存储系统中,存储系统中包含一个或多个处理器。处理器可以是X86或ARM处理器等等。存储系统可以是集中式的存储阵列,还可以是分布式存储系统,本申请不作具体限定。
如图2所示,该存储系统200可包括内存210、硬盘220以及一个或多个处理器230,图2以1个处理器230为例进行了说明,具体实现中,本申请不对存储器230的数量进行限制。内存210、硬盘220以及一个或多个处理器230之间可以通过总线相互连接,比如快捷外围部件互连标准(peripheral component interconnect express,PCIe)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等,也可以通过有线传输等其他手段实现通信,比如以太网(Ethernet),本申请不作具体限定。应理解,图2仅为一种示例性的划分方式,各个模块单元之间可以合并或者拆分为更多或更少的模块单元,本申请不作具体限定,且图2中所示的系统和模块之间的位置关系也不构成任何限制。
内存210可以是只读存储器(read-only memory,ROM)、随机存储器(random access memory,RAM)、动态随机存储器(dynamic random-access memory,DRAM)、双倍速率同步动态随机存储器(double data rate SDRAM,DDR)、存储级内存(storage class memory,SCM)等等中任意一种,内存210还可以是上述各种存储介质的组合,本申请不作具体限定。其中,内存210存储不可修改内存表212和内存表211。应理解,关于内存210、不可修改内存表212以及内存表211的描述可以参考图1实施例中的内存110、不可修改内存表112以及内存表111,这里不重复赘述。
硬盘220可以是机械硬盘(hard disk drive,HDD)、固态硬盘(solid-state drive,SSD)、混合硬盘(solid state hybrid drive,SSHD)、SCM等等中任意一种,还可以是上述各种存储介质的组合,本申请不作具体限定。其中,硬盘220包括日志存储区域221、数据存储区域222以及结构树区域223。在存储系统200部署于存储阵列或者分布式存储系统时,日志存储区域221、数据存储区域222以及结构树区域223可以在同一个硬盘中,也可以在不同的硬盘中,图2是一种示例性的划分方式,本申请不对此进行限定。
应理解,日志存储区域221、数据存储区域222的描述可以参考图1实施例中的日志存储区域121、数据存储区域122,这里不重复赘述,下面主要对硬盘220中的结构树区域223进行解释说明。
如图2所示,硬盘220中的结构树区域223包括第一存储层2231和第二存储层2232,应理解,在存储系统200部署于存储阵列或者分布式存储系统时,第一存储层2231和第二存储层2232可以在同一个硬盘中,也可以在不同的硬盘中,本申请不对此进行限定。其中,第一存储层2231可存储不可修改写的第一元数据文件,第一存储层基于日志结构树管理该第一元数据文件。具体实现中,日志结构树可以是LSM树,该第一元数据文件可以是不可修改内存表212中的数据写入第一存储层2231后获得的。也就是说,内存表211中的数据在达到阈值后,内存表211将会切换为不可修改内存表212,并创建新的内存表211用于处理新数据,而不可修改内存表212中的数据可 写入第一存储层2231中,第一存储层2231中的数据即为第一元数据文件。该第一元数据文件也可以是合并得到的有序字符串表。
第二存储层2232存储可修改写的第二元数据文件,第二存储层基于非日志结构树管理该第二元数据文件。具体实现中,非日志结构树可以是B树(B tree)、B+树(B+tree)、字典序列树(trie tree)、跳表(skip list)等等,本申请不作具体限定。其中,第二元数据文件可以用于存储第一存储层2231合并得到的有序字符串表中的元数据。具体的,第一存储层2231中的数据达到上限时,存储系统200可根据第一存储层2231中的不可修改写的元数据文件(有序字符串表)合并得到有序字符串表,确定第二存储层2232所需修改的修改节点,然后对其进行修改,将该有序字符串表中的元数据写入第二存储层2232中第二元数据文件,从而避免在大容量场景下第二存储层2232进行合并带来的巨额开销,降低存储系统200的资源消耗。
一个或多个处理器230可以由至少一个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合,上述硬件芯片可以是专用集成电路(applicaition specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合,本申请不作具体限定。处理器230可执行各种类型的程序代码,以使存储系统200实现各种功能。具体地,一个或多个处理器230可用于在存储系统100的第一存储层2231存储不可修改写的第一元数据文件,在第二存储层2232存储可修改写的第二元数据文件,在第一存储层2331基于日志结构树管理第一元数据文件,在第二存储层基于非日志结构树管理第二元数据文件。
在一可能的实现方式中,第一存储层2231可以包含多个元数据文件(有序字符串表),例如,第一存储层还可存储不可修改的第三元数据文件,第一存储层基于该日志结构树管理第三元数据文件。存储系统200将第一元数据文件和第三元数据文件进行合并获得第四元数据文件,根据第四元数据文件对第二元数据文件进行修改,完成元数据的持久化存储。
例如,图3是结构树区域223的一种示例性的划分方式,图3以第一存储层2231包括两个存储子层L0和L1为例进行了举例说明,具体实现中,本申请不对存储子层的数量进行限定。在图3所示的存储树区域233中,当L0层中的第一元数据文件达到上限时,存储系统200可将第一元数据文件与L1层的第三元数据文件进行合并,生成第四元数据文件,然后将第三元数据文件删除,第四元数据文件写入L1层,当L1层中的第四元数据文件达到上限时,根据第四元数据文件和第二存储层2232中的第二元数据文件,对第二元数据文件进行修改,完成元数据的持久化存储。应理解,图3用于举例说明,本申请不作具体限定。
可以理解的,通过将第一存储层2231进一步划分为多个存储子层,多个存储子层按照从上至下的方式进行合并,根据最底部的存储子层中的数据对第二元数据文件进行修改,可以减少对第二存储层2232中第二元数据文件的修改次数,从而降低第二存储层2332的非日志结构树写放大问题的出现。其中,写放大问题指的是在对非日志结 构树进行修改时,每次修改都需要遍历整个非日志结构树确定所需修改的叶子节点、中间节点和根节点等等,若每次修改量较小,而修改次数又较为频繁,将会出现资源浪费的情况。
可以理解,另一种实现,也可以不对第一存储层2231进行进一步分层管理。即在第一存储层2231中的多个元数据文件(有序字符串表),根据第一存储层2231的容量,对多个元数据文件进行合并,然后写入第二存储层。
在一可能的实现方式中,在根据第四元数据文件对第二元数据文件进行修改时,可先将第四元数据文件和第二元数据文件进行归并排序确定所需修改的修改节点,比如将相同的元数据的不同版本进行排序,举例来说,可以将元数据Y在第四元数据文件和第二元数据文件中对应的值进行排序,将元数据X在第四元数据文件和第二元数据文件中对应的值进行排序,从而确定所需修改的修改节点,其中,修改节点可以是非日志结构树的部分叶子节点、中间节点、根节点等等,本申请不作具体限定;最后以事务写的方式,批量修改上述修改节点,从而完成对第二元数据文件的修改。可以理解的,存储系统200以事务写所需的少量开销换取数据量较大的存储层进行合并时带来的巨额开销,从而解决存储系统合并任务耗时久、资源消耗大的问题。
根据第四元数据文件对第二元数据文件进行修改时,具体实现中,在硬盘220是SCM的情况下,由于SCM是读写性能接近内存的存储介质,可以直接以原地修改写(write-in-place)的方式,对第二元数据文件进行修改,实现元数据的持久化存储。在硬盘220是SSD的情况下,由于SSD的固有特性导致其需要进行垃圾回收(garbage collection,GC),为了减少GC带来的写放大问题,SSD可以以写时重定向(redirect-on-write,ROW)方式将对元数据进行持久化处理。在硬盘220是HDD的情况下,由于HDD的随机写性能较差,因此HDD可以以ROW方式,将多个随机写的小写请求转化为顺序大写请求,实现元数据的持久化存储。
在一可能实现方式中,存储系统200在接收到元数据的查询请求时,可依次查询内存210、第一存储层2231以及第二存储层2232,直至读取到该元数据。若第一存储层2231包括多个存储子层,那么存储系统200在查询第一存储层2231时,按照合并顺序对第一存储层2231进行查询,比如图3所述的例子中,L0层向L1层合并,那么在读取元数据时,可先查询L0再查询L1,以此类推,这里不一一举例说明。可以理解的,相比于图1所示的存储系统100,由于存储系统200的存储层数较少,因此在元数据查询时,存储系统200的查询效率可以提升,提高用户的使用体验,
在一可能的实现方式中,存储系统200在接收到顺序写请求时,由于第二存储层的第二元数据文件是可修改写的元数据文件,可将顺序写请求中携带的元数据直接写入第二元数据文件中,从而减少存储层之间合并带来的资源消耗,提高元数据持久化的效率。
具体实现中,在接收到顺序写请求时,存储系统200还可先确定内存210和第一存储层2231中是否存在该顺序写请求所携带的元数据的历史版本,若不存在,再将该元数据直接写入第二元数据文件中,否则,将该元数据写入内存210中,从而避免查询元数据时,内存中的元数据版本不是最新的元数据版本,提高元数据读取的可靠性。
在一可能的实现方式中,存储系统200在接收到随机写请求时,可将随机写请求 中携带的元数据Y写入内存210的内存表211中,这样,当内存表211中的数据量达到阈值时,内存表211切换为不可修改内存表212,元数据Y可随着不可修改内存表212,被写入第一存储层2231中,当第一存储层2231中的数据量达到上限时,元数据Y可随着第一存储层2231中的元数据文件的合并,被写入第二存储层2232,具体可对第二存储层2232中的第二元数据文件进行修改,比如修改元数据Y对应的部分叶子节点、根节点和中间节点等等,从而实现将元数据Y进行持久化存储的目的。
综上可知,本申请提供的存储系统包括第一存储层和第二存储层,第一存储层基于日志结构树管理不可修改写的文件,第二存储层基于非日志结构树管理可修改写的文件。该结构的存储系统在处理元数据读取请求时,由于存储层数较少,可以提升元数据读取的效率,在处理元数据写入请求时,可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,也可将随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解决基于日志结构树的存储系统在大容量场景下,占用存储空间且系统资源消耗越大的问题。
图4是本申请提供的一种存储系统中元数据管理方法的步骤流程示意图,其中,该管理方法可应用于图2所述的存储系统200中,如图4所示,该方法可由存储系统200中的一个或多个处理器执行,该方法可包括以下步骤:
S410:在存储系统200的第一存储层2231存储不可修改写的第一元数据文件。
具体实现中,第一元数据文件可以是内存210中的不可修改内存表212中的数据写入第一存储层2231后获得的。第一存储层2231和第一元数据文件的描述可以参考图2-图3实施例,这里不再重复赘述。第一存储层2231基于日志结构树管理第一元数据文件。
在一可能的实现方式中,第一存储层2231可包括多个元数据文件,例如,第一存储层2231还可存储不可修改写的第三元数据文件。其中,当第一元数据文件的大小达到上限时,第一元数据文件可与第三元数据文件进行合并,获得第四元数据文件,第四元数据文件可以被写入第二存储层2232。
S420:在存储系统200的第二存储层2232存储可修改写的第二元数据文件。
具体实现中,第二元数据文件可以是第一存储层2331中的数据写入第二存储层2232后获得的,也可以是存储系统200接收到顺序写请求后,将顺序写请求的元数据直接写入第二存储层2232存储的第二元数据文件。应理解,第二存储层2231和第二元数据文件的描述可以参考前述图2-图3实施例,这里不再重复赘述。第二存储层2232基于非日志结构树管理第二元数据文件。
具体实现中,上述非日志结构树可包括B树、B+树、字典序列树、跳表等等,本申请不作具体限定。
在一可能的实现方式中,在第一存储层存储有第三元数据文件时,第一元数据文件和第三元数据文件进行合并获得的第四元数据文件,可被写入至第二元数据文件,完成元数据的持久化存储。具体实现中,可将第四元数据文件和第二元数据文件进行归并排序,将第四元数据文件和第二元数据文件中的每个元数据进行排序,比如相同 的元数据的不同版本按写入顺序进行排序,然后确定将第四元数据文件写入第二元数据文件时所需修改的修改节点和修改内容,然后以事务写的方式,对修改节点进行批量修改,从而将第四元数据文件写入第二元数据文件,其中,修改节点可以是非日志结构树的部分叶子节点、中间节点以及根节点等等。
具体实现中,在对第四元数据文件和第二元数据文件进行归并排序后,可根据排序结果对第四元数据文件和第二元数据文件进行合并,获得合并文件,删除旧版本的数据,保留新版本的数据,然后根据合并文件确定修改节点和修改内容。举例来说,第四元数据文件包括元数据X1=A2,X3=C,第二元数据文件包括元数据X1=A1,X2=B,那么对第四元数据文件和第二元数据文件进行归并排序后,可以获得X1=A2,X1=A1,X2=B,X3=C,由于X1的最新版本为A2,那么第二元数据文件中的X1=A1可以被删除,那么合并文件可包括X1=A2,X2=B,X3=C。根据该合并文件,可以对第二元数据文件中X1所在的叶子节点进行修改,对X2所在的叶子节点不进行修改,并新增叶子节点记录元数据X3,具体可以将上述多个修改操作添加至一个事务中,以事务为单位进行本次修改操作,从而完成对第二元数据文件的修改。可以理解的,通过修改第二元数据文件的方式对元数据进行持久化存储,并且以事务写所需的少量开销换取数据量较大的存储层进行合并时带来的巨额开销,可以解决存储系统合并任务耗时久、资源消耗大的问题。
在一可能的实现方式中,步骤S420之后,存储系统200还可接收查询请求,根据该查询请求,依次查询内存210、第一存储层2231以及第二存储层2232,直至查询到所需的元数据。可以理解的,相比于图1实施例中基于LSM树的存储系统100,图1所示的存储系统100在数据查询时,需要依次查询内存110、C0层、C1层、C2层、C3层等等,在大容量场景下,图1所示的存储系统的查询效率非常低,而本申请提供的方法,可依次查询内存210、第一存储层2231以及第二存储层2232获得所需的元数据,能够提高查询效率,降低查询时延。
本申请提供的存储系统200可基于上述步骤S410~步骤S420对所存储的数据进行管理,下面结合具体的实施例,对存储系统200的写操作流程进行说明。
图5是本申请提供的一种写操作流程的步骤流程示意图,如图5所示,本申请提供的写操作流程包括以下步骤:
S510:接收写请求,其中,该写请求携带有元数据X。
S520:确定写请求的写入类型,其中,写入类型包括顺序写和随机写,在写入类型是顺序写的情况下,执行步骤S530~步骤S550,在写入类型是随机写的情况下,执行步骤S560。
应理解,顺序写请求和随机写请求的特性不同,顺序写请求指的是顺序请求大量数据,比如顺序写请求可以包括数据库执行大量的查询请求、流媒体服务请求等等,以数据库中的请求为例,顺序写请求可以包括对数据库执行写重做(redo)/撤销(undo)日志操作请求、流媒体服务请求等等。随机写请求指的是随机请求少量的数据,比如万维网(world wide web,Web)服务请求,邮箱(mail)服务请求等。上述对于顺序写和随机写的举例用于说明,本申请不作具体限定。
因此,对于顺序写请求来说,由于请求的数量较大,当第二存储层以上的存储元 都没有存储该顺序写请求的元数据的历史版本,可以直接将元数据写入第二存储层2232进行持久化处理,不仅消除存储层合并带来的系统开销,同时减少第二存储层2232中非日志结构树的写放大问题。对于随机写请求来说,可以先将随机写请求的元数据写入第一存储层2231,由第一存储层2231中的日志结构树对其进行管理,经过元数据文件的合并后,此时再通过对第二存储层2232中的第二元数据文件进行修改的方式,将元数据进行持久化存储,可以有效减少第二存储层2232中非日志结构树的写放大问题。
S530:查询内存和第一存储层2231是否存储有历史版本的元数据X。其中,在内存210和第一存储层2231没有历史版本的元数据X的情况下,执行步骤S550,在内存210和/或第一存储层2232包括历史版本的元数据X的情况下,执行步骤S540。
S540:将元数据X写入第一存储层2231。
可以理解的,在内存210和/或第一存储层2231包括历史版本的元数据X的情况下,如果将元数据X写入第二存储层2232,存储系统200接收到元数据X的读请求时,若按照内存210、第一存储层2231和第二存储层2232的顺序依次查询,将会先查询到内存210或者第一存储层2231中的历史版本的元数据X,但是第二存储层2232中的元数据X才是最新版本的元数据。而在将顺序请求写入第二存储层2232之前先判定一下内存210和第一存储层2231中是否包含该元数据,可以避免该问题的出现,提高元数据读取的准确性。
S550:将元数据X写入第二存储层2232。
S560:将元数据X写入第一存储层2231。
参考前述图2~图4实施例可知,元数据X写入第一存储层2231后,第一存储层2231可基于日志结构树对元数据X进行管理,具体的,元数据X可先被写入第一存储层2231中的第一元数据文件中,然后在第一元数据文件达到上限的情况下,存储系统将第一元数据文件和第三元数据文件进行合并,获得第四元数据文件,使得元数据X被写入第四元数据文件中,将第四元数据文件写入第二存储层2232中的第二元数据文件中,从而实现元数据X的持久化存储。应理解,步骤S560之后,存储系统200对元数据X进行处理的步骤流程可参考参考前述图2~图4实施例,具体可参考第一存储层基于日志结构树对第一元数据文件进行管理的相关内容,这里不再重复赘述。
综上可知,本申请提供的存储系统中元数据管理方法,基于日志结构树管理第一存储层2231中的不可修改写的第一元数据文件,基于非日志结构树管理第二存储层2232中的可修改写的第二元数据文件,可以在处理读请求时,由于存储层数较少,提升元数据读取的效率,而且在处理写请求时,可将顺序写请求的元数据直接写入第二存储层2232,消除存储层合并带来的系统开销,也可将随机写请求的元数据写入第一存储层后,再根据第一存储层2231中的元数据,批量修改第二存储层2232中的元数据文件中的元数据,以事务开销的代价换取合并的开销,从而解决基于日志结构树的存储系统在大容量场景下,占用存储空间且系统资源消耗越大的问题。
上面详细阐述了本申请实施例的方法,为了便于更好地实施本申请实施例上述方案,相应地,下面还提供用于配合上述实施方案的相关设备。
图6是本申请提供的一种存储系统中的元数据管理装置,该元数据管理装置可以 是前述内容中的处理器230,该元数据管理装置包括:第一存储单元610和第二存储单元620。
第一存储单元610用于在存储系统的第一存储层存储不可修改写的第一元数据文件,第二存储单元620,用于在存储系统的第二存储层存储可修改写的第二元数据文件,第一存储层基于日志结构树管理第一元数据文件,第二存储层基于非日志结构树管理第二元数据文件。
可选的,该元数据管理装置还包括合并单元630和第一写入单元640。第一存储层还存储不可修改写的第三元数据文件;第一存储层基于日志结构树管理第三元数据文件;合并单元630,用于合并第一元数据文件和第三元数据文件得到第四元数据文件;第一写入单元640,用于将第四元数据文件中存储的元数据写入到第二元数据文件。
可选的,该元数据管理装置还包括接收单元650以及第二写入单元660。接收单元650,用于接收顺序写请求;顺序写请求中携带元数据;第二写入单元660,用于将顺序写请求中携带的元数据写入第二元数据文件中。
可选地,第二存储层基于B树管理第二元数据文件。
可选地,第二存储层基于B+树管理第二元数据文件。
可选地,第一写入单元640具体用于:将第四元数据文件和第二元数据文件进行排序,确定第二元数据文件的修改节点;对修改节点进行修改,将第四元数据文件中存储的元数据写入到第二元数据文件。
可选地,第一写入单元640具体用于:以事务写的方式,对修改节点进行批量修改。
本申请实施例的元数据管理装置可对应于执行本申请图2至图5实施例中描述的方法,并且处理器230中的各个模块和/或功能分别为了实现图2至图5中的各个方法的相应流程,为了简洁,在此不再赘述。
综上可知,本申请提供的元数据管理装置,在存储系统的第一存储层基于日志结构树管理不可修改写的第一元数据文件,在存储系统的第二存储层基于非日志结构树管理可修改写的第二元数据文件。在处理元数据读取请求时,由于存储层数较少,可以提升元数据读取的效率,在处理元数据写入请求时,可将顺序写请求的元数据直接写入第二存储层,消除存储层合并带来的系统开销,也可将随机写请求的元数据写入第一存储层后,再根据第一存储层中的数据,批量修改第二存储层,以事务开销的代价换取合并的开销,从而解决基于日志结构树的存储系统在大容量场景下,占用存储空间且系统资源消耗越大的问题。
图7是本申请提供的一种存储阵列700的硬件结构示意图,如图7所示,存储阵列700可以是前述内容的存储系统200。其中,该存储阵列700包括存储控制器710和至少一个存储器720,其中,存储控制器710和至少一个存储器720通过总线730或网络相互连接。总线730可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线730可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储控制器710可以包含图2实施例中的处理器230和元数据管理装置600,存储控制器710包括一个或者多个处理器,其中处理器包括CPU、微处理器、微控制器、主处理器、控制器以及ASIC等等。
至少一个存储器720可以是图2实施例中的硬盘220和内存210,具体可以是非易失性存储器,例如HDD、SSD、SCM等等,还可以包括上述种类的存储器的组合。例如,存储阵列700可以是由多个HDD或者多个SDD组成,或者,存储阵列700可以是由多个HDD或者多个SCM组成。其中,至少一个存储器720在存储控制器710的控制下按不同的方式组合起来形成存储器组,从而提供比单个存储器更高的存储性能。
需要说明的,图7仅仅是本申请实施例的一种可能的实现方式,实际应用中,存储阵列700还可以包括更多或更少的部件,这里不作限制。关于本申请实施例中未示出或未描述的内容,可参见前述图1-图6实施例中的相关阐述,这里不再赘述。
存储系统还可以是分布式存储系统,或者具有实现上述方案的单台服务器等,本发明对存储系统的具体形态不作限定。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在处理器上运行时,图4和图5所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当计算机程序产品在处理器上运行时,图4和图5所示的方法流程得以实现。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括至少一个计算机指令。在计算机上加载或执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含至少一个可用介质集合的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(Digital Video Disc,DVD)、或者半导体介质。半导体介质可以是SSD。
以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (17)

  1. 一种存储系统中元数据管理方法,其特征在于,所述方法由所述存储系统中的一个或多个处理器执行,包括:
    在所述存储系统的第一存储层存储不可修改写的第一元数据文件;
    在所述第二存储层存储可修改写的第二元数据文件;其中,所述第一存储层基于日志结构树管理所述第一元数据文件;所述第二存储层基于非日志结构树管理所述第二元数据文件。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一存储层还存储不可修改写的第三元数据文件;所述第一存储层基于所述日志结构树管理所述第三元数据文件;所述方法还包括:
    合并所述第一元数据文件和第三元数据文件得到第四元数据文件;
    将所述第四元数据文件中存储的元数据写入到所述第二元数据文件。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    接收顺序写请求;所述顺序写请求中携带元数据;
    将所述顺序写请求中携带的元数据写入所述第二元数据文件中。
  4. 根据权利要求1-3任一所述的方法,其特征在于,
    所述第二存储层基于B树管理所述第二元数据文件。
  5. 根据权利要求1-3任一所述的方法,其特征在于,
    所述第二存储层基于B+树管理所述第二元数据文件。
  6. 根据权利要求2所述的方法,其特征在于,将所述第四元数据文件中存储的元数据写入到所述第二元数据文件包括:
    将所述第四元数据文件和所述第二元数据文件进行排序,确定所述第二元数据文件的修改节点;
    对所述修改节点进行修改,将所述第四元数据文件中存储的元数据写入到所述第二元数据文件。
  7. 根据权利要求6所述的方法,其特征在于,所述对所述修改节点进行修改包括:以事务写的方式,对所述修改节点进行批量修改。
  8. 一种存储系统,其特征在于,所述存储系统包含一个或多个处理器以及与所述一个或多个处理器通信的存储设备;所述存储设备用于提供第一存储层和第二存储层;所述一个或多个处理器用于:
    在所述存储系统的第一存储层存储不可修改写的第一元数据文件;
    在所述第二存储层存储可修改写的第二元数据文件;其中,所述第一存储层基于 日志结构树管理所述第一元数据文件;所述第二存储层基于非日志结构树管理所述第二元数据文件。
  9. 根据权利要求8所述的存储系统,其特征在于,所述第一存储层还存储不可修改写的第三元数据文件;所述第一存储层基于日志结构树管理所述第三元数据文件;
    所述一个或多个处理器还用于:
    合并所述第一元数据文件和第三元数据文件得到第四元数据文件;
    将所述第四元数据文件中存储的元数据写入到所述第二元数据文件。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包含计算机程序指令,存储系统中一个或多个处理器执行所述程序指令使得所述存储系统实现如权利要求1至7任一权利要求所述的方法。
  11. 一种存储系统中的元数据管理装置,其特征在于,所述元数据管理装置包括:
    第一存储单元,用于在所述存储系统的第一存储层存储不可修改写的第一元数据文件;
    第二存储单元,用于在所述存储系统的第二存储层存储可修改写的第二元数据文件;其中,所述第一存储层基于日志结构树管理所述第一元数据文件;所述第二存储层基于非日志结构树管理所述第二元数据文件。
  12. 根据权利要求11所述的元数据管理装置,其特征在于,所述第一存储层还存储不可修改写的第三元数据文件;所述第一存储层基于所述日志结构树管理所述第三元数据文件;所述元数据管理装置还包括:
    合并单元,用于合并所述第一元数据文件和所述第三元数据文件得到第四元数据文件;
    第一写入单元,用于将所述第四元数据文件中存储的元数据写入到所述第二元数据文件。
  13. 根据权利要求11或12所述的元数据管理装置,其特征在于,所述元数据管理装置还包括:
    接收单元,用于接收顺序写请求;所述顺序写请求中携带元数据;
    第二写入单元,用于将所述顺序写请求中携带的元数据写入所述第二元数据文件中。
  14. 根据权利要求11-13任一所述的元数据管理装置,其特征在于,
    所述第二存储层基于B树管理所述第二元数据文件。
  15. 根据权利要求11-13任一所述的元数据管理装置,其特征在于,
    所述第二存储层基于B+树管理所述第二元数据文件。
  16. 根据权利要求12所述的元数据管理装置,其特征在于,所述第一写入单元具体用于:
    将所述第四元数据文件和所述第二元数据文件进行排序,确定所述第二元数据文件的修改节点;
    对所述修改节点进行修改,将所述第四元数据文件中存储的元数据写入到所述第二元数据文件。
  17. 根据权利要求16所述的元数据管理装置,其特征在于,所述第一写入单元具体用于:以事务写的方式,对所述修改节点进行批量修改。
PCT/CN2021/100904 2020-12-10 2021-06-18 一种存储系统中元数据管理方法、装置及存储系统 WO2022121274A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011454997 2020-12-10
CN202011454997.X 2020-12-10
CN202110245752.4A CN114625713A (zh) 2020-12-10 2021-03-05 一种存储系统中元数据管理方法、装置及存储系统
CN202110245752.4 2021-03-05

Publications (1)

Publication Number Publication Date
WO2022121274A1 true WO2022121274A1 (zh) 2022-06-16

Family

ID=81896889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100904 WO2022121274A1 (zh) 2020-12-10 2021-06-18 一种存储系统中元数据管理方法、装置及存储系统

Country Status (2)

Country Link
CN (1) CN114625713A (zh)
WO (1) WO2022121274A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230333983A1 (en) * 2022-04-18 2023-10-19 Samsung Electronics Co., Ltd. Systems and methods for a cross-layer key-value store architecture with a computational storage device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722449A (zh) * 2012-05-24 2012-10-10 中国科学院计算技术研究所 基于SSD的Key-Value型本地存储方法及系统
US20160378653A1 (en) * 2015-06-25 2016-12-29 Vmware, Inc. Log-structured b-tree for handling random writes
CN109933570A (zh) * 2019-03-15 2019-06-25 中山大学 一种元数据管理方法、系统及介质
CN110851434A (zh) * 2018-07-27 2020-02-28 阿里巴巴集团控股有限公司 一种数据存储方法、装置及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722449A (zh) * 2012-05-24 2012-10-10 中国科学院计算技术研究所 基于SSD的Key-Value型本地存储方法及系统
US20160378653A1 (en) * 2015-06-25 2016-12-29 Vmware, Inc. Log-structured b-tree for handling random writes
CN110851434A (zh) * 2018-07-27 2020-02-28 阿里巴巴集团控股有限公司 一种数据存储方法、装置及设备
CN109933570A (zh) * 2019-03-15 2019-06-25 中山大学 一种元数据管理方法、系统及介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230333983A1 (en) * 2022-04-18 2023-10-19 Samsung Electronics Co., Ltd. Systems and methods for a cross-layer key-value store architecture with a computational storage device

Also Published As

Publication number Publication date
CN114625713A (zh) 2022-06-14

Similar Documents

Publication Publication Date Title
Gessert et al. NoSQL database systems: a survey and decision guidance
US10296498B2 (en) Coordinated hash table indexes to facilitate reducing database reconfiguration time
CN110799960B (zh) 数据库租户迁移的系统和方法
US9311252B2 (en) Hierarchical storage for LSM-based NoSQL stores
CN103765393B (zh) 存储系统
US7418544B2 (en) Method and system for log structured relational database objects
US10657154B1 (en) Providing access to data within a migrating data partition
CN101866359B (zh) 一种机群文件系统中的小文件存储和访问方法
US10853193B2 (en) Database system recovery using non-volatile system memory
US20200265068A1 (en) Replicating Big Data
US20130006993A1 (en) Parallel data processing system, parallel data processing method and program
CN108140040A (zh) 存储器中数据库的选择性数据压缩
CN105912687A (zh) 海量分布式数据库存储单元
WO2023165196A1 (zh) 一种日志存储加速方法、装置、电子设备及非易失性可读存储介质
US20220027349A1 (en) Efficient indexed data structures for persistent memory
CN104054071A (zh) 访问存储设备的方法和存储设备
CN113626431A (zh) 一种基于lsm树的延迟垃圾回收的键值分离存储方法及系统
CN113377868A (zh) 一种基于分布式kv数据库的离线存储系统
Carniel et al. A generic and efficient framework for flash-aware spatial indexing
WO2024021488A1 (zh) 一种基于分布式键值数据库的元数据存储方法及装置
WO2022121274A1 (zh) 一种存储系统中元数据管理方法、装置及存储系统
Cao et al. Is-hbase: An in-storage computing optimized hbase with i/o offloading and self-adaptive caching in compute-storage disaggregated infrastructure
CN114281989A (zh) 基于文本相似度的数据去重方法、装置及存储介质和服务器
Zhao et al. Toward efficient and flexible metadata indexing of big data systems
JP2014130492A (ja) インデックスの生成方法及び計算機システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901986

Country of ref document: EP

Kind code of ref document: A1