WO2022105585A1 - 数据存储方法、装置、设备及存储介质 - Google Patents

数据存储方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022105585A1
WO2022105585A1 PCT/CN2021/127981 CN2021127981W WO2022105585A1 WO 2022105585 A1 WO2022105585 A1 WO 2022105585A1 CN 2021127981 W CN2021127981 W CN 2021127981W WO 2022105585 A1 WO2022105585 A1 WO 2022105585A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
disk
written
persistent memory
writing
Prior art date
Application number
PCT/CN2021/127981
Other languages
English (en)
French (fr)
Inventor
韩银俊
屠要峰
高洪
陈正华
田海东
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022105585A1 publication Critical patent/WO2022105585A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device

Definitions

  • the embodiments of the present application relate to the field of communication technologies, and in particular, to a data storage method, apparatus, device, and storage medium.
  • An embodiment of the present application provides a data storage method, including: aligning data to be written according to a minimum allocation unit; obtaining aligned partial data and unaligned partial data of the data to be written; writing the aligned partial data into a disk, and write the unaligned partial data into persistent memory; write the unaligned partial data from the persistent memory to the disk.
  • An embodiment of the present application further provides a data storage device, including: an alignment module, configured to align the data to be written according to a minimum allocation unit; an acquisition module, configured to obtain the aligned partial data and the non-aligned data of the data to be written. Aligning partial data; a first writing module, for writing the aligned partial data to a disk, and writing the unaligned partial data into persistent memory; a second writing module, for writing the unaligned partial data Data is written to the disk from the persistent memory.
  • a data storage device including: an alignment module, configured to align the data to be written according to a minimum allocation unit; an acquisition module, configured to obtain the aligned partial data and the non-aligned data of the data to be written. Aligning partial data; a first writing module, for writing the aligned partial data to a disk, and writing the unaligned partial data into persistent memory; a second writing module, for writing the unaligned partial data Data is written to the disk from the persistent memory.
  • An embodiment of the present application further provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a program that can be executed by the at least one processor Instructions that are executed by the at least one processor to enable the at least one processor to perform the data storage method described above.
  • Embodiments of the present application further provide a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the foregoing data storage method is implemented.
  • FIG. 1 is a schematic flowchart of a data storage method provided by a first embodiment of the present application
  • FIG. 2 is a schematic diagram of the principle of the data storage method provided by the first embodiment of the present application.
  • FIG 3 is an example diagram of the principle framework of the data storage method provided by the first embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data storage method provided by a second embodiment of the present application.
  • FIG. 5 is a schematic diagram of the principle of the data storage method provided by the second embodiment of the present application.
  • FIG. 6 is a schematic diagram of a module structure of a data storage device provided by a third embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by a fourth embodiment of the present application.
  • the main purpose of the embodiments of the present application is to provide a data storage method, apparatus, device, and storage medium, which can reduce the problem of write amplification when writing data to a disk.
  • the first embodiment of the present application relates to a data storage method, by aligning the data to be written according to the minimum allocation unit, obtaining the data of the aligned part and the unaligned part of the data to be written, writing the data of the aligned part to the disk, and Unaligned parts of data are written to persistent memory; unaligned parts of data are written from persistent memory to disk. Since the aligned part of the data can be written to the disk as a whole, without causing other old data to be rewritten, the write amplification problem caused when the data is written to the disk can be reduced; at the same time, the unaligned part of the data is written first.
  • Persistent memory is then written to disk from persistent memory, and the persistent feature of persistent memory can be used to save the unaligned part of the data, so as to ensure that the unaligned part of the data is eventually stored in the disk; in addition, when writing the unaligned part of the data to the persistent After writing data to disk from persistent memory, it can increase the probability of data merging of unaligned partial data in persistent memory, which can further reduce the write amplification problem when writing data to be written to disk.
  • the execution body of the data storage method provided by the embodiment of the present application may be a storage system including a combination of a disk and a persistent memory, and may specifically be a processor of a terminal or a server, wherein the server may be composed of a single server or multiple servers. A cluster of servers is implemented.
  • FIG. 1 The specific process of the data storage method provided by the embodiment of the present application is shown in FIG. 1 , and specifically includes the following steps:
  • S101 Align the data to be written according to the minimum allocation unit.
  • S102 Acquire the aligned partial data and the unaligned partial data of the data to be written.
  • S103 Write the aligned partial data to the disk, and write the unaligned partial data to the persistent memory.
  • S104 Write the unaligned partial data to the disk from the persistent memory.
  • the magnetic disk may include a solid-state disk (SSD) and a hard disk (HDD), and preferably, the magnetic disk is a solid-state disk.
  • SSD solid-state disk
  • HDD hard disk
  • FIG. 2 is a schematic diagram of an exemplary principle of a data storage method provided by an embodiment of the present application.
  • the data to be written is aligned according to the minimum allocation unit (Minimal Allocate Size, MAS) to obtain the aligned partial data in the middle and the unaligned partial data at the beginning and the end; then the aligned partial data in the middle is directly written into the disk, and The unaligned part of the data at the beginning and the end is written to persistent memory; the last unaligned part of the data is written from persistent memory to disk.
  • Minimal Allocate Size MAS
  • the data to be written before aligning the data to be written according to the minimum allocation unit, the data to be written can be judged, and if the data to be written is smaller than the minimum allocation unit, the data to be written is directly written into the persistent memory, The data to be written is then written to the disk from the persistent memory; if the data to be written is greater than or equal to the minimum allocation unit, the data to be written is aligned according to the minimum allocation unit.
  • the step of aligning the data to be written according to the minimum allocation unit is not performed, and the data is directly written to the disk.
  • the aligned partial data when the aligned partial data is written to the disk, the aligned partial data is written to the raw device of the disk. Since the raw device is not buffered by the file system and is not directly managed by the operating system, the number of files can be reduced. The overhead caused by the system or intermediate storage system to data storage helps to improve the throughput (IO) efficiency of data.
  • IO throughput
  • the metadata of the data is usually also managed.
  • the data storage method provided by the embodiment of the present application further includes: writing the metadata of the data to be written into the persistent memory. Since the read and write latency of persistent memory is low, by writing the metadata of the data to be written into the persistent memory, the reading and writing of metadata can be facilitated and the efficiency of data management can be improved.
  • the aligned part when the aligned part is written to disk, update the metadata of the aligned part in persistent memory; when writing the unaligned part to persistent memory, update the metadata of the unaligned part data in persistent memory ; When writing the unaligned part of the data from persistent memory to disk, update the metadata of the unaligned part of the data in the persistent memory.
  • writing the metadata of the data to be written into the persistent memory includes: writing the metadata of the data to be written into a jump table of the persistent memory, that is, managing the metadata in the form of a jump table.
  • metadata can also be managed in other forms in persistent memory, for example, in the form of a tree structure.
  • the skip table is more suitable for concurrent access, because when the tree structure is modified, the concurrent access usually needs to rebalance the tree structure.
  • Mutual exclusion locks are used on multi-tree nodes; operations on the skip table will only affect the node itself and the nodes inserted before and after.
  • the skip table When changing the skip table structure, it is not necessary to lock and synchronize the entire skip table data, which has better concurrency
  • the skip table has the same query complexity O(lgn) as the balanced binary tree in the tree structure, and has better concurrency under the condition of the same query performance.
  • the skip table may include a lock skip table and a lock-free skip table.
  • the skip table is a lock-free concurrent skip table.
  • the skip table with lock can lock the skip table, which has better security, but the structure is more complicated, and the efficiency of reading and writing is not high; while the structure of the skip table without lock is relatively simple, the efficiency of reading and writing is high, and it is more conducive to the Management of metadata.
  • FIG. 3 is an example diagram of a principle framework of a data storage method provided by an embodiment of the present application.
  • the data to be written is received from the data object store (ObjectStore), and then the data to be written is aligned according to the minimum allocation unit, the left-aligned part is written to the raw disk device, and the unaligned part is written to persistent memory,
  • the metadata (meta) of the data to be written is written into the persistent memory through the Persistent Memory Development Kit (PMDK).
  • PMDK Persistent Memory Development Kit
  • the reading and deletion of data may be performed in the following manner.
  • For data reading after receiving the data read request, first go to the jump table of persistent memory to find the data in the disk (aligned part of the data or dumped to the non-aligned part of the disk according to the range of the read data) data) or persistent memory (unaligned part of the data that is not dumped to disk), and then combine the two parts of the data and return it to the upper-layer application.
  • the data index is deleted and the spatial index is released directly to the jump table of the persistent memory.
  • the data index is deleted, it is not necessary to explicitly transfer the data to the disk and persistent memory Delete, because the index of data space management in disk and persistent memory is stored in the jump table of persistent memory. After the index is deleted, the corresponding space will be released accordingly.
  • the aligned part of the data is written to the disk, and the unaligned part of the data is first written into the persistent memory and then stored from the persistent memory.
  • Write to disk Since the aligned part of the data can be written to the disk as a whole, without causing other old data to be rewritten, the write amplification problem caused when the data is written to the disk can be reduced; at the same time, the unaligned part of the data is written first.
  • Persistent memory is then written to disk from persistent memory, and the persistent feature of persistent memory can be used to save the unaligned part of the data, so as to ensure that the unaligned part of the data is eventually stored in the disk; in addition, when writing the unaligned part of the data to the persistent After writing data to disk from persistent memory, it can increase the probability of data merging of unaligned partial data in persistent memory, which can further reduce the write amplification problem when writing data to be written to disk.
  • the second embodiment of the present application relates to a data storage method.
  • the second embodiment is substantially the same as the first embodiment, with the main difference being that: in the embodiment of the present application, writing the unaligned part of the data from the persistent memory to the disk includes: : Write the unaligned portion of data from persistent memory to disk when a preset time interval has been reached or the used space of persistent memory has reached a preset threshold.
  • FIG. 4 The specific process of the data storage method provided by the embodiment of the present application is shown in FIG. 4 , and specifically includes the following steps:
  • S202 Acquire the aligned partial data and the unaligned partial data of the data to be written.
  • S203 Write the aligned partial data to the disk, and write the unaligned partial data to the persistent memory.
  • S201-S203 are the same as S101-103 in the first embodiment.
  • S201-S203 are the same as S101-103 in the first embodiment.
  • S201-S203 are the same as S101-103 in the first embodiment.
  • the preset time interval can be set according to actual needs, and there is no specific limitation here, as long as the space of the persistent memory can be guaranteed not to be completely occupied within the preset time interval.
  • the preset time interval can be set to be longer, so that the non-aligned part of the data can be completely written into the persistent memory and then written to the disk.
  • writing the unaligned part of the data from the persistent memory to the disk when the preset time interval is reached is equivalent to using the lazy write strategy to write the unaligned part of the data to the disk, which can further increase the persistence of the unaligned part of the data.
  • the probability of data merging in persistent memory can further reduce the write amplification problem when writing data to be written to disk. At the same time, since data will be merged in persistent memory, the amount of data read subsequently can also be reduced. Also saves storage space in PMEM.
  • the preset threshold can be set according to actual needs, for example, 50%, 60%, or 70%, etc., and no specific limitation is made here.
  • FIG. 5 is a schematic diagram of an exemplary principle of a data storage method provided by an embodiment of the present application.
  • the data to be written is aligned according to the minimum allocation unit, and then the aligned part of the data is directly written to the disk, and the jump table index (that is, the metadata) in the persistent memory is updated; Write the unaligned part of the data directly into the persistent memory, and update the skip table index in the persistent memory; after the regular time reaches or exceeds the limited capacity (the used space of the persistent memory), the unaligned part of the data is removed from the persistent memory.
  • the persistent memory is dumped to disk, and the jump table index of persistent memory is updated.
  • the data in the disk is read and written in block alignment
  • the unaligned part of the data is dumped to the disk, the content in the disk is first read and merged in block alignment, and then updated to the disk in situ.
  • the unaligned part of the data is written from the persistent memory to the disk, and the unaligned part of the data can be increased in the
  • the probability of data merging in persistent memory can increase the probability of whole block writing when the non-aligned part is written to disk from persistent memory, reduce the write amplification problem when the non-aligned part of data is written to disk, and reduce the write to disk
  • the number of times and fragmentation of data in addition, while reducing the write amplification problem, the jitter problem caused by the write amplification problem can also be reduced accordingly; the non-aligned part is written to the persistent memory first, because the persistent memory can directly To store data, serialization is not required, so the ratio of serialization and deserialization of data can also be reduced, and the consumption of the processor (CPU) can be reduced.
  • the third embodiment of the present application relates to a data storage device 300, as shown in FIG. 6, including: an alignment module 301, an acquisition module 302, a first writing module 303, and a second writing module 304.
  • the functions of each module are described in detail as follows :
  • Alignment module 301 for aligning the data to be written according to the minimum allocation unit
  • an acquisition module 302, configured to acquire the aligned partial data and the unaligned partial data of the data to be written
  • the first writing module 303 is used to write the aligned partial data to the disk, and write the unaligned partial data to the persistent memory;
  • the second writing module 304 is configured to write the unaligned partial data from the persistent memory to the disk.
  • the second writing module 304 is also used for:
  • the first writing module 303 is also used for:
  • the copy-on-write is used to write the aligned part of the data to the disk.
  • the first writing module 303 is also used for:
  • the data storage device 300 provided in this embodiment of the present application further includes a third writing module, where the third writing module is configured to: write the metadata of the data to be written into the persistent memory.
  • the third writing module is further configured to: write the metadata of the data to be written into the jump table of the persistent memory.
  • the skip table is a lock-free and concurrent skip table.
  • the magnetic disk is a solid-state hard disk.
  • this embodiment is a device embodiment corresponding to the foregoing embodiment, and this embodiment can be implemented in cooperation with the foregoing embodiment.
  • the related technical details mentioned in the foregoing embodiment are still valid in this embodiment, and are not repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this embodiment can also be applied to the foregoing embodiments.
  • a logical unit may be a physical unit, a part of a physical unit, or multiple physical units.
  • a composite implementation of the unit in order to highlight the innovative part of the present application, this embodiment does not introduce units that are not closely related to solving the technical problem raised by the present application, but this does not mean that there are no other units in this embodiment.
  • the fourth embodiment of the present application relates to an electronic device, as shown in FIG. 7 , comprising: at least one processor 401; and a memory 402 connected in communication with the at least one processor 401; Instructions executed by the processor 401, the instructions are executed by the at least one processor 401, so that the at least one processor 401 can execute the above-mentioned data storage method.
  • the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory.
  • the bus may also connect together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein.
  • the bus interface provides the interface between the bus and the transceiver.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data used by the processor in performing operations.
  • the fifth embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • a storage medium includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or optical disk, etc. medium of program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

一种数据存储方法,涉及通信技术领域,所述方法包括:将待写入数据按照最小分配单元进行对齐(S101);获取所述待写入数据的对齐部分数据和非对齐部分数据(S102);将所述对齐部分数据写入磁盘,并将所述非对齐部分数据写入持久性内存(S103);将所述非对齐部分数据从所述持久性内存写入所述磁盘(S104)。

Description

数据存储方法、装置、设备及存储介质
相关申请的交叉引用
本申请基于申请号为“202011321690.2”、申请日为2020年11月23日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及通信技术领域,特别涉及一种数据存储方法、装置、设备及存储介质。
背景技术
传统磁盘SSD(固态硬盘)和HDD(硬盘),特别是HDD,设备的随机访问性能与顺序访问性能差距较大。为了充分发挥磁盘的性能,往往通过把数据随机写转换为顺序写的方法对磁盘进行访问。比如LSM-Tree,通过把数据更新操作,在内存中进行缓存排序后,再批量顺序的写入到磁盘中。
虽然通过把数据随机写转换为顺序写的方法可以充分发挥磁盘的顺序访问的优势,但是,在用顺序写的方式写入数据时,写入的数据会导致一部分旧数据需要重新写入到磁盘的其它位置,从而导致了写入放大的问题。
发明内容
本申请实施例提供了一种数据存储方法,包括:将待写入数据按照最小分配单元进行对齐;获取所述待写入数据的对齐部分数据和非对齐部分数据;将所述对齐部分数据写入磁盘,并将所述非对齐部分数据写入持久性内存;将所述非对齐部分数据从所述持久性内存写入所述磁盘。
本申请实施例还提供了一种数据存储装置,包括:对齐模块,用于将待写入数据按照最小分配单元进行对齐;获取模块,用于获取所述待写入数据的对齐部分数据和非对齐部分数据;第一写入模块,用于将所述对齐部分数据写入磁盘,并将所述非对齐部分数据写入持久性内存;第二写入模块,用于将所述非对齐部分数据从所述持久性内存写入所述磁盘。
本申请实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的数据存储方法。
本申请实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述的数据存储方法。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定。
图1是本申请第一实施例提供的数据存储方法的流程示意图;
图2是本申请第一实施例提供的数据存储方法的原理示例图;
图3是本申请第一实施例提供的数据存储方法的原理框架示例图;
图4是本申请第二实施例提供的数据存储方法的流程示意图;
图5是本申请第二实施例提供的数据存储方法的原理示例图;
图6是本申请第三实施例提供的数据存储装置的模块结构示意图;
图7是本申请第四实施例提供的电子设备的结构示意图。
具体实施方式
本申请实施例的主要目的在于提出一种数据存储方法、装置、设备及存储介质,可以减少在磁盘写入数据时的写入放大的问题。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。
本申请第一实施例涉及一种数据存储方法,通过将待写入数据按照最小分配单元进行对齐,获取待写入数据的对齐部分和非对齐部分数据,将对齐部分数据写入磁盘,并将非对齐部分数据写入持久性内存;将非对齐部分数据从持久性内存写入磁盘。由于对齐部分数据写入磁盘时可以整块写入,不会导致其它旧数据的重新写入,因此可以减少将数据写入磁盘时导致的写入放大问题;同时,非对齐部分数据先写入持久性内存再从持久性内存写入磁盘,可以利用持久性内存的持久特性保存非对齐部分数据,从而保证非对齐部分数据最终也存储到磁盘中;另外,在将非对齐部分数据写入持久性内存后再从持久性内存写入磁盘,可以增加非对齐部分数据在持久性内存中进行数据合并的几率,可以进一步减少将待写入数据写入磁盘时的写入放大问题。
应当说明的是,本申请实施例提供的数据存储方法的执行主体可以为包含磁盘与持久性内存组合的存储系统,具体可以为终端或服务端的处理器,其中,服务端可以由单个服务器或多个服务器组成的集群来实现。
本申请实施例提供的数据存储方法的具体流程如图1所示,具体包括以下步骤:
S101:将待写入数据按照最小分配单元进行对齐。
S102:获取待写入数据的对齐部分数据和非对齐部分数据。
S103:将对齐部分数据写入磁盘,并将非对齐部分数据写入持久性内存。
S104:将非对齐部分数据从持久性内存写入磁盘。
对于S101-S104,详细说明如下:
其中,磁盘可以包括固态硬盘(SSD)和硬盘(HDD),优选地,磁盘为固态硬盘。
请参考图2,其为本申请实施例提供的数据存储方法的原理示例图。具体地,将待写入数据按最小分配单元(Minimal Allocate Size,MAS)进行对齐,得到中间的对齐部分数据和首尾的非对齐部分数据;然后将中间的对齐部分数据直接写入磁盘中,将首尾的非对齐部分数据写入持久性内存;最后将非对齐部分数据从持久性内存写入磁盘。
可选地,在将待写入数据按照最小分配单元进行对齐之前,可以对待写入数据进行判断,若待写入数据小于最小分配单元,则直接将待写入数据写入持久性内存中,再将待写入数据从持久性内存写入磁盘;若待写入数据大于或等于最小分配单元,则将待写入数据按照最小 分配单元进行对齐。可选地,若待写入数据等于最小分配单元,则不执行待写入数据按照最小分配单元进行对齐的步骤,直接写入磁盘。
在将对齐部分数据写入磁盘时,包括两种情况,一种是数据新写入磁盘,一种是数据以覆盖写的方式写入磁盘。对于新写的情况,直接将对齐部分数据直接写入磁盘即可。当对齐部分数据写入磁盘的方式为覆盖写方式时,可选地,可以采用写时复制(Copy on write,COW)将对齐部分数据写入磁盘中,以保证数据的完整性和事务的一致性。
在一个具体的例子中,在将对齐部分数据写入磁盘时,将对齐部分数据写入磁盘的裸设备中,由于裸设备不经过文件系统的缓冲,不被操作系统直接管理,因此可以减少文件系统或中间存储系统对数据存储带来的开销,有助于提高数据的吞吐(IO)效率。
在将非对齐部分数据写入持久性内存时,可以使用相应的指令将非对齐部分数据进行写入,例如使用ntstore或clwb指令进行数据的更新,具体可以是:对于相对大块的数据采用ntstore指令进行更新,对于相对小块的数据采用clwb指令进行更新。应当理解的是,当存储系统不同时,使用的指令可能不同,以上ntstore或clwb指令仅为示例,不应作为限定。
为了方便对数据的管理,通常还会对数据的元数据进行管理。在一个具体的例子中,本申请实施例提供的数据存储方法还包括:将待写入数据的元数据写入持久性内存。由于持久性内存的读写时延低,因此通过将待写入数据元数据写入持久性内存,可以方便对元数据的读写,提高数据管理的效率。可选地,在将对齐部分写入磁盘时,在持久性内存中更新对齐部分的元数据;在将非对齐部分写入持久性内存时,在持久性内存中更新非对齐部分数据的元数据;在将非对齐部分数据从持久性内存写入磁盘时,再在持久性内存中更新非对齐部分数据的元数据。
优选地,将待写入数据的元数据写入持久性内存,包括:将待写入数据的元数据写入至持久性内存的跳表中,即采用跳表的形式对元数据进行管理。当然,也可以在持久性内存中采用其它形式管理元数据,例如,采用树结构形式。相比树结构而言,跳表更适合并发访问,因为在对树结构进行修改时,并发访问通常需要对树结构重新平衡,而重新平衡操作可能会影响树结构的较大范围,需要在较多树节点上使用互斥锁;而对跳表的操作只会影响到节点本身和前后插入的节点,在更改跳表结构时不需要锁住和同步整个跳表数据,具有更好的并发性;同时,跳表跟树结构中的平衡二叉树具有相同的查询复杂度O(lgn),在具有相同查询的性能的情况下具有更优的并发度。
在实际应用中,跳表可以包括有锁跳表和无锁跳表,优选地,跳表为无锁并发跳表。有锁跳表可以对跳表进行锁定,具有更好的安全性,但结构比较复杂,读写的效率不高;而无锁跳表结构比较简单,读写的效率较高,更有利于对元数据的管理。
请参考图3,其为本申请实施例提供的数据存储方法的原理框架示例图。具体地,从数据对象商店(ObjectStore)接收待写入数据,然后将待写入数据按最小分配单元进行对齐,左边对齐的部分写入到磁盘裸设备中,非对齐部分写入持久性内存,待写入数据的元数据(meta)通过持久性内存开发工具包(Persistent Memory Development Kit,PMDK)写入至持久性内存中。
在使用本申请实施例提供的数据存储方法对数据进行管理时,可以采用以下方式执行数据的读取和删除。
对于数据的读取,在接收到数据的读取请求后,首先到持久性内存的跳表中根据读取的 数据的范围,查找数据在磁盘(对齐部分数据或转存到磁盘的非对齐部分数据)或持久性内存(未转存到磁盘的非对齐部分数据)中的具体位置,再将两部分数据进行合并后,返回给上层应用。
对于数据的删除,根据删除请求的范围,直接到持久性内存的跳表中进行数据索引的删除以及空间索引的释放,数据索引删除后,不需要显式到磁盘和持久性内存中进行数据的删除,因为磁盘和持久性内存中数据空间管理的索引存放在持久性内存的跳表中,索引删除后,对应的空间会进行相应的释放。
本申请实施例提供的数据存储方法,通过将待写入数据按照最小分配单元进行对齐后,将对齐部分数据写入磁盘,并将非对齐部分数据先写入持久性内存后再从持久性内存写入磁盘。由于对齐部分数据写入磁盘时可以整块写入,不会导致其它旧数据的重新写入,因此可以减少将数据写入磁盘时导致的写入放大问题;同时,非对齐部分数据先写入持久性内存再从持久性内存写入磁盘,可以利用持久性内存的持久特性保存非对齐部分数据,从而保证非对齐部分数据最终也存储到磁盘中;另外,在将非对齐部分数据写入持久性内存后再从持久性内存写入磁盘,可以增加非对齐部分数据在持久性内存中进行数据合并的几率,可以进一步减少将待写入数据写入磁盘时的写入放大问题。
本申请第二实施例涉及一种数据存储方法,第二实施例与第一实施例大致相同,主要区别在于:在本申请实施例中,将非对齐部分数据从持久性内存写入磁盘,包括:当达到预设时间间隔或持久性内存的已使用空间达到预设阈值时,将非对齐部分数据从持久性内存写入磁盘。
本申请实施例提供的数据存储方法的具体流程如图4所示,具体包括以下步骤:
S201:将待写入数据按照最小分配单元进行对齐。
S202:获取待写入数据的对齐部分数据和非对齐部分数据。
S203:将对齐部分数据写入磁盘,并将非对齐部分数据写入持久性内存。
S204:当达到预设时间间隔时或持久性内存的已使用空间达到预设阈值时,将非对齐部分数据从持久性内存写入磁盘。
其中,S201-S203与第一实施例中的S101-103相同,具体可以参见第一实施例中的相关描述,为了避免重复,这里不再赘述。
对于S204,具体说明如下:
其中,预设时间间隔可以根据实际需要进行设置,此处不做具体限制,只要在预设时间间隔内,可以保证持久性内存的空间不至于全部占用即可。可选地,可以在保证持久性内存的空间不被全部占用的同时,将预设时间间隔设置得较长,可以使非对齐部分数据全部写入持久性内存后再写入磁盘中。可以理解的是,在达到预设时间间隔时将非对齐部分数据从持久性内存写入磁盘,相当于采用惰性写入策略将非对齐部分数据写入磁盘,可以进一步增加非对齐部分数据在持久性内存进行数据合并的机率,从而进一步减少将待写入数据写入磁盘时的写入放大问题,同时,由于数据在持久性内存中会进行数据合并,也可以减少后续读取的数据量,也节约了在PMEM中的存储空间。
预设阈值可以根据实际需要进行设置,例如是50%、60%或70%等,此处不做具体限制。
请参考图5,其为本申请实施例提供的数据存储方法的原理示例图。具体地,当获取到 待写入数据时,将待写入数据按最小分配单元进行对齐,然后将对齐部分数据直接写入磁盘,更新在持久性内存中的跳表索引(即元数据);将非对部分数据直接写入持久性内存中,更新在持久性内存的跳表索引;在定期时间到或超过限定的容用(持久性内存已使用空间)后,将非对齐部分数据从持久性内存转存到磁盘中,并更新持久性内存的跳表索引。其中,由于磁盘中的数据都是按块对齐读写的,所以在将非对齐部分数据转存至磁盘时,首先按块对齐读取磁盘中内容合并后,再原地更新到磁盘中。
本申请实施例提供的数据存储方法,在达到预设时间间隔或持久性内存的已使用空间达到预设阈值时,将非对齐部分数据从持久性内存写入磁盘,可以增加非对齐部分数据在持久性内存进行数据合并的几率,从而可以增加非对齐部分从持久性内存写入磁盘时整块写入的几率,减少非对齐部分数据写入磁盘时的写入放大问题,减少磁盘的写入次数和数据的碎片化;另外,在减少写入放大问题的同时,原来因为写入放大问题导致的抖动问题也可以相应减少;将非对齐部分先写入持久性内存,由于持久性内存可以直接存放数据,不需要进行序列化,因此也可以减少数据的序列化和反序列化的比例,减少处理器(CPU)的消耗。
本申请第三实施例涉及一种数据存储装置300,如图6所示,包括:对齐模块301、获取模块302、第一写入模块303和第二写入模块304,各模块功能详细说明如下:
对齐模块301,用于将待写入数据按照最小分配单元进行对齐;
获取模块302,用于获取待写入数据的对齐部分数据和非对齐部分数据;
第一写入模块303,用于将对齐部分数据写入磁盘,并将非对齐部分数据写入持久性内存;
第二写入模块304,用于将非对齐部分数据从持久性内存写入磁盘。
进一步地,第二写入模块304还用于:
当达到预设时间间隔时或持久性内存的已使用空间达到预设阈值时,将非对齐部分数据从持久性内存写入磁盘。
进一步地,第一写入模块303还用于:
当对齐部分数据写入磁盘的方式为覆盖写方式时,采用写时复制将对齐部分数据写入磁盘。
进一步地,第一写入模块303还用于:
将对齐部分数据写入磁盘的裸设备中。
进一步地,本申请实施例提供的数据存储装置300还包括第三写入模块,其中,第三写入模块用于:将待写入数据的元数据写入持久性内存。
进一步地,第三写入模块还用于:将待写入数据的元数据写入至持久性内存的跳表中。
进一步地,跳表为无锁并发跳表。
进一步地,磁盘为固态硬盘。
不难发现,本实施例为与前述实施例相对应的装置实施例,本实施例可与前述实施例互相配合实施。前述实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在前述实施例中。
值得一提的是,本实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合 实现。此外,为了突出本申请的创新部分,本实施例中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。
本申请第四实施例涉及一种电子设备,如图7所示,包括:至少一个处理器401;以及,与至少一个处理器401通信连接的存储器402;其中,存储器402存储有可被至少一个处理器401执行的指令,指令被至少一个处理器401执行,以使至少一个处理器401能够执行上述的数据存储方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
本申请第五实施例涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称:ROM)、随机存取存储器(Random Access Memory,简称:RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施例是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (11)

  1. 一种数据存储方法,包括:
    将待写入数据按照最小分配单元进行对齐;
    获取所述待写入数据的对齐部分数据和非对齐部分数据;
    将所述对齐部分数据写入磁盘,并将所述非对齐部分数据写入持久性内存;
    将所述非对齐部分数据从所述持久性内存写入所述磁盘。
  2. 根据权利要求1所述的数据存储方法,其中,所述将所述非对齐部分数据从所述持久性内存写入所述磁盘,包括:
    当达到预设时间间隔时或所述持久性内存的已使用空间达到预设阈值时,将所述非对齐部分数据从所述持久性内存写入所述磁盘。
  3. 根据权利要求1或2所述的数据存储方法,其中,所述将所述对齐部分数据写入磁盘,包括:
    当所述对齐部分数据写入磁盘的方式为覆盖写方式时,采用写时复制将所述对齐部分数据写入所述磁盘。
  4. 根据权利要求1-3中任一项所述的数据存储方法,其中,所述将所述对齐部分数据写入磁盘,包括:
    将所述对齐部分数据写入磁盘的裸设备中。
  5. 根据权利要求1-4中任一项所述的数据存储方法,其中,还包括:
    将所述待写入数据的元数据写入所述持久性内存。
  6. 根据权利要求5所述的数据存储方法,其中,所述将所述待写入数据的元数据写入所述持久性内存,包括:
    将所述待写入数据的元数据写入至所述持久性内存的跳表中。
  7. 根据权利要求6所述的数据存储方法,其中,所述跳表为无锁并发跳表。
  8. 根据权利要求1-7任一项所述的数据存储方法,其中,所述磁盘为固态硬盘。
  9. 一种数据存储装置,包括:
    对齐模块,用于将待写入数据按照最小分配单元进行对齐;
    获取模块,用于获取所述待写入数据的对齐部分数据和非对齐部分数据;
    第一写入模块,用于将所述对齐部分数据写入磁盘,并将所述非对齐部分数据写入持久性内存;
    第二写入模块,用于将所述非对齐部分数据从所述持久性内存写入所述磁盘。
  10. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至8任一项所述的数据存储方法。
  11. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的数据存储方法。
PCT/CN2021/127981 2020-11-23 2021-11-01 数据存储方法、装置、设备及存储介质 WO2022105585A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011321690.2A CN113485635A (zh) 2020-11-23 2020-11-23 数据存储方法、装置、设备及存储介质
CN202011321690.2 2020-11-23

Publications (1)

Publication Number Publication Date
WO2022105585A1 true WO2022105585A1 (zh) 2022-05-27

Family

ID=77932590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127981 WO2022105585A1 (zh) 2020-11-23 2021-11-01 数据存储方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113485635A (zh)
WO (1) WO2022105585A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113485635A (zh) * 2020-11-23 2021-10-08 中兴通讯股份有限公司 数据存储方法、装置、设备及存储介质
CN115951841B (zh) * 2023-02-27 2023-06-20 浪潮电子信息产业股份有限公司 存储系统及创建方法、数据处理方法、装置、设备和介质
CN116069267B (zh) * 2023-04-06 2023-07-14 苏州浪潮智能科技有限公司 一种raid卡的写缓存方法、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120066190A1 (en) * 2010-09-10 2012-03-15 International Business Machines Corporation Handling file operations with low persistent storage space
CN109558457A (zh) * 2018-12-11 2019-04-02 浪潮(北京)电子信息产业有限公司 一种数据写入方法、装置、设备及存储介质
CN110018792A (zh) * 2019-04-10 2019-07-16 苏州浪潮智能科技有限公司 一种待落盘数据处理方法、装置、电子设备及存储介质
CN110209341A (zh) * 2018-03-23 2019-09-06 腾讯科技(深圳)有限公司 一种数据写入方法、装置和存储设备
US20200363998A1 (en) * 2020-08-07 2020-11-19 Intel Corporation Controller and persistent memory shared between multiple storage devices
CN113485635A (zh) * 2020-11-23 2021-10-08 中兴通讯股份有限公司 数据存储方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120066190A1 (en) * 2010-09-10 2012-03-15 International Business Machines Corporation Handling file operations with low persistent storage space
CN110209341A (zh) * 2018-03-23 2019-09-06 腾讯科技(深圳)有限公司 一种数据写入方法、装置和存储设备
CN109558457A (zh) * 2018-12-11 2019-04-02 浪潮(北京)电子信息产业有限公司 一种数据写入方法、装置、设备及存储介质
CN110018792A (zh) * 2019-04-10 2019-07-16 苏州浪潮智能科技有限公司 一种待落盘数据处理方法、装置、电子设备及存储介质
US20200363998A1 (en) * 2020-08-07 2020-11-19 Intel Corporation Controller and persistent memory shared between multiple storage devices
CN113485635A (zh) * 2020-11-23 2021-10-08 中兴通讯股份有限公司 数据存储方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN113485635A (zh) 2021-10-08

Similar Documents

Publication Publication Date Title
WO2022105585A1 (zh) 数据存储方法、装置、设备及存储介质
US11307769B2 (en) Data storage method, apparatus and storage medium
US9881041B2 (en) Multiple RID spaces in a delta-store-based database to support long running transactions
US11442961B2 (en) Active transaction list synchronization method and apparatus
US20200387486A1 (en) In-place load unit conversion
CN103595797B (zh) 一种分布式存储系统中的缓存方法
EP2534571B1 (en) Method and system for dynamically replicating data within a distributed storage system
US11847034B2 (en) Database-level automatic storage management
CN110109868B (zh) 用于索引文件的方法、装置和计算机程序产品
CN109033359A (zh) 一种多进程安全访问sqlite的方法
US11640411B2 (en) Data replication system
CN112307119A (zh) 数据同步方法、装置、设备及存储介质
CN113377868A (zh) 一种基于分布式kv数据库的离线存储系统
US10936500B1 (en) Conditional cache persistence in database systems
WO2023077971A1 (zh) 事务处理方法、装置、计算设备及存储介质
US20150268878A1 (en) Efficient serialization of journal data
WO2023124422A1 (zh) 一种数据读写的控制方法及电子设备
Luo et al. {SMART}: A {High-Performance} Adaptive Radix Tree for Disaggregated Memory
CN114096962A (zh) 区块链高速缓存系统
US7613786B2 (en) Distributed file system
US20170286442A1 (en) File system support for file-level ghosting
US9069821B2 (en) Method of processing files in storage system and data server using the method
CN114490540B (zh) 数据存储方法、介质、装置和计算设备
CN107832121B (zh) 一种应用于分布式串行长事务的并发控制方法
CN115794819A (zh) 一种数据写入方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893733

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21893733

Country of ref document: EP

Kind code of ref document: A1