WO2024051109A1 - 一种数据存储方法、装置、系统、设备和介质 - Google Patents
一种数据存储方法、装置、系统、设备和介质 Download PDFInfo
- Publication number
- WO2024051109A1 WO2024051109A1 PCT/CN2023/078282 CN2023078282W WO2024051109A1 WO 2024051109 A1 WO2024051109 A1 WO 2024051109A1 CN 2023078282 W CN2023078282 W CN 2023078282W WO 2024051109 A1 WO2024051109 A1 WO 2024051109A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- area
- data block
- block
- memory
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 89
- 238000013500 data storage Methods 0.000 title claims abstract description 54
- 239000000872 buffer Substances 0.000 claims abstract description 169
- 238000003860 storage Methods 0.000 claims abstract description 136
- 238000013507 mapping Methods 0.000 claims abstract description 53
- 230000008569 process Effects 0.000 claims description 18
- 238000007726 management method Methods 0.000 claims description 13
- 230000005540 biological transmission Effects 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 108010001267 Protein Subunits Proteins 0.000 description 4
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 238000004904 shortening Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000002567 autonomic effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 239000012536 storage buffer Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
- G06F3/0631—Configuration or reconfiguration of storage systems by allocating resources to storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
Definitions
- This application relates to the field of distributed storage technology, and in particular to a data storage method, device, system, equipment and computer non-volatile readable storage medium.
- a distributed storage system usually consists of a main control server, a storage server, and multiple clients. Its essence is to evenly distribute a large number of files to multiple storage servers, and has the characteristics of high scalability and high reliability. At the same time, distributed storage systems can be used in a variety of scenarios. Driven by the commercial model, the performance requirements for distributed storage systems are getting higher and higher.
- a file storage engine is used as the back-end storage engine of a distributed storage system.
- the file storage engine manages the OSD (Object Storage Device, a process that returns specific data in response to client requests) data on the storage node through the file system.
- OSD Object Storage Device
- the file storage engine technology is mature in application, but the file storage engine is based on the characteristics of the log file system and has a double-write problem.
- a write request of the file storage engine is converted into two write operations. First, the log is written synchronously, and then the data is written asynchronously to the storage disk medium. Therefore, as the amount of data increases, IO (Input/Output , input/output) pressure increases will lead to a serious decline in storage system performance.
- Nvme SSD non-volatile memory-express solid-state drives, non-volatile memory devices
- a block device storage engine In the field of research optimization, a block device storage engine is proposed. Since this storage engine does not rely on a log file system, it can directly manage block devices, thus reducing the problem of log double-write performance degradation. However, in the scenario of massive small file data, the block device storage engine needs to rely on the embedded key-value store (RocksDB) to manage metadata and small block data. Metadata and small data key-value pairs written by users will first be written to the write-ahead log on the disk, and then the data will be written. Therefore, RocksDB's log writing method will also cause double-write problems.
- RocksDB embedded key-value store
- the purpose of the embodiments of this application is to provide a data storage method, device, system, equipment and computer non-volatile readable storage medium, which can reduce the IO delay of the storage system.
- a data storage method including:
- the IO data is converted into the memory according to the set data structure area and the set data block granularity; wherein, the data structure area includes a first area for storing metadata information, a first area for storing metadata information, and a first area for storing metadata information. a second area for data description information, a third area for storing object data, and a log area;
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a piece of mapped data; the user buffer is a buffer on the hardware storage device.
- converting IO data into memory according to the set data structure area and set data block granularity includes:
- the data length and offset information of each IO data block are determined; where the offset information includes logical offset and actual offset;
- the method before dividing the IO data into IO data blocks according to the set data block granularity, the method further includes: setting the value of the data block granularity to 4KB.
- mapping IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area includes:
- the IO data When performing an overwrite write operation of IO data, if the IO data has object data in the logical address space on the hardware storage device, the first IO data block that belongs to the overlapping area and meets the data block granularity requirements is written to the newly allocated core. buffer;
- the last data block is overwritten and written into the corresponding kernel buffer.
- writing the first IO data block that belongs to the overlapping area and meets the data block granularity requirements to the newly allocated kernel buffer includes:
- writing the subsequent data blocks that do not meet the data block granularity requirements and are adjacent to the first IO data block to the log area includes:
- the subsequent data blocks that do not meet the data block granularity requirements and are adjacent to the first IO data block are written into the log area through the set consistency transaction interface.
- the last data block stored in the log area reaches the set length, the last data block stored in the log area is deleted.
- mapping IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area includes:
- IO data When performing an overwrite write operation of IO data, if the IO data does not have object data in the logical address space on the hardware storage device, addressing is performed according to the byte granularity of the memory to map the IO data to the core corresponding to the logical address. buffer.
- mapping IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area includes:
- a new kernel buffer is allocated for the IO data based on the information stored in the data structure area corresponding to the IO data;
- the method further includes: updating the metadata through the consistency transaction interface.
- the IO data acquisition process includes:
- the protocol access interface includes object interface, block interface and file system interface; different protocol access interfaces have their corresponding data slicing methods;
- the data to be processed is divided into IO data according to the corresponding slicing method, it also includes:
- aggregate each IO data and its corresponding copy data into the same group including: splitting the data into 4MB objects; mapping the objects to the same group through hash calculation.
- map the IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area including: calculating the written bits of the IO data through the metadata information of the object According to the minimum allocation unit of 4KB agreed upon by the non-volatile memory device, the IO data is written using the writing method corresponding to the object at the location.
- the writing method includes create append write and overwrite modify write.
- the data blocks to be written are A, C, C, and B, where the data blocks C, C, and B are the data to be written in the overlapping data area, and the data block A is the data to be written in the non-overlapping area.
- both data block C and data block A meet the data block granularity requirements
- data block B does not meet the data block granularity requirements.
- the IO data is mapped to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area. area, including: when performing an overwrite write operation of IO data, two data blocks C are written into the newly allocated kernel buffer; data block A is additionally written into the free kernel buffer adjacent to the newly allocated kernel buffer. area; write data block B into the log block in the log area; based on the data length and offset information corresponding to data block B, overwrite data block B and write it to the corresponding kernel buffer.
- the conversion unit is configured to convert the IO data into the memory according to the set data structure area and the set data block granularity when acquiring the IO data; wherein, the data structure area includes a third block for storing metadata information.
- the mapping unit is configured to map IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area; among them, the user buffer and the kernel buffer share a piece of mapping data; the user buffer is a hardware storage device buffer on.
- the conversion unit includes a dividing subunit, a first writing subunit, a determining subunit, a second writing subunit and a third writing subunit;
- the dividing subunit is configured to divide IO data into IO data blocks according to the set data block granularity; each IO data block has its corresponding number information;
- the first writing subunit is configured to write metadata information of each IO data block into the first area
- the determination subunit is configured to determine the data length and offset information of each IO data block based on the metadata information of each IO data block; wherein the offset information includes a logical offset and an actual offset;
- the second writing subunit is configured to write the number information, data length and offset information corresponding to each IO data block into the second area;
- the third writing subunit is configured to write each IO data block into the third area.
- the mapping unit includes a new allocation writing sub-unit, an append writing sub-unit, a log area writing sub-unit and an overwriting writing sub-unit;
- the newly allocated write subunit is configured so that when performing an overwrite write operation of IO data, if the IO data has object data in the logical address space on the hardware storage device, it will belong to the overlapping area and meet the data block granularity requirements.
- One IO data block is written to the newly allocated kernel buffer;
- the append write subunit is configured to append write the previous data block forward adjacent to the first IO data block into the free kernel buffer adjacent to the newly allocated kernel buffer;
- the log area writing subunit is configured to write the subsequent data blocks that do not meet the data block granularity requirements and are adjacent to the first IO data block into the log area;
- the overwrite writing subunit is configured to overwrite and write the subsequent data block to the corresponding kernel buffer based on the data length and offset information corresponding to the subsequent data block.
- the newly allocated writing subunit is configured to determine, from the IO data, the first IO data block that belongs to the overlapping area and meets the data block granularity requirements;
- the log area writing subunit is configured to write the subsequent data blocks that do not meet the data block granularity requirements and are backward adjacent to the first IO data block into the log area through the set consistency transaction interface.
- it also includes a judgment unit and a deletion unit;
- the judgment unit is configured to judge whether the last data block stored in the log area reaches the set length after the overwriting sub-unit overwrites and writes the last data block into the corresponding kernel buffer;
- the deletion unit is configured to delete the last data block stored in the log area if the last data block stored in the log area reaches a set duration.
- the mapping unit is configured to address the IO data according to the byte granularity of the memory if there is no object data in the logical address space of the IO data on the hardware storage device when performing an overwrite write operation of the IO data.
- IO data is mapped to the kernel buffer corresponding to the logical address.
- mapping unit includes an allocation subunit and a writing subunit
- the allocation subunit is configured to allocate a new kernel buffer for IO data based on the information stored in the data structure area corresponding to the IO data when performing an additional write operation of IO data;
- Write subunit configured to address at byte granularity of memory to write IO data to the new kernel buffer.
- the device for the acquisition process of IO data, includes an acquisition subunit and a slicing subunit;
- the acquisition subunit is configured to obtain the pending data transmitted by the client according to the set protocol access interface; among which, the protocol access interface includes object interface, block interface and file system interface; different protocol access interfaces have their corresponding data slices Way;
- the slicing subunit is configured to segment the data to be processed according to the corresponding slicing method to obtain IO data.
- a summary unit is also included.
- the aggregation unit is configured to aggregate each IO data and its corresponding copy data into the same group;
- the conversion unit is configured to synchronously execute the steps of converting IO data to the memory according to the set data structure area and the set data block granularity for the data in the same group.
- Embodiments of the present application also provide a data storage system, including a storage management module, a transmission interface and a hardware storage device; the storage management module is connected to the hardware storage device through the transmission interface;
- the storage management module is configured to, when acquiring IO data, convert the IO data into the memory according to the set data structure area and the set data block granularity; wherein, the data structure area includes a block for storing metadata information. a first area, a second area used to store data description information, a third area used to store object data, and a log area;
- the storage management module is configured to store information according to the byte granularity of the memory and the data structure area. Information, the IO data is mapped to the kernel buffer through the transmission interface; among them, the user buffer and the kernel buffer share a piece of mapping data; the user buffer is a buffer on the hardware storage device.
- the transmission interface includes a unit interface for transmitting that meets the data block granularity requirements, and a consistency transaction interface for transmitting that does not meet the data block granularity requirements.
- An embodiment of the present application also provides an electronic device, including:
- the processor is configured to execute the computer program to implement the steps of the above data storage method.
- Embodiments of the present application also provide a computer non-volatile readable storage medium.
- a computer program is stored on the computer non-volatile readable storage medium.
- the computer program is executed by a processor, the steps of the above data storage method are implemented.
- the IO data when the IO data is obtained, the IO data is converted into the memory according to the set data structure area and the set data block granularity; among which, the data structure area includes the information used to store metadata The first area, the second area used to store data description information, the third area used to store object data and the log area.
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a piece of mapped data; the user buffer is a buffer on the hardware storage device.
- IO data by setting the data structure area, IO data can be supported for memory mapping, thereby reducing repeated copying of data and shortening the IO path.
- mapping process is based on the operating system's direct memory access copy.
- the user buffer and the kernel buffer share a piece of mapping data.
- mapping data By establishing a shared mapping, there is no need to copy IO data from the kernel buffer to the user buffer.
- Memory mapping and byte addressing are used to manage hardware storage devices, which reduces the IO latency of the storage system in scenarios with massive small files.
- Figure 1 is a schematic diagram of a hardware composition framework applicable to a data storage method provided by an embodiment of the present application
- Figure 2 is a flow chart of a data storage method provided by an embodiment of the present application.
- Figure 3 is an architectural diagram of a distributed storage system provided by an embodiment of the present application.
- Figure 4 is a schematic diagram of the interaction between a storage engine and a hardware storage device provided by an embodiment of the present application
- Figure 5 is a schematic flow chart of an overwriting operation provided by an embodiment of the present application.
- Figure 6 is a schematic structural diagram of a data storage device provided by an embodiment of the present application.
- Figure 7 is a schematic structural diagram of a data storage system provided by an embodiment of the present application.
- FIG. 1 is a schematic diagram of a hardware composition framework suitable for a data storage method provided by an embodiment of the present application.
- the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output interface 104, and a communication component 105.
- the processor 101 is configured to control the overall operation of the electronic device 100 to complete all or part of the steps in the data storage method;
- the memory 102 is configured to store various types of data to support the operation of the electronic device 100. These data For example, instructions for any application or method configured to operate on the electronic device 100 may be included, as well as application-related data.
- the memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read- Only Memory (ROM), magnetic memory, flash memory, one or more of magnetic disks or optical disks.
- SRAM static random access memory
- EEPROM Electrically erasable programmable read-only memory
- EPROM Erasable Programmable Read-Only Memory
- PROM Programmable Read-Only Memory
- ROM Read-Only Memory
- magnetic memory flash memory
- flash memory one or more of magnetic disks or optical disks.
- the memory 102 stores at least programs and/or data for implementing the following functions:
- the IO data is converted into the memory according to the set data structure area and the set data block granularity; wherein, the data structure area includes a first area for storing metadata information, a first area for storing metadata information, and a first area for storing metadata information. a second area for data description information, a third area for storing object data, and a log area;
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a piece of mapped data; the user buffer is a buffer on the hardware storage device.
- Multimedia components 103 may include screen and audio components.
- the screen may be a touch screen, for example, and the audio component is used to output and/or input audio signals.
- the audio component may include a microphone for receiving external audio signals.
- the received audio signals may be further stored in memory 102 or sent via communication component 105 .
- the audio component also includes at least one speaker for outputting audio signals.
- the information input/information output interface 104 provides an interface between the processor 101 and other interface modules.
- the other interface modules may be keyboards, mice, buttons, etc. These buttons can be virtual buttons or physical buttons.
- Communication component 105 is configured to communicate between electronic device 100 and other devices. Perform wired or wireless communications. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 105 may include: Wi -Fi parts, Bluetooth parts, NFC parts.
- the electronic device 100 may be configured by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processor (Digital Signal Processor, DSP), digital signal processing device (Digital Signal Processing Device, DSPD), programmable logic Device (Programmable Logic Device, PLD), Field Programmable Gate Array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation, used to perform data storage methods.
- ASIC Application Specific Integrated Circuit
- DSP Digital Signal Processor
- DSPD Digital Signal Processing Device
- PLD programmable logic Device
- FPGA Field Programmable Gate Array
- controller microcontroller, microprocessor or other electronic component implementation, used to perform data storage methods.
- the structure of the electronic device 100 shown in FIG. 1 does not constitute a limitation on the electronic device in the embodiment of the present application.
- the electronic device 100 may include more or fewer components than those shown in FIG. 1 , or a combination thereof. certain parts.
- Figure 2 is a flow chart of a data storage method provided by an embodiment of the present application. The method includes:
- the data structure area may include a first area for storing metadata information, a second area for storing data description information, a third area for storing object data, and a log area.
- IO data can be divided into IO data blocks according to the set data block granularity; each IO data block has its corresponding number information.
- the value of the data block granularity can be set based on actual requirements. For example, it can be set to 4KB. The reason is that there is no difference between the random and sequential IO performance of solid-state flash memory particle devices. In comparison, the random IO performance of mechanical disks is much lower than the sequential IO performance.
- the allocation unit of 4KB can reduce the number of request processing in small IO scenarios, thereby improving the storage system. performance.
- Each IO data block has its corresponding metadata information.
- the metadata information of each IO data block can be written into the first area.
- the data description information may include number information, data length and offset information corresponding to each IO data block.
- the data length and offset information of each IO data block can be determined based on the metadata information of each IO data block; where the offset information can include a logical offset and an actual offset; The number information, data length and offset information corresponding to the IO data block are written into the second area; each IO data block is written into the third area.
- the data storage method provided by this application is suitable for distributed storage systems.
- the distributed storage system can obtain the data to be processed transmitted by the client according to the set protocol access interface; wherein, the protocol access interface can include Object interface, block interface and file system interface.
- Protocol access interfaces have their corresponding data slicing methods. Therefore, after the distributed storage system obtains the data to be processed through the protocol access interface, it will segment the data to be processed according to the corresponding slicing method to obtain IO data.
- each IO data and its corresponding copy data can be summarized to the same group; synchronize the data in the same group to perform the steps of converting IO data to memory according to the set data structure area and set data block granularity.
- the client IO data when the client IO data, the data is first divided into 4MB objects, and the objects are mapped to the same group through hash calculation, and then the objects in the group are synchronized to This is used to read and write data.
- FIG 3 is an architectural diagram of a distributed storage system provided by an embodiment of the present application.
- the distributed storage system provides three protocol access interfaces: object, block and file system, respectively corresponding to the object gateway service (RadosGW) and block device service (RADOS). Block Device (RBD) and File Service (LibFS).
- RadosGW object gateway service
- RFD Block Device
- LibFS File Service
- RadosGW Reliable Autonomic Distributed Object Store, basic storage system
- the file system protocol also requires a metadata cluster, which monitors processes and maintains cluster status. Data is stored in the storage pool and mapped to hardware storage devices through the storage engine, such as HDD (Hard Disk Drive, hard disk drive) or SSD (Solid State Drives, solid state drive).
- HDD Hard Disk Drive, hard disk drive
- SSD Solid State Drives, solid state drive
- FIG 4 is a schematic diagram of the interaction between a storage engine and a hardware storage device provided by an embodiment of the present application.
- the storage engine (MemStore) can interact with the hardware storage device through a memory map (Memory Map) unit interface and a consistency transaction interface.
- the MemStore storage engine obtains IO data, it can store the IO data according to the divided first area, second area, third area and log area.
- the superblock represents the first area
- the metadata area represents the second area
- the data area represents the third area.
- the MemStore storage engine is completely modular for client IO operations. Client IO requests do not need to be modified and have no impact on upper-layer interface adaptation.
- the superblock can be designed to be 4K, and the metadata, object data, and log blocks are all 4MB.
- the superblock is mainly used to store the structural information of the system itself, metadata information such as the data structure that describes the overall system information; the metadata area records and stores the description information of the system object data, such as the object number and its corresponding data area, and the logical bias of the object data.
- Offset, data size length and physical actual offset length on the Nvme SSD device when adding metadata, according to the principle of fixed area size of metadata, a new 16-byte pointer points to the start of the new metadata area address to ensure atomicity when new metadata is written.
- the last area is the log area, which performs transaction processing on data written to Nvme SSD.
- S202 Map the IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area.
- the user buffer and the kernel buffer share a piece of mapping data;
- the user buffer is a buffer on the hardware storage device.
- the MemStore storage engine accesses the Nvme SSD device driver through Memory Map.
- the management and data IO of Nvme SSD devices are addressing operations at byte granularity.
- the MemStore storage engine When the distributed storage client requests to write operation data to the MemStore storage engine, the MemStore storage engine will calculate the location of the data area to be written through the metadata information of the Object object, and write according to the minimum allocation unit of 4KB agreed by Nvme SSD.
- Object objects There are different ways to write Object objects, which can generally be divided into two types: creation and append writing and overwriting and modifying writing.
- the embodiment of this application designs a consistency transaction interface for the overwrite failure scenario, writes data that does not meet the 4KB data size into the log area, and then transmits the data in the log area to the Nvme SSD storage device through the consistency transaction interface, so as to Ensure data inconsistency caused by failure during data flushing, and avoid data inconsistency through atomic operations of transactions.
- the first IO data that belongs to the overlapping area and meets the data block granularity requirements will be The block is written to the newly allocated kernel buffer. Appending the previous data block forward adjacent to the first IO data block to the free kernel buffer adjacent to the newly allocated kernel buffer; will not meet the data block granularity requirements and backward relative to the first IO data block The adjacent last data block is written into the log area; based on the data length and offset information corresponding to the last data block, the last data block is overwritten and written into the corresponding kernel buffer.
- subsequent data blocks that do not meet the data block granularity requirements and are backward adjacent to the first IO data block can be written into the log area through the set consistency transaction interface.
- the process of newly allocating the kernel buffer for the first IO data may include determining from the IO data the first IO data block that belongs to the overlapping area and meets the data block granularity requirements; based on the information stored in the data structure area corresponding to the first IO data block , allocate the target kernel buffer adjacent to the existing object data storage area for the first IO data block; write the first IO data block into the target kernel buffer according to the byte granularity of the memory.
- the duration of data written to the log area can be set. Determine whether the last data block stored in the log area reaches the set length; if the last data block stored in the log area reaches the set length, it means that the last data block has been stored for a long time and generally no longer has utilization value. At this time The last data block stored in the log area can be deleted.
- addressing can be performed according to the byte granularity of the memory to directly map the IO data to the corresponding logical address. kernel buffer.
- Figure 5 is a schematic flow chart of an overwrite operation provided by an embodiment of the present application. It is assumed that the data blocks to be written are A, C, C, and B, where the data blocks C, C, and B correspond to Object data already exists in the logical address space. When the object object is overwritten and written, the Object object data already exists in the Nvme SSD logical address space, and the overlapping data area and the non-overlapping area are overwritten and written. Data blocks C, C, and B are the data that need to be written into the overlapping data area, and data block A is the data that needs to be written into the non-overlapping area. In the traditional method, when overwriting an overlapping data area, you need to first The data is read out, and then new data is written into the freed space.
- the IO data to be written in this application is written in slices according to the latest allocation unit 4KB of Nvme SSD, and two data blocks C are written into the newly allocated space.
- data block A which is less than one storage unit, is appended to the space unit of data block C, in order to overwrite data block B, data block B is first written to the log area, and then the data transaction consistency is updated to overwrite the original data block B. Write the location.
- step 1 in Figure 5 the entire implementation process is to follow step 1 in Figure 5 to first allocate the newly allocated kernel buffer for the two data blocks C, write the two data blocks C into the newly allocated kernel buffer; follow step 2 to write the data block A into the free kernel buffer adjacent to the newly allocated kernel buffer; follow step 3 to write data block B into a log block in the log area, and follow step 4 according to the data length and offset information corresponding to data block B, Overwrite data block B and write it to the corresponding kernel buffer.
- the above process does not use block-based addressing. Instead, it uses the Linux operating system Nvme protocol to use byte-based addressing.
- the memory mapping in this application is shared.
- the address space reduces the process of overwriting and reading the written data blocks first.
- the storage system reaches a high water level, the storage device space will not be affected by the 4K read overhead caused by overwriting when data recovery and other operations are performed. Therefore, it can effectively improve the performance sustainability of the storage and reduce the impact caused by the increase in capacity. Performance degradation issue.
- a new kernel buffer can be allocated for the IO data based on the information stored in the data structure area corresponding to the IO data; addressing is performed according to the byte granularity of the memory. to write IO data to the new kernel buffer.
- the allocated storage unit is based on the byte addressing method of Memory Map, which is different from the page alignment method in the existing technology and effectively reduces the risk of non-aligned IO writes. Space wasted problem.
- the IO data when the IO data is obtained, the IO data is converted into the memory according to the set data structure area and the set data block granularity; among which, the data structure area includes the information used to store metadata The first area, the second area used to store data description information, the third area used to store object data and the log area.
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a Block mapped data; user buffers are buffers on hardware storage devices.
- IO data by setting the data structure area, IO data can be supported for memory mapping, thereby reducing repeated copying of data and shortening the IO path.
- mapping process is based on the operating system's direct memory access copy.
- the user buffer and the kernel buffer share a piece of mapping data.
- mapping data By establishing a shared mapping, there is no need to copy IO data from the kernel buffer to the user buffer.
- Memory mapping and byte addressing are used to manage hardware storage devices, which reduces the IO latency of the storage system in scenarios with massive small files.
- Figure 6 is a schematic structural diagram of a data storage device provided by an embodiment of the present application, including a conversion unit 61 and a mapping unit 62;
- the conversion unit 61 is configured to, when acquiring the IO data, convert the IO data into the memory according to the set data structure area and the set data block granularity; wherein the data structure area includes a block for storing metadata information. a first area, a second area used to store data description information, a third area used to store object data, and a log area;
- the mapping unit 62 is configured to map IO data to the kernel buffer according to the byte granularity of the memory and the information stored in the data structure area; wherein the user buffer and the kernel buffer share a piece of mapping data; the user buffer is hardware storage Buffer on the device.
- the conversion unit includes a dividing subunit, a first writing subunit, a determining subunit, a second writing subunit and a third writing subunit;
- the dividing subunit is configured to divide IO data into IO data blocks according to the set data block granularity; each IO data block has its corresponding number information;
- the first writing subunit is configured to write metadata information of each IO data block into the first area
- the determination subunit is configured to determine the data length and offset information of each IO data block based on the metadata information of each IO data block; wherein the offset information includes a logical offset and an actual offset;
- the second writing subunit is configured to write the number information, data length and offset information corresponding to each IO data block into the second area;
- the third writing subunit is configured to write each IO data block into the third area.
- the mapping unit includes a new allocation writing sub-unit, an append writing sub-unit, a log area writing sub-unit and an overwriting writing sub-unit;
- the newly allocated write subunit is configured so that when performing an overwrite write operation of IO data, if the IO data has object data in the logical address space on the hardware storage device, it will belong to the overlapping area and meet the data block granularity requirements.
- One IO data block is written to the newly allocated kernel buffer;
- the append write subunit is configured to append write the previous data block forward adjacent to the first IO data block into the free kernel buffer adjacent to the newly allocated kernel buffer;
- the log area writing subunit is configured to write the subsequent data blocks that do not meet the data block granularity requirements and are adjacent to the first IO data block into the log area;
- the overwrite writing subunit is configured to overwrite and write the subsequent data block to the corresponding kernel buffer based on the data length and offset information corresponding to the subsequent data block.
- the newly allocated write subunit is configured to determine from the IO data that it belongs to the overlapping area and The first IO data block that meets the data block granularity requirements;
- the log area writing subunit is configured to write the subsequent data blocks that do not meet the data block granularity requirements and are backward adjacent to the first IO data block into the log area through the set consistency transaction interface.
- it also includes a judgment unit and a deletion unit;
- the judgment unit is configured to judge whether the last data block stored in the log area reaches the set length after the overwriting sub-unit overwrites and writes the last data block into the corresponding kernel buffer;
- the deletion unit is configured to delete the last data block stored in the log area if the last data block stored in the log area reaches a set duration.
- mapping unit is configured as
- IO data When performing an overwrite write operation of IO data, if the IO data does not have object data in the logical address space on the hardware storage device, addressing is performed according to the byte granularity of the memory to map the IO data to the core corresponding to the logical address. buffer.
- mapping unit includes an allocation subunit and a writing subunit
- the allocation subunit is configured to allocate a new kernel buffer for IO data based on the information stored in the data structure area corresponding to the IO data when performing an additional write operation of IO data;
- Write subunit configured to address at byte granularity of memory to write IO data to the new kernel buffer.
- the device for the acquisition process of IO data, includes an acquisition subunit and a slicing subunit;
- the acquisition subunit is configured to obtain the pending data transmitted by the client according to the set protocol access interface; among which, the protocol access interface includes object interface, block interface and file system interface; different protocol access interfaces have their corresponding data slices Way;
- the slicing subunit is configured to segment the data to be processed according to the corresponding slicing method to obtain IO data.
- a summary unit is also included.
- the aggregation unit is configured to aggregate each IO data and its corresponding copy data into the same group;
- the conversion unit is configured to synchronously execute the steps of converting IO data to the memory according to the set data structure area and the set data block granularity for the data in the same group.
- the IO data when the IO data is obtained, the IO data is converted into the memory according to the set data structure area and the set data block granularity; among which, the data structure area includes the information used to store metadata The first area, the second area used to store data description information, the third area used to store object data and the log area.
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a Block mapped data; user buffers are buffers on hardware storage devices.
- IO data by setting the data structure area, IO data can be supported for memory mapping, thereby reducing repeated copying of data and shortening the IO path.
- mapping process is based on the operating system's direct memory access copy.
- the user buffer and the kernel buffer share a piece of mapping data.
- mapping data By establishing a shared mapping, there is no need to copy IO data from the kernel buffer to the user buffer.
- Memory mapping and byte addressing are used to manage hardware storage devices, which reduces the IO latency of the storage system in scenarios with massive small files.
- Figure 7 is a schematic structural diagram of a data storage system provided by an embodiment of the present application, including a storage management module 71, a transmission interface 72 and a hardware storage device 73; the storage management module 71 is connected to the hardware storage device 73 through the transmission interface 72;
- the storage management module 71 is configured to, when acquiring IO data, convert the IO data into the memory according to the set data structure area and the set data block granularity; wherein, the data structure area includes information for storing metadata The first area, the second area used to store data description information, the third area used to store object data and the log area;
- the storage management module 71 is configured to map IO data to the kernel buffer through the transmission interface 72 according to the byte granularity of the memory and the information stored in the data structure area; wherein the user buffer and the kernel buffer share a piece of mapped data; the user buffer
- the buffer is a buffer on the hardware storage device 73 .
- the transmission interface includes a unit interface for transmitting that meets the data block granularity requirements, and a consistency transaction interface for transmitting that does not meet the data block granularity requirements.
- the IO data when the IO data is obtained, the IO data is converted into the memory according to the set data structure area and the set data block granularity; among which, the data structure area includes the information used to store metadata The first area, the second area used to store data description information, the third area used to store object data and the log area.
- the IO data is mapped to the kernel buffer; among them, the user buffer and the kernel buffer share a piece of mapped data; the user buffer is a buffer on the hardware storage device.
- IO data by setting the data structure area, IO data can be supported for memory mapping, thereby reducing repeated copying of data and shortening the IO path.
- mapping process is based on the operating system's direct memory access copy.
- the user buffer and the kernel buffer share a piece of mapping data.
- mapping data By establishing a shared mapping, there is no need to copy IO data from the kernel buffer to the user buffer.
- Memory mapping and byte addressing are used to manage hardware storage devices, which reduces the IO latency of the storage system in scenarios with massive small files.
- the data storage method in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer non-volatile readable storage medium.
- the technical solution of the present application is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product.
- the computer software product is stored in a non-volatile computer software. In a sexual storage medium, executing various embodiments of the present application all or part of the process.
- non-volatile storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), electrically erasable programmable ROM, register , hard disk, removable disk, CD-ROM, magnetic disk or optical disk and other media that can store program code.
- embodiments of the present application also provide a computer non-volatile readable storage medium.
- a computer program is stored on the computer non-volatile readable storage medium.
- the above data storage method is implemented. A step of.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请涉及分布式存储技术领域,公开了一种数据存储方法、装置、系统、设备和计算机非易失性可读存储介质,在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。通过设定数据结构区,可以支持IO数据进行内存映射。用户缓冲区和内核缓冲区共享一块映射数据,不再需要将IO数据从内核缓冲区拷贝到用户缓冲区,减少了存储系统IO时延。
Description
相关申请的交叉引用
本申请要求于2022年09月06日提交中国专利局,申请号为202211081604.4,申请名称为“一种数据存储方法、装置、系统、设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及分布式存储技术领域,特别是涉及一种数据存储方法、装置、系统、设备和计算机非易失性可读存储介质。
数字化转型背景下,海量数据不断增长,分布式存储系统已经得到普遍应用。分布式存储系统通常包括主控服务器、存储服务器,以及多个客户端组成,其本质是将大量的文件,均匀分布到多个存储服务器上,具有高扩展性,高可靠性的特点。与此同时,分布式存储系统能够应用于多种场景,在商业化模式的推动下,对分布式存储系统的性能要求越来越高。
通常情况下,使用文件存储引擎作为分布式存储系统的后端存储引擎,文件存储引擎是通过文件系统对存储节点上的OSD(Object Storage Device,响应客户端请求返回具体数据的进程)数据进行管理。文件存储引擎技术应用成熟,但是文件存储引擎基于日志型文件系统的特点存在双写问题。为了保证数据的可靠性,文件存储引擎的一次写请求转化为两次写操作,先同步写入日志,然后异步写入数据到存储磁盘介质上,因此随着数据量增加,IO(Input/Output,输入/输出)压力增大时将导致存储系统性能严重下降。
目前技术中许多日志文件系统使用Nvme SSD(non-volatile memory-express solid-state drives,非易失性内存设备)作为存储日志设备,以提高存储IO性能。经过研究和市场应用发现,在海量小文件IO场景下,会出现存储IO性能波动大问题,因为将海量小文件数据块回写到持久化磁盘驱动器上的后端文件系统比写日志慢得多,并且NVMe SSD利用率极低,当出现小文件落盘回写到低速磁盘进行持久化存储,回写队列写满阻塞时,此时日志队列空闲,无法发挥Nvme SSD性能优势。
在研究优化领域中提出了一种块设备存储引擎,由于该存储引擎不依赖日志型文件系统,可以直接管理块设备,从而减少了日志双写性能下降的问题。但是在海量小文件数据场景下,块设备存储引擎需要依赖嵌入式键值存储器(RocksDB)对元数据和小块数据的管理。用户写入的元数据和小块数据键值对,会先写入磁盘上的预写日志,然后再写入数据,因此RocksDB的日志写入方式同样会造成双写问题。
无论是文件存储引擎的日志型文件系统,还是块设备存储引擎的RocksDB数据库引擎,其原理都需要依赖通用的Linux访问接口进行数据的存储,因此
整个存储IO路径较长,即便使用高速存储介质设备,分布式存储系统仍然会存在较高的IO时延。
可见,如何减少存储系统IO时延,是本领域技术人员需要解决的问题。
发明内容
本申请实施例的目的是提供一种数据存储方法、装置、系统、设备和计算机非易失性可读存储介质,可以减少存储系统IO时延。
为解决上述技术问题,本申请实施例提供一种数据存储方法,包括:
在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
可选地,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中包括:
将IO数据按照设定的数据块粒度,划分为IO数据块;其中,每个IO数据块有其对应的编号信息;
将各IO数据块的元数据信息写入第一区域;
依据各IO数据块的元数据信息,确定出各IO数据块的数据长度和偏移信息;其中,偏移信息包括逻辑偏移和实际偏移;
将各IO数据块对应的编号信息、数据长度和偏移信息写入第二区域;
将各IO数据块写入第三区域。
可选地,,在将IO数据按照设定的数据块粒度,划分为IO数据块之前,方法还包括:将数据块粒度的取值设置为4KB。
可选地,按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区包括:
在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间存在对象数据,则将属于重叠区域并且满足数据块粒度要求的第一IO数据块写入新分配的内核缓冲区;
将与第一IO数据块前向相邻的前数据块追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;
将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块写入日志区;
依据后数据块对应的数据长度和偏移信息,将后数据块覆盖写入对应的内核缓冲区。
可选地,将属于重叠区域并且满足数据块粒度要求的第一IO数据块写入新分配的内核缓冲区包括:
从IO数据中确定出属于重叠区域并且满足数据块粒度要求的第一IO数据块;
依据第一IO数据块对应的数据结构区存储的信息,为第一IO数据块分配与已有对象数据存储区域相邻的目标内核缓冲区;
将第一IO数据块按照内存的字节粒度写入目标内核缓冲区。
可选地,将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块写入日志区包括:
将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块通过设定的一致性事务接口写入日志区。
可选地,在将后数据块覆盖写入对应的内核缓冲区之后还包括:
判断日志区存储的后数据块是否达到设定的时长;
若日志区存储的后数据块达到设定的时长,则删除日志区存储的后数据块。
可选地,按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区包括:
在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间不存在对象数据,则按照内存的字节粒度进行寻址,以将IO数据映射到逻辑地址对应的内核缓冲区。
可选地,按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区包括:
在执行IO数据的追加写入操作时,依据IO数据对应的数据结构区存储的信息,为IO数据分配新的内核缓冲区;
按照内存的字节粒度进行寻址,以将IO数据写入新的内核缓冲区。
可选地,在按照内存的字节粒度进行寻址,以将IO数据写入新的内核缓冲区之后,方法还包括:通过一致性事务接口对元数据进行更新。
可选地,IO数据的获取过程包括:
依据设定的协议访问接口获取客户端传输的待处理数据;其中,协议访问接口包括对象接口、块接口和文件系统接口;不同的协议访问接口有其对应的数据切片方式;
将待处理数据按照对应的切片方式,切分得到IO数据。
可选地,在将待处理数据按照对应的切片方式,切分得到IO数据之后还包括:
将每个IO数据及其对应的副本数据汇总到同一个组内;
对同一个组内的数据同步执行按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存的步骤。
可选地,将每个IO数据及其对应的副本数据汇总到同一个组内,包括:将数据分割成4MB的对象;将对象通过哈希计算映射到同一个组内。
可选地,按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区,包括:通过对象的元数据信息计算出IO数据的写入的位
置;根据非易失性内存设备约定的最小分配单元4KB采用与位置上对象对应的写入方式将IO数据写入,其中,写入方式包括创建追加写和覆盖修改写。
可选地,待写入的数据块为A、C、C、B,其中,数据块C、C、B为待写入重叠数据区域的数据,数据块A为待写入未重叠区域的数据,两个数据块C和数据块A均满足数据块粒度要求,数据块B不满足数据块粒度要求,其中,按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区,包括:在执行IO数据的覆盖写入操作时,将两个数据块C写入新分配的内核缓冲区;将数据块A追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;将数据块B写入日志区的日志块中;依据数据块B对应的数据长度和偏移信息,将数据块B覆盖写入对应的内核缓冲区。
本申请实施例还提供了一种数据存储装置,包括转换单元和映射单元;
转换单元,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
映射单元,被配置为按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
可选地,转换单元包括划分子单元、第一写入子单元、确定子单元、第二写入子单元和第三写入子单元;
划分子单元,被配置为将IO数据按照设定的数据块粒度,划分为IO数据块;其中,每个IO数据块有其对应的编号信息;
第一写入子单元,被配置为将各IO数据块的元数据信息写入第一区域;
确定子单元,被配置为依据各IO数据块的元数据信息,确定出各IO数据块的数据长度和偏移信息;其中,偏移信息包括逻辑偏移和实际偏移;
第二写入子单元,被配置为将各IO数据块对应的编号信息、数据长度和偏移信息写入第二区域;
第三写入子单元,被配置为将各IO数据块写入第三区域。
可选地,映射单元包括新分配写入子单元、追加写入子单元、日志区写入子单元和覆盖写入子单元;
新分配写入子单元,被配置为在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间存在对象数据,则将属于重叠区域并且满足数据块粒度要求的第一IO数据块写入新分配的内核缓冲区;
追加写入子单元,被配置为将与第一IO数据块前向相邻的前数据块追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;
日志区写入子单元,被配置为将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块写入日志区;
覆盖写入子单元,被配置为依据后数据块对应的数据长度和偏移信息,将后数据块覆盖写入对应的内核缓冲区。
可选地,新分配写入子单元被配置为从IO数据中确定出属于重叠区域并且满足数据块粒度要求的第一IO数据块;
依据第一IO数据块对应的数据结构区存储的信息,为第一IO数据块分配与已有对象数据存储区域相邻的目标内核缓冲区;
将第一IO数据块按照内存的字节粒度写入目标内核缓冲区。
可选地,日志区写入子单元被配置为将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块通过设定的一致性事务接口写入日志区。
可选地,还包括判断单元和删除单元;
判断单元,被配置为在覆盖写入子单元将后数据块覆盖写入对应的内核缓冲区之后,判断日志区存储的后数据块是否达到设定的时长;
删除单元,被配置为若日志区存储的后数据块达到设定的时长,则删除日志区存储的后数据块。
可选地,映射单元被配置为在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间不存在对象数据,则按照内存的字节粒度进行寻址,以将IO数据映射到逻辑地址对应的内核缓冲区。
可选地,映射单元包括分配子单元和写入子单元;
分配子单元,被配置为在执行IO数据的追加写入操作时,依据IO数据对应的数据结构区存储的信息,为IO数据分配新的内核缓冲区;
写入子单元,被配置为按照内存的字节粒度进行寻址,以将IO数据写入新的内核缓冲区。
可选地,针对于IO数据的获取过程,装置包括获取子单元和切片子单元;
获取子单元,被配置为依据设定的协议访问接口获取客户端传输的待处理数据;其中,协议访问接口包括对象接口、块接口和文件系统接口;不同的协议访问接口有其对应的数据切片方式;
切片子单元,被配置为将待处理数据按照对应的切片方式,切分得到IO数据。
可选地,还包括汇总单元;
汇总单元,被配置为将每个IO数据及其对应的副本数据汇总到同一个组内;
转换单元,被配置为对同一个组内的数据同步执行按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存的步骤。
本申请实施例还提供了一种数据存储系统,包括存储管理模块、传输接口和硬件存储设备;存储管理模块通过传输接口与硬件存储设备连接;
存储管理模块,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
存储管理模块,被配置为按照内存的字节粒度以及数据结构区存储的信
息,通过传输接口将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
可选地,传输接口包括用于传输满足数据块粒度要求的单元接口,和用于传输不满足数据块粒度要求的一致性事务接口。
本申请实施例还提供了一种电子设备,包括:
存储器,被配置为存储计算机程序;
处理器,被配置为执行计算机程序以实现上述数据存储方法的步骤。
本申请实施例还提供了一种计算机非易失性可读存储介质,计算机非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述数据存储方法的步骤。
由上述技术方案可以看出,在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。在该技术方案中,通过设定数据结构区,可以支持IO数据进行内存映射,从而减少了数据的重复复制,缩短了IO路径。并且映射的过程是基于操作系统直接内存存取拷贝的,用户缓冲区和内核缓冲区共享一块映射数据,通过建立共享映射,不再需要将IO数据从内核缓冲区拷贝到用户缓冲区,以此减少了IO路径。采用内存映射、字节寻址的方式管理硬件存储设备,在海量小文件场景下,减少了存储系统IO时延。
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种数据存储方法所适用的硬件组成框架示意图;
图2为本申请实施例提供的一种数据存储方法的流程图;
图3为本申请实施例提供的一种分布式存储系统的架构图;
图4为本申请实施例提供的一种存储引擎与硬件存储设备交互的示意图;
图5为本申请实施例提供的一种覆盖写操作的流程示意图;
图6为本申请实施例提供的一种数据存储装置的结构示意图;
图7为本申请实施例提供的一种数据存储系统的结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是
全部实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本申请保护范围。
本申请的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。
为了便于理解,先对本申请实施例提供的一种数据存储方法对应的方案所使用的硬件组成框架进行介绍。请参考图1,图1为本申请实施例提供的一种数据存储方法所适用的硬件组成框架示意图。其中电子设备100可以包括处理器101和存储器102,还可以进一步包括多媒体组件103、信息输入/信息输出接口104以及通信组件105中的一种或多种。
其中,处理器101被配置为控制电子设备100的整体操作,以完成数据存储方法中的全部或部分步骤;存储器102被配置为存储各种类型的数据以支持在电子设备100的操作,这些数据例如可以包括被配置为在该电子设备100上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,SRAM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、只读存储器(Read-Only Memory,ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。在本实施例中,存储器102中至少存储有用于实现以下功能的程序和/或数据:
在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
多媒体组件103可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或通过通信组件105发送。音频组件还包括至少一个扬声器,用于输出音频信号。信息输入/信息输出接口104为处理器101和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件105被配置为电子设备100与其他设备之间进
行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件105可以包括:Wi-Fi部件,蓝牙部件,NFC部件。
电子设备100可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、数字信号处理设备(Digital Signal Processing Device,DSPD)、可编程逻辑器件(Programmable Logic Device,PLD)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行数据存储方法。
当然,图1所示的电子设备100的结构并不构成对本申请实施例中电子设备的限定,在实际应用中电子设备100可以包括比图1所示的更多或更少的部件,或者组合某些部件。
接下来,详细介绍本申请实施例所提供的一种数据存储方法。图2为本申请实施例提供的一种数据存储方法的流程图,该方法包括:
S201:在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中。
其中,数据结构区可以包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。
在实际应用中,可以将IO数据按照设定的数据块粒度,划分为IO数据块;其中,每个IO数据块有其对应的编号信息。
数据块粒度的取值可以基于实际需求设置,例如,可以设置为4KB。原因是固态闪存颗粒设备的随机和顺序IO性能无差别,对比而言,机械盘随机IO性能远低于顺序IO性能,分配单元为4KB可以减少小IO场景下的请求处理次数,进而提升存储系统性能。
每个IO数据块有其对应的元数据信息,在执行IO数据的存储时,可以将各IO数据块的元数据信息写入第一区域。
数据描述信息可以包括每个IO数据块对应的编号信息、数据长度和偏移信息。在可选的实施例中,可以依据各IO数据块的元数据信息,确定出各IO数据块的数据长度和偏移信息;其中,偏移信息可以包括逻辑偏移和实际偏移;将各IO数据块对应的编号信息、数据长度和偏移信息写入第二区域;将各IO数据块写入第三区域。
本申请提供的数据存储方法适用于分布式存储系统,针对于IO数据的获取过程,分布式存储系统可以依据设定的协议访问接口获取客户端传输的待处理数据;其中,协议访问接口可以包括对象接口、块接口和文件系统接口。
不同的协议访问接口有其对应的数据切片方式,因此,分布式存储系统通过协议访问接口获取到待处理数据之后,会将待处理数据按照对应的切片方式,切分得到IO数据。
为了提升数据的安全性能,针对于每份数据会设置对应的副本数据。为了实现对同一类数据的同步处理,可以将每个IO数据及其对应的副本数据汇总
到同一个组内;对同一个组内的数据同步执行按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存的步骤。
在实际应用中,客户端IO数据时,先将数据分割成4MB的对象(object),将对象通过哈希(hash)计算映射到同一个组内,然后一组内的对象进行同步处理,以此来对数据进行读写。
图3为本申请实施例提供的一种分布式存储系统的架构图,分布式存储系统提供对象,块和文件系统三种协议访问接口,分别对应对象网关服务(RadosGW),块设备服务(RADOS Block Device,RBD)和文件服务(LibFS)。Rados(Reliable Autonomic Distributed Object Store,基础存储系统)提供统一的、自控的、可扩展的分布式存储。文件系统协议还需要元数据集群,集群监控进程、维护集群状态。数据存放在存储池中,通过存储引擎映射到硬件存储设备上,例如HDD(Hard Disk Drive,硬盘驱动器)或SSD(Solid State Drives,固态硬盘)。
图4为本申请实施例提供的一种存储引擎与硬件存储设备交互的示意图,存储引擎(MemStore)可以通过内存映射(Memory Map)单元接口和一致性事务接口实现与硬件存储设备的交互。MemStore存储引擎在获取到IO数据时,可以按照划分的第一区域、第二区域、第三区域和日志区存储IO数据。图4中超级块(superblock)表示第一区域,元数据区表示第二区域,数据区表示第三区域。
MemStore存储引擎针对客户端IO操作完全模块化,客户端IO请求无须改动,对上层接口适配无影响。图4中MemStore结构图可以设计超级块(superblock)为4K,元数据、对象数据、日志块均为4MB。superblock主要用于存放系统本身的结构信息,描述系统整体信息的数据结构等元数据信息;元数据区记录存储系统对象数据的描述信息,如对象编号和与其对应的数据区,对象数据的逻辑偏移量,数据大小长度和在Nvme SSD设备上的物理实际偏移量长度;新增元数据时,按照元数据的区域大小固定的原则,新增16字节指针指向新元数据区的起始地址,保证元数据新增写入时的原子性。最后一个区域是日志区,对写入到Nvme SSD数据进行事务处理。
通过设计写入对象编号、逻辑偏移、数据长度、实际偏移结构进行内存映射,减少了数据的重复复制(copy),缩短了IO路径。本申请IO不经过Linux操作系统的页(page)缓存,而是直接访问Nvme SSD设备。
S202:按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区。
其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
本申请实施例中,MemStore存储引擎的读写流程主要区别其他分布式存储引擎在于,MemStore存储引擎通过Memory Map的方式访问Nvme SSD设备驱动。不失一般性,对Nvme SSD设备进行管理和数据IO是通过字节粒度进行寻址操作。
分布式存储客户端请求写操作数据到MemStore存储引擎时,MemStore存储引擎会通过Object对象的元数据信息计算出要写入数据区的位置,根据Nvme SSD约定的最小分配单元4KB进行写入,根据Object对象写入的方式不同,一般可分为创建追加写和覆盖修改写两种。
根据局部性原理,在Memory Map交换数据时读取一个4个字节数据时,不会只把这4个字节读进来,而是会把这4个字节之后的很多的数据一起读进来,通常会读到64K数据,读写Nvme SSD时大于最小分配单元时,需要进行分割后分配到不同的单元中,当系统中存在较多覆盖写,数据在下刷时发生故障会出现数据损坏问题,本申请实施例针对覆盖写故障场景设计了一致性事务接口,将不满足4KB数据大小的数据写入日志区,然后将日志区的数据通过一致性事务接口传输至Nvme SSD存储设备,以此保证数据下刷过程中发送故障的数据不一致问题,通过事务的原子性操作,避免数据不一致性。
在本申请实施例中,在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间存在对象数据,则将属于重叠区域并且满足数据块粒度要求的第一IO数据块写入新分配的内核缓冲区。将与第一IO数据块前向相邻的前数据块追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块写入日志区;依据后数据块对应的数据长度和偏移信息,将后数据块覆盖写入对应的内核缓冲区。
在可选的实施例中,可以将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块通过设定的一致性事务接口写入日志区。
为第一IO数据新分配内核缓冲区的过程可以包括从IO数据中确定出属于重叠区域并且满足数据块粒度要求的第一IO数据块;依据第一IO数据块对应的数据结构区存储的信息,为第一IO数据块分配与已有对象数据存储区域相邻的目标内核缓冲区;将第一IO数据块按照内存的字节粒度写入目标内核缓冲区。
为了避免数据对日志区的长时间占用,对于写入日志区的数据可以设置时长。判断日志区存储的后数据块是否达到设定的时长;若日志区存储的后数据块达到设定的时长,则说明后数据块已经存储了较长时间,一般不再具有利用价值,此时可以删除日志区存储的后数据块。
在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间不存在对象数据,则可以按照内存的字节粒度进行寻址,以将IO数据直接映射到逻辑地址对应的内核缓冲区。
图5为本申请实施例提供的一种覆盖写操作的流程示意图,假设待写入的数据块为A、C、C、B,其中数据块C、C、B在硬件存储设备上所对应的逻辑地址空间已经存在对象数据。当覆盖写入object对象时,在Nvme SSD逻辑地址空间已存在Object对象数据,覆盖写有重叠数据区域和未重叠区域。数据块C、C、B即为需要写入重叠数据区域的数据,数据块A即为写入未重叠区域的数据。传统方式中,对于覆盖写有重叠数据区域时,需要先将已经存在的对象
数据读取出来,然后将新的数据写入释放的空间中。
本申请对待写入的IO数据按Nvme SSD最新分配单元4KB进行分片写入,将两个数据块C写入到新分配的空间内。将不足一个存储单元的数据块A追加写到数据块C空间单元之后,针对覆盖写数据块B,先将数据块B写入日志区,然后进行数据事务一致性更新到原有数据块B覆盖写位置。即整个实现流程为按照图5中步骤①先为这两个数据块C分配新分配的内核缓冲区,将两个数据块C写入新分配的内核缓冲区;按照步骤②将数据块A写入与新分配的内核缓冲区相邻的空闲内核缓冲区;按照步骤③将数据块B写入日志区的某个日志块中,按照步骤④依据数据块B对应的数据长度和偏移信息,将数据块B覆盖写入对应的内核缓冲区。
以上过程相比于现有技术中的存储引擎,不使用按块(block)为单位的寻址方式,采用Linux操作系统Nvme协议使用按字节的寻址方式,本申请中的内存映射通过共享的地址空间,减少了覆盖写先读取已写入数据块的过程。在存储系统达到高水位情况下,存储设备空间进行数据回收等操作时,不会被覆盖写带来的4K读开销影响,因此能够有效的提高存储的性能持续性,减少随着容量的递增导致性能下降的问题。
对于追加写的情况,可以在执行IO数据的追加写入操作时,依据IO数据对应的数据结构区存储的信息,为IO数据分配新的内核缓冲区;按照内存的字节粒度进行寻址,以将IO数据写入新的内核缓冲区。
考虑到分布式存储对象IO需要满足ACID(Atomicity、Consistency、Isolation、Durability,原子性、一致性、隔离性和持久性)特点,创建追加写入后,通过一致性事务接口对元数据进行更新,创建追加写入方式,不需要日志进行数据持久化存储,通过建立的内存共享映射空间完成。
创建追加写入到Nvme SSD存储设备新分配的存储单元,分配的存储单元基于Memory Map的字节寻址方式,区别于现有技术中的page页对齐方式,有效减少了非对齐IO写入的空间浪费问题。
从数据写入Nvme SSD存储设备的IO路径说明,本申请通过Linux操作系统直接内存存取拷贝,用户态缓冲区和内核缓冲区共享一块映射数据,建立共享映射之后,数据直接与Nvme SSD进行数据读写。而传统技术中,数据写入块设备,在用户态下使用Linux异步通知(Linux aio)模式直接对裸设备进行IO操作,通过有一个线程(aio_thread)检查异步输入输出(asynchronous input output,aio)是否完成,直写落盘后,通过回调函数(aio_callback)通知调用客户端写入完成,相对于传统技术,本申请IO路径减少了aio操作。
由上述技术方案可以看出,在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一
块映射数据;用户缓冲区为硬件存储设备上的缓冲区。在该技术方案中,通过设定数据结构区,可以支持IO数据进行内存映射,从而减少了数据的重复复制,缩短了IO路径。并且映射的过程是基于操作系统直接内存存取拷贝的,用户缓冲区和内核缓冲区共享一块映射数据,通过建立共享映射,不再需要将IO数据从内核缓冲区拷贝到用户缓冲区,以此减少了IO路径。采用内存映射、字节寻址的方式管理硬件存储设备,在海量小文件场景下,减少了存储系统IO时延。
图6为本申请实施例提供的一种数据存储装置的结构示意图,包括转换单元61和映射单元62;
转换单元61,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
映射单元62,被配置为按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。
可选地,转换单元包括划分子单元、第一写入子单元、确定子单元、第二写入子单元和第三写入子单元;
划分子单元,被配置为将IO数据按照设定的数据块粒度,划分为IO数据块;其中,每个IO数据块有其对应的编号信息;
第一写入子单元,被配置为将各IO数据块的元数据信息写入第一区域;
确定子单元,被配置为依据各IO数据块的元数据信息,确定出各IO数据块的数据长度和偏移信息;其中,偏移信息包括逻辑偏移和实际偏移;
第二写入子单元,被配置为将各IO数据块对应的编号信息、数据长度和偏移信息写入第二区域;
第三写入子单元,被配置为将各IO数据块写入第三区域。
可选地,映射单元包括新分配写入子单元、追加写入子单元、日志区写入子单元和覆盖写入子单元;
新分配写入子单元,被配置为在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间存在对象数据,则将属于重叠区域并且满足数据块粒度要求的第一IO数据块写入新分配的内核缓冲区;
追加写入子单元,被配置为将与第一IO数据块前向相邻的前数据块追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;
日志区写入子单元,被配置为将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块写入日志区;
覆盖写入子单元,被配置为依据后数据块对应的数据长度和偏移信息,将后数据块覆盖写入对应的内核缓冲区。
可选地,新分配写入子单元被配置为从IO数据中确定出属于重叠区域并
且满足数据块粒度要求的第一IO数据块;
依据第一IO数据块对应的数据结构区存储的信息,为第一IO数据块分配与已有对象数据存储区域相邻的目标内核缓冲区;
将第一IO数据块按照内存的字节粒度写入目标内核缓冲区。
可选地,日志区写入子单元被配置为将不满足数据块粒度要求并且与第一IO数据块后向相邻的后数据块通过设定的一致性事务接口写入日志区。
可选地,还包括判断单元和删除单元;
判断单元,被配置为在覆盖写入子单元将后数据块覆盖写入对应的内核缓冲区之后,判断日志区存储的后数据块是否达到设定的时长;
删除单元,被配置为若日志区存储的后数据块达到设定的时长,则删除日志区存储的后数据块。
可选地,映射单元被配置为
在执行IO数据的覆盖写入操作时,若IO数据在硬件存储设备上的逻辑地址空间不存在对象数据,则按照内存的字节粒度进行寻址,以将IO数据映射到逻辑地址对应的内核缓冲区。
可选地,映射单元包括分配子单元和写入子单元;
分配子单元,被配置为在执行IO数据的追加写入操作时,依据IO数据对应的数据结构区存储的信息,为IO数据分配新的内核缓冲区;
写入子单元,被配置为按照内存的字节粒度进行寻址,以将IO数据写入新的内核缓冲区。
可选地,针对于IO数据的获取过程,装置包括获取子单元和切片子单元;
获取子单元,被配置为依据设定的协议访问接口获取客户端传输的待处理数据;其中,协议访问接口包括对象接口、块接口和文件系统接口;不同的协议访问接口有其对应的数据切片方式;
切片子单元,被配置为将待处理数据按照对应的切片方式,切分得到IO数据。
可选地,还包括汇总单元;
汇总单元,被配置为将每个IO数据及其对应的副本数据汇总到同一个组内;
转换单元,被配置为对同一个组内的数据同步执行按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存的步骤。
图6所对应实施例中特征的说明可以参见图2所对应实施例的相关说明,这里不再一一赘述。
由上述技术方案可以看出,在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一
块映射数据;用户缓冲区为硬件存储设备上的缓冲区。在该技术方案中,通过设定数据结构区,可以支持IO数据进行内存映射,从而减少了数据的重复复制,缩短了IO路径。并且映射的过程是基于操作系统直接内存存取拷贝的,用户缓冲区和内核缓冲区共享一块映射数据,通过建立共享映射,不再需要将IO数据从内核缓冲区拷贝到用户缓冲区,以此减少了IO路径。采用内存映射、字节寻址的方式管理硬件存储设备,在海量小文件场景下,减少了存储系统IO时延。
图7为本申请实施例提供的一种数据存储系统的结构示意图,包括存储管理模块71、传输接口72和硬件存储设备73;存储管理模块71通过传输接口72与硬件存储设备73连接;
存储管理模块71,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;
存储管理模块71,被配置为按照内存的字节粒度以及数据结构区存储的信息,通过传输接口72将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备73上的缓冲区。
可选地,传输接口包括用于传输满足数据块粒度要求的单元接口,和用于传输不满足数据块粒度要求的一致性事务接口。
图7所对应实施例中特征的说明可以参见图2所对应实施例的相关说明,这里不再一一赘述。
由上述技术方案可以看出,在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将IO数据转换到内存中;其中,数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区。按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区;其中,用户缓冲区和内核缓冲区共享一块映射数据;用户缓冲区为硬件存储设备上的缓冲区。在该技术方案中,通过设定数据结构区,可以支持IO数据进行内存映射,从而减少了数据的重复复制,缩短了IO路径。并且映射的过程是基于操作系统直接内存存取拷贝的,用户缓冲区和内核缓冲区共享一块映射数据,通过建立共享映射,不再需要将IO数据从内核缓冲区拷贝到用户缓冲区,以此减少了IO路径。采用内存映射、字节寻址的方式管理硬件存储设备,在海量小文件场景下,减少了存储系统IO时延。
可以理解的是,如果上述实施例中的数据存储方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机非易失性可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性存储介质中,执行本申请各个实施例方
法的全部或部分步骤。而前述的非易失性存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的介质。
基于此,本申请实施例还提供了一种计算机非易失性可读存储介质,计算机非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述数据存储方法的步骤。
以上对本申请实施例所提供的一种数据存储方法、装置、系统、设备和计算机非易失性可读存储介质进行了详细介绍。说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上对本申请所提供的一种数据存储方法、装置、系统、设备和计算机非易失性可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
Claims (20)
- 一种数据存储方法,其特征在于,包括:在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将所述IO数据转换到内存中;其中,所述数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区;其中,用户缓冲区和所述内核缓冲区共享一块映射数据;所述用户缓冲区为硬件存储设备上的缓冲区。
- 根据权利要求1所述的数据存储方法,其特征在于,所述按照设定的数据结构区以及设定的数据块粒度,将所述IO数据转换到内存中包括:将所述IO数据按照设定的数据块粒度,划分为IO数据块;其中,每个IO数据块有其对应的编号信息;将各所述IO数据块的元数据信息写入所述第一区域;依据各所述IO数据块的元数据信息,确定出各所述IO数据块的数据长度和偏移信息;其中,所述偏移信息包括逻辑偏移和实际偏移;将各所述IO数据块对应的编号信息、数据长度和偏移信息写入所述第二区域;将各所述IO数据块写入所述第三区域。
- 根据权利要求2所述的数据存储方法,其特征在于,在所述将所述IO数据按照设定的数据块粒度,划分为IO数据块之前,所述方法还包括:将所述数据块粒度的取值设置为4KB。
- 根据权利要求2所述的数据存储方法,其特征在于,所述按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区包括:在执行所述IO数据的覆盖写入操作时,若所述IO数据在所述硬件存储设备上的逻辑地址空间存在对象数据,则将属于重叠区域并且满足所述数据块粒度要求的第一IO数据块写入新分配的内核缓冲区;将与所述第一IO数据块前向相邻的前数据块追加写入与所述新分配的内核缓冲区相邻的空闲内核缓冲区;将不满足所述数据块粒度要求并且与所述第一IO数据块后向相邻的后数据块写入所述日志区;依据所述后数据块对应的数据长度和偏移信息,将所述后数据块覆盖写入对应的内核缓冲区。
- 根据权利要求4所述的数据存储方法,其特征在于,所述将属于重叠区域并且满足所述数据块粒度要求的第一IO数据块写入新分配的内核缓冲区包括:从所述IO数据中确定出属于重叠区域并且满足所述数据块粒度要求的第一IO数据块;依据所述第一IO数据块对应的数据结构区存储的信息,为所述第一IO数据块分配与已有对象数据存储区域相邻的目标内核缓冲区;将所述第一IO数据块按照内存的字节粒度写入所述目标内核缓冲区。
- 根据权利要求4所述的数据存储方法,其特征在于,所述将不满足所述数据块粒度要求并且与所述第一IO数据块后向相邻的后数据块写入所述日志区包括:将不满足所述数据块粒度要求并且与所述第一IO数据块后向相邻的后数据块通过设定的一致性事务接口写入所述日志区。
- 根据权利要求4所述的数据存储方法,其特征在于,在所述将所述后数据块覆盖写入对应的内核缓冲区之后还包括:判断所述日志区存储的后数据块是否达到设定的时长;若所述日志区存储的后数据块达到设定的时长,则删除所述日志区存储的后数据块。
- 根据权利要求2所述的数据存储方法,其特征在于,所述按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区包括:在执行所述IO数据的覆盖写入操作时,若所述IO数据在所述硬件存储设备上的逻辑地址空间不存在对象数据,则按照内存的字节粒度进行寻址,以将所述IO数据映射到逻辑地址对应的内核缓冲区。
- 根据权利要求2所述的数据存储方法,其特征在于,所述按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区包括:在执行所述IO数据的追加写入操作时,依据所述IO数据对应的数据结构区存储的信息,为所述IO数据分配新的内核缓冲区;按照内存的字节粒度进行寻址,以将所述IO数据写入所述新的内核缓冲区。
- 根据权利要求2所述的数据存储方法,其特征在于,在所述按照内存的字节粒度进行寻址,以将所述IO数据写入所述新的内核缓冲区之后,所述方法还包括:通过一致性事务接口对元数据进行更新。
- 根据权利要求1所述的数据存储方法,其特征在于,所述IO数据的获取过程包括:依据设定的协议访问接口获取客户端传输的待处理数据;其中,所述协议访问接口包括对象接口、块接口和文件系统接口;不同的协议访问接口有其对应的数据切片方式;将所述待处理数据按照对应的切片方式,切分得到IO数据。
- 根据权利要求11所述的数据存储方法,其特征在于,在所述将所述待处理数据按照对应的切片方式,切分得到IO数据之后还包括:将每个IO数据及其对应的副本数据汇总到同一个组内;对同一个组内的数据同步执行所述按照设定的数据结构区以及设定的数据块粒度,将所述IO数据转换到内存的步骤。
- 根据权利要求11所述的数据存储方法,其特征在于,所述将每个IO数据及其对应的副本数据汇总到同一个组内,包括:将数据分割成4MB的对象;将对象通过哈希计算映射到同一个组内。
- 根据权利要求1所述的数据存储方法,其特征在于,所述按照内存的字节粒度以及数据结构区存储的信息,将IO数据映射到内核缓冲区,包括:通过对象的元数据信息计算出所述IO数据的写入的位置;根据非易失性内存设备约定的最小分配单元4KB采用与所述位置上对象对应的写入方式将所述IO数据写入,其中,所述写入方式包括创建追加写和覆盖修改写。
- 根据权利要求1所述的数据存储方法,其特征在于,待写入的数据块为A、C、C、B,其中,数据块C、C、B为待写入重叠数据区域的数据,数据块A为待写入未重叠区域的数据,两个数据块C和数据块A均满足所述数据块粒度要求,数据块B不满足所述数据块粒度要求,其中,所述按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区,包括:在执行所述IO数据的覆盖写入操作时,将两个数据块C写入新分配的内核缓冲区;将数据块A追加写入与新分配的内核缓冲区相邻的空闲内核缓冲区;将数据块B写入日志区的日志块中;依据数据块B对应的数据长度和偏移信息,将数据块B覆盖写入对应的内核缓冲区。
- 一种数据存储装置,其特征在于,包括转换单元和映射单元;所述转换单元,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将所述IO数据转换到内存中;其中,所述数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;所述映射单元,被配置为按照内存的字节粒度以及所述数据结构区存储的信息,将所述IO数据映射到内核缓冲区;其中,用户缓冲区和所述内核缓冲区共享一块映射数据;所述用户缓冲区为硬件存储设备上的缓冲区。
- 一种数据存储系统,其特征在于,包括存储管理模块、传输接口和硬件存储设备;所述存储管理模块通过所述传输接口与所述硬件存储设备连接;所述存储管理模块,被配置为在获取到IO数据时,按照设定的数据结构区以及设定的数据块粒度,将所述IO数据转换到内存中;其中,所述数据结构区包括用于存储元数据信息的第一区域、用于存储数据描述信息的第二区域、用于存储对象数据的第三区域以及日志区;所述存储管理模块,被配置为按照内存的字节粒度以及所述数据结构区存储的信息,通过所述传输接口将所述IO数据映射到内核缓冲区;其中, 用户缓冲区和所述内核缓冲区共享一块映射数据;所述用户缓冲区为硬件存储设备上的缓冲区。
- 根据权利要求17所述的数据存储系统,其特征在于,所述传输接口包括用于传输满足数据块粒度要求的单元接口,和用于传输不满足数据块粒度要求的一致性事务接口。
- 一种电子设备,其特征在于,包括:存储器,被配置为存储计算机程序;处理器,被配置为执行所述计算机程序以实现如权利要求1至15任意一项所述数据存储方法的步骤。
- 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至15任意一项所述数据存储方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211081604.4 | 2022-09-06 | ||
CN202211081604.4A CN115167786B (zh) | 2022-09-06 | 2022-09-06 | 一种数据存储方法、装置、系统、设备和介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024051109A1 true WO2024051109A1 (zh) | 2024-03-14 |
Family
ID=83481844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/078282 WO2024051109A1 (zh) | 2022-09-06 | 2023-02-24 | 一种数据存储方法、装置、系统、设备和介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115167786B (zh) |
WO (1) | WO2024051109A1 (zh) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115167786B (zh) * | 2022-09-06 | 2023-01-24 | 浪潮电子信息产业股份有限公司 | 一种数据存储方法、装置、系统、设备和介质 |
CN116069685B (zh) * | 2023-03-07 | 2023-07-14 | 浪潮电子信息产业股份有限公司 | 一种存储系统写控制方法、装置、设备及可读存储介质 |
CN116561036B (zh) * | 2023-07-10 | 2024-04-02 | 牛芯半导体(深圳)有限公司 | 数据访问控制方法、装置、设备及存储介质 |
CN116795296B (zh) * | 2023-08-16 | 2023-11-21 | 中移(苏州)软件技术有限公司 | 一种数据存储方法、存储设备及计算机可读存储介质 |
CN118170323B (zh) * | 2024-05-11 | 2024-08-16 | 中移(苏州)软件技术有限公司 | 数据读写方法、装置、电子设备、存储介质和程序产品 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062253A (zh) * | 2017-12-11 | 2018-05-22 | 北京奇虎科技有限公司 | 一种内核态与用户态的通信方法、装置及终端 |
US20200004670A1 (en) * | 2018-06-28 | 2020-01-02 | Seagate Technology Llc | Segregating map data among different die sets in a non-volatile memory |
CN111221776A (zh) * | 2019-12-30 | 2020-06-02 | 上海交通大学 | 面向非易失性内存的文件系统的实现方法、系统及介质 |
CN111881104A (zh) * | 2020-07-29 | 2020-11-03 | 苏州浪潮智能科技有限公司 | 一种nfs服务器及其数据写入方法、装置和存储介质 |
CN115167786A (zh) * | 2022-09-06 | 2022-10-11 | 浪潮电子信息产业股份有限公司 | 一种数据存储方法、装置、系统、设备和介质 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20000065846A (ko) * | 1999-04-09 | 2000-11-15 | 구자홍 | 오퍼레이팅 시스템에서 커널과 사용자 사이의 제로-카피 방법 |
US7290114B2 (en) * | 2004-11-17 | 2007-10-30 | Intel Corporation | Sharing data in a user virtual address range with a kernel virtual address range |
US9213501B2 (en) * | 2013-05-23 | 2015-12-15 | Netapp, Inc. | Efficient storage of small random changes to data on disk |
CN107894921A (zh) * | 2017-11-09 | 2018-04-10 | 郑州云海信息技术有限公司 | 一种分布式块存储卷性能统计的实现方法及系统 |
CN107908365A (zh) * | 2017-11-14 | 2018-04-13 | 郑州云海信息技术有限公司 | 用户态存储系统数据交互的方法、装置及设备 |
US11263122B2 (en) * | 2019-04-09 | 2022-03-01 | Vmware, Inc. | Implementing fine grain data coherency of a shared memory region |
CN113228576B (zh) * | 2019-08-06 | 2022-10-04 | 华为技术有限公司 | 一种处理网络中的数据的方法及装置 |
CN113254198B (zh) * | 2021-04-30 | 2022-08-05 | 南开大学 | 融合Linux虚拟内存系统和文件系统的持久性内存统一管理方法 |
CN114253713B (zh) * | 2021-12-07 | 2024-07-09 | 中信银行股份有限公司 | 一种基于reactor的异步批处理方法及系统 |
CN114610660A (zh) * | 2022-03-01 | 2022-06-10 | Oppo广东移动通信有限公司 | 控制接口数据的方法、装置及系统 |
-
2022
- 2022-09-06 CN CN202211081604.4A patent/CN115167786B/zh active Active
-
2023
- 2023-02-24 WO PCT/CN2023/078282 patent/WO2024051109A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062253A (zh) * | 2017-12-11 | 2018-05-22 | 北京奇虎科技有限公司 | 一种内核态与用户态的通信方法、装置及终端 |
US20200004670A1 (en) * | 2018-06-28 | 2020-01-02 | Seagate Technology Llc | Segregating map data among different die sets in a non-volatile memory |
CN111221776A (zh) * | 2019-12-30 | 2020-06-02 | 上海交通大学 | 面向非易失性内存的文件系统的实现方法、系统及介质 |
CN111881104A (zh) * | 2020-07-29 | 2020-11-03 | 苏州浪潮智能科技有限公司 | 一种nfs服务器及其数据写入方法、装置和存储介质 |
CN115167786A (zh) * | 2022-09-06 | 2022-10-11 | 浪潮电子信息产业股份有限公司 | 一种数据存储方法、装置、系统、设备和介质 |
Also Published As
Publication number | Publication date |
---|---|
CN115167786B (zh) | 2023-01-24 |
CN115167786A (zh) | 2022-10-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024051109A1 (zh) | 一种数据存储方法、装置、系统、设备和介质 | |
JP7329518B2 (ja) | 追加専用記憶デバイスを使用するデータベース管理のためのシステム及び方法 | |
US20210117441A1 (en) | Data replication system | |
WO2019141186A1 (zh) | 数据处理方法和装置 | |
US11636089B2 (en) | Deferred reclamation of invalidated entries that are associated with a transaction log in a log-structured array | |
US20100199065A1 (en) | Methods and apparatus for performing efficient data deduplication by metadata grouping | |
US11487460B2 (en) | Deferred reclamation of invalidated entries associated with replication in a log-structured array | |
WO2014000300A1 (zh) | 数据缓存装置、数据存储系统及方法 | |
WO2023015866A1 (zh) | 一种数据写入方法、装置、系统、电子设备及存储介质 | |
GB2534956A (en) | Storage system and storage control method | |
US11240306B2 (en) | Scalable storage system | |
CN108595347B (zh) | 一种缓存控制方法、装置及计算机可读存储介质 | |
KR20190033122A (ko) | 멀티캐스트 통신 프로토콜에 따라 호스트와 통신하는 저장 장치 및 호스트의 통신 방법 | |
US11704053B1 (en) | Optimization for direct writes to raid stripes | |
WO2023246843A1 (zh) | 数据处理方法、装置及系统 | |
WO2024131379A1 (zh) | 一种数据存储方法、装置及系统 | |
WO2019089057A1 (en) | Scalable storage system | |
EP4044015A1 (en) | Data processing method and apparatus | |
KR20210075038A (ko) | 분산형 블록 저장시스템, 방법, 장치, 장비와 매체 | |
US11650920B1 (en) | Write cache management | |
US11775194B2 (en) | Data storage method and apparatus in distributed storage system, and computer program product | |
WO2022033269A1 (zh) | 数据处理的方法、设备及系统 | |
WO2024001863A1 (zh) | 一种数据处理方法及相关设备 | |
US20240020014A1 (en) | Method for Writing Data to Solid-State Drive | |
WO2017054714A1 (zh) | 磁盘阵列的读方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23861817 Country of ref document: EP Kind code of ref document: A1 |