WO2023082556A1 - 一种面向内存键值纠删码的混合数据更新方法及存储系统 - Google Patents

一种面向内存键值纠删码的混合数据更新方法及存储系统 Download PDF

Info

Publication number
WO2023082556A1
WO2023082556A1 PCT/CN2022/089526 CN2022089526W WO2023082556A1 WO 2023082556 A1 WO2023082556 A1 WO 2023082556A1 CN 2022089526 W CN2022089526 W CN 2022089526W WO 2023082556 A1 WO2023082556 A1 WO 2023082556A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
update
check
data
node
Prior art date
Application number
PCT/CN2022/089526
Other languages
English (en)
French (fr)
Inventor
胡燏翀
程良锋
徐佳
冯丹
Original Assignee
华中科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华中科技大学 filed Critical 华中科技大学
Publication of WO2023082556A1 publication Critical patent/WO2023082556A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1012Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using codes or arrangements adapted for a specific type of error
    • G06F11/1016Error in accessing a memory location, i.e. addressing error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature

Definitions

  • the invention belongs to the field of reliable data storage of computers, and more specifically relates to a hybrid data update method and storage system oriented to memory key-value erasure codes.
  • in-place update in-place update
  • full stripe Update full-stripe update
  • parity logging the update of the check block is put into the log node in the form of additional update in the disk.
  • the in-place update method will occupy relatively high network transmission resources
  • the full-stripe update method will have high storage overhead
  • the verification log update eliminates the need to read the original verification during the process. Block overhead and storage overhead are currently the lowest.
  • the present invention provides a hybrid data update method and device oriented to memory key-value erasure codes. , using the verification log update technology on the memory key-value system to reduce the storage overhead in memory and improve the update performance when data changes.
  • the present invention provides a hybrid update storage system oriented to a hybrid memory key-value erasure code system, including: several clients, a proxy server, k+1 memory nodes and r-1 Log node; k is the number of the data block; r is the number of the check block;
  • the proxy server and the client, the memory node and the log node are connected in two-way communication;
  • the memory node is used to store and update the XOR block in the data block and check block by means of in-place update;
  • the log node is used to store and update the remaining check blocks by means of check log update;
  • the proxy server is used to save the stripe metadata, realize the access request transmitted by the client, and repair the node failure;
  • the client is used to provide external memory access requests.
  • the queue in the proxy server acts as an encoding buffer for reading in the client's write request; the proxy server provides k data blocks of a fixed size for the object, and encodes the k data blocks into r equal-sized The check block; the proxy server saves the stripe metadata to complete the client's request, downgrade read and repair operations.
  • the metadata includes a stripe ID, an object index and a stripe index
  • the stripe ID is used to distinguish each stripe;
  • the object index is used to store the mapping relationship between the object and the stripe ID, and the mapping relationship between the object and the specific data block metadata;
  • the stripe index is used to store each stripe With corresponding k data blocks and r check blocks.
  • the log node includes a buffer and a disk; the buffer is located between the proxy server and the disk;
  • the buffer is used to store the difference between multiple check blocks
  • the disk is used to write all the check block differences to the disk according to the time sequence stored in the buffer when the buffer is full, and then periodically merge the check block differences of the same stripe, and the merged difference Store in the reserved space corresponding to each check block.
  • the present invention provides a hybrid data update method oriented to memory key-value erasure codes, comprising the following steps:
  • the storage system includes multiple clients, a proxy server, k+1 memory nodes and r-1 log nodes; the XOR blocks in the data blocks and check blocks are stored in the memory nodes above; the remaining check blocks are stored in the log node; k is the number of data blocks; r is the number of check blocks;
  • the proxy server Based on the stripe ID of the object, the proxy server reads the historical value of the data block corresponding to the object and the XOR block of the stripe; and obtains the encoding coefficient of each check block based on the serial number of the object;
  • XOR merges the difference between the XOR block and the XOR block, and updates the XOR block;
  • the log node includes a buffer and a disk; the buffer is located between the proxy server and the disk;
  • the way to update the verification log is: store the difference between multiple verification blocks into the buffer;
  • check block difference values of the same stripe are merged periodically, and the combined difference values are stored in the reserved space corresponding to each check block.
  • the hybrid data update method provided by the present invention constructs a hybrid update storage system, wherein the XOR block of the data block and the check block is stored on the memory node, and the remaining check blocks are stored on the log node; wherein, the memory node
  • the data blocks and XOR blocks on the log node use the in-place update method to update data, while the remaining verification blocks on the log node use the verification log update method.
  • the combination of the in-place update method and the verification log update method can greatly balance The relationship between storage cost and data update performance and single failure recovery performance.
  • the hybrid data update method oriented to memory key-value erasure code reads and encodes from the memory node when the XOR block exists in the memory node; the data block and the XOR block are updated in situ way, which makes it superior to the current mainstream solutions in terms of repair performance and memory storage overhead; the remaining check blocks are updated with check logs to ensure efficient update performance.
  • Figure 1(a) is a schematic diagram of an update process of an in-place update provided by an embodiment of the present invention
  • FIG. 1(b) is a schematic diagram of the update process of the full-stripe update provided by the embodiment of the present invention when the update frequency is high;
  • Fig. 1(c) is a schematic diagram of the update process of the full-stripe update provided by the embodiment of the present invention when the update frequency is low;
  • Fig. 1(d) is a schematic diagram of the update process of the verification log update provided by the embodiment of the present invention.
  • FIG. 2 is a storage architecture diagram of a mixed update when the parameters provided by the embodiment of the present invention are (6,3);
  • FIG. 3 is a schematic diagram of an update process in which D 1 is updated to D 1 ′ when the parameter provided by the embodiment of the present invention is (6,3);
  • FIG. 4 is a schematic diagram of a cache log based on merging of a check block a+2b provided by an embodiment of the present invention
  • Fig. 5 is a schematic diagram of a PLM implementation process when the parameters are (2, 2) provided by the embodiment of the present invention.
  • the present invention provides a hybrid data update method for a memory key-value erasure code system, comprising the following steps:
  • the hybrid update storage system is adopted for the storage of data blocks and check blocks.
  • the architecture includes multiple clients, a proxy server, k+1 memory nodes and r-1 log nodes; data blocks are stored in memory nodes using in-place
  • the update method is used to update, and the XOR block (the first check block) in the check block is also put into the memory node and updated in the way of in-place update; while the remaining check blocks are stored in the log node using checksum
  • the log update method is updated; the log node is composed of a buffer and a disk, and the buffer is located between the proxy server and the disk.
  • the difference of the check block is written to the buffer of the log node, and then written to the disk of the log node in batches; k is the number of data blocks; r is the number of check blocks;
  • the update operation means that the object value stored in the existing data block has changed to a new value, and at the same time, the check block in the corresponding stripe needs to be updated; in order to update the object value stored in the data block, the proxy server first uses the object index element The data obtains the stripe ID, sequence number, offset and length of the object; then the proxy server reads the value of the old data object and the XOR block of the stripe (they are all obtained according to the stripe ID); at this time , the proxy server first calculates the difference value of the data block according to the values stored in the new and old data blocks; then uses the difference value of the data block and the encoding matrix coefficient (determined by the serial number) corresponding to the check block to calculate the difference value of the XOR block ; After XORing the difference between the XOR block and the XOR block, the calculation will finally get a new XOR block; the proxy server will also send the difference of the data block to each log node, together with Corresponding encoding coefficients of the
  • the log record writing process uses the lazy merged parity log technology (PLM); first write the parity block difference in time order to an additional contiguous disk space, and then read them back together , merging the check block difference data of the same stripe, and then storing the merged difference;
  • PLM lazy merged parity log technology
  • the verification log record of the reserved space is adopted.
  • the specific operation is: allocate the corresponding reserved storage space of the verification block on the log node in advance, and update
  • the difference between the check blocks that have been merged by PLM is kept in the continuous reserved storage space behind the old check block; in this way, the disk random read work of the disk when calculating the latest check block can be reduced, and the data can be improved. Degraded read repair performance;
  • the traditional repair method is to use all the remaining memory nodes to perform degraded reading and decoding to obtain the data on the failed node, and the existing memory nodes also provide continuous services through the proxy server, and at the same time transmit data for repair It is likely to cause network congestion of nodes; the present invention is based on log node-assisted node repair and uses relatively idle log nodes and memory nodes to repair together; taking the erasure code deployed by (k, r) parameters as an example, recovering a bar In this method, the block carried on the failed node needs to read a check block from the log node and read k-1 blocks from the memory node, which reduces the network transmission overhead of the memory node compared with the traditional repair method .
  • the present invention provides a hybrid update storage system oriented to a hybrid memory key-value erasure code system, including: several clients, a proxy server, k+1 memory nodes and r-1 log nodes; k is The number of the data blocks; r is the number of the check blocks;
  • the proxy server and the client, the memory node and the log node are connected in two-way communication;
  • the memory node is used to store and update the XOR block in the data block and check block by means of in-place update;
  • the log node is used to store and update the remaining check blocks by means of check log update;
  • the proxy server is used to save the stripe metadata, realize the access request transmitted by the client, and repair the node failure;
  • the client is used to provide external memory access requests.
  • the queue in the proxy server acts as an encoding buffer for reading in the client's write request; the proxy server provides k data blocks of a fixed size for the object, and encodes the k data blocks into r equal-sized The check block; the proxy server saves the stripe metadata to complete the client's request, downgrade read and repair operations.
  • the metadata includes a stripe ID, an object index and a stripe index
  • the stripe ID is used to distinguish each stripe;
  • the object index is used to store the mapping relationship between the object and the stripe ID, and the mapping relationship between the object and the specific data block metadata;
  • the stripe index is used to store each stripe With corresponding k data blocks and r check blocks.
  • the log node includes a buffer and a disk; the buffer is located between the proxy server and the disk;
  • the buffer is used to store the difference between multiple check blocks
  • the disk is used to write all the check block differences to the disk according to the time sequence stored in the buffer when the buffer is full, and then periodically merge the check block differences of the same stripe, and the merged difference Store in the reserved space corresponding to each check block.
  • a stripe consists of 6 data blocks D 1 ⁇ D 6 and 3 parity blocks P 1 ⁇ P 3 ;
  • Figure 1(b) and Figure 1(c) respectively show the update process of the full stripe update in two cases with different update frequencies;
  • Figure 1(b) is the case of frequent updates
  • the data blocks D 1 ⁇ D 6 are all updated and updated to D 1 ′ ⁇ D 6 ′; the updated data blocks are
  • the embodiment of the present invention is oriented to the mixed data update method of the memory key-value erasure code system.
  • the technical solution of the present invention is further described below in conjunction with specific implementation modes, including the following steps:
  • Figure 2 shows the hybrid update storage architecture when the parameter is (6,3); the architecture includes multiple clients, a proxy server, 7 memory nodes and 2 log nodes; taking a stripe as an example, the data block D 1 -D 6 pieces are stored in 6 memory nodes and updated by in-place update; XOR check block P 1 is also stored in the memory node and updated by in-place update; the remaining 2 check blocks P 2 and P 3 are saved on the log node and updated by checking the log update method; the log node is composed of a buffer and a disk, and the buffer is located between the proxy server and the disk. When updating, the difference of the check block is written to the buffer of the log node, and then asynchronously written to the disk of the log node in batches;
  • the proxy server obtains the stripe ID, serial number, offset and length of the updated object by updating the object index metadata
  • the proxy server calculates ⁇ D 1 and P 1 ′, D 1 is updated to D 1 ′ in situ, P 1 is updated to P 1 ′ in situ, and the proxy server sends ⁇ D 1 to the nodes where P 2 and P 3 are located;
  • the nodes where D 1 and P 1 are located send confirmation information to the proxy server after the update and writing are completed, and the nodes where P 2 and P 3 are located respectively calculate ⁇ P 2 and ⁇ P 3 and save them in the buffers of their respective nodes ;
  • Figure 4 is the cache situation of the node where the check block with the content of a+2b is located; assuming the incoming data stream: a, b, a', b', a"; where a, b are the data of the original data block; a', a" is updated by a, b' is updated by b; the cache area stores the check block increment calculated according to the data block increment a'-a, 2(b'-b), a" -a' has a total of 3 check block increments, and these three check block increments are combined into one check block increment (a”-a)+2(b’-b), and then stored;
  • FIG. 5 it is a schematic diagram of the process of a log node PLM in an erasure code architecture with parameters (2,2); assuming that a and b, c and d are data blocks from two stripes respectively, At the same time, the second parity block of this even stripe is stored in the log node shown in the figure; the data input and update process in the system is: a, b, c, d, a', c', c ", b', a", b"; the corresponding data block stream in the log node cache is: a+2b, c+2d, a'-a, c'-c, c"-c', 2(b '-b), a"-a', 2(b"-b'); the above data blocks will be directly stored in a continuous disk space according to the PLM; then they will be read out to calculate the merged result , after the combined calculation is completed, the blocks a”+2b” and c”+2d
  • the present invention has the following advantages:
  • the hybrid data update method provided by the present invention constructs a hybrid update storage system, wherein the XOR block of the data block and the check block is stored on the memory node, and the remaining check blocks are stored on the log node; wherein, the memory node
  • the data blocks and XOR blocks on the log node use the in-place update method to update data, while the remaining verification blocks on the log node use the verification log update method.
  • the combination of the in-place update method and the verification log update method can greatly balance The relationship between storage cost and data update performance and single failure recovery performance.
  • the hybrid data update method oriented to memory key-value erasure code reads and encodes from the memory node when the XOR block exists in the memory node; the data block and the XOR block are updated in situ This method makes it superior to the current mainstream solutions in terms of repair performance and memory storage overhead; the remaining check blocks are updated with check logs to ensure efficient update performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种面向内存键值纠删码的混合数据更新方法及存储系统,属于计算机的可靠数据存储领域,方法包括:构建混合更新的存储系统;根据对象索引元数据获取对象的条带ID、序列号、偏移量和长度;读取对象对应数据块的历史值和条带的异或值,并获取各校验块的编码系数;将数据块的历史值与更新值作差获取数据块的差值;使用数据块的差值与校验块的编码系数计算各校验块的差值;将异或块和异或块的差值进行异或合并,更新异或块;对象对应的数据块和更新后的异或块采用就地更新方式写入内存节点;具有偏移量和长度的校验块差值采用校验日志更新方式写入日志节点。本发明能够平衡存储成本、数据更新性能及单个故障修复性能之间的关系。

Description

一种面向内存键值纠删码的混合数据更新方法及存储系统 【技术领域】
本发明属于计算机的可靠数据存储领域,更具体地,涉及一种面向内存键值纠删码的混合数据更新方法及存储系统。
【背景技术】
现代HPC集群上的并行文件系统通常使用内存键值存储以进行高效地数据分析。为了保证存储的可靠性,许多分布式键值存储将数据的副本分散保存在不同的节点上。而这种方式在实现容错的同时也大大增加了系统中数据的冗余度。最新的研究中则借助纠删码技术以保证内存键值存储的可靠性,它与副本技术相比显著降低了存储成本。纠删码是将原始数据块编码为校验块,原始数据块和校验块共同组成一个条带;当出现数据损坏或者丢失时,可以使用条带上剩余的数据块对损失的数据进行一个恢复。
对于大数据分析工作负载而言,数据更新是繁重且不可避免的。在纠删码存储中,更新的开销是非常高的,任何数据块的更新都会导致所有校验块的更新。在大条带的情况下,每个条带的数据块更多,它们可能分散在几十甚至上百个节点当中,更新开销又进一步地变得更大。
纠删码的数据更新目标有三种主流方式:1)就地更新(in-place update),原始的数据块和校验块被更新后数据块和校验块原地替换;2)全条带更新(full-stripe update),更新后的数据块被组织到一起直接编码生成新的条带;3)校验日志更新(parity logging),校验块的更新以追加更新的方式放入日志节点的磁盘中。在上述的三种更新方法中,就地更新方式会占用较高的网络传输资源,全条带更新方式会有很高的存储开销,而校验日志更新消除了过程中读取原有校验块的开销同时存储开销是目前最低的。
然而,现今部署了纠删码的内存键值存储系统还是只使用了就地更新或全条带更新的方式。虽然校验日志更新能够有效地降低纠删码更新时的网络 传输开销和系统存储开销,但是存储校验日志节点的磁盘性能与内存键值存储系统性能之间的差距巨大难以处理。这使得目前还没有哪个系统考虑过使用校验日志更新。具体地说就是,即使网络传输开销降低了,日志节点还是会降低更新性能,另外当出现单个故障时,由于校验日志存储在磁盘中,读取恢复得到最新校验块的过程还会拖累降级读的性能。
【发明内容】
针对现有技术的缺陷,本发明提供一种面向内存键值纠删码的混合数据更新方法及装置,目的在于通过对数据块与校验块使用混合更新的存储架构,基于混合更新的存储架构,在内存键值系统上使用校验日志更新技术降低内存中的存储开销、提升数据变化时的更新性能。
为实现上述目的,第一方面,本发明提供了一种面向混合内存键值纠删码系统的混合更新存储系统,包括:若干客户端,代理服务器,k+1个内存节点和r-1个日志节点;k为所述数据块的数目;r为所述校验块的数目;
代理服务器与客户端、内存节点与日志节点均双向通信连接;
内存节点用于存储并采用就地更新的方式更新数据块和校验块中的异或块;日志节点用于存储并采用校验日志更新的方式更新剩余校验块;
代理服务器作为客户端的前端接口,用于保存条带元数据,实现客户端传输的访求请求,并用于对节点故障进行修复;
客户端用于对外提供访存请求。
更为具体地,代理服务器中的队列作为编码缓冲区,用于读入客户端的写入请求;代理服务器为对象提供固定大小的k个数据块,并将k个数据块编码成r个相同大小的校验块;代理服务器保存条带元数据,以完成客户端的请求、降级读和修复等操作。
优选地,元数据包括条带ID、对象索引和条带索引;
条带ID用于区分每个条带;对象索引用于存储对象与条带ID之间的映 射关系,以及对象与具体数据块元数据之间的映射关系;条带索引用于存储每个条带对应的k个数据块r个校验块。
优选地,日志节点包括缓冲区和磁盘;缓冲区位于代理服务器与磁盘之间;
缓冲区用于将多次校验块差值存储;
磁盘用于若缓冲区填满后,将所有校验块差值按照在缓冲区存储的时间顺序先写入磁盘,然后定期对相同条带的校验块差值进行合并,合并后的差值存储至各校验块对应的预留空间。
优选地,若存在内存节点失效时,从日志节点读出一个校验块、从内存节点中读取k-1个数据块对失效的内存节点进行修复。
第二方面,本发明提供了一种面向内存键值纠删码的混合数据更新方法,包括以下步骤:
构建混合更新的存储系统;其中,存储系统包括多个客户端,一个代理服务器、k+1个内存节点和r-1个日志节点;数据块和校验块中的异或块存放至内存节点上;剩余校验块存储在日志节点中;k为数据块的数目;r为校验块的数目;
根据对象索引元数据获取对象的条带ID、序列号、偏移量和长度;
基于对象的条带ID,代理服务器读取对象对应数据块的历史值和条带的异或块;且基于对象的序列号获取各校验块的编码系数;
将数据块的历史值与更新值作差获取数据块的差值;
使用数据块的差值与校验块对应的编码矩阵系数计算各校验块的差值;
将异或块和异或块的差值进行异或合并,更新异或块;
采用就地更新的方式将对象对应的数据块和更新后的异或块写入内存节点;
采用校验日志更新方式将具有偏移量和长度的校验块差值写入日志节 点。
优选地,日志节点包括缓冲区和磁盘;缓冲区位于代理服务器与磁盘之间;
校验日志更新方式为:将多次校验块差值存储至缓冲区;
若缓冲区的填满后,将所有校验块差值按照其在缓冲区存储的时间顺序先写入连续磁盘空间中;
定期对相同条带的校验块差值进行合并,合并后的差值存储至各校验块对应的预留空间。
优选地,若存在内存节点失效时,从日志节点读出一个校验块、从内存节点中读取k-1个数据块对失效的内存节点进行修复。
总体而言,通过本发明所构思的以上技术方案与现有技术相比,具有以下有益效果:
本发明所提供的混合数据更新方法,通过构建混合更新的存储系统,其中,数据块和校验块的异或块存放至内存节点上,剩余校验块存储在日志节点上;其中,内存节点上的数据块和异或块采用就地更新的方式更新数据,而日志节点上的剩余校验块采用校验日志更新方式进行,就地更新方式和校验日志更新方式的结合能很多地平衡存储成本和数据更新性能及单个故障修复性能之间的关系。
现有技术中存在磁盘性能与内存性能有着巨大差距。如果所有校验块都在日志节点上,有单个数据损坏需要进行降级读的时候,必须从磁盘上读取校验块差值与校验块,合并得到最新的校验块;最后使用最新的校验块与剩余数据块进行解码得到损坏的数据块,在这个过程中磁盘读取比较慢。基于上述原因,本发明所提供的面向内存键值纠删码的混合数据更新方法,异或块存在内存节点时就从内存节点中读取进行编码;数据块和异或块采用原地更新的方式,使得其在修复性能和内存存储开销两方面优于目前的主流方案; 剩余校验块采用校验日志更新的方式保证了高效的更新性能。
【附图说明】
图1(a)是本发明实施例提供的就地更新的更新过程示意图;
图1(b)是本发明实施例提供的全条带更新在更新频率高时的更新过程示意图;
图1(c)是本发明实施例提供的全条带更新在更新频率低时的更新过程示意图;
图1(d)是本发明实施例提供的校验日志更新的更新过程示意图;
图2是本发明实施例提供的参数为(6,3)时的混合更新的存储架构图;
图3是本发明实施例提供的参数为(6,3)时D 1更新为D 1’的更新过程示意图;
图4是本发明实施例提供的校验块a+2b的基于合并的缓存日志示意图;
图5是本发明实施例提供的参数为(2,2)时PLM实现过程的示意图。
【具体实施方式】
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
一方面,本发明提供了一种面向内存键值纠删码系统的混合数据更新方法,包括以下步骤:
(1)混合更新的存储系统
在数据块和校验块的存储上采取的是混合更新的存储系统。对于参数为(k,r)的纠删码部署,架构中包含多个客户端,一个代理服务器,k+1个内存节点与r-1个日志节点;数据块存储在内存节点中使用就地更新的方式进行更新,校验块中的异或块(第一个校验块)也放入内存节点中使用就地更新的方式进行更新;而剩余校验块存储在日志节点中使用校验日志更新的方式 进行更新;日志节点由缓冲区与磁盘构成,缓冲区位于代理服务器与磁盘之间。更新时,校验块的差值被写入到日志节点的缓冲区,再一步批量的写入到日志节点的磁盘;k为数据块的数目;r为校验块的数目;
(2)更新数据的步骤
更新操作是指存在的数据块存储的对象值变化成了新值,同时需要对相应条带中的校验块进行更新;为了进行数据块存储的对象值的更新,代理服务器首先根据对象索引元数据获取到对象的条带ID、序列号、偏移量和长度;随后代理服务器读取旧数据对象的值和条带的异或块(它们都是根据条带ID获取到的);这时,代理服务器器首先根据新旧数据块存储的值计算出数据块的差值;再使用数据块的差值与校验块对应的编码矩阵系数(由序列号确定)计算出异或块的差值;在将异或块和异或块的差值进行异或合并,计算最终得到新的异或块;代理服务器同时也会将数据块的差值发送到每个日志节点,一起发送的还有日志节点相应的编码系数,日志节点采用类似的方式计算出其他校验块的差值;最后,新数据对象和异或块被写入内存节点,具有偏移量和长度的校验块差值被写入日志节点;
(3)基于合并的缓存日志
日志节点的缓冲区中对于一个校验块而言一般存在多个校验块差值;多个校验块差值会被合并为一个块后再进行存储;通过这种方式可以减少日志节点更新期间的磁盘IO;
(4)基于合并的校验日志技术
对于日志节点上,日志记录的写入过程使用惰性合并的校验日志技术(PLM);首先将校验块差值按时间顺序写入一个额外的连续磁盘空间中,之后再将它们一起读回,对相同条带的校验块差值数据进行合并,再对合并后的差值进行存储;
对于日志节点上的校验块及其差值的保存采取的方式为预留空间的校验 日志记录,具体操作为:预先在日志节点上分配好校验块相应的预留存储空间,发生更新时将经过PLM合并好的校验块差值保留在旧的校验块后连续的预留存储空间;通过这种方式能够减少在计算最新校验块时磁盘的磁盘随机读取工作,提升数据降级读修复的性能;
(5)基于日志节点辅助的节点修复优化技术
当存在内存节点失效时,传统的修复方法是使用剩余所有内存节点以进行降级读解码得到失效节点上的数据,而现有的内存节点还要通过代理服务器提供连续的服务,同时传输数据进行修复很可能造成节点的网络拥塞;本发明基于日志节点辅助的节点修复使用此时相对空闲的日志节点与内存节点一起进行修复;以(k,r)参数部署的纠删码为例,恢复一个条带在失效节点上的块在这种方法下需要从日志节点读出一个校验块、从内存节点中读取k-1个块,与传统的修复方法相比减少了内存节点的网络传输开销。
另一方面,本发明提供了一种面向混合内存键值纠删码系统的混合更新存储系统,包括:若干客户端,代理服务器,k+1个内存节点和r-1个日志节点;k为所述数据块的数目;r为所述校验块的数目;
代理服务器与客户端、内存节点与日志节点均双向通信连接;
内存节点用于存储并采用就地更新的方式更新数据块和校验块中的异或块;日志节点用于存储并采用校验日志更新的方式更新剩余校验块;
代理服务器作为客户端的前端接口,用于保存条带元数据,实现客户端传输的访求请求,并用于对节点故障进行修复;
客户端用于对外提供访存请求。
更为具体地,代理服务器中的队列作为编码缓冲区,用于读入客户端的写入请求;代理服务器为对象提供固定大小的k个数据块,并将k个数据块编码成r个相同大小的校验块;代理服务器保存条带元数据,以完成客户端的请求、降级读和修复等操作。
优选地,元数据包括条带ID、对象索引和条带索引;
条带ID用于区分每个条带;对象索引用于存储对象与条带ID之间的映射关系,以及对象与具体数据块元数据之间的映射关系;条带索引用于存储每个条带对应的k个数据块r个校验块。
优选地,日志节点包括缓冲区和磁盘;缓冲区位于代理服务器与磁盘之间;
缓冲区用于将多次校验块差值存储;
磁盘用于若缓冲区填满后,将所有校验块差值按照在缓冲区存储的时间顺序先写入磁盘,然后定期对相同条带的校验块差值进行合并,合并后的差值存储至各校验块对应的预留空间。
优选地,若存在内存节点失效时,从日志节点读出一个校验块、从内存节点中读取k-1个数据块对失效的内存节点进行修复。
实施例
在介绍本发明实施例之前,首先对就地更新、全条带更新、校验日志更新三种纠删码的更新方式进行简要介绍;
如图1(a)~(d)所示,以参数为(6,3)的纠删码配置为例,一个条带由6个数据块D 1~D 6与3个校验块P 1~P 3组成;图1(a)中提供的是就地更新的过程,D 1更新成了D 1’,该节点计算了ΔD 1=D 1-D 1’,并将其传输到三个校验块所在的节点,校验块节点读出原校验块并与ΔD 1结合算出更新后的P 1’~P 3’;在这个过程中进行了3次校验块的读操作,更新完成后一共存储了9个块;图1(b)和图1(c)分别展示的全条带更新在更新频率不同的两种情况下的更新过程;图1(b)是更新频繁的情况下,数据块D 1~D 6全部都发生了更新,更新为D 1’~D 6’;更新后的数据块重新编码得到了新的数据块P 4~P 6;旧数据块和旧校验块由后续的垃圾回收过程进行清理;这个更新过程中没有校验块的读操作,更新完成后一共存储了18个块;图1(c)是在 更新不频繁的情况下,只有D 1更新成了D 1’;此时D 1’与D 2~D 6一起再次进行编码得到校验块P 7~P 9;在这个更新过程中,重新计算了3次校验块,更新完成后一共存储了13个块;图1(d)提供的是校验日志更新的过程,D 1更新成了D 1’,该节点计算了ΔD 1=D 1-D 1’,并将其传输到三个校验日志所在节点,计算得到校验块的增量ΔP 1~ΔP 3后增量保存在日志中,在这个更新过程中没有校验块的读操作,更新完成后一共存储了12个块;
本发明实施例面向内存键值纠删码系统的混合数据更新方法,下面结合具体的实施方式,进一步描述本发明的技术方案,包括以下步骤:
(1)混合更新的存储架构
图2是参数为(6,3)时的混合更新的存储架构;架构中包括多个客户端,一个代理服务器,7个内存节点与2个日志节点;以一个条带为例,数据块D 1-D 6个分别保存在6个内存节点中使用就地更新的方式进行更新;异或校验块P 1也保存在内存节点中使用就地更新的方式进行更新;剩余2个校验块P 2、P 3保存在日志节点上使用校验日志更新的方式进行更新;日志节点由缓冲区与磁盘构成,缓冲区位于代理服务器与磁盘之间。更新时,校验块的差值被写入到日志节点的缓冲区,再异步批量地写入到日志节点的磁盘;
(2)更新数据的步骤
如图3所示,参数为(6,3)的纠删码部署情况下工作流使得数据块D 1更新成了D 1’;
(2.1)代理服务器通过更新对象索引元数据获取更新对象的条带ID、序列号、偏移量和长度;
(2.2)D 1和P 1所在的节点将D 1和P 1传输给代理服务器;
(2.3)代理服务器计算出ΔD 1和P 1’,D 1就地更新成D 1’,P 1就地更新成P 1’,代理服务器将ΔD 1发送到P 2、P 3所在的节点;
(2.4)D 1和P 1所在的节点在更新写完成之后向代理服务器发送确认信 息,P 2、P 3所在的节点分别计算得到ΔP 2和ΔP 3并将它们保存在各自节点的缓冲区中;
(2.5)P 2、P 3所在的节点向代理服务器发送确认消息;
(2.6)代理服务器收到D 1、P 1、P 2、P 3四个节点发来的确认信息后,则确认更新操作已经完成;
(3)基于合并的缓存日志
图4是内容为a+2b的校验块所在节点的缓存情况;假设传入的数据流:a,b,a’,b’,a”;其中,a,b是原始数据块的数据;a’,a”由a更新得到,b’是由b更新得到的;缓存区保存着根据数据块增量计算得到校验块增量a’-a,2(b’-b),a”-a’共计3个校验块增量,这三个校验块增量被合并成一个校验块增量(a”-a)+2(b’-b),再进行存储;
(4)基于合并的校验日志技术
如图5所示,是部署了参数为(2,2)的纠删码架构中某个日志节点PLM的过程示意图;假设a和b、c和d是分别来自两个条带的数据块,同时这连个条带的第二个校验块都保存在图中所示的这个日志节点;系统中数据传入和更新过程为:a,b,c,d,a’,c’,c”,b’,a”,b”;相应的该日志节点缓存中数据块流为:a+2b,c+2d,a’-a,c’-c,c”-c’,2(b’-b),a”-a’,2(b”-b’);上述数据块根据PLM会先直接保存在一个连续的磁盘空间中;随后它们又被读出用以计算合并之后的结果,合并计算完成后得到块a”+2b”、c”+2d,存储到磁盘中事先分配好的日志文件中,在校验块右侧预留的区域用以存储未来可能出现的校验块差值;
(5)基于日志辅助的节点修复
以(6,3)参数部署的纠删码为例,恢复一个条带在失效节点上的块在日志辅助的节点修复下需要从日志节点读出1个校验块、从内存节点中读取5个块,与传统修复方法相比从内存节点中读取数量有所减少;由于不是所有 内存节点都被使用,所以可以通过相应的方法实现节点上数据修复并行的实现。
综上所述,本发明与现有技术相比,存在以下优势:
本发明所提供的混合数据更新方法,通过构建混合更新的存储系统,其中,数据块和校验块的异或块存放至内存节点上,剩余校验块存储在日志节点上;其中,内存节点上的数据块和异或块采用就地更新的方式更新数据,而日志节点上的剩余校验块采用校验日志更新方式进行,就地更新方式和校验日志更新方式的结合能很多地平衡存储成本和数据更新性能及单个故障修复性能之间的关系。
现有技术中存在磁盘性能与内存性能有着巨大差距。如果所有校验块都在日志节点上,有单个数据损坏需要进行降级读的时候,必须从磁盘上读取校验块差值与校验块,合并得到最新的校验块;最后使用最新的校验块与剩余数据块进行解码得到损坏的数据块,在这个过程中磁盘读取比较慢。基于上述原因,本发明所提供的面向内存键值纠删码的混合数据更新方法,异或块存在内存节点时就从内存节点中读取进行编码;数据块和异或块采用原地更新的方式,使得其在修复性能和内存存储开销两方面优于目前的主流方案;剩余校验块采用校验日志更新的方式保证了高效的更新性能。
本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (7)

  1. 一种面向内存键值纠删码的混合数据更新存储系统,其特征在于,包括:若干客户端、代理服务器、内存节点和日志节点;
    所述代理服务器与所述客户端、所述内存节点、所述日志节点均双向通信连接;
    所述内存节点的数量为k+1个,用于存储并采用就地更新的方式更新数据块和校验块的异或块,k为所述数据块的数目;所述日志节点的数量为r-1个,用于存储并采用校验日志更新的方式更新剩余校验块,r为所述校验块的数目;
    所述代理服务器作为客户端的前端接口,用于保存条带元数据,实现客户端传输的访存请求,并用于对节点故障进行修复;
    所述客户端用于对外提供访存请求。
  2. 根据权利要求1所述的混合更新存储系统,其特征在于,所述元数据包括条带ID、对象索引和条带索引;
    条带ID用于区分每个条带;对象索引用于存储对象与条带ID之间的映射关系,以及对象与具体数据块元数据之间的映射关系;条带索引用于存储每个条带对应的k个数据块r个校验块。
  3. 根据权利要求1或2所述的混合更新存储系统,其特征在于,所述日志节点包括缓冲区和磁盘;所述缓冲区位于所述代理服务器与所述磁盘之间;
    所述缓冲区用于将多次校验块差值存储;所述磁盘用于若缓冲区填满后,将所有校验块差值按照在缓冲区存储的时间顺序写入,对相同条带的校验块差值进行合并,合并后的差值存储至各校验块对应的预留空间。
  4. 根据权利要求1或2所述的混合更新存储系统,其特征在于,若存在内存节点失效时,从日志节点读出一个校验块、从内存节点中读取k-1个数据块 对失效的内存节点进行修复。
  5. 一种面向内存键值纠删码的混合数据更新方法,其特征在于,包括以下步骤:
    构建混合更新的存储系统;其中,存储系统包括多个客户端、一个代理服务器、k+1个内存节点和r-1个日志节点;数据块和校验块中的异或块存放至所述内存节点上;剩余校验块存储在所述日志节点中;k为所述数据块的数目;r为所述校验块的数目;
    根据对象索引元数据获取对象的条带ID、序列号、偏移量和长度;
    基于所述对象的条带ID,代理服务器读取对象对应数据块的历史值和条带的异或值,且基于对象的序列号获取各校验块的编码系数;
    将数据块的历史值与更新值作差获取数据块的差值;
    使用所述数据块的差值与校验块的编码系数计算各校验块的差值;
    将异或块和异或块的差值进行异或合并,更新异或块;
    采用就地更新的方式将对象对应的数据块和更新后的异或块写入内存节点;
    采用校验日志更新方式将具有偏移量和长度的校验块差值写入日志节点。
  6. 根据权利要求5所述的混合数据更新方法,其特征在于,所述日志节点包括缓冲区和磁盘;所述缓冲区位于所述代理服务器与所述磁盘之间;
    校验日志更新方式为:
    将多次校验块差值存储至缓冲区;若缓冲区的填满后,将所有校验块差值按照其在缓冲区存储的时间顺序写入连续磁盘空间中;定期对相同条带的校验块差值进行合并,合并后的差值存储至各校验块对应的预留空间。
  7. 根据权利要求5或6所述的混合数据更新方法,其特征在于,若存在内存节点失效时,从日志节点读出一个校验块、从内存节点中读取数据块与异 或块共计k-1个,对失效的内存节点进行修复。
PCT/CN2022/089526 2021-11-09 2022-04-27 一种面向内存键值纠删码的混合数据更新方法及存储系统 WO2023082556A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111320055.7 2021-11-09
CN202111320055.7A CN114138526A (zh) 2021-11-09 2021-11-09 一种面向内存键值纠删码的混合数据更新方法及存储系统

Publications (1)

Publication Number Publication Date
WO2023082556A1 true WO2023082556A1 (zh) 2023-05-19

Family

ID=80392562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089526 WO2023082556A1 (zh) 2021-11-09 2022-04-27 一种面向内存键值纠删码的混合数据更新方法及存储系统

Country Status (2)

Country Link
CN (1) CN114138526A (zh)
WO (1) WO2023082556A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138526A (zh) * 2021-11-09 2022-03-04 华中科技大学 一种面向内存键值纠删码的混合数据更新方法及存储系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206135A1 (en) * 2015-12-31 2017-07-20 Huawei Technologies Co., Ltd. Data Reconstruction Method in Distributed Storage System, Apparatus, and System
CN107766170A (zh) * 2016-08-23 2018-03-06 北京计算机技术及应用研究所 单一存储池的差量日志式纠删码更新方法
CN110442535A (zh) * 2019-07-19 2019-11-12 华中科技大学 提高分布式固态盘键值缓存系统可靠性的方法及系统
CN113157715A (zh) * 2021-05-12 2021-07-23 厦门大学 纠删码数据中心机架协同更新方法
CN114138526A (zh) * 2021-11-09 2022-03-04 华中科技大学 一种面向内存键值纠删码的混合数据更新方法及存储系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206135A1 (en) * 2015-12-31 2017-07-20 Huawei Technologies Co., Ltd. Data Reconstruction Method in Distributed Storage System, Apparatus, and System
CN107766170A (zh) * 2016-08-23 2018-03-06 北京计算机技术及应用研究所 单一存储池的差量日志式纠删码更新方法
CN110442535A (zh) * 2019-07-19 2019-11-12 华中科技大学 提高分布式固态盘键值缓存系统可靠性的方法及系统
CN113157715A (zh) * 2021-05-12 2021-07-23 厦门大学 纠删码数据中心机架协同更新方法
CN114138526A (zh) * 2021-11-09 2022-03-04 华中科技大学 一种面向内存键值纠删码的混合数据更新方法及存储系统

Also Published As

Publication number Publication date
CN114138526A (zh) 2022-03-04

Similar Documents

Publication Publication Date Title
US7546485B2 (en) Method and system for efficient journal-based resynchronization
US9411685B2 (en) Parity chunk operating method and data server apparatus for supporting the same in distributed raid system
US11150986B2 (en) Efficient compaction on log-structured distributed file system using erasure coding for resource consumption reduction
Qi et al. BFT-Store: Storage partition for permissioned blockchain via erasure coding
US9021335B2 (en) Data recovery for failed memory device of memory device array
CN103944981B (zh) 一种基于纠删码技术改进的云存储系统及实现方法
US9785498B2 (en) Archival storage and retrieval system
US20080151724A1 (en) Systems and methods for managing unavailable storage devices
CN110442535B (zh) 提高分布式固态盘键值缓存系统可靠性的方法及系统
CN114415976B (zh) 一种分布式数据存储系统与方法
EP3695304B1 (en) Log-structured storage systems
WO2014056381A1 (zh) 数据冗余实现方法及装置
CN112835743B (zh) 分布式账本数据存储优化方法、装置、电子设备及介质
CN110262922A (zh) 基于副本数据日志的纠删码更新方法及系统
CN110427156B (zh) 一种基于分片的mbr的并行读方法
CN103914402B (zh) 一种基于纠删码缓存的重构优化方法
CN112835738B (zh) 一种条带数据存储结构的构建方法
US20210181992A1 (en) Data storage method and apparatus, and storage system
CN106951340B (zh) 一种基于局部性优先的rs纠删码数据布局方法及系统
WO2023103213A1 (zh) 一种分布式数据库的数据存储方法和装置
WO2023082556A1 (zh) 一种面向内存键值纠删码的混合数据更新方法及存储系统
US7689877B2 (en) Method and system using checksums to repair data
KR101254179B1 (ko) 분산 파일 시스템에서 효율적인 자료 복구 방법
CN107766170B (zh) 单一存储池的差量日志式纠删码更新方法
US7716519B2 (en) Method and system for repairing partially damaged blocks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891368

Country of ref document: EP

Kind code of ref document: A1