CN114415976A - Distributed data storage system and method - Google Patents

Distributed data storage system and method Download PDF

Info

Publication number
CN114415976A
CN114415976A CN202210313064.1A CN202210313064A CN114415976A CN 114415976 A CN114415976 A CN 114415976A CN 202210313064 A CN202210313064 A CN 202210313064A CN 114415976 A CN114415976 A CN 114415976A
Authority
CN
China
Prior art keywords
data
write
cache
fragments
storage node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210313064.1A
Other languages
Chinese (zh)
Other versions
CN114415976B (en
Inventor
徐言林
文刘飞
陈坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sandstone Data Technology Co ltd
Original Assignee
Shenzhen Sandstone Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sandstone Data Technology Co ltd filed Critical Shenzhen Sandstone Data Technology Co ltd
Priority to CN202210313064.1A priority Critical patent/CN114415976B/en
Publication of CN114415976A publication Critical patent/CN114415976A/en
Application granted granted Critical
Publication of CN114415976B publication Critical patent/CN114415976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket

Abstract

A distributed data storage system comprising: a network, at least 3 storage nodes; the service layer read-write unit receives the write-in data request, the service layer read-write module reads local metadata information according to file name information associated with the write-in data request, a new version number is allocated to each write request, and whether the current write-in is the overwrite write is judged; the business layer read-write unit respectively sends the data fragments and metadata information to K + M storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers; the storage node receives the metadata information and the data fragments, and writes the data fragments into an SSD cache disk to form temporary cache data; after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data; and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data.

Description

Distributed data storage system and method
Technical Field
The present application relates to the field of software technologies, and in particular, to a distributed data storage system and method.
Background
The noun explains: ceph: is a free software distributed file system designed specifically for the doctor paper by Sage Weil at Santa Cruz division, California university.
EC: erasure codes, which are Forward Error Correction (FEC) technologies, are mainly used in network transmission to avoid packet loss, and storage systems use the Erasure codes to improve storage reliability. Compared with multi-copy replication, erasure codes can achieve higher data reliability with smaller data redundancy, but the encoding mode is complex and requires a large amount of calculation. Erasure codes can only tolerate data loss and cannot tolerate data tampering. Currently, the erasure Code technology is mainly applied to three types of distributed storage systems, namely Array Code (RAID 5, RAID6, etc.), RS (Reed-Solomon) Reed-Solomon type erasure codes, and ldpc (density Parity Check Code) low-density Parity-Check erasure codes.
RAID, disk array, having the meaning of array with redundancy capability formed by Independent Disks. Is a special case of EC. In conventional RAID, only limited disk failures are supported, RAID5 supports only one disk failure, RAID6 supports two disk failures, and EC supports multiple disk failures. EC is used mainly in the fields of storage and digital encoding. Such as disk array storage (RAID 5, RAID 6), cloud storage (RS), etc.
SSD: solid State Disk or Solid State Drive, also called Solid State Drive, is a hard Disk made of an array of Solid State electronic memory chips.
HDD: hard Disk Drive acronym, most basic computer memory.
OSD: the Object Storage Device is called fully, namely the process responsible for returning specific data in response to a client request. A Ceph cluster typically has many OSDs.
EC is a structure of k data blocks + m check blocks, where k and m values may be set according to a certain rule, and may be represented by the following formula: n = K + M. The variable k represents the value of the original data or symbol. The variable m represents the value of an extra or redundant symbol added after a failure to provide protection. The variable n represents the total value of the symbols created after the erasure coding process. When the storage blocks (data blocks or check blocks) smaller than m are damaged, the whole data block can be obtained by calculating the data on the rest storage blocks, and the whole data cannot be lost.
LBA, known as Logical Block Address, is a common mechanism used to indicate the location of data on a PC data storage device, and most commonly, the device using the LBA is a hard disk.
As shown in fig. 1, Ceph adopts the principle of erasure correction coding, and as shown in fig. 1, it is assumed that k =2 and m =1, and an object named cat. After uploading cat.jpg to Ceph, the client calls a corresponding erasure code algorithm in the main OSD to perform coding calculation on data: splitting the original ABCDEFGH into two fragments, wherein the content corresponds to a stripe fragment 1 in FIG. 1 and is ABCD; and a stripe slice 2, the content of which is EFGH, and another check stripe slice 3, the content of which is WXYZ, are calculated. And randomly distributing the 3 fragments on 3 different OSD nodes according to the calculation and addressing rules of the distributed file system nodes to finish the storage operation of the object. In Ceph, any one node is damaged and can be recovered according to the other two nodes.
In a large-scale distributed file system, a large file, such as a high-definition movie, is decomposed into a plurality of data fragments, and after erasure code calculation, the file becomes larger and is distributed to a plurality of storage nodes, if one or more nodes are damaged in the storage process, the data writing at this time fails, some nodes may have written data, and some nodes have not written successful data.
Overwriting, particularly overwriting designed to a plurality of nodes, wherein the plurality of nodes need to be ensured to be successfully written at the same time under the condition of erasure codes, if the nodes fail to write, the nodes need to return to the previous version, and the condition before overwriting needs to be recovered, which relates to the coordination of each node and the efficiency problem of recovering the previous version.
Disclosure of Invention
In order to ensure that the write failure condition of a distributed storage system is ensured under the condition of overwriting, particularly under the condition of overwriting node failure, data can normally return to a version before failure, and the consistency performance of the data is improved.
A distributed data storage system comprising: a network, at least 3 storage nodes; the storage node comprises a service layer read-write unit, a redirection cache unit, a storage unit and a node read-write control unit; the storage nodes are connected through network signals; the redirection cache unit comprises an SSD cache disk and a cache management module; the storage unit comprises an HDD storage magnetic disk and an HDD storage management module; the service layer read-write unit comprises a service layer read-write module and a metadata database; the system comprises a data slicing module and an erasure code calculation module; the service layer read-write unit receives the write-in data request, the service layer read-write module reads local metadata information according to file name information associated with the write-in data request, a new version number is allocated to each write request, and whether the current write-in is the overwrite write is judged; the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments; the business layer read-write unit respectively sends the data fragments and metadata information to K + M storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers; the storage node receives the metadata information and the data fragments, and if the metadata information and the data fragments are overwritten, the storage node writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message; after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data; the service layer read-write unit sends a message of successful write-in to the K + M storage nodes; and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
In the data storage system, local metadata information is stored in a metadata database; the service layer read-write unit is any one of K + M storage nodes; or the service layer read-write unit is a first node in the K + M storage nodes; the metadata information comprises name information, data fragment numbers and version numbers of the data; the version number is a unique numerical value which is sequentially increased; the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the write request at the storage node; temporarily controlling file operation and metadata database operation on a storage node to be relevance operation, and simultaneously succeeding or failing; judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, the data exists in the area corresponding to the current writing request, and the current writing is the overwriting; if the data is the overwriting, if the length of the overwriting data does not reach the integral multiple of the length of the K fragments, part of the data fragments need to be read again, the length of the data fragments is the integral multiple of the K fragments when the data fragments are completely gathered, and the erasure code fragments are recalculated; the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments; and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
The storage node receives the metadata information and the data fragments, and acquires the corresponding disk LBA address and the offset written in this time by inquiring the metadata based on the file name and the write offset of the write request; if the data is the overwriting, based on the LBA address, carrying version number information, and writing the data into an SSD cache disk through a redirection cache; and recording the LBA address, the data length and the version number information into a corresponding temporary control file.
The service layer read-write unit receives a write-in data request, and if the metadata information is not inquired, the write-in is primary write; the storage node receives the metadata information and the data fragments, if the write-in data request is initial write, the disk offset corresponding to the file is obtained through the metadata, then the offset and the length of the write-in request are combined, the LBA disk address corresponding to the current write-in is calculated, the redirection cache unit writes the data fragments into the SSD cache disk to form normal cache data, and the storage node records the metadata information into a metadata database; the storage node returns a write cache success message; and after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data.
The service layer read-write unit receives a failure message of writing the data fragments into the SSD cache disk, or receives a message of successfully writing the data fragments into the SSD cache disk, wherein the number of the received messages is less than K, and the service layer read-write unit returns a response of data writing failure;
the service layer read-write unit sends data rollback information to K + M storage nodes; the data rollback message comprises file name information, fragment numbers and version numbers; and after receiving the data rollback message, the storage node acquires the corresponding temporary file and temporary cache data information stored in the temporary file through the file name and version number information, deletes the temporary data fragment on the SSD cache disk, and deletes the temporary control file.
After the system is restarted, the storage nodes read the metadata information and confirm whether the write request is successful before restarting according to the write operation log recorded in the metadata information in the K + M storage nodes; if the write request is successful, the data is not rolled back, the temporary cache data is changed into normal cache data, the metadata is modified, and the temporary control file is deleted; and if the write request is unsuccessful, the system performs data rollback, the storage node deletes the temporary data fragment on the SSD cache disk, and the storage node deletes the temporary control file.
A distributed data storage method, comprising: step 10: the selected storage node receives the write-in data request, reads local metadata information according to file name information associated with the write-in data request, allocates a new version number to each write request, and judges whether the current write-in is an overwrite write; step 20: selecting a storage node to perform data fragmentation on the written data to obtain K original data fragments, and selecting the storage node to perform erasure code calculation on the data fragments to obtain M erasure code fragments; step 30: selecting a storage node and respectively sending the K + M data fragments and metadata information to other storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers; step 40: the storage node receives the metadata information and the data fragments, and if the metadata information and the data fragments are overwritten, the storage node writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message; step 50: after the selected storage node obtains the successful write-in cache information of the K + M data fragments, the selected storage node returns a response of the successful write-in data; step 60: the selected storage node sends a message of successful writing to other storage nodes; step 70: and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
In the distributed data storage method, the local metadata information is stored in a metadata database; the selected storage node is any one of K + M storage nodes; or the selected storage node is a first node among the K + M storage nodes; the metadata information comprises name information, data fragment numbers and version numbers of the data; the version number is a unique numerical value which is sequentially increased; the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the storage node; the temporary control file operation and the metadata database operation on the storage node are related operations and succeed or fail at the same time; judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, the current writing is the overwriting; if the data fragment is the overwriting data, if the data fragment is not the full overwriting data, the written data fragment needs to be read again, the overwriting data is used for overwriting the data fragment, and the erasure code fragment is recalculated.
The distributed data storage method further includes, after step 10, step 11: if the write-in is the initial write, the storage node acquires the disk offset corresponding to the file through the metadata, then calculates the corresponding LBA disk address of the write-in according to the offset and the length of the write-in request, writes the data into the SSD cache disk in a fragmentation mode to form normal cache data, and records the metadata information into a metadata database; the storage node returns a write cache success message; after the service layer read-write unit receives the successful write-in cache message of the K data fragments, the service layer read-write unit returns a response of successfully writing data; step 70 is followed by step 80: and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
If the system is restarted and recovered, the distributed data storage method comprises the following steps of 90: after the system is restarted, the storage nodes read the local temporary control file, whether the write request is successful or not is confirmed according to log comparison in the K + M storage nodes, if the write request is successful, data is not rolled back, and if the write request is unsuccessful, the system rolls back the data; data rollback is not performed, temporary cache data is changed into normal cache data, metadata is modified, and the temporary control file is deleted; and if rollback is needed, deleting the temporary data fragments on the SSD cache disk by the storage node, and deleting the temporary control file by the storage node.
The beneficial effects of the technical scheme in this application have at least: if the writing fails, the temporary data is directly deleted, the version regression can be rapidly carried out, the operation steps are greatly reduced, and the system efficiency is improved; the temporary control file is adopted to control the writing of the storage node, the old version data is not required to be read out to the temporary control file, after the writing is completed successfully, the temporary cache data is converted into normal cache data, and the complexity of the writing operation is greatly reduced; the name of the temporary control file comprises file name information and version number information, so that the uniqueness of the temporary control file is ensured, and different versions can be written in a multithreading and multi-progress manner; the temporary control file operation and the metadata database operation are related operations, so that the condition of data inconsistency cannot occur; data recovery is performed through the log, and the situation that the data are inconsistent after the system is restarted is avoided.
Drawings
FIG. 1 is a schematic diagram of Ceph using erasure codes;
FIG. 2 is a schematic diagram of a data processing principle for implementing a distributed storage system using erasure codes;
FIG. 3 is a schematic diagram of erasure code recalculation in an overwrite mode;
FIG. 4 is a system block diagram schematic of an embodiment of a distributed data storage system;
FIG. 5 is a schematic diagram of an internal functional module of a service layer read-write unit;
FIG. 6 is a schematic diagram of an internal functional module of a service layer read-write unit;
FIG. 7 is a schematic diagram of an overwrite write mode process for writing data;
FIG. 8 is a schematic diagram of a read storage node data recalculation erasure code under an overwrite condition;
FIG. 9 is a schematic diagram of a write failure clearing a temporary cache;
FIG. 10 is a schematic diagram of reading data of each node.
Detailed Description
The present disclosure is described in further detail below with reference to the attached drawings. It should be noted that the following description is of the preferred embodiments of the present invention and should not be construed as limiting the invention in any way. The description of the preferred embodiments of the present invention is made merely for the purpose of illustrating the general principles of the invention.
As shown in fig. 2, the erasure code is used to implement a data processing schematic of the distributed storage system, read and write the write data of the access node, the read-write unit is firstly divided into K data fragments at the service layer, the data fragments can be preset fixed-length blocks, the size of the fixed-length blocks is 128K, 256K, 512K or 1M, the fixed-length blocks are generally integral multiples of the size of physical magnetic tracks or erased data blocks of each operation of an SSD cache disk or an HDD storage disk, k + M erasure code fragments are obtained by performing erasure code calculation on the data fragments of the read-write unit at the service layer, according to different erasure code algorithms, after one or more erasure code fragments are lost, recovery reading of data is not affected, the K + M erasure code fragments are stored on the K + M distributed storage nodes, one or more storage nodes are in crash or track damage, and other storage nodes can recover damaged data. The K + M erasure code fragments are distributed to the K + M distributed storage nodes, generally, a hash algorithm is adopted for calculation and addressing, and after the erasure code fragments are distributed according to the hash algorithm, the number of the erasure code fragments stored on each distributed storage node is almost the same.
The distributed storage system is realized by adopting the erasure code, in a partial overwrite mode, fragments needing to be overwritten need to be read out, and the erasure code is recalculated, as shown in fig. 3, the contents of a data fragment 2 and a data fragment 3 are overwritten, K data fragments need to be read back, M erasure code fragments are recalculated, and then the erasure code fragments are rewritten on the distributed storage nodes, the data fragment 1 is not modified, if a specific erasure code algorithm is adopted, the erasure code fragment 1 is equal to the data fragment 1, so that the data fragment 1 is not rewritten, and only the modified data fragment and the M erasure code fragments need to be rewritten.
As shown in fig. 4, a distributed data storage system, comprising: a network, at least 3 storage nodes; the storage node comprises a service layer read-write unit, a redirection cache unit, a storage unit and a node read-write control unit; the storage nodes are connected through network signals; the redirection cache unit comprises an SSD cache disk and a cache management module; the storage unit comprises an HDD storage magnetic disk and an HDD storage management module; the service layer read-write unit comprises a service layer read-write module and a metadata database; the system comprises a data slicing module and an erasure code calculation module; the service layer read-write unit receives the write-in data request, the service layer read-write module reads local metadata information according to file name information associated with the write-in data request, a new version number is allocated to each write request, and whether the current write-in is the overwrite write is judged; the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments; the business layer read-write unit respectively sends the data fragments and metadata information to K + M storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers; the storage node receives the metadata information and the data fragments, and if the metadata information and the data fragments are overwritten, the storage node writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message; after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data; the service layer read-write unit sends a message of successful write-in to the K + M storage nodes; and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
As shown in fig. 4, the node read-write control unit may be further internally divided into a service layer read-write module and a metadata database, where the metadata database may be a database in various forms or a distributed database, and if only the metadata related to the node is retained, the metadata of one data may be found according to a distributed algorithm without using the distributed database, and as long as the node is found, the concurrency of data processing may be provided.
As shown in fig. 4, the service layer read-write unit includes a service layer read-write module and a metadata database; the system comprises a data slicing module and an erasure code calculation module; each module may be a separate software or software module.
By setting the version number for each write-in, the data written in each time has the version number, and in the state of overwriting, the data is written in the SSD cache disk in the form of the temporary version, so that the temporary data cannot be written in the HDD hard disk, and if the write-in fails, the temporary data is directly deleted, the version regression can be quickly carried out, the operation steps are greatly reduced, and the system efficiency is improved. The temporary control file is adopted to control the writing of the storage node, the old version data is not required to be read out to the temporary control file, after the writing is completed successfully, the temporary cache data is converted into normal cache data, and the complexity of the writing operation is greatly reduced.
Referring to fig. 4, in the data storage system, the local metadata information is stored in the metadata database; the service layer read-write unit is any one of K + M storage nodes; or the service layer read-write unit is a first node in the K + M storage nodes; the metadata information comprises name information, data fragment numbers and version numbers of the data; the local metadata information can be locally stored in various modes and managed by a database, so that the management efficiency can be improved; in the write operation, any node can be used as an access node of the write operation, but metadata of each file is stored on a corresponding node according to a distributed algorithm, a service layer read-write unit of the node can be used as an operation node, K + M storage nodes are also selected according to the algorithm and can also be subsequent nodes of the first node, a lookup table can be obtained by computing the IP address hash of the node, the name hash of the written file is computed, and the IP address of the first node can be obtained by looking up the lookup table.
The version number is a unique numerical value which is sequentially increased; the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the write request at the storage node; the uniqueness of the written version codes can be ensured through the unique numerical values which are sequentially increased, each time the version codes are written in, a version number is rolled back, and the version numbers are rolled back to adjacent version numbers. The name of the temporary control file comprises file name information and version number information, so that the uniqueness of the temporary control file is ensured, each temporary control file comprises LBA disk deviation, data length and version number information, namely information of the same corresponding position on an HDD disk, but the temporary control file is cached in an SSD cache disk and can be a plurality of versions, and if the temporary cache data is modified into normal cache data, a file with the same file name can be written into the same SSD cache disk position.
Temporarily controlling file operation and metadata database operation on a storage node to be relevance operation, and simultaneously succeeding or failing; judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, the data exists in the area corresponding to the current writing request, and the current writing is the overwriting; if the data is the overwriting, if the length of the overwriting data does not reach the integral multiple of the length of the K fragments, part of the data fragments need to be read again, the length of the data fragments is the integral multiple of the K fragments when the data fragments are completely gathered, and the erasure code fragments are recalculated; the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments; and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
The temporary control file operation and the metadata database operation are related operations, the consistency of data is ensured, the temporary control file operation and the metadata database operation are recorded by writing a log file of an atomic operation, the two operations are recorded in the log file, and the recording operation is successful only if the two operations are successful, so that the temporary control file operation and the metadata database operation are guaranteed to be related operations.
And under the condition of overwriting, if the data are wholly overwritten, the data do not need to be recovered, if the data are partially overwritten, the data need to be read again, the overwritten data are used for overwriting the original data, and the erasure code fragments are recalculated.
The storage node receives the metadata information and the data fragments, and acquires the corresponding disk LBA address and the offset written in this time by inquiring the metadata based on the file name and the write offset of the write request; if the data is the overwriting, based on the LBA address, carrying version number information, and writing the data into an SSD cache disk through a redirection cache; and recording the LBA address, the data length and the version number information into a corresponding temporary control file.
The temporary control file is created for each overwrite write request, and can support the overwriting of a plurality of versions to be written into the SSD cache disk in a thread-sharing manner, the probability of the occurrence of the situation is high in a distributed storage system, the temporary control file is created for each version, a plurality of version data of one file can be cached in the SSD cache disk, and after the writing is successful, the temporary cache is modified into the formal cache according to the sequence.
The service layer read-write unit receives a write-in data request, and if the metadata information is not inquired, the write-in is primary write; the storage node receives the metadata information and the data fragments, if the write-in data request is initial write, the disk offset corresponding to the file is obtained through the metadata, then the offset and the length of the write-in request are combined, the LBA disk address corresponding to the current write-in is calculated, the redirection cache unit writes the data fragments into the SSD cache disk to form normal cache data, and the storage node records the metadata information into a metadata database; the storage node returns a write cache success message; and after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data.
When writing for the first time, the HDD disk has no data, the problem of version rollback and historical data recovery does not exist, the data can be directly written into the SSD cache disk as normal cache data, the same processing steps as the overwriting can be adopted, and if the writing fails, the temporary cache data in the SSD cache disk is deleted.
The service layer read-write unit receives a failure message of writing the data fragments into the SSD cache disk, or receives a message of successfully writing the data fragments into the SSD cache disk, wherein the number of the received messages is less than K, and the service layer read-write unit returns a response of data writing failure;
the service layer read-write unit sends data rollback information to K + M storage nodes; the data rollback message comprises file name information, fragment numbers and version numbers; and after receiving the data rollback message, the storage node acquires the corresponding temporary file and temporary cache data information stored in the temporary file through the file name and version number information, deletes the temporary data fragment on the SSD cache disk, and deletes the temporary control file.
When K fragments are successfully written, theoretically, a writing node also keeps M erasure code fragments, and when one node fails, the system can recover data, so that successful writing is returned, and timely response can be realized. After response, the writing of the M nodes is required to be continuously completed.
After the system is restarted, the storage nodes read the metadata information and confirm whether the write request is successful before restarting according to the write operation log recorded in the metadata information in the K + M storage nodes; if the write request is successful, the data is not rolled back, the temporary cache data is changed into normal cache data, the metadata is modified, and the temporary control file is deleted; and if the write request is unsuccessful, the system performs data rollback, the storage node deletes the temporary data fragment on the SSD cache disk, and the storage node deletes the temporary control file.
In the distributed file system, system crash is possible to occur in any step, node failure exists, after the system is electrified and recovered again, the system needs to recover data, and the state before the data crash and the power failure crash and the like is ensured. If no K node succeeds, the temporary control files of all the nodes are required to be deleted, and the temporary cache data are deleted at the same time.
As shown in fig. 7, a distributed data storage method includes: step 10: the selected storage node, namely the storage node 1 in the graph, receives a write-in data request, reads local metadata information according to file name information associated with the write-in data request, allocates a new version number to each write request, and judges whether the current write-in is an overwrite write; step 20: selecting a storage node to perform data fragmentation on the written data to obtain K original data fragments, and selecting the storage node to perform erasure code calculation on the data fragments to obtain M erasure code fragments; step 30: selecting a storage node and respectively sending the K + M data fragments and metadata information to other storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers; step 40: the storage node receives the metadata information and the data fragments, and writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message; step 50: after the selected storage node obtains the successful write-in cache information of the K + M data fragments, the selected storage node returns a response of the successful write-in data; step 60: the selected storage node sends a message of successful writing to other storage nodes; step 70: and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
The service layer read-write unit may be any node, or any node in K + M, and in fig. 7, the service layer read-write unit is located in the storage node 1, that is, the storage node 1 is a selected node, and after the read-write access node is connected to the storage system, the selected node is first found according to the file name of the written data or other encoding methods, and if the storage node 1 is directly found, certain network transmission may be reduced.
In the distributed data storage method, the local metadata information is stored in a metadata database; the selected storage node is any one of K + M storage nodes; or the selected storage node is a first node among the K + M storage nodes; the metadata information comprises name information, data fragment numbers and version numbers of the data; the version number is a unique numerical value which is sequentially increased; the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the storage node; the temporary control file operation and the metadata database operation on the storage node are related operations and succeed or fail at the same time; judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, the current writing is the overwriting; if the data fragment is the overwriting data, if the data fragment is not the full overwriting data, the written data fragment needs to be read again, the overwriting data is used for overwriting the data fragment, and the erasure code fragment is recalculated.
As shown in fig. 8, it is determined that the data is an overwrite write through metadata reading, and if the data is a partial overwrite, the position of the fragment needs to be calculated, the data of K nodes is re-read, and the erasure code calculation is performed for several data fragments by overwriting the data fragment with the overwrite data.
The distributed data storage method further includes, after step 10, step 11: if the write-in is the initial write, the storage node acquires the disk offset corresponding to the file through the metadata, then calculates the corresponding LBA disk address of the write-in according to the offset and the length of the write-in request, writes the data into the SSD cache disk in a fragmentation mode to form normal cache data, and records the metadata information into a metadata database; the storage node returns a write cache success message; after the service layer read-write unit receives the successful write-in cache message of the K data fragments, the service layer read-write unit returns a response of successfully writing data; step 70 is followed by step 80: and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
As shown in fig. 9, after data version reading, data fragmentation, and erasure code calculation, a failure occurs in the process of distributing data, after a failure in writing data is returned, a message for clearing the temporary cache is sent to each node, each storage node clears the temporary version data, and deletes the temporary control file, the data can be restored without operating the HDD disk, the situation of data inconsistency does not occur, and the operation is convenient and fast.
As shown in fig. 10, when data is read normally, the data is first located to a selected node, metadata is read, the position of a fragment is calculated, data fragments are read from the located storage node, if in the reading process, the fragment is found to be lost, erasure correction fragment reading is required, data recovery is performed by using erasure correction calculation, data distribution is performed again, and data is returned.
In the daily system maintenance process, partial nodes are found to be invalid, the data fragments need to be read by the stationer program, the data of the invalid nodes are recovered, and the data are redistributed.
If the system is restarted and recovered, the distributed data storage method comprises the following steps of 90: after the system is restarted, the storage nodes read the local temporary control file, whether the write request is successful or not is confirmed according to log comparison in the K + M storage nodes, if the write request is successful, data is not rolled back, and if the write request is unsuccessful, the system rolls back the data; data rollback is not performed, temporary cache data is changed into normal cache data, metadata is modified, and the temporary control file is deleted; and if rollback is needed, deleting the temporary data fragments on the SSD cache disk by the storage node, and deleting the temporary control file by the storage node.
While the invention has been illustrated and described in terms of a preferred embodiment and several alternatives, the invention is not limited by the specific description in this specification. Other additional alternative or equivalent components may also be used in the practice of the present invention.

Claims (10)

1. A distributed data storage system, comprising: a network, at least 3 storage nodes;
the storage node comprises a service layer read-write unit, a redirection cache unit, a storage unit and a node read-write control unit; the storage nodes are connected through network signals; the redirection cache unit comprises an SSD cache disk and a cache management module; the storage unit comprises an HDD storage magnetic disk and an HDD storage management module;
the service layer read-write unit comprises a service layer read-write module and a metadata database; the system comprises a data slicing module and an erasure code calculation module;
the service layer read-write unit receives the write-in data request, the service layer read-write module reads local metadata information according to file name information associated with the write-in data request, a new version number is allocated to each write request, and whether the current write-in is the overwrite write is judged;
the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments;
the business layer read-write unit respectively sends the data fragments and metadata information to K + M storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers;
the storage node receives the metadata information and the data fragments, and if the metadata information and the data fragments are overwritten, the storage node writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message;
after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data;
the service layer read-write unit sends a message of successful write-in to the K + M storage nodes;
and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
2. The distributed data storage system of claim 1, wherein:
the local metadata information is stored in a metadata database;
the service layer read-write unit is any one of K + M storage nodes; or the service layer read-write unit is a first node in the K + M storage nodes;
the metadata information comprises name information, data fragment numbers and version numbers of the data;
the version number is a unique numerical value which is sequentially increased;
the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the write request at the storage node;
the temporary control file operation and the metadata database operation on the storage node are related operations and succeed or fail at the same time;
judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, the data exists in the area corresponding to the current writing request, and the current writing is the overwriting; if the data is the overwriting, if the length of the overwriting data does not reach the integral multiple of the length of the K fragments, part of the data fragments need to be read again, the length of the data fragments is the integral multiple of the K fragments when the data fragments are completely gathered, and the erasure code fragments are recalculated;
the data fragmentation module performs data fragmentation on the written data to obtain K original data fragments, and the erasure code calculation module performs erasure code calculation on the data fragments to obtain M erasure code fragments;
and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
3. The distributed data storage system of claim 2, wherein: the storage node receives the metadata information and the data fragments, and acquires the corresponding disk LBA address and the offset written in this time by inquiring the metadata based on the file name and the write offset of the write request; if the data is the overwriting, based on the LBA address, carrying version number information, and writing the data into an SSD cache disk through a redirection cache; and recording the LBA address, the data length and the version number information into a corresponding temporary control file.
4. The distributed data storage system according to claim 2, wherein the service layer read-write unit receives a write data request, and if the metadata information is not queried, the write is the initial write;
the storage node receives the metadata information and the data fragments, if the write-in data request is initial write, the disk offset corresponding to the file is obtained through the metadata, then the offset and the length of the write-in request are combined, the LBA disk address corresponding to the current write-in is calculated, the redirection cache unit writes the data fragments into the SSD cache disk to form normal cache data, and the storage node records the metadata information into a metadata database; the storage node returns a write cache success message;
and after the business layer read-write unit receives the successful write-in cache message of the K + M data fragments, the business layer read-write unit returns a response of successfully writing the data.
5. The distributed data storage system according to claim 1 or claim 2, wherein the service layer read-write unit receives a failure message of writing data fragments into the SSD cache disk, or receives a message of successfully writing the data fragments into the SSD cache disk less than K, and the service layer read-write unit returns a response of failure of writing data;
the service layer read-write unit sends data rollback information to K + M storage nodes; the data rollback message comprises file name information, fragment numbers and version numbers;
and after receiving the data rollback message, the storage node acquires the corresponding temporary file and temporary cache data information stored in the temporary file through the file name and version number information, deletes the temporary data fragment on the SSD cache disk, and deletes the temporary control file.
6. The distributed data storage system according to claim 5, wherein after the system is restarted, the storage nodes read the metadata information, and confirm whether the write request before restarting is successful according to the write operation log recorded in the metadata information in the K + M storage nodes; if the write request is successful, the data is not rolled back, the temporary cache data is changed into normal cache data, the metadata is modified, and the temporary control file is deleted;
and if the write request is unsuccessful, the system performs data rollback, the storage node deletes the temporary data fragment on the SSD cache disk, and the storage node deletes the temporary control file.
7. A distributed data storage method, comprising:
step 10: the selected storage node receives the write-in data request, reads local metadata information according to file name information associated with the write-in data request, allocates a new version number to each write request, and judges whether the current write-in is an overwrite write;
step 20: selecting a storage node to perform data fragmentation on the written data to obtain K original data fragments, and selecting the storage node to perform erasure code calculation on the data fragments to obtain M erasure code fragments;
step 30: selecting a storage node and respectively sending the K + M data fragments and metadata information to other storage nodes, wherein the metadata information comprises file name information, data fragment numbers and version numbers;
step 40: the storage node receives the metadata information and the data fragments, and if the metadata information and the data fragments are overwritten, the storage node writes the data fragments into an SSD cache disk to form temporary cache data; the storage node generates a temporary control file, and returns a write-in cache success message;
step 50: after the selected storage node obtains the successful write-in cache information of the K + M data fragments, the selected storage node returns a response of the successful write-in data;
step 60: the selected storage node sends a message of successful writing to other storage nodes;
step 70: and after receiving the message of successful writing, the storage node converts the temporary cache data into normal cache data, deletes the temporary control file, and returns a successful response.
8. The distributed data storage method of claim 7,
the local metadata information is stored in a metadata database;
the selected storage node is any one of K + M storage nodes; or the selected storage node is a first node among the K + M storage nodes;
the metadata information comprises name information, data fragment numbers and version numbers of the data;
the version number is a unique numerical value which is sequentially increased;
the name of the temporary control file comprises file name information and version number information, and the temporary control file comprises LBA disk offset, data length and version number information corresponding to the storage node;
the temporary control file operation and the metadata database operation on the storage node are related operations and succeed or fail at the same time;
judging whether the current writing is the overwriting according to the result of reading the local metadata information, if the local metadata information is inquired, finding that data exists in a corresponding writing position, wherein the current writing is the overwriting; if the data fragment is the overwriting data, if the data fragment is not the full overwriting data, the written data fragment needs to be read again, the overwriting data is used for overwriting the data fragment, and the erasure code fragment is recalculated.
9. The distributed data storage method of claim 8, further comprising, after step 10
Step 11: if the write-in is the initial write, the storage node acquires the disk offset corresponding to the file through the metadata, then calculates the corresponding LBA disk address of the write-in according to the offset and the length of the write-in request, writes the data into the SSD cache disk in a fragmentation mode to form normal cache data, and records the metadata information into a metadata database; the storage node returns a write cache success message; after the service layer read-write unit receives the successful write-in cache message of the K data fragments, the service layer read-write unit returns a response of successfully writing data;
step 70 is followed by step 80: and the storage node writes the normal cache data into the HDD disk based on the cache elimination strategy.
10. The distributed data storage method of claim 9, wherein if the system is restarted and resumed, including
Step 90: after the system is restarted, the storage nodes read the local temporary control file, whether the write request is successful or not is confirmed according to log comparison in the K + M storage nodes, if the write request is successful, data is not rolled back, and if the write request is unsuccessful, the system rolls back the data; data rollback is not performed, temporary cache data is changed into normal cache data, metadata is modified, and the temporary control file is deleted; and if rollback is needed, deleting the temporary data fragments on the SSD cache disk by the storage node, and deleting the temporary control file by the storage node.
CN202210313064.1A 2022-03-28 2022-03-28 Distributed data storage system and method Active CN114415976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210313064.1A CN114415976B (en) 2022-03-28 2022-03-28 Distributed data storage system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210313064.1A CN114415976B (en) 2022-03-28 2022-03-28 Distributed data storage system and method

Publications (2)

Publication Number Publication Date
CN114415976A true CN114415976A (en) 2022-04-29
CN114415976B CN114415976B (en) 2022-07-01

Family

ID=81263936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210313064.1A Active CN114415976B (en) 2022-03-28 2022-03-28 Distributed data storage system and method

Country Status (1)

Country Link
CN (1) CN114415976B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357199A (en) * 2022-10-19 2022-11-18 安超云软件有限公司 Data synchronization method, system and storage medium in distributed storage system
CN115391093A (en) * 2022-08-18 2022-11-25 江苏安超云软件有限公司 Data processing method and system
CN115827789A (en) * 2023-02-21 2023-03-21 苏州浪潮智能科技有限公司 Method, system, equipment and storage medium for optimizing file type database upgrading
CN115981570A (en) * 2023-01-10 2023-04-18 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database
CN116644086A (en) * 2023-05-24 2023-08-25 上海沄熹科技有限公司 SST-based Insert SQL statement implementation method
CN117240873A (en) * 2023-11-08 2023-12-15 阿里云计算有限公司 Cloud storage system, data reading and writing method, device and storage medium
WO2023241350A1 (en) * 2022-06-17 2023-12-21 重庆紫光华山智安科技有限公司 Data processing method and device, data access end, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000040026A (en) * 1998-07-24 2000-02-08 Nec Corp Logical file management system in disk shared cluster system
US20130019062A1 (en) * 2011-07-12 2013-01-17 Violin Memory Inc. RAIDed MEMORY SYSTEM
CN105930103A (en) * 2016-05-10 2016-09-07 南京大学 Distributed storage CEPH based erasure correction code overwriting method
US20160357634A1 (en) * 2015-06-04 2016-12-08 Huawei Technologies Co.,Ltd. Data storage method, data recovery method, related apparatus, and system
CN106406760A (en) * 2016-09-14 2017-02-15 郑州云海信息技术有限公司 Direct erasure code optimization method and system based on cloud storage
CN110865903A (en) * 2019-11-06 2020-03-06 重庆紫光华山智安科技有限公司 Node abnormal reconnection multiplexing method and system based on erasure code distributed storage
CN112000627A (en) * 2020-08-14 2020-11-27 苏州浪潮智能科技有限公司 Data storage method, system, electronic equipment and storage medium
CN113868192A (en) * 2021-12-03 2021-12-31 深圳市杉岩数据技术有限公司 Data storage device and method and distributed data storage system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000040026A (en) * 1998-07-24 2000-02-08 Nec Corp Logical file management system in disk shared cluster system
US20130019062A1 (en) * 2011-07-12 2013-01-17 Violin Memory Inc. RAIDed MEMORY SYSTEM
US20160357634A1 (en) * 2015-06-04 2016-12-08 Huawei Technologies Co.,Ltd. Data storage method, data recovery method, related apparatus, and system
CN105930103A (en) * 2016-05-10 2016-09-07 南京大学 Distributed storage CEPH based erasure correction code overwriting method
CN106406760A (en) * 2016-09-14 2017-02-15 郑州云海信息技术有限公司 Direct erasure code optimization method and system based on cloud storage
CN110865903A (en) * 2019-11-06 2020-03-06 重庆紫光华山智安科技有限公司 Node abnormal reconnection multiplexing method and system based on erasure code distributed storage
CN112000627A (en) * 2020-08-14 2020-11-27 苏州浪潮智能科技有限公司 Data storage method, system, electronic equipment and storage medium
CN113868192A (en) * 2021-12-03 2021-12-31 深圳市杉岩数据技术有限公司 Data storage device and method and distributed data storage system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241350A1 (en) * 2022-06-17 2023-12-21 重庆紫光华山智安科技有限公司 Data processing method and device, data access end, and storage medium
CN115391093A (en) * 2022-08-18 2022-11-25 江苏安超云软件有限公司 Data processing method and system
CN115391093B (en) * 2022-08-18 2024-01-02 江苏安超云软件有限公司 Data processing method and system
CN115357199A (en) * 2022-10-19 2022-11-18 安超云软件有限公司 Data synchronization method, system and storage medium in distributed storage system
CN115357199B (en) * 2022-10-19 2023-02-10 安超云软件有限公司 Data synchronization method, system and storage medium in distributed storage system
CN115981570A (en) * 2023-01-10 2023-04-18 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database
CN115981570B (en) * 2023-01-10 2023-12-29 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database
CN115827789A (en) * 2023-02-21 2023-03-21 苏州浪潮智能科技有限公司 Method, system, equipment and storage medium for optimizing file type database upgrading
CN116644086A (en) * 2023-05-24 2023-08-25 上海沄熹科技有限公司 SST-based Insert SQL statement implementation method
CN116644086B (en) * 2023-05-24 2024-02-20 上海沄熹科技有限公司 SST-based Insert SQL statement implementation method
CN117240873A (en) * 2023-11-08 2023-12-15 阿里云计算有限公司 Cloud storage system, data reading and writing method, device and storage medium
CN117240873B (en) * 2023-11-08 2024-03-29 阿里云计算有限公司 Cloud storage system, data reading and writing method, device and storage medium

Also Published As

Publication number Publication date
CN114415976B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN114415976B (en) Distributed data storage system and method
US10437672B2 (en) Erasure coding and replication in storage clusters
US8286029B2 (en) Systems and methods for managing unavailable storage devices
US6766491B2 (en) Parity mirroring between controllers in an active-active controller pair
US10019317B2 (en) Parity protection for data chunks in an object storage system
US6513093B1 (en) High reliability, high performance disk array storage system
US7234024B1 (en) Application-assisted recovery from data corruption in parity RAID storage using successive re-reads
CN110442535B (en) Method and system for improving reliability of distributed solid-state disk key value cache system
US20220342541A1 (en) Data updating technology
US7882420B2 (en) Method and system for data replication
CN109445681B (en) Data storage method, device and storage system
US11429498B2 (en) System and methods of efficiently resyncing failed components without bitmap in an erasure-coded distributed object with log-structured disk layout
US7689877B2 (en) Method and system using checksums to repair data
US7716519B2 (en) Method and system for repairing partially damaged blocks
CN113377569A (en) Method, apparatus and computer program product for recovering data
US20040128582A1 (en) Method and apparatus for dynamic bad disk sector recovery
US11385806B1 (en) Methods and systems for efficient erasure-coded storage systems
CN114676000A (en) Data processing method and device, storage medium and computer program product
US10747610B2 (en) Leveraging distributed metadata to achieve file specific data scrubbing
WO2023125507A1 (en) Method and apparatus for generating block group, and device
US11494090B2 (en) Systems and methods of maintaining fault tolerance for new writes in degraded erasure coded distributed storage
CN117370067A (en) Data layout and coding method of large-scale object storage system
CN116414294A (en) Method, device and equipment for generating block group

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant