WO2024037104A1 - 数据存储方法、子系统、分布式存储系统及存储介质 - Google Patents

数据存储方法、子系统、分布式存储系统及存储介质 Download PDF

Info

Publication number
WO2024037104A1
WO2024037104A1 PCT/CN2023/097138 CN2023097138W WO2024037104A1 WO 2024037104 A1 WO2024037104 A1 WO 2024037104A1 CN 2023097138 W CN2023097138 W CN 2023097138W WO 2024037104 A1 WO2024037104 A1 WO 2024037104A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
node
logical node
logical
log
Prior art date
Application number
PCT/CN2023/097138
Other languages
English (en)
French (fr)
Inventor
袁东平
Original Assignee
重庆紫光华山智安科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 重庆紫光华山智安科技有限公司 filed Critical 重庆紫光华山智安科技有限公司
Publication of WO2024037104A1 publication Critical patent/WO2024037104A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • This application relates to the field of data storage technology, specifically, to a data storage method, subsystem, distributed storage system and storage medium.
  • RS(N,M) erasure coding can be used to generate M pieces of verification data from N pieces of original data. When any M pieces of data in N+M pieces of data are lost, the remaining The next N copies of data are regenerated to achieve the effect of data recovery.
  • One of the purposes of this application is to provide a data storage method, subsystem, distributed storage system and storage medium to reduce the risk of data inconsistency or complete damage and the complexity of data recovery.
  • this application provides a data storage method, which is applied to a data storage subsystem in a distributed storage system.
  • the data storage subsystem includes multiple logical nodes, and each logical node corresponds to a physical storage node;
  • the method includes: synchronizing the write operation log received by the master logical node to all slave logical nodes; wherein the master logical node is one of all the logical nodes; the slave logical node is other than the master logical node.
  • Logical node when it is determined that the main logical node receives the message of log synchronization failure, determine the abnormal logical node corresponding to the message; wherein the log synchronization failure represents the data on the target physical storage node corresponding to the abnormal logical node Missing; generate missing data based on the preset erasure code and the data logs of the remaining slave logical nodes except the abnormal logical node and the master logical node; the data log is used to record the writing of the corresponding logical node data on the physical storage node; perform data recovery on the target physical storage node based on the missing data.
  • this application provides a data storage subsystem.
  • the data storage subsystem includes multiple logical nodes, each logical node corresponding to a physical storage node, including: a synchronization module for synchronizing the data received by the main logical node.
  • the write operation log is synchronized to all slave logical nodes; where the master logical node is one of all the logical nodes; so The slave logical node is a logical node other than the master logical node; the determination module is used to determine the abnormal logical node corresponding to the message when it is determined that the master logical node receives a log synchronization failure message; wherein, the The failure of log synchronization indicates the lack of data on the target physical storage node corresponding to the abnormal logical node; the generation module is used to generate the module based on the preset erasure code and the remaining slave logic except the abnormal logical node and the main logical node.
  • the data log of the node generates missing data; the data log is used to record the data written on the physical storage node corresponding to the logical node; the storage module is used to perform operations on the target physical storage node based on the missing data. Data Recovery.
  • this application provides a distributed storage system.
  • the distributed storage system includes a data storage subsystem.
  • the data storage subsystem is composed of multiple logical nodes, and each logical node corresponds to a physical storage node.
  • the data storage subsystem is used to perform the data storage method as described in the first aspect.
  • the present application provides a storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the data storage method as described in the first aspect is implemented.
  • the data storage method, subsystem, distributed storage system and storage medium provided by this application include: after the master logical node receives the write operation log, it synchronizes the write operation log to the slave logical node. When receiving the log feedback from the slave logical node After the message of synchronization failure, it indicates that the data on the physical storage node corresponding to the abnormal logical node that fed back the message is missing. At this time, it can be based on the preset erasure code and the remaining slave logical nodes except the abnormal logical node and the master logical node. Data log, generate missing data, and perform data recovery on the physical storage node corresponding to the abnormal logical node based on the missing data.
  • the data storage subsystem provided by this application can monitor whether there is a physical storage node with missing data. Once it exists, it can be based on other Generate missing data from the recorded data on the logical node, and then restore the physical storage node where the data is missing.
  • the entire process makes anomaly detection and data writing dependent, and does not require service nodes to schedule and restore, avoiding server downtime. It effectively reduces the complexity of the recovery process when the machine encounters the problem of data loss during recovery.
  • Figure 1 is an example diagram of an existing data storage method
  • Figure 2 is a system structure diagram of a distributed storage system provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of the data storage subsystem provided by the embodiment of the present application.
  • Figure 4 is a schematic flow chart of the data storage method provided by the embodiment of the present application.
  • Figure 5 is a schematic diagram of a scenario of data storage provided in the embodiment of the present application.
  • Figure 6 is a schematic flow chart of another data storage method provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of another scenario of data storage provided in the embodiment of the present application.
  • Figure 8 is a functional module diagram of the data storage subsystem provided by the embodiment of the present application.
  • RS(N,M) erasure code generate M pieces of verification data from N pieces of original data. When any M pieces of data in N+M pieces of data are lost, it can be regenerated from the remaining N pieces of data.
  • RAFT consistency algorithm A solution to the distributed strong consistency problem by copying logs from master nodes to slave nodes. This algorithm regards the system as a state machine and the operation of the state machine as a log. According to the master-slave system Performing the same operation from the same initial state results in the same target state to ensure consistency. The algorithm maintains a log sequence internally, and ensures that the logs are copied to each backup correctly and orderly through primary node selection and synchronous replication.
  • Figure 1 is an example diagram of an existing data storage method.
  • the data node needs to store the data and version number of the strip at the same time.
  • the service side One write will generate two inputs and two outputs.
  • the data written by the client to the server needs to be aligned to the stripe size.
  • Data alignment requires two additional read requests, namely Reading the version number and reading data will significantly increase the write operation delay when the cache is missed.
  • the data is divided into 4KB strips.
  • Each 4KB strip corresponds to an 8Byte version number. Every 64MB of data needs to store an additional 128KB version number.
  • a service node needs to update the version number and data at the same time. Content, at this time, two inputs and two outputs will be generated. These two inputs and two outputs are written to different locations, which increases the overhead of seeking.
  • the data written to the data node after dividing the strip needs to be aligned to the strip. With size writing, that is, the data offset and length written to the data node need to be aligned to the size of 4KB, and the data written to the SDK need to be aligned to the size of 4KB*N.
  • the SDK profile write delay is MAX(N+M)(read version number)+MAX(N+M) (read data)+MAX(N+M)(write data)+MAX(N+M)(submit data).
  • the existing system uses NVM for write acceleration, MAX(N+M)(write data)+MAX( N+M) (submit data) part of the delay is mainly affected by the network. Under normal circumstances, it is much smaller than MAX (N+M) (read version number) + MAX (N+M) (read data).
  • the write data spans For both stripe groups, both the first and last stripes need to be read into the SDK before the erasure data can be calculated.
  • the service node needs to report a data recovery request to the metadata server, which schedules and generates missing data.
  • the data recovery operation is actually executed on the data node side.
  • the data node The side will simultaneously receive the data that needs to be sent and the data to be written from the service node, which increases the complexity of the system.
  • the data recovery request is reported by the service node, once the service node goes down, it may cause a data recovery request. If the data is lost, inconsistent data needs to be found through background scanning. The background scanning cycle is long. Reoccurrence of anomalies during this period will significantly increase the risk of system data inconsistency or complete damage.
  • FIG. 2 is a system structure diagram of the distributed storage system provided by an embodiment of the present application, which includes a client 201, a service Node 202, data node 203 (DN), and data storage subsystem 205 based on RAFT consistency algorithm, client 201, service node 202 and data node 203 are connected by communication.
  • the above distributed storage system may also include metadata Server (MS) (omitted here).
  • MS metadata Server
  • the service node 202 can actually be a server or an SDK.
  • the client 201 and the service node 202 can, but are not limited to, perform data interaction through the iscsi protocol.
  • the service node 202 and the data node 203 can establish an RPC connection and set the RPC communication timeout. for 5 seconds.
  • the data storage subsystem 205 is used to take over the data writing business and data recovery, and ensures the consistency of the data written to the data node 103 through a strong consistency algorithm and data recovery.
  • the data storage subsystem 205 is bound to the block storage erasure object (OBJ).
  • OBJ block storage erasure object
  • the data storage subsystem 205 can be created in advance to ensure the consistency of the OBJ data.
  • the destruction of the data storage subsystem 205 can be started.
  • the data storage subsystem 205 provided by the embodiment of the present application can remove the stripe and version number on the data node side during the process of writing data. At this time, the data node side can receive writes with any offset and granularity, and transfer the data The alignment operation is deferred until NVM downloads the disk and eliminates the overhead of data node profiling version numbers.
  • the minimum write granularity is 512Byte.
  • the server side can directly transfer the data to the data nodes without reading the data from the data node side for stripe group alignment.
  • the write delay is MAX(N+M)(write data)+MAX(N+M)(synchronization operation log). Taking the erasure ratio of 8+2 as an example, the old system can eliminate the stripe when setting the stripe size to 64Byte.
  • Figure 3 is a schematic structural diagram of the data storage subsystem provided by the embodiment of the present application.
  • the data storage subsystem 205 is composed of an algorithm core, a logical node, an operation log, and a data log, and uses LUN ID+OBJ ID as data
  • the identifier of the storage subsystem 205 For example, assuming that the LUN ID is 1 and the OBJ ID is 0, the identifier of the data storage subsystem can be expressed as r-1-0.
  • the core of the algorithm is the RAFT algorithm, which consists of two parts: master election and log replication.
  • the slave logical node initiates the master election process when it does not receive the keep-alive of the master logical node.
  • the election node becomes the master logical node after receiving more than half of the votes.
  • it needs to be mapped one by one with the data log. Therefore, it can become the main logical node only when it receives more than or equal to N votes, where N is the total number of original data divided by the data to be written. number.
  • a logical node is a virtual node created by the data node based on the request of the metadata server or service node. According to the sequence number, it is located on the physical node where the corresponding data block is located.
  • the identity of each logical node can be based on the identity of the data storage subsystem and the corresponding data node. For identification, for example, the logical node on the data node DN-1 can be expressed as r-1-0-1, where r-1-0 is the identifier of the data storage subsystem and 1 is the data node DN- 1 logo.
  • Each logical node contains all logical node information of the data storage subsystem 205, mainly including the status of each logical node and the identification of the physical node where it is located.
  • Each logical node consists of an operation log and a data log, which are stored in the transaction log of the data node 203.
  • the operation log is divided into a write operation log and a full data recovery log.
  • the write operation log contains the offset within the OBJ (that is, the starting position of data writing), data length, and transaction ID; the full data recovery log records two states: start and end.
  • the data log records the data written and distributed to the current data node.
  • the service node writes it to the transaction log of the data node.
  • the write operation log controls the writing to the data storage. When the write operation log is written to the data node, it will be processed. A detection to identify data log loss and repair it. Logs with log IDs less than or equal to the RAFT commit ID will be applied to storage.
  • the data storage subsystem provided by the embodiment of the present application is a virtual subsystem in the separate storage system, which is the main difference from the existing distributed storage system.
  • the data storage subsystem is composed of the RAFT algorithm, by and each logical node. Each logical node is used to maintain operation logs and data logs. The data log can maintain the data currently being written to avoid the impact of unrecoverable data due to data node failure.
  • When there is data writing When there is data writing, you can request to create a data storage subsystem, and the data storage subsystem will complete data writing, data recovery and other functions. In this way, you can avoid reporting data writing or data recovery requests through the server. You can Avoid the adverse effects caused by server downtime. At the same time, when there are no data write requests, the data storage subsystem can also be deleted to avoid occupying additional resources.
  • Figure 4 is a schematic flow chart of the data storage method provided by the embodiment of the present application. The method may include:
  • S401 Synchronize the write operation log received by the master logical node to all slave logical nodes.
  • the master logical node is one of all logical nodes; the slave logical node is a logical node other than the master logical node.
  • log synchronization failure indicates the lack of data on the target physical storage node corresponding to the abnormal logical node
  • the data log is used to record data written to the physical storage node corresponding to the logical node
  • S404 Perform data recovery on the target physical storage node based on the missing data.
  • the master logical node after receiving the write operation log, the master logical node synchronizes the write operation log to the slave logical node. After receiving the feedback message of the log synchronization failure from the logical node, it indicates that the message is fed back
  • the data on the physical storage node corresponding to the abnormal logical node is missing.
  • the missing data can be generated based on the preset erasure code and the data logs of the remaining slave logical nodes except the abnormal logical node and the master logical node, and based on the For missing data, perform data recovery on the physical storage node corresponding to the abnormal logical node.
  • the data storage subsystem provided by this application can monitor whether there is a physical storage node with missing data.
  • step S401 the write operation log received by the master logical node is synchronized to all slave logical nodes.
  • the above-mentioned write operation log is an operation log generated after the service node distributes the data to be written to multiple physical storage nodes. That is, in an optional implementation, the operation log is generated in the following manner:
  • the original data and verification data are obtained.
  • the data to be written is divided according to the preset erasure code.
  • the preset erasure code is R(N,M), where N is the total number of copies of the original data, and M is the total number of verification data.
  • Erasure calculation is performed on the data slices, and the original data is divided into N parts of the original data according to the size, and the correction is performed.
  • the deletion encoding generates M pieces of verification data.
  • the offset and data length corresponding to each piece of data are both 1/N of the data length of the data to be written.
  • a2 distribute the transaction ID, original data and verification data generated by the service node to multiple physical storage nodes, so that each physical storage node writes the received transaction ID, original data or verification data into the transaction log .
  • the physical storage node in the embodiment of this application is the data node shown in Figure 2. That is to say, before generating the operation log, the service node first sends the original data and verification data together with the transaction ID to the physical storage node. , after each physical storage node receives the original data or verification data, it writes the transaction ID, original data and verification data into the transaction log maintained by itself.
  • the transaction ID can be based on the maximum value of the existing transaction log in the data storage subsystem + 1 as the initial value, + 1 each time the service node is used, and LUNID-OBJID is added as a globally unique value.
  • the service node can determine N+M physical storage nodes that receive the original data, the verification data and the transaction ID in advance based on the N+M pieces of data. That is, in an optional implementation, the service node can First, the query information of the data storage subsystem is sent to the metadata server. After receiving the query information, the metadata server queries the OBJ information that has been created, that is, it determines the physical location of each of the N+M data blocks (BLK) contained in the OBJ. Storage nodes, as these N+M physical storage nodes.
  • BLK N+M data blocks
  • a logical node is created in each physical storage node. If not, a creation request can be sent to the physical storage node so that the physical storage node creates a logical node. If there is, you don't need to create it.
  • a3 determine whether the number of data transmission success messages received by the service node is greater than or equal to the total number of copies of the original data.
  • the physical storage node may feedback a data transmission success message or a data transmission failure message to the service node.
  • the service node can judge based on the returned results. When the number of successful data transmission messages is greater than or equal to N, it will generate a write operation log and send the write operation log to the main logical node. Otherwise, a rollback message will be sent to the physical storage node.
  • the write operation log in the embodiment of this application can, on the one hand, instruct the logical node to confirm whether there is data to be written. On the other hand, the operation log can also be used to write the data to be written to the data storage of the physical storage node. middle.
  • the write operation log contains the transaction ID.
  • the primary logical node can write the write operation log into the transaction log of the physical storage node corresponding to the primary logical node, and then initiate write operation log synchronization. .
  • the original 2PC writing mode is adjusted to write data + synchronization operation.
  • Operation log replace the submitted data version number by synchronizing the operation log to avoid synchronizing data content and increasing network pressure.
  • the service node before performing log synchronization, the service node first distributes data to the physical storage node, and then detects data loss anomalies during the log synchronization process.
  • the implementation of this application can make Anomaly detection has a dependency on data writing, and there is no need for the service node to detect anomalies and report them, which reduces the loss of data loss caused by service node downtime.
  • Figure 5 is a schematic diagram of a data storage scenario provided in an embodiment of the present application.
  • each OJB is 256MB.
  • the identifier is 0, and the data to be written needs to be stored on the six data nodes DN-0 to DN-5.
  • the created data storage subsystem is r-1-0, including r-1-0-0 To the six logical nodes r-1-0-5, each logical node corresponds to a data node.
  • the SDK can divide the data to be written into 4 parts and generate 2 parts of verification data, in which the data information of the data to be written includes the data to be written.
  • the data length is 512
  • the starting writing position offset of the data is 512.
  • the data length of each piece of data is 128, and the starting writing position is 128.
  • the obtained data can be written into the data storage.
  • the obtained data can be written into the identifier.
  • step S402 when it is determined that the main logical node receives the log synchronization failure message, the abnormal logical node corresponding to the message is determined.
  • the corresponding physical storage node transaction log can be queried according to the transaction ID in the write operation log. If the transaction log is successfully queried, a synchronization success message will be returned. There is no query. A synchronization failure message will be returned. Therefore, in an optional implementation, after receiving the log synchronization request from the logical node, the following steps can be performed:
  • the log synchronization failure message is fed back from the logical node to the main logical node;
  • the log synchronization success message is fed back from the logical node to the main logical node.
  • the write operation log can be applied according to the transaction ID in the write operation log, so that the physical storage node where the slave logical node is located can store the original data or correction in the transaction log corresponding to the transaction ID.
  • the verification data is written to the data store.
  • the master logical node determines the logical node corresponding to the message as an abnormal logical node, and then executes steps S403 and S404 to achieve data recovery.
  • step S403 missing data is generated based on the preset erasure code and the data logs of the remaining slave logical nodes except the abnormal logical node and the master logical node; the data log is used to record the writing on the physical storage node corresponding to the logical node. The data.
  • the data log corresponding to each logical node is used to record the original data or verification data to be written into the physical storage node.
  • the physical storage node where the abnormal logical node is located is missing data, other data can be obtained.
  • the original data or verification data in the logical node data log generates missing data through preset erasure coding.
  • step S404 perform data recovery on the target physical storage node based on the missing data.
  • the primary logical node generates missing data, sends the missing data to the target physical storage node, so that the target physical storage node writes the missing data into the transaction log, and then resynchronizes the write operation log for the abnormal logical node.
  • the embodiment of the present application also provides the following implementation of steps c1 to c3:
  • c1 determines whether the number of successful log synchronization messages received by the primary logical node is greater than or equal to the total number of original data divided by the data to be written;
  • the service node can also feed back the received writing results to the client, thereby notifying the user of the data writing status in a timely manner.
  • step d3 since there are multiple logical nodes in the data storage subsystem, in order to determine the master logical node and the slave logical node, the implementation of this application also provides the following implementation of steps d1 to step d3:
  • the d1 determines whether the first slave logical node receives the keep-alive information from the master logical node; the first slave logical node is any one except the master logical node;
  • the target logical node When there is a target logical node that receives a number of votes that is greater than or equal to the total number of copies of the original data after dividing the data to be written, the target logical node is determined to be the master logical node.
  • Figure 6 is another data provided by the embodiment of the present application.
  • the schematic flow chart of the storage method that is, the data storage method provided by the embodiment of this application may also include the following steps:
  • S405 When the master logical node determines that there is an offline slave logical node, and the offline slave logical node has not come online within a preset time period, it detects whether the number of online slave logical nodes is greater than or equal to the total number of original data after dividing the data to be written. number of copies.
  • the node offline determination time of the data storage subsystem can be set according to actual needs, for example, set to 1 minute. If it does not come online within the preset time period after being offline, you can apply for new data from the metadata server. Blocks are used for data recovery. When new data blocks cannot be applied for, they are removed from the data storage subsystem and the remaining node logs are submitted.
  • slave logical nodes After all slave logical nodes receive the started full recovery message, they will stop submitting updated data logs to the data storage. Data log submission can only be resumed when the full data recovery logs of all nodes are in the completed state.
  • S408 Generate new missing data based on the preset erasure code and the data log of the online slave logical node, and send the new missing data to the new logical node.
  • Figure 5 is a schematic diagram of another scenario of data storage provided in the embodiment of the present application.
  • the data node DN-5 is offline, then The data sent to DN-5 will be lost.
  • the new data block corresponds to the data node DN-6
  • the data length is 128 and the starting writing position is 128.
  • Figure 8 is a functional module diagram of the data storage subsystem provided by the embodiment of the present application, including:
  • the synchronization module 205-1 is used to synchronize the write operation log received by the master logical node to all slave logical nodes; wherein, The master logical node is one of all logical nodes; the slave logical node is a logical node other than the master logical node;
  • the determination module 205-2 is used to determine the abnormal logical node corresponding to the message when it is determined that the main logical node receives the log synchronization failure message; wherein the log synchronization failure indicates the lack of data on the target physical storage node corresponding to the abnormal logical node;
  • the generation module 205-3 is used to generate missing data based on the preset erasure code and the data logs of the remaining slave logical nodes except the abnormal logical node and the master logical node; the data log is used to record the physical data written to the logical node corresponding to Store data on the node;
  • the storage module 205-4 is used to perform data recovery on the target physical storage node based on missing data.
  • the synchronization module 205-1, the determination module 205-2, the generation module 205-3 and the storage module 205-4 can cooperatively execute each step in Figure 4 to achieve corresponding technical effects.
  • the data storage subsystem may also include a sending module.
  • the determining module 205-2, the generating module 205-3 and the sending module may collaboratively execute steps S405 to S408 in Figure 7 to implement the corresponding technology. Effect.
  • the generation module 205-3 is also used to perform steps a1 to a2 to achieve corresponding technical effects.
  • the determining module 205-2 and the sending module can also be used to perform steps b1 to b3 and steps c1 to c3 to achieve corresponding technical effects.
  • the data storage subsystem may also include an election module, and the determination module 205-2 and the election module may also collaboratively execute steps d1 to d3 to achieve corresponding technical effects.
  • the operation log is generated by: dividing the received data to be written by the service node to obtain the original data and verification data;
  • the transaction ID, as well as the original data and the verification data are distributed to the multiple physical storage nodes, so that each of the physical storage nodes will receive the transaction ID, the original data or the verification data. Verify that the data is written into the transaction log; determine whether the number of data transmission success messages received by the service node is greater than or equal to the total number of copies of the original data; if so, generate the write operation log through the service node .
  • the storage module 205-4 can also be used to: write the write operation log into the transaction log of the physical storage node corresponding to the main logical node; apply all the data through the main logical node. Describe the write operation log so that the physical storage node where the main logical node is located writes the received original data or verification data into the data storage.
  • Embodiments of the present application also provide a storage medium on which a computer program is stored.
  • the computer-readable storage medium may be, but is not limited to, a U disk, a mobile hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic disk or an optical disk that can store program codes.

Abstract

本申请提供的数据存储方法、子系统、分布式存储系统及存储介质,方法包括:将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;当确定主逻辑节点接收到日志同步失败的消息,确定消息对应的异常逻辑节点;其中,日志同步失败表征异常逻辑节点对应的目标物理存储节点上的数据缺失;基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;数据日志用于记录写入逻辑节点对应的物理存储节点上的数据;根据缺失数据,对目标物理存储节点进行数据恢复。本申请数据丢失检测与数据写入存在依赖关系,不需要服务节点去调度和恢复,避免了服务器宕机出现恢复数据丢失的问题,有效降低恢复流程复杂度。

Description

数据存储方法、子系统、分布式存储系统及存储介质
相关申请的交叉引用
本申请要求于2022年08月19日提交中国国家知识产权局的申请号为202211003028.1、名称为“数据存储方法、子系统、分布式存储系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据存储技术领域,具体而言,涉及一种数据存储方法、子系统、分布式存储系统及存储介质。
背景技术
为了避免数据在写入过程中出现错误,可以采用RS(N,M)纠删编码通过N份原始数据生成M份校验数据,当N+M份数据中任意M份数据丢失后可通过剩下N份数据重新生成,从而实现恢复数据的效果。
在已有的分布式存储系统中,服务器侧在数据写入失败后,需要上报到元数据服务器,由元数据服务器调度恢复数据,恢复操作实际上在数据存储节点侧执行,由此可见,现有技术因恢复数据是服务器侧上报到元数据服务器之后才能被系统感知,当服务器宕机后,容易出现恢复数据丢失,此时数据存储节点侧的数据不一致无法被立即感知到,期间再次出现异常将显著增加系统数据不一致或彻底损坏的风险。
发明内容
本申请的目的之一在于提供一种数据存储方法、子系统、分布式存储系统及存储介质,用以降低数据不一致或彻底损坏的风险和数据恢复的复杂度。
第一方面,本申请提供一种数据存储方法,应用于分布式存储系统中的数据存储子系统,所述数据存储子系统包括多个逻辑节点,每个逻辑节点对应一个物理存储节点;所述方法包括:将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;其中,所述主逻辑节点为全部所述逻辑节点其中一个;所述从逻辑节点为除所述主逻辑节点以外的逻辑节点;当确定所述主逻辑节点接收到日志同步失败的消息,确定所述消息对应的异常逻辑节点;其中,所述日志同步失败表征所述异常逻辑节点对应的目标物理存储节点上的数据缺失;基于预设纠删码、以及除所述异常逻辑节点和所述主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;所述数据日志用于记录写入所述逻辑节点对应的物理存储节点上的数据;根据所述缺失数据,对所述目标物理存储节点进行数据恢复。
第二方面,本申请提供一种数据存储子系统,所述数据存储子系统包括多个逻辑节点,每个逻辑节点对应一个物理存储节点,包括:同步模块,用于将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;其中,所述主逻辑节点为全部所述逻辑节点其中一个;所 述从逻辑节点为除所述主逻辑节点以外的逻辑节点;确定模块,用于当确定所述主逻辑节点接收到日志同步失败的消息,确定所述消息对应的异常逻辑节点;其中,所述日志同步失败表征所述异常逻辑节点对应的目标物理存储节点上的数据缺失;生成模块,用于基于预设纠删码、以及除所述异常逻辑节点和所述主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;所述数据日志用于记录写入所述逻辑节点对应的物理存储节点上的数据;存储模块,用于根据所述缺失数据,对所述目标物理存储节点进行数据恢复。
第三方面,本申请提供一种分布式存储系统,所述分布式存储系统中包含数据存储子系统,所述数据存储子系统由多个逻辑节点构成,每个逻辑节点对应一个物理存储节点,所述数据存储子系统用于执行如第一方面所述的数据存储方法。
第四方面,本申请提供一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的数据存储方法。
本申请提供的数据存储方法、子系统、分布式存储系统及存储介质,包括:主逻辑节点接收到写操作日志之后,将写操作日志同步给从逻辑节点,当接收到从逻辑节点反馈的日志同步失败的消息之后,表明反馈该消息的异常逻辑节点对应的物理存储节点上的数据缺失,此时可以基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据,并基于该缺失数据,对异常逻辑节点对应的物理存储节点进行数据恢复,本申请提供的数据存储子系统可以监测是否存在数据缺失的物理存储节点,一旦存在可以基于其他从逻辑节点上的记录的数据生成缺失数据,然后对存在数据缺失的物理存储节点进行恢复,整个过程使异常检测与数据写入存在依赖关系,不需要服务节点去调度和恢复,避免了服务器宕机出现恢复数据丢失的问题,有效降低恢复流程复杂度。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1为现有的一种数据存储方法的示例图;
图2为本申请实施例提供的分布式存储系统的系统结构图;
图3为本申请实施例提供的数据存储子系统的结构示意图;
图4为本申请实施例提供的数据存储方法的示意性流程图;
图5为本申请实施例中提供的数据存储的一种场景示意图;
图6为本申请实施例提供的另一种数据存储方法的示意性流程图;
图7为本申请实施例中提供的数据存储的另一种场景示意图;
图8为本申请实施例提供的数据存储子系统的功能模块图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。
因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
在本申请的描述中,需要说明的是,若出现术语“上”、“下”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,或者是该申请产品使用时惯常摆放的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。
此外,若出现术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
需要说明的是,在不冲突的情况下,本申请的实施例中的特征可以相互结合。
下面先对本申请实施例中涉及的相关术语进行解释。
RS(N,M)纠删码:通过N份原始数据生成M份校验数据,当N+M份数据中任意M份数据丢失后可通过剩下N份数据重新生成。
RAFT一致性算法:一种通过主节点复制日志到从节点的分布式强一致性问题解决方案,该算法将系统视作一个状态机,将对状态机的操作视为一条日志,根据主从系统从相同的初始状态执行相同的操作得到相同目标状态来保证一致性。算法内部维护一个日志序列,通过主节点选择、同步复制来保证日志正确有序的复制到各个备份。
请参见图1,图1为现有的一种数据存储方法的示例图,如图1所示,在现有的数据存储方式中,数据节点需要同时存储条带的数据和版本号,服务侧一次写入将产生两次输入和两次输出,划分条带后由客户端写入到服务侧的数据需要对齐到条带大小写入,进行数据对齐需要产生两次额外的读请求,分别是读取版本号与读取数据,在未命中缓存时将显著增加写操作时延。
例如,将数据划分为4KB大小的条带,每一个4KB条带对应一个8Byte版本号,每64MB数据需要额外存储128KB的版本号,服务节点一次写入需要同时更新版本号与数据 内容,此时将产生两次输入和两次输出,这两次输入和两次输出写入不同位置,额外增加了寻道的开销,划分条带后写入到数据节点的数据需要对齐到条带大小写入,即写入数据节点的数据偏移与长度需要对齐到4KB大小,写入到SDK的数据需要对齐到4KB*N大小。
当由客户端写入到服务侧的数据不能对齐到4KB*N大小时需要先从DN侧读取4KB*(N+M)大小的数据进行覆盖后再次写入数据节点,极端情况下写入服务侧的数据首尾均需要做一次对齐操作,此时一次服务侧的写入将产生2*(N+M)*2次读与(N+M)*2次写入操作,在未命中缓存时将显著增加写操作时延,顺序写场景下命中缓存能够消除一半的读操作时延,此时SDK侧写时延为MAX(N+M)(读版本号)+MAX(N+M)(读数据)+MAX(N+M)(写数据)+MAX(N+M)(提交数据),现有系统使用了NVM进行写加速,MAX(N+M)(写数据)+MAX(N+M)(提交数据)部分时延主要受网络影响,一般情况下要远小于MAX(N+M)(读版本号)+MAX(N+M)(读数据),写入数据跨越了两个条带组,首尾条带均需要被读到SDK之后才能计算纠删数据。
另一方面,服务节点在检测到数据存储失败后,需要上报数据恢复请求到元数据服务器,由元数据服务器进行调度并生成缺失数据,数据恢复操作实际上在数据节点侧执行,此时数据节点侧将同时收到需要会发的数据以及来自服务节点的待写入数据,提高系统复杂度,而且,由于数据恢复请求是由服务节点上报的,一旦服务节点宕机,则可能导致数据恢复请求丢失,需要通过后台扫描发现不一致数据,后台扫描周期较长,期间再次出现异常将显著增加系统数据不一致或彻底损坏的风险。
为了解决上述问题,本申请实施例提供的一种改进的分布式存储系统,请参见图2,图2为本申请实施例提供的分布式存储系统的系统结构图,其中包括客户端201、服务节点202、数据节点203(DN)、以及基于RAFT一致性算法的数据存储子系统205,客户端201、服务节点202和数据节点203之间通信连接,上述分布式存储系统中还可以包括元数据服务器(MS)(此处省略)。
服务节点202实质可以为服务器,或者为SDK,客户端201和服务节点202之间可以但不限于通过iscsi协议进行数据交互,服务节点202和数据节点203之间可以建立RPC连接,设置RPC通信超时为5秒钟。
数据存储子系统205,用于接管写数据业务、数据恢复,通过强一致性算法配合数据恢复保证写入到数据节点103侧的数据一致性。
数据存储子系统205与块存储纠删对象(OBJ)进行绑定,当OBJ需要进行数据写、数据恢复时,可以预先创建数据存储子系统205,保证OBJ数据的一致性,当数据存储子系统205的空闲时长达到预设时长,可以启动对数据存储子系统205的销毁。
本申请实施例提供的数据存储子系统205在写入数据的过程中,可以移除数据节点侧的条带与版本号,此时数据节点侧可以接收任意偏移与粒度的写入,将数据对齐的操作延后到NVM下盘时执行并消除数据节点侧写版本号的开销。
例如,假设块存储通过iscsi协议对外提供服务时,最小写入粒度为512Byte,此时选取合适的N值(2,4,8,16等)使512%N==0并设置条带大小为512/N,可使服务侧写入的数据被均分到N+M个数据节点,此时服务侧可直接将数据到数据节点,不需要从数据节点侧读取数据进行条带组对齐,写入时延为MAX(N+M)(写数据)+MAX(N+M)(同步操作日志),以8+2纠删比为例,旧系统设置条带大小为64Byte时可消除条带对齐的读操作,但此时会大幅提升版本号数量,此时没64Byte数据需要8Byte版本号,空间利用率为64/(64+8)=88.9%,本方案消除条带版本号之后可任意设置条带大小,无版本号空间占用。
请参见图3,图3为本申请实施例提供的数据存储子系统的结构示意图,数据存储子系统205是由算法核心、逻辑节点、操作日志、数据日志构成,使用LUN ID+OBJ ID作为数据存储子系统205的标识,例如假设LUN ID为1,OBJ ID为0,那么数据存储子系统的标识可以表示为r-1-0。
算法核心为RAFT算法,有选主与日志复制两部分组成,从逻辑节点在没有收到主逻辑节点保活时发起选主流程,在原有算法中选举节点收到过半投票后成为主逻辑节点,但在本申请实施例中,需要与数据日志一一映射,因此,只有在收到大于或等于N份投票时才能成为主逻辑节点,其中N为待写入数据划分后的原始数据的总份数。
逻辑节点,是数据节点基于元数据服务器或服务节点的请求创建的虚拟节点,根据序号位于对应数据块所在物理节点上,每个逻辑节点的标识可以基于数据存储子系统的标识和对应的数据节点的标识进行标识,例如,针对数据节点DN-1上的逻辑节点,可以表示为r-1-0-1,其中,r-1-0为数据存储子系统的标识、1为数据节点DN-1的标识。每个逻辑节点包含数据存储子系统205所有逻辑节点信息,主要有每个逻辑节点的状态、所在物理节点的标识。每个逻辑节点由操作日志和数据日志构成,存储在数据节点203的事务日志中。
操作日志,分为写操作日志和全量数据恢复日志,写操作日志包含OBJ内偏移(即数据写入起始位置)、数据长度与事务ID;全量数据恢复日志记录启动与结束两种状态。
数据日志,记录此次写入分发到当前数据节点的数据,由服务节点写入到数据节点的事务日志中,通过写操作日志控制写入数据存储,写操作日志写入到数据节点时将进一次检测,用于识别数据日志丢失并进行修复。日志ID小于等于RAFT提交ID的日志将被应用到存储。
可以理解的是,本申请实施例提供的数据存储子系统是分别式存储系统中的一个虚拟子系统,是与现有的分布式存储系统最主要的区别,数据存储子系统是由RAFT算法、以 及各个逻辑节点构成,每个逻辑节点用于维护操作日志和数据日志,通过数据日志可以维护当前要写入的数据,避免因数据节点故障而遭成数据无法恢复的影响,当存在数据写入请求的时候,可以请求创建数据存储子系统,并由数据存储子系统来完成数据写入、数据恢复等功能,这样一来,就可以避免通过服务器来上报数据写入或者数据恢复等请求,可以避免服务器宕机所带来的不好的影响,同时,当不存在数据写入请求,还可以删除数据存储子系统,避免占用额外资源。
基于数据存储子系统205,本申请实施例提供了一种数据存储方法,请参见图4,图4为本申请实施例提供的数据存储方法的示意性流程图,该方法可以包括:
S401,将主逻辑节点接收到的写操作日志同步给全部从逻辑节点。
其中,主逻辑节点为全部逻辑节点其中一个;从逻辑节点为除主逻辑节点以外的逻辑节点。
S402,当确定主逻辑节点接收到日志同步失败的消息,确定消息对应的异常逻辑节点;
其中,日志同步失败表征异常逻辑节点对应的目标物理存储节点上的数据缺失;
S403,基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;
数据日志用于记录写入逻辑节点对应的物理存储节点上的数据;
S404,根据缺失数据,对目标物理存储节点进行数据恢复。
根据本申请实施例提供的数据存储方法,主逻辑节点接收到写操作日志之后,将写操作日志同步给从逻辑节点,当接收到从逻辑节点反馈的日志同步失败的消息之后,表明反馈该消息的异常逻辑节点对应的物理存储节点上的数据缺失,此时可以基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据,并基于该缺失数据,对异常逻辑节点对应的物理存储节点进行数据恢复,本申请提供的数据存储子系统可以监测是否存在数据缺失的物理存储节点,一旦存在可以基于其他从逻辑节点上的记录的数据生成缺失数据,然后对存在数据缺失的物理存储节点进行恢复,整个过程使异常检测与数据写入存在依赖关系,不需要服务节点去调度和恢复,避免了服务器宕机出现恢复数据丢失的问题,有效降低恢复流程复杂度。
下面结合附图4至附图7,对上述步骤S401至步骤S404进行详细介绍。
在步骤S401中、将主逻辑节点接收到的写操作日志同步给全部从逻辑节点。
在本申请实施例中,上述写操作日志是在服务节点将待写入数据分发给多个物理存储节点之后,生成的操作日志,即在可选的实施方式中,生成操作日志的方式如下:
a1,根据服务节点将接收到的待写入数据,得到原始数据和校验数据。
在本申请实施例中,按照预设纠删码对待写入数据进行划分,例如假设预设纠删码为 R(N,M),其中N为原始数据的总份数,M为校验数据的总份数,对数据切片进行纠删计算,将原始数据根据大小切分成N份原始数据,并通过纠删编码生成M份校验数据,每份数据对应的偏移与数据长度均为待写入数据的数据长度的1/N。
a2,将服务节点生成的事务ID,以及原始数据和校验数据分发给多个物理存储节点,以使每个物理存储节点将接收到的事务ID,以及原始数据或校验数据写入事务日志。
本申请实施例中的物理存储节点即为图2所示的数据节点,也就是说,在生成操作日志之前,先由服务节点先将原始数据和校验数据连同事务ID一同发送给物理存储节点,每个物理存储节点接收到原始数据或者校验数据之后,将事务ID以及原始数据和校验数据写入自身维护的事务日志中。
在本申请实施例中,事务ID可以基于数据存储子系统中已有事务日志最大值+1作为初始值,服务节点每次使用时+1,添加LUNID-OBJID作为全局唯一值。
在本申请实施例中,可以预先由服务节点基于N+M份数据,确定接收原始数据和检验数据以及事务ID的N+M个物理存储节点,即在可选的实施方式中,服务节点可以先向元数据服务器发送数据存储子系统的查询信息,元数据服务器接收到查询信息之后,查询已经创建的OBJ信息,即确定该OBJ所包含的N+M个数据块(BLK)各自所在的物理存储节点,作为这N+M个物理存储节点。
在上述确定N+M个物理存储节点的过程中,还可以确定每个物理存储节点中是否创建有逻辑节点,若没有则可以向物理存储节点发送创建请求,以使物理存储节点创建逻辑节点,若有则可以不进行创建。
a3,确定服务节点接收到的数据发送成功消息的个数是否大于或等于原始数据的总份数。
在本申请实施例中,物理存储节点在接收到原始数据或者校验数据以及事务ID之后,可以向服务节点反馈数据发送成功消息或者数据发送失败消息。服务节点可以根据返回的结果进行判断,数据发送成功消息的数量大于或等于N时,则生成写操作日志并向主逻辑节点发送写操作日志,否则向物理存储节点发送回滚消息。
a4,若是,则通过服务节点生成写操作日志。
本申请实施例中的写操作日志,一方面,可以指示从逻辑节点确认是否存在待写入数据,另一方面,还可以应用操作日志,将待写入数据写入到物理存储节点的数据存储中。
在本申请实施例中,写操作日志包含事物ID,主逻辑节点接收到写操作日志之后,可以将写操作日志写入主逻辑节点对应的物理存储节点的事务日志中,然后发起写操作日志同步。
在本申请实施例中,通过同步写操作日志,将原有2PC写入模式调整为写数据+同步操 作日志,通过同步操作日志替换提交数据版本号,避免同步数据内容增加网络压力。
还可以看出,本申请实施例在进行日志同步之前,先由服务节点将数据分发给物理存储节点,然后在日志同步的过程中可以进行数据丢失异常检测,也就是说,本申请实施可以使异常检测与数据写入具有依赖关系,而不需要由服务节点去检测异常并上报,减小了因服务节点宕机所造成的数据缺失损失。
为了方便理解上述内容,请参见图5,图5为本申请实施例中提供的数据存储的一种场景示意图。
如图5所示,假设将存储空间划分为64M大小BLK,使用4+2纠删比,每一个OJB大小为256MB,客户端可以先基于LUN=1,通过SDK向MS查询OBJ信息,确定OBJ的标识为0,需要将待写入数据存储到数据节点DN-0至DN-5这六个数据节点上,创建的数据存储子系统为r-1-0,包括r-1-0-0至r-1-0-5这六个逻辑节点,每个逻辑节点对应一个数据节点。
客户端将待写入数据的数据信息发送个SDK侧之后,SDK可以将待写入数据划分成4份,并生成2份校验数据,其中,待写入数据的数据信息包括待写入数据的数据长度为512,数据的起始写入位置offset为512,那么每份数据的数据长度为128,起始写入位置为128。SDK首先生成事务ID(即id=1),然后将事务ID以及6份数据分发到DN-0至DN-5上,每个数据节点收到数据和事务ID之后,可以写入事务日志中,SDK在收到的返回结果成功的数量大于或等于4之后,可以基于事务ID、数据的起始写入位置(offset=512)以及数据长度(即Length=512)等信息生成写操作日志,日志ID为log=2,并将生成的操作日志发送给r-1-0-0至r-1-0-5中的主逻辑节点,并由主逻辑节点同步给从逻辑节点。
还可以理解的是,针对每个数据节点,当逻辑节点应用写操作日志之后,即可将获得的数据写入数据存储,例如,以数据节点DN-0为例,将获得的数据写入标识为1-0-0的数据块中,数据起始写入位置offset=128。
在步骤S402中、当确定主逻辑节点接收到日志同步失败的消息,确定消息对应的异常逻辑节点。
在本申请实施例中,当从逻辑节点接收到写操作日志后,可以根据写操作日志中的事务ID查询所对应的物理存储节点事务日志,成功查询到事物日志则返回同步成功消息,没有查询到则返回同步失败消息。因此,在一种可选的实施方式中,从逻辑节点接收到日志同步请求之后可以执行如下步骤:
b1,确定从逻辑节点所对应的物理存储节点是否存在写操作日志中的事务ID;
b2,若不存在事务ID,则通过从逻辑节点向主逻辑节点反馈日志同步失败的消息;
b3,若存在事务ID,则通过从逻辑节点向主逻辑节点反馈日志同步成功的消息。
为了方便理解,请继续参见图5,以DN-0为例,DN-0在接收到事务ID(即id=1)以及原始数据或者校验数据之后,将事务ID(即id=1)写进事务日志中,DN-1对应的逻辑节点r-1-0-0收到同步过来的写操作日志中,可以根据写操作日志中的事务ID(即id=1)查询DN-0中是否存在该事务ID,若存在,则说明DN-0中已经存在要写入的原始数据或者校验数据,若不存在,则说明DN-0数据缺失。
针对写操作日志同步成功的从逻辑节点,可以根据写操作日志中的事务ID应用写操作日志,以使该从逻辑节点所在的物理存储节点将该事务ID对应的事务日志中的原始数据或者校验数据写入数据存储。
针对写操作日志同步失败的从逻辑节点,主逻辑节点在接收到同步失败消息之后,将该消息对应的逻辑节点确定为异常逻辑节点,然后可以执行步骤S403和步骤S404实现数据恢复。
在步骤S403中、基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;数据日志用于记录写入逻辑节点对应的物理存储节点上的数据。
在本申请实施例中,每个逻辑节点对应的数据日志用来记录要写入物理存储节点内的原始数据或者校验数据,当异常逻辑节点所在的物理存储节点出现数据缺失,则可以获取其他逻辑节点数据日志中的原始数据或者校验数据,通过预设纠删码,生成缺失数据。
在步骤S404中、根据缺失数据,对目标物理存储节点进行数据恢复。
在本申请实施例中,主逻辑节点生成缺失数据,将缺失数据发送到目标物理存储节点、以使目标物理存储节点将该缺失数据写入事务日志,然后对异常逻辑节点重新同步写操作日志。
在可选的实施方式中,为了确定待写入数据写入成功还是写入失败,本申请实施例还给出了如下步骤c1至步骤c3的实施方式:
c1判断主逻辑节点收到的日志同步成功的消息数量是否大于或等于待写入数据划分后的原始数据的总份数;
c2,若是,则通过主逻辑节点向服务节点反馈数据写入成功的消息;
c3,若否,则通过主逻辑节点将操作日志丢弃,并向服务节点反馈数据写入失败的消息。
在本申请实施例中,服务节点还可以将接收到的写入结果反馈给客户端,已及时通知用户数据写入状态。
在可选的实施方式中,由于数据存储子系统中存在多个逻辑节点,为了确定主逻辑节点和从逻辑节点,本申请实施还给出了如下步骤d1至步骤d3的实施方式:
d1确定第一从逻辑节点是否收到来自主逻辑节点的保活信息;第一从逻辑节点为除主逻辑节点中以外的任意一个;
d2,若否,发起主逻辑节点选举流程;
d3,当存在一个目标逻辑节点收到的投票数大于或等于待写入数据划分后的原始数据的总份数,则将目标逻辑节点确定为主逻辑节点。
在可选的实施方式中,当数据节点离线,可能会导致数据缺失,因此本申请实施例提供了一种数据恢复方法,请参见图6,图6为本申请实施例提供的另一种数据存储方法的示意性流程图,即本申请实施例提供的数据存储方法还可以包括如下步骤:
S405,当主逻辑节点确定存在离线的从逻辑节点,且离线的从逻辑节点在预设时间段内未上线,检测在线从逻辑节点的数量是否大于或等于待写入数据划分后的原始数据的总份数。
在本申请实施例中,数据存储子系统的节点离线判定时间可以根据实际需求进行设置,例如设置为1分钟,离线后在预设时间段内未上线,则可以向元数据服务器申请新的数据块进行数据恢复,申请不到新的数据块时从数据存储子系统移除,提交剩余节点日志。
S406,若是,则通过主逻辑节点向元数据服务器发送申请新数据块的请求,以使元数据服务器创建新数据块,并在新数据块所在的物理存储节点上创建新逻辑节点。
S407,将主逻辑节点的全量数据恢复日志更新为启动状态,并将全量数据恢复日志同步到除离线的从逻辑节点以外的从逻辑节点和新逻辑节点;
需要说明的是,所有从逻辑节点在获得启动的全量恢复消息后,暂停提交更新的数据日志到数据存储,只有当所有节点的全量数据恢复日志处于结束状态后才可以恢复数据日志提交。
S408,基于预设纠删码以及在线从逻辑节点的数据日志,生成新的缺失数据,并将新的缺失数据发送给新逻辑节点。
为了方便理解上述内容,请参见图7,图5为本申请实施例中提供的数据存储的另一种场景示意图。如图7所示,假设本来要在DN-5的数据块blk=1-0-4上写入数据长度为128的数据,起始写入位置为128,当数据节点DN-5离线,那么发送到DN-5的数据将会丢失,此时可以先申请新的数据块,假设新数据块对应的数据节点DN-6,那么可以先在DN-6上创建新的逻辑节点,由主逻辑节点新的缺失数据之后,则可以在DN-6上申请的新的数据块上写入缺失数据,数据长度为128,起始写入位置为128。
基于相同的申请的构思,请参见图8,图8为本申请实施例提供的数据存储子系统的功能模块图,包括:
同步模块205-1,用于将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;其中, 主逻辑节点为全部逻辑节点其中一个;从逻辑节点为除主逻辑节点以外的逻辑节点;
确定模块205-2,用于当确定主逻辑节点接收到日志同步失败的消息,确定消息对应的异常逻辑节点;其中,日志同步失败表征异常逻辑节点对应的目标物理存储节点上的数据缺失;
生成模块205-3,用于基于预设纠删码、以及除异常逻辑节点和主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;数据日志用于记录写入逻辑节点对应的物理存储节点上的数据;
存储模块205-4,用于根据缺失数据,对目标物理存储节点进行数据恢复。
可以理解的是,同步模块205-1、确定模块205-2、生成模块205-3以及存储模块205-4可以协同执行图4中的各个步骤以实现相应的技术效果。
在可选的实施方式中,数据存储子系统还可以包括发送模块,确定模块205-2、生成模块205-3和发送模块可以协同的执行图7中的步骤S405至步骤S408以实现相应的技术效果。
在可选的实施方式中,生成模块205-3还用于执行步骤a1至步骤a2以实现相应的技术效果。
在可选的实施方式中,确定模块205-2和发送模块还可以用来执行步骤b1至步骤b3、步骤c1至步骤c3以实现相应的技术效果。
在可选的实施方式中,数据存储子系统还可以包括选举模块,确定模块205-2和选举模块还可以协同的来执行步骤d1至步骤d3以实现相应的技术效果。
在可选的实施方式中,所述操作日志是通过如下方式生成的:将所述服务节点将接收到的待写入数据进行划分,得到所述原始数据和校验数据;将服务节点生成的事务ID,以及所述原始数据和所述校验数据分发给所述多个物理存储节点,以使每个所述物理存储节点将接收到的所述事务ID,以及所述原始数据或所述校验数据写入事务日志;确定所述服务节点接收到的数据发送成功消息的个数是否大于或等于所述原始数据的总份数;若是,则通过所述服务节点生成所述写操作日志。
在可选的实施方式中,存储模块205-4还可以用于:将所述写操作日志写入到所述主逻辑节点对应的物理存储节点的事务日志中;通过所述主逻辑节点应用所述写操作日志,以使所述主逻辑节点所在的物理存储节点将接收的原始数据或者校验数据写入数据存储。
本申请实施例还提供一种存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如前述实施方式中任一项的数据存储方法。该计算机可读存储介质可以是,但不限于,U盘、移动硬盘、ROM、RAM、PROM、EPROM、EEPROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。

Claims (10)

  1. 一种数据存储方法,其特征在于,应用于分布式存储系统中的数据存储子系统,所述数据存储子系统包括多个逻辑节点,每个逻辑节点对应一个物理存储节点;所述方法包括:
    将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;其中,所述主逻辑节点为全部所述逻辑节点其中一个;所述从逻辑节点为除所述主逻辑节点以外的逻辑节点;
    当确定所述主逻辑节点接收到日志同步失败的消息,确定所述消息对应的异常逻辑节点;其中,所述日志同步失败表征所述异常逻辑节点对应的目标物理存储节点上的数据缺失;
    基于预设纠删码、以及除所述异常逻辑节点和所述主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;所述数据日志用于记录写入所述逻辑节点对应的物理存储节点上的数据;
    根据所述缺失数据,对所述目标物理存储节点进行数据恢复。
  2. 根据权利要求1所述的数据存储方法,其特征在于,所述分布式存储系统中包括元数据服务器,所述元数据服务器与所述数据存储子系统进行数据交互,所述方法还包括:
    当所述主逻辑节点确定存在离线的从逻辑节点,且所述离线的从逻辑节点在预设时间段内未上线,检测在线从逻辑节点的数量是否大于或等于待写入数据划分后的原始数据的总份数;
    若是,则通过所述主逻辑节点向元数据服务器发送申请新数据块的请求,以使所述元数据服务器创建新数据块,并在所述新数据块所在的物理存储节点上创建新逻辑节点;
    通过所述主逻辑节点生成全量数据恢复日志,并将所述全量数据恢复日志同步到除所述离线的从逻辑节点以外的从逻辑节点和所述新逻辑节点;
    基于所述预设纠删码以及所述在线从逻辑节点的数据日志,生成新的缺失数据,并将所述新的缺失数据发送给所述新逻辑节点。
  3. 根据权利要求1所述的数据存储方法,其特征在于,所述分布式存储系统中包括元数据服务器,所述元数据服务器与所述数据存储子系统进行数据交互,所述方法还包括:
    判断所述主逻辑节点收到的日志同步成功的消息数量是否大于或等于待写入数据划分后的原始数据的总份数;
    若是,则通过所述主逻辑节点向服务节点反馈数据写入成功的消息;
    若否,则通过所述主逻辑节点将所述操作日志丢弃,并向所述服务节点反馈数据写入失败的消息。
  4. 根据权利要求1所述的数据存储方法,其特征在于,所述方法还包括:
    确定第一从逻辑节点是否收到来自主逻辑节点的保活信息;所述第一从逻辑节点为除所述主逻辑节点中以外的任意一个;
    若否,发起主逻辑节点选举流程;
    当存在一个目标逻辑节点收到的投票数大于或等于待写入数据划分后的原始数据的总份数,则将所述目标逻辑节点确定为所述主逻辑节点。
  5. 根据权利要求3所述的数据存储方法,其特征在于,所述操作日志是通过如下方式生成的:
    将所述服务节点将接收到的待写入数据进行划分,得到所述原始数据和校验数据;
    将服务节点生成的事务ID,以及所述原始数据和所述校验数据分发给多个所述物理存储节点,以使每个所述物理存储节点将接收到的所述事务ID,以及所述原始数据或所述校验数据写入事务日志;
    确定所述服务节点接收到的数据发送成功消息的个数是否大于或等于所述原始数据的总份数;
    若是,则通过所述服务节点生成所述写操作日志。
  6. 根据权利要求1所述的数据存储方法,其特征在于,在当确定所述主逻辑节点接收到日志同步失败的消息,确定所述消息对应的异常逻辑节点之前,所述方法还包括:
    确定所述从逻辑节点所对应的物理存储节点是否存在所述写操作日志中的事务ID;
    若不存在所述事务ID,则通过所述从逻辑节点向所述主逻辑节点反馈日志同步失败的消息;
    若存在所述事务ID,则通过所述从逻辑节点向所述主逻辑节点反馈日志同步成功的消息。
  7. 根据权利要求1所述的数据存储方法,其特征在于,所述方法还包括:
    将所述写操作日志写入到所述主逻辑节点对应的物理存储节点的事务日志中;
    通过所述主逻辑节点应用所述写操作日志,以使所述主逻辑节点所在的物理存储节点将接收的原始数据或者校验数据写入数据存储。
  8. 一种数据存储子系统,其特征在于,所述数据存储子系统包括多个逻辑节点,每个逻辑节点对应一个物理存储节点,包括:
    同步模块,用于将主逻辑节点接收到的写操作日志同步给全部从逻辑节点;其中,所述主逻辑节点为全部所述逻辑节点其中一个;所述从逻辑节点为除所述主逻辑节点以外的逻辑节点;
    确定模块,用于当确定所述主逻辑节点接收到日志同步失败的消息,确定所述消息对应的异常逻辑节点;其中,所述日志同步失败表征所述异常逻辑节点对应的目标物理存储 节点上的数据缺失;
    生成模块,用于基于预设纠删码、以及除所述异常逻辑节点和所述主逻辑节点以外的剩余从逻辑节点的数据日志,生成缺失数据;所述数据日志用于记录写入所述逻辑节点对应的物理存储节点上的数据;
    存储模块,用于根据所述缺失数据,对所述目标物理存储节点进行数据恢复。
  9. 一种分布式存储系统,其特征在于,所述分布式存储系统中包含数据存储子系统,所述数据存储子系统由多个逻辑节点构成,每个逻辑节点对应一个物理存储节点,所述数据存储子系统用于执行如权利要求1-7任意一项所述的数据存储方法。
  10. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7任意一项所述的数据存储方法
PCT/CN2023/097138 2022-08-19 2023-05-30 数据存储方法、子系统、分布式存储系统及存储介质 WO2024037104A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211003028.1A CN115344211A (zh) 2022-08-19 2022-08-19 数据存储方法、子系统、分布式存储系统及存储介质
CN202211003028.1 2022-08-19

Publications (1)

Publication Number Publication Date
WO2024037104A1 true WO2024037104A1 (zh) 2024-02-22

Family

ID=83953190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/097138 WO2024037104A1 (zh) 2022-08-19 2023-05-30 数据存储方法、子系统、分布式存储系统及存储介质

Country Status (2)

Country Link
CN (1) CN115344211A (zh)
WO (1) WO2024037104A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344211A (zh) * 2022-08-19 2022-11-15 重庆紫光华山智安科技有限公司 数据存储方法、子系统、分布式存储系统及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9361306B1 (en) * 2012-12-27 2016-06-07 Emc Corporation Managing concurrent write operations to a file system transaction log
CN106662983A (zh) * 2015-12-31 2017-05-10 华为技术有限公司 分布式存储系统中的数据重建的方法、装置和系统
CN110865903A (zh) * 2019-11-06 2020-03-06 重庆紫光华山智安科技有限公司 基于纠删码分布式存储的节点异常重连复用方法及系统
CN112559140A (zh) * 2020-12-17 2021-03-26 江苏满运物流信息有限公司 数据一致性的事务控制方法、系统、设备及存储介质
CN114880165A (zh) * 2022-03-31 2022-08-09 重庆紫光华山智安科技有限公司 数据恢复方法及相关装置
CN115344211A (zh) * 2022-08-19 2022-11-15 重庆紫光华山智安科技有限公司 数据存储方法、子系统、分布式存储系统及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9361306B1 (en) * 2012-12-27 2016-06-07 Emc Corporation Managing concurrent write operations to a file system transaction log
CN106662983A (zh) * 2015-12-31 2017-05-10 华为技术有限公司 分布式存储系统中的数据重建的方法、装置和系统
CN110865903A (zh) * 2019-11-06 2020-03-06 重庆紫光华山智安科技有限公司 基于纠删码分布式存储的节点异常重连复用方法及系统
CN112559140A (zh) * 2020-12-17 2021-03-26 江苏满运物流信息有限公司 数据一致性的事务控制方法、系统、设备及存储介质
CN114880165A (zh) * 2022-03-31 2022-08-09 重庆紫光华山智安科技有限公司 数据恢复方法及相关装置
CN115344211A (zh) * 2022-08-19 2022-11-15 重庆紫光华山智安科技有限公司 数据存储方法、子系统、分布式存储系统及存储介质

Also Published As

Publication number Publication date
CN115344211A (zh) 2022-11-15

Similar Documents

Publication Publication Date Title
US7278049B2 (en) Method, system, and program for recovery from a failure in an asynchronous data copying system
US5657440A (en) Asynchronous remote data copying using subsystem to subsystem communication
US8706700B1 (en) Creating consistent snapshots across several storage arrays or file systems
US9740572B1 (en) Replication of xcopy command
US8977593B1 (en) Virtualized CG
US9563517B1 (en) Cloud snapshots
US7627775B2 (en) Managing failures in mirrored systems
US9582382B1 (en) Snapshot hardening
US9535801B1 (en) Xcopy in journal based replication
US8935498B1 (en) Splitter based hot migration
US8464101B1 (en) CAS command network replication
RU2449358C1 (ru) Распределенная файловая система и способ управления согласованностью блоков данных в такой системе
US8060714B1 (en) Initializing volumes in a replication system
US7882286B1 (en) Synchronizing volumes for replication
US7398354B2 (en) Achieving data consistency with point-in-time copy operations in a parallel I/O environment
WO2018098972A1 (zh) 一种日志恢复方法、存储装置和存储节点
JP2006209775A (ja) データ追跡を有するストレージ複製システム
KR20150035507A (ko) 데이터 송신 방법, 데이터 수신 방법, 및 저장 장치
US10592128B1 (en) Abstraction layer
JP2007183930A (ja) 異なるコピー技術を用いてデータをミラーリングするときの整合性の維持
WO2024037104A1 (zh) 数据存储方法、子系统、分布式存储系统及存储介质
US7685386B2 (en) Data storage resynchronization using application features
JP2008299789A (ja) リモートコピーシステム及びリモートコピーの制御方法
CN103544081B (zh) 双元数据服务器的管理方法和装置
US10296517B1 (en) Taking a back-up software agnostic consistent backup during asynchronous replication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854010

Country of ref document: EP

Kind code of ref document: A1