WO2018040589A1 - 一种基于分布式存储系统的数据处理方法及存储设备 - Google Patents

一种基于分布式存储系统的数据处理方法及存储设备 Download PDF

Info

Publication number
WO2018040589A1
WO2018040589A1 PCT/CN2017/081339 CN2017081339W WO2018040589A1 WO 2018040589 A1 WO2018040589 A1 WO 2018040589A1 CN 2017081339 W CN2017081339 W CN 2017081339W WO 2018040589 A1 WO2018040589 A1 WO 2018040589A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage node
primary storage
data object
write
target data
Prior art date
Application number
PCT/CN2017/081339
Other languages
English (en)
French (fr)
Inventor
冯永刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018040589A1 publication Critical patent/WO2018040589A1/zh
Priority to US16/293,098 priority Critical patent/US11614867B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • G06F3/0623Securing storage systems in relation to content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 
    • H04L67/1051Group master selection mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the invention belongs to the field of data reading and writing technology, and in particular relates to a data processing method and a storage device based on a distributed storage system.
  • redundant data objects store multiple copies between different storage devices. But at the same time multiple copies of a data object can only be used for reading or for writing.
  • the Quorum mechanism is a voting algorithm used in distributed systems to ensure data redundancy and eventual consistency.
  • This mechanism has three key values N, R, and W, which can also be called the NRW mechanism, where N is the number of copies the data has, R is the minimum number of copies that need to be read to complete the read operation, and W is the write completion. The minimum number of copies that need to be written.
  • N is the number of copies the data has
  • R is the minimum number of copies that need to be read to complete the read operation
  • W is the write completion.
  • R+W>N is guaranteed, which provides a guarantee of strong consistency, because the storage node that reads the data overlaps with the storage node that is written synchronously.
  • For the write operation it is necessary to wait for at least 3 copies to complete the write operation, and the system will return the status of successful write;
  • the present invention provides a data processing method and a storage device based on a distributed storage system, which can solve the problem that the read operation of the prior art requires a large number of successful copies, a large read success time delay, and poor read performance.
  • Technical problem can solve the problem that the read operation of the prior art requires a large number of successful copies, a large read success time delay, and poor read performance.
  • the present invention provides a data processing method based on a distributed storage system.
  • the primary storage node receives a read IO request, and then the primary storage node determines whether the target data object stored on the primary storage node is Trusted; if the target data object stored on the primary storage node is trusted, then only the target data object stored on the primary storage node is read, and the target data object is sent directly to the initiator that initiated the read IO request.
  • the target data object does not need to be read from other storage nodes of the logical partition where the primary storage node is located.
  • the read IO request is used to request to read a target data object on a logical partition where the primary storage node is located, each of the logical partitions includes multiple storage nodes, each storage node stores a data object, and one of each logical partition The storage node is the primary storage node.
  • the benefit of the data processing method provided by the first aspect is that when the data is read, when the target data object stored on the primary storage node is trusted, the target data object only needs to be read from the primary storage node and directly read.
  • the fetched target data object is returned to the originating end of the initiating read IO request, and no data needs to be read from other storage nodes of the current logical partition.
  • the number of copies of the read operation is greatly reduced, thereby reducing the latency of the read operation and improving the performance of the read operation.
  • the determining, by the primary storage node, whether the target data object stored on the primary storage node is trusted comprises: determining a status of the primary storage node, where the status of the primary storage node includes a trusted state and an untrusted state; if the state of the primary storage node is a trusted state, determining that the target data object stored on the primary storage node is trusted; if the state of the primary storage node is an untrusted state, Obtaining a blacklist on the primary storage node, and determining whether the blacklist is complete, and storing the number of failed writes on the primary storage node in the blacklist According to the object; if the blacklist is incomplete, it is determined that the target data object on the primary storage node is not trusted; if the blacklist is complete, it is determined whether the target data object is included in the blacklist, if included And determining, by the target data object, that the target data object on the primary storage node is not trusted; if the target data object is not included, determining that the
  • the method further includes: receiving, by the primary storage node, a primary storage node determining message, where the primary storage node determines that the message includes identification information of the primary storage node; when the primary storage node is configured according to When the primary storage node determines that the message is the primary storage node, the primary storage node collects an object degraded write log to all the storage nodes in the logical partition where the primary storage node is located, and marks the primary storage node as an untrusted state; Degrading a write log for recording a log of a storage node whose data object failed to write, and recording on all storage nodes in the logical partition where the data object is successfully written; if the object degraded write log includes the primary storage node If the failed data object is written, the data record is degraded from the object, and all data objects that fail to be written on the primary storage node are selected to obtain a blacklist; if the object degraded write log does not include the primary storage node Writing a failed data
  • the determining whether the blacklist is complete includes: obtaining, by the primary storage node, a status of the blacklist, where the status of the blacklist includes a completed status and an incomplete status, where During the process of collecting the degraded write log of the object, the blacklist is in an incomplete state, and the object of all storage nodes in the logical partition where the primary storage node is located is downgraded to write a log, and the blacklist is The state is a completion state; when the state obtained by the primary storage node is a completion state, determining that the blacklist is complete; and when the state obtained by the primary storage node is an incomplete state, determining the blacklist incomplete.
  • the method further includes: the primary storage node reconstructing the blacklist one by one Writing a failed data object, and deleting the degraded write log corresponding to the reconstructed data object from the blacklist; after all the data objects in the blacklist are successfully reconstructed, marking the primary storage node as Trusted status.
  • the present invention provides a data processing method based on a distributed storage system, including: a primary storage node receives a write IO request, and the write IO request is used to request to write a target to a logical partition where the primary storage node is located a data object, the logical partition includes a plurality of storage nodes, and one storage node in each logical partition is a primary storage node; when the primary storage node fails to write the target data object, directly requests the write IO The originating end returns a write failure response message; when the primary storage node writes the target data object successfully, the target data object is copied to other storage nodes of the logical partition where the primary storage node is located; When the primary storage node receives the write success response message returned by the preset number of storage nodes in the logical partition where the primary storage node is located, the primary storage node returns a write success response message to the originating end of the write IO request, and the preset quantity is determined according to the The number of storage nodes in
  • the data processing method based on the distributed storage system provided by the second aspect, when the data is written, if the primary storage node fails to write data, the response message of the write failure is directly returned to the initiator of the write IO request. If the primary storage node writes successfully, it needs to write successfully on the preset number of other storage nodes of the current logical partition to return a write success response message to the initiator that initiates the write IO request, so that the current primary storage node can be guaranteed.
  • the data object is trusted data.
  • the present invention provides a storage device, which is a primary storage node in a logical partition in a distributed storage system, and the storage device includes: a first receiving module, configured to receive a read IO request, Declaring an IO request for requesting to read a target data object on a logical partition where the primary storage node is located; and determining a module for determining Whether the target data object stored in the storage device is trusted; the reading module, configured to read the target data object in the storage device when the determining module determines that the target data object is trusted a sending module, configured to send the target data object read by the reading module to an initiator of the read IO request.
  • the determining module includes: a first determining submodule, configured to determine a state of the storage device, where the state of the storage device includes a trusted state and an untrusted state Determining that the target data object stored on the primary storage node is trusted if the state of the primary storage node is a trusted state; the second determining submodule is configured to determine the primary when the first determining submodule determines When the state of the storage node is in an untrusted state, the blacklist on the primary storage node is obtained, and the blacklist is determined to be complete.
  • the blacklist stores the data object that fails to be written on the primary storage node; If the blacklist is incomplete, it is determined that the target data object on the primary storage node is not trusted; and the third determining submodule is configured to determine the blacklist when the second determining submodule determines that the blacklist is complete. Whether the target data object is included, if the target data object is included, determining that the target data object on the primary storage node is not trusted; if the target is not included Data object, it is determined that the target object on the primary data storage node trusted.
  • the storage device further includes: a second receiving module, configured to receive a primary storage node determining message, and determine, according to the primary storage node, that the message is a primary storage node The primary storage node determines that the message includes the identification information of the primary storage node, and the collection module is configured to collect the object degraded write log to all the storage nodes in the logical partition where the primary storage node is located, and mark the primary storage node as not available.
  • a second receiving module configured to receive a primary storage node determining message, and determine, according to the primary storage node, that the message is a primary storage node
  • the primary storage node determines that the message includes the identification information of the primary storage node
  • the collection module is configured to collect the object degraded write log to all the storage nodes in the logical partition where the primary storage node is located, and mark the primary storage node as not available.
  • the object degraded write log is used to record a log of a storage node whose data object has failed to be written, and is recorded on all storage nodes in the logical partition where the data object is successfully written;
  • a blacklist building module is used to The object degraded the write log, and selects all the data objects that failed to be written on the primary storage node to obtain a blacklist;
  • the trusted markup module is configured to: when the object degraded write log does not include the write failure on the primary storage node The data object marks the primary storage node as a trusted state.
  • the second determining submodule is specifically configured to:
  • the status of the blacklist includes a completion status and an incomplete status, and the blacklist is in an incomplete state during the process of collecting the object degraded write log by the primary storage node until the collection is completed.
  • the object of all the storage nodes in the logical partition where the primary storage node is located is degraded and writes the log, and the state of the blacklist is the completion state; when the state of the blacklist is the completion state, it is determined that the blacklist is complete; When the status of the blacklist is in an incomplete state, it is determined that the blacklist is incomplete.
  • the storage device further includes: a data reconstruction module, configured to use one by one Reconstructing the data object that fails to be written in the blacklist, and deleting the degraded write log corresponding to the reconstructed data object from the blacklist; the state modification module is configured to use all the data objects in the blacklist After the reconstruction is successful, the primary storage node is marked as a trusted state.
  • the method further includes: a third receiving module, configured to receive a write IO request, where the write IO request is used to request to write a target to a logical partition where the primary storage node is located a data object; a write data module, configured to write the target data object into a corresponding storage space of the storage device according to the write IO request; and a first return module, configured to: when the target data object fails to be written Returning a write failure response message directly to the initiator of the write IO request;
  • a copy module configured to copy the target data object to the main when the target data object is successfully written
  • the second returning module is configured to: when the primary storage node receives the write success response message returned by the preset number of storage nodes in the logical partition where the primary storage node is located, The initiator of the write IO request returns a write success response message, and the preset number is determined according to the number of storage nodes of the logical partition where the primary storage node is located and the Quorum mechanism.
  • the present invention provides a storage device, which is a primary storage node in a logical partition in a distributed storage system, the storage device includes: a receiver for reading an IO request, the read IO a request for requesting to read a target data object on a logical partition where the primary storage node is located; a processor, configured to determine whether the target data object stored in the storage device is trusted; if the target data object is determined to be And the sender reads the target data object in the storage device; the sender is configured to send the read target data object to the initiator of the read IO request.
  • the receiver is further configured to receive a write IO request, where the write IO request is used to request to write a target data object to a logical partition where the primary storage node is located
  • the processor is further configured to write the target data object into a corresponding storage space of the storage device according to the write IO request, and directly write the IO if the target data object fails to be written
  • the originating end of the request returns a write failure response message; if the target data object is successfully written, copy the target data object to another storage node of the logical partition where the primary storage node is located, and receive the other storage node to return a write success response message;
  • the sender is further configured to: when receiving the preset number of write success response messages returned by the other storage nodes, return a write success response message to the initiator of the write IO request, the pre- The number is determined according to the number of storage nodes of the logical partition where the primary storage node is located and the Quorum mechanism.
  • the data processing method based on the distributed storage system provided by the present invention firstly selects a storage node as a primary storage node for any logical partition (Partition, PT) in the distributed storage system.
  • the primary storage node receives the read IO request, and the read IO request is used to read the target data object in the logical partition where the primary storage node is located.
  • the primary storage node After receiving the read IO request, the primary storage node first determines whether the target data object stored on the primary storage node is trusted, and if trusted, reads only the target data object from the primary storage node and returns to the initiation of sending the read IO request. end.
  • FIG. 1 is a schematic diagram of a distributed storage system according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a logical partition according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a data processing method based on a distributed storage system according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart of a method for constructing a blacklist of a primary storage node according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a process for determining whether a target data object stored by a primary storage node is trusted according to an embodiment of the present invention
  • FIG. 6 is a flowchart of another data processing method based on a distributed storage system according to an embodiment of the present invention.
  • FIG. 7 is a block diagram of a storage device according to an embodiment of the present invention.
  • FIG. 8 is a block diagram of a determining module according to an embodiment of the present invention.
  • FIG. 9 is a block diagram of another storage device according to an embodiment of the present invention.
  • FIG. 10 is a block diagram of still another storage device according to an embodiment of the present invention.
  • FIG. 11 is a block diagram of still another storage device in accordance with an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a data processing method based on a distributed storage system according to an embodiment of the present invention.
  • This embodiment uses the PT partition shown in FIG. 2 as an example. Assume that the primary storage node in the PT1 partition is disk1.
  • the data processing method includes the following steps:
  • the disk1 receives the primary storage node determination message and determines that it is the primary storage node of the current logical partition.
  • the cluster management module determines a storage node (for example, disk1) as the primary storage node of the PT1 partition from PT1, and sends a primary storage node determination message to the determined primary storage node (disk1), wherein the primary storage node determines that the message carries the primary The identification information of the storage node (for example, the device unique identifier).
  • the primary storage node determines itself as the primary storage node of the logical partition according to the identification information in the primary storage node determination message.
  • the disk1 receives the read IO request.
  • the initiator of the IO request (for example, may be a PC client or a smart terminal client, etc.), the initiator sends the IO request to the data cluster server in the distributed storage system, and then the IO request is sent by the data cluster server.
  • the primary storage node ie, disk1 in this embodiment
  • Disk1 determines the type of IO request, the IO request includes a read IO request and a write IO request; the read IO request is used to read the target data object in the storage device; and the write IO request is used to write the target data object in the storage device.
  • the IO request includes the object identifier of the data object requested to be read or written.
  • the disk1 determines whether the target data object requested to be read by the read IO request stored on itself is trusted; if it is trusted, executes S140; if not, executes S150.
  • the disk1 determines whether the object corresponding to the object identifier stored by itself is trusted according to the object identifier carried by the read IO request.
  • the disk1 reads the target data object stored on itself and returns the target data object to the originating end of the read IO request.
  • the target data object on disk1 is trusted, the target data object is read directly from disk1 and returned to the originator of the IO request.
  • the read operation is performed according to the traditional Quorum mechanism.
  • PT1 includes seven storage nodes. According to the principle of R+W>7, R is 4 and W is 4, that is, at least 4 copies are successfully read, and the read operation is considered successful.
  • the read IO request may be forwarded by disk1 to other storage nodes in PT1.
  • disk1 forwards the read IO requests to disk2 to disk7, respectively.
  • disk1 collects the read response results returned by each storage node, and collects four successful read response results, and determines that the read operation is successful; according to the Quorum mechanism, the four read successful results must contain a latest target number. According to the object. If the version numbers of the target data objects in the four read success results are different, the target data object corresponding to the latest version number is selected and sent to the initiator of the IO request.
  • the cluster management module if the cluster management module detects a disk1 failure, the cluster management module reselects a storage node from the current logical partition as a primary storage node, and then resends the read IO request to the new one. Primary storage node.
  • the data processing method based on the distributed storage system selects a storage node as a primary storage node for any logical partition in the distributed storage system. After receiving the read IO request, first determining whether the target data object requested by the read IO request on the primary storage node in the corresponding logical partition is trusted, and if trusted, directly reading the target data from the primary storage node. The object is returned to the originator that sent the read IO request. If the target data object on the primary storage node is not trusted, the read operation is performed according to the traditional Quorum mechanism.
  • the application scenario only needs to read data from a storage node, which is related to the Quorum mechanism. Compared, the number of copies of the read operation is greatly reduced, thereby reducing the latency of the read operation and improving the performance of the read operation.
  • FIG. 4 is a schematic flowchart of a method for constructing a blacklist of a primary storage node according to an embodiment of the present invention.
  • the PT partition shown in FIG. 2 is used as an example for description. This embodiment will describe the blacklist of the primary storage node in detail.
  • the blacklist is a list of data objects that fail to be written on the storage node before the storage node becomes the primary storage node of the current logical partition, that is, the data objects in the blacklist on the storage node are not trusted.
  • the method includes the following steps between S110 and S120 of the embodiment shown in FIG. 3:
  • the disk1 receives the primary storage node determination message and determines that it is the primary storage node of the current logical partition.
  • the disk1 collects the object degraded write log on the other storage nodes in the current logical partition, and marks the disk1 as an untrusted state.
  • the object degraded write log refers to the log of the data object writing failure on the storage node and recorded on all the storage nodes in the current logical partition that have successfully written.
  • the object degraded write log records the object ID of the write failed data object, the disk that failed to write, the logical partition to which it belongs, and the time information of the write failure. For example, when the data object 1 fails to write on disk2 and the other storage nodes write successfully at time T1, the log of data object 1 failing to write on disk2 is recorded in the logs of disk1 and disk3 to disk7.
  • Disk1 only needs to collect the object degraded write log recorded by all other storage nodes in the current logical partition before it becomes the primary storage node. Disk1 sends a request for the descending write log of the collection object to other nodes in the current logical partition, and receives the object degraded write log returned by each storage node. During the object degraded write log collection process, disk1 is marked as untrusted, and the untrusted state indicates that the data object on disk1 is not complete.
  • S230 The disk1 determines whether the collected object degraded write log includes a data object that fails to be written on the disk1; if yes, executes S240; if all the collected object degraded write logs are not included, S270 is performed.
  • Disk1 determines whether the collected object degraded write log contains the data object that failed to write on disk1 during the process of degrading the write log.
  • the blacklist When the blacklist is incomplete, it indicates that the blacklist is still incomplete and does not know whether the data object to be read is in the blacklist.
  • S260 Reconfigure the data object that fails in the blacklist one by one, and delete the degraded write log corresponding to the data object that is successfully restored from the blacklist, until all the data objects in the blacklist are successfully reconstructed, and execute S270.
  • Data reconstruction is to restore the data object on the storage node where the data object fails to write according to the content of the data object on the storage node where the data object is successfully written.
  • the main storage node (disk1) can actively read the object content corresponding to the data object to be restored stored in the storage node whose data object is successfully written.
  • the data object that fails to be written in the blacklist includes the data object 1, the data object 2, and the data object 3. Then, the data object 1 is copied from the storage node where the data object 1 is successfully written to the disk1; similarly, the data object 2 Copy the data object 2 to disk1 in the successfully written storage node; copy the data object 3 to disk1 from the storage node where the data object 3 is successfully written.
  • the cluster management module determines that the primary storage node disk1 in PT1 fails for some reason, the cluster management module reselects a storage node from PT1 as a new primary storage node (according to each The storage node is selected as the balancing principle of the primary storage node.
  • the new primary storage node determines the method flow of the above S210-S270 after determining itself as the primary storage node.
  • the data processing method provided in this embodiment collects the object degraded write log on the other storage nodes in the current logical partition after the disk1 determines that it is the primary storage node of the current logical partition, and selects the log before the user is the primary storage node.
  • the data object that failed to write itself is blacklisted, and during this period, disk1 is marked as untrusted. It is necessary to further determine whether the target data object on disk1 is trusted.
  • FIG. 5 is a schematic diagram of a process for determining whether a target data object stored by a primary storage node is trusted according to an embodiment of the present invention, and determining a target data on a primary storage node according to a blacklist constructed in the embodiment shown in FIG. Whether the object is trusted or not, this embodiment still uses the logical partition shown in FIG. 2 as an example.
  • disk1 is the primary storage node.
  • the method may include the following steps:
  • disk1 receives the read IO request.
  • disk1 determines its own state, the state of the primary storage node includes a trusted state and an untrusted state; if disk1 is in an untrusted state, then S330 is performed; if disk1 is in a trusted state, then S350 is performed.
  • disk1 determines whether the blacklist corresponding to itself is complete; if complete, executes S340; if not, executes S360.
  • the disk1 determines whether the blacklist contains the target data object requested to be read by the read IO request; if it is included, executing S360; if not, executing S350;
  • disk1 determines that the target data object stored on itself is trusted.
  • disk1 determines that the target data object stored on itself is not trusted.
  • the state of the primary storage node is first determined, and if the primary storage node is in a trusted state, determining that the target data object on the primary storage node is trusted; If the primary storage node is in an untrusted state, it is determined whether the blacklist of the primary storage node is complete; if the blacklist is complete, it is further determined whether the target data object is included in the blacklist; if included, determining the target data on the primary storage node The object is not trusted; if the target data object is not included in the blacklist, or if the blacklist is incomplete, then the target data object on the primary storage node is determined to be trusted. After determining that the target data object on the primary storage node is trusted, the target data object only needs to be read from the primary storage node, which greatly reduces the latency of the read operation and improves the read performance.
  • an embodiment of the present invention selects a storage node as a primary storage node from a plurality of storage nodes of the logical partition, and only needs to be trusted from the target data object on the primary storage node.
  • the target data object is read on the primary storage node. Accordingly, when writing data, it is necessary to ensure that the data object of the primary storage node is successfully written. The process of writing data will be described in detail below with reference to FIG.
  • FIG. 6 is a flowchart of a method for writing data based on a distributed storage system according to an embodiment of the present invention. The method still uses the logical partition shown in FIG. 2 as an example. As shown in FIG. 6, the method for writing data may include the following. step:
  • the disk1 receives the write IO request, and the write IO request carries the target data object to be written.
  • disk1 writes the target data object locally.
  • disk1 If the target data object fails to write on disk1, disk1 returns a write failure response message to the initiator that writes the IO request.
  • the cluster management module reselects one other storage node from the current logical partition as the primary storage node, and the cluster management module resends the new primary storage node to the new primary storage node. Write an IO request to write.
  • a retry request is initiated, that is, a write IO request is reinitiated.
  • the disk1 After receiving the response result of at least three successful writes returned by the other storage nodes, the disk1 returns a write success response message to the initiator that writes the IO request.
  • the write operation is still in accordance with the Quorum mechanism, and the primary storage node must be included in at least 4 write-successful storage nodes that are successfully written, that is, the successfully written storage node must include the primary storage node and three other storage nodes.
  • the write operation is considered successful, and the write success response result is returned to the initiator that writes the IO request.
  • the primary storage node receives the write IO request and is local to the primary storage node. If the primary storage node writes successfully, it also needs to write data at other storage nodes, and writes the primary storage. The number of successful writes, including the node, needs to meet the requirements of the Quorum mechanism. At this point, the write operation is considered successful, and the primary storage node returns the write success response result to the initiator that writes the IO request; if the primary storage node write operation fails, The write failure response result is directly returned from the primary storage node to the initiator that writes the IO request. Ensuring that the primary storage node writes successfully maximizes the probability of successfully reading data only from the primary storage node.
  • the present invention also provides an embodiment of a corresponding storage device.
  • FIG. 7 a block diagram of a storage device, which is a primary storage node of a logical partition in a distributed storage system, is shown in an embodiment of the present invention.
  • the storage device includes: a first receiving module 110, and determining Module 120, reading module 130 and sending module 140;
  • the first receiving module 110 receives the read IO request, and the determining module 120 determines whether the target data object of the storage device is trusted.
  • the target data object is a data object to be read by the read IO request; if trusted, the read module 130 stores the data.
  • the target data object is read in the device; the sending module 140 returns the target data object read by the reading module 130 to the originating end of the read IO request.
  • the determining module 120 includes a first determining sub-module 121, a second determining sub-module 122, and a third determining sub-module 123.
  • a first determining sub-module 121 configured to determine a state of the storage device, where the state of the storage device includes a trusted state and an untrusted state; if the state of the primary storage node is a trusted state, determining the primary The target data object stored on the storage node is trusted.
  • the second determining sub-module 122 is configured to: when the first determining sub-module determines that the status of the primary storage node is in an untrusted state, obtain a blacklist on the primary storage node, and determine whether the blacklist is complete. And storing, in the blacklist, a data object that fails to be written on the primary storage node; if the blacklist is incomplete, determining that the target data object on the primary storage node is not trusted.
  • the second judging sub-module is specifically configured to: obtain a blacklist state, where the blacklist state includes a completion state and an incomplete state, and the blacklist is in an incomplete state during the process of the primary storage node collecting the object degraded write log.
  • the blacklist is in the completed state until the object of all the storage nodes in the logical partition in which the primary storage node is located is downgraded.
  • the blacklist is determined to be complete;
  • the blacklist is determined to be incomplete.
  • a third determining sub-module 123 configured to determine, when the second determining sub-module determines that the blacklist is complete, whether the target data object is included in the blacklist, and if the target data object is included, determining The target data object on the primary storage node is not trusted; if the target data object is not included, it is determined that the target data object on the primary storage node is trusted.
  • the storage device After receiving the read IO request, the storage device provided by the embodiment first determines whether the target data object requested by the read IO request on the primary storage node in the corresponding logical partition is trusted, and if it is trusted, directly The target data object is read on the primary storage node and returned to the initiator that sent the read IO request. If the target data object on the primary storage node is not trusted, the read operation is performed according to the traditional Quorum mechanism.
  • the storage device when the data is read, if the target data object on the primary storage node is trusted and only needs to be read from the primary storage node, the application scenario only needs to read data from a storage node, and the Quorum mechanism. In comparison, the number of copies of the read operation is greatly reduced, thereby reducing the latency of the read operation and improving the performance of the read operation.
  • the storage device further includes: a second receiving module 210, a collecting module 220, and a blacklist building module. 230.
  • the second receiving module 210 is configured to receive a primary storage node determining message, and determine, according to the primary storage node determining message, that it is a primary storage node.
  • the primary storage node determines that the message contains identification information of the primary storage node.
  • the storage device may receive the read IO request at any time.
  • the primary storage node may receive the read IO request during the process of creating a blacklist, or may Create A read IO request is received after the blacklist is completed, or a read IO request is received after the data object in the blacklist is reconstructed.
  • the collecting module 220 is configured to collect an object degraded write log to all storage nodes in the logical partition where the primary storage node is located, and mark the primary storage node as an untrusted state; and the object degraded write log is used to record the data object write a log of the failed storage node, and recorded on all storage nodes in the logical partition where the data object is successfully written;
  • the blacklist construction module 230 is configured to select, from the object degraded write log, all the data objects that failed to be written on the primary storage node to obtain a blacklist;
  • the data reconstruction module 240 is configured to reconstruct the data object that fails to be written in the blacklist one by one, and delete the degraded write log corresponding to the reconstructed data object from the blacklist.
  • the state modification module 250 is configured to mark the primary storage node as a trusted state after all the data objects in the blacklist are successfully reconstructed.
  • the trusted markup module 260 is configured to mark the primary storage node as a trusted state when the object degraded write log does not include a data object that fails to be written on the primary storage node.
  • the storage device collects the object degraded write log on the other storage nodes in the current logical partition after the storage device determines that it is the primary storage node of the current logical partition, and selects the log before the user is the primary storage node.
  • a data object that fails to write itself gets a blacklist, and marks itself as untrusted during this period, and needs to further determine whether the target data object stored by itself is trusted.
  • the storage device further includes: a second receiving module 210, a third receiving module 310, and write data, based on the embodiment shown in FIG.
  • the second receiving module 210 is configured to receive a primary storage node determining message, and determine, according to the primary storage node determining message, that it is a primary storage node.
  • the primary storage node determines that the message contains identification information of the primary storage node.
  • the third receiving module 310 is configured to receive a write IO request, where the write IO request is used to request to write a target data object to a logical partition where the primary storage node is located.
  • the write data module 320 is configured to write the target data object into a corresponding storage space of the storage device according to the write IO request.
  • the first returning module 330 is configured to directly return a write failure response message to the originating end of the write IO request when the target data object fails to be written.
  • the copying module 340 is configured to: when the target data object is successfully written, copy the target data object to another storage node of the logical partition where the primary storage node is located.
  • the second returning module 350 is configured to: when the primary storage node receives the write success response message returned by the preset number of storage nodes in the logical partition where the primary storage node is located, return the write to the initiator of the write IO request Successful response message.
  • the preset number is determined according to the number of storage nodes of the logical partition where the primary storage node is located and the Quorum mechanism.
  • the primary storage node receives the write IO request and is local to the primary storage node. If the primary storage node writes successfully, the data needs to be written in other storage nodes, and the write success including the primary storage node is written. The quantity needs to meet the requirements of the Quorum mechanism. At this time, the write operation is considered successful, and the primary storage node returns the write success response result to the initiator that writes the IO request; if the primary storage node write operation fails, the primary storage node directly Write IO The originating end of the request returns the result of the write failure response. Ensuring that the primary storage node writes successfully maximizes the probability of successfully reading data only from the primary storage node.
  • FIG. 11 a block diagram of still another storage device in a logical partition of a distributed storage system is shown in the embodiment of the present invention.
  • the storage device includes a processor 410, and a receiver 420 and a transmitter 430 that are coupled to the processor 410.
  • the receiver 420 is configured to receive an IO request and provide it to the processor 410.
  • the type of IO request includes a read IO request and a write IO request. If it is a read IO request, the read IO request is used to read the target data object on the logical partition where the primary storage node is located; if it is an IO request, it is used for the request. Writing a target data object to the logical partition where the primary storage node is located;
  • the processor 410 is configured to execute the method in the embodiment shown in FIG. 3 to FIG. 6.
  • the transmitter 430 is configured to send the read target data object to the initiator of the read IO request, or when the processor 410 receives the preset number of write success response messages returned by the other storage node, The initiator that writes the IO request returns a write success response message.
  • the storage device After receiving the read IO request, the storage device provided by the embodiment first determines whether the target data object requested by the read IO request on the primary storage node in the corresponding logical partition is trusted, and if it is trusted, directly The target data object is read on the primary storage node and returned to the initiator that sent the read IO request. If the target data object on the primary storage node is not trusted, the read operation is performed according to the traditional Quorum mechanism.
  • the storage device when the data is read, if the target data object on the primary storage node is trusted and only needs to be read from the primary storage node, the application scenario only needs to read data from a storage node, and the Quorum mechanism. In comparison, the number of copies of the read operation is greatly reduced, thereby reducing the latency of the read operation and improving the performance of the read operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于分布式存储系统的数据处理方法及存储设备,对于分布式存储系统中的任意一个逻辑分区选取一个存储节点作为主存储节点。读取数据时,当主存储节点接收到读IO请求后,判断主存储节点上的读IO请求所请求读取的目标数据对象是否可信,如果可信,则直接从主存储节点上读取目标数据对象并返回给发送读IO请求的发起端。采用该方法,如果主存储节点中存储的目标数据对象可信,则只需从一个存储节点上读取数据,不需要再从其它存储节点中读取该数据对象。与Quorum机制相比,大大减少了读操作的副本数量,从而减少了读操作的时延,提高了读操作的性能。

Description

一种基于分布式存储系统的数据处理方法及存储设备 技术领域
本发明属于数据读写技术领域,尤其涉及一种基于分布式存储系统的数据处理方法及存储设备。
背景技术
在有冗余数据的分布式存储系统中,冗余数据对象会在不同的存储设备之间存放多份副本。但同一时刻一个数据对象的多个副本只能用于读或者用于写。
Quorum机制是分布式系统中用来保证数据冗余和最终一致性的一种投票算法。这个机制有三个关键值N、R和W,又可以称为NRW机制,其中,N表示数据所具有的副本数,R表示完成读操作所需要读取的最小副本数,W表示完成写操作所需要写入的最小副本数。采用该机制,只需要保证R+W>N,就会提供强一致性的保证,因为读取数据的存储节点和被同步写入的存储节点是有重叠的。例如:N=5,W=3,R=3,表示系统中数据有5个不同的副本,对于写操作,需要等待至少3个副本完成该写操作,系统才会返回写成功的状态;对于读操作,需要至少3个副本完成读操作,系统才会返回读成功的状态。
理论上,读取分布式系统中的数据对象时,可以只读取其中的一个副本,但是,采用Quorum机制,读操作需要成功读取的副本数与写操作需要写成功的副本数相同。而且,为了保证尽快获得要求的读成功副本数,通常需要向所有副本所在的存储节点发送读IO请求,因此,读IO请求比较多,读成功的时延增大,大大降低了读性能。
发明内容
有鉴于此,本发明在于提供一种基于分布式存储系统的数据处理方法及存储设备,可以解决现有技术存在的读操作需要读成功的副本数较多、读成功时延大、读性能差的技术问题。
第一方面,本发明提供一种基于分布式存储系统的数据处理方法,读取数据时,主存储节点接收读IO请求,接着,主存储节点判断所述主存储节点上存储的目标数据对象是否可信;如果主存储节点上存储的所述目标数据对象可信,则只需读取主存储节点上存储的目标数据对象,并将该目标数据对象直接发送给发起该读IO请求的发起端,不需要再从主存储节点所在逻辑分区的其它存储节点上读取目标数据对象。该读IO请求用于请求读取主存储节点所在逻辑分区上的目标数据对象,每个所述逻辑分区包括多个存储节点,每个存储节点存储有数据对象,且每个逻辑分区中有一个存储节点是主存储节点。
第一方面提供的该数据处理方法的有益效果在于:在读取数据时,当主存储节点上存储的目标数据对象可信时,只需从主存储节点中读取目标数据对象,并直接将读取的目标数据对象返回给发起读IO请求的发起端,不需要再从当前逻辑分区的其它存储节点上读取数据。与Quorum机制相比,大大减少了读操作的副本数量,从而减少了读操作的时延,提高了读操作的性能。
在第一种可能的实现方式中,所述主存储节点判断所述主存储节点上存储的目标数据对象是否可信,包括:判断所述主存储节点的状态,所述主存储节点的状态包括可信状态和不可信状态;如果所述主存储节点的状态为可信状态,则确定所述主存储节点上存储的目标数据对象可信;如果所述主存储节点的状态为不可信状态,则获取所述主存储节点上的黑名单,并判断所述黑名单是否完整,所述黑名单中存储所述主存储节点上写失败的数 据对象;如果所述黑名单不完整,则确定所述主存储节点上的目标数据对象不可信;如果所述黑名单完整,则判断所述黑名单中是否包含所述目标数据对象,如果包含所述目标数据对象,则确定所述主存储节点上的目标数据对象不可信;如果不包含所述目标数据对象,则确定所述主存储节点上的目标数据对象可信。
在第二种可能的实现方式中,所述方法还包括:所述主存储节点接收主存储节点确定消息,所述主存储节点确定消息包含主存储节点的标识信息;当所述主存储节点根据所述主存储节点确定消息确定自身为主存储节点时,向所述主存储节点所在逻辑分区内的全部存储节点收集对象降级写日志,并标记所述主存储节点为不可信状态;所述对象降级写日志用于记录数据对象写失败的存储节点的日志,且记录在所述逻辑分区内所述数据对象写成功的全部存储节点上;如果所述对象降级写日志中包含所述主存储节点上写失败的数据对象,则从所述对象降级写日志中,选取所述主存储节点上写失败的全部数据对象获得黑名单;如果所述对象降级写日志中不包含所述主存储节点上写失败的数据对象,则标记所述主存储节点为可信状态。
在第三种可能的实现方式中,所述判断所述黑名单是否完整,包括:所述主存储节点获取所述黑名单的状态,所述黑名单的状态包括完成状态和未完成状态,所述主存储节点收集所述对象降级写日志的过程中,所述黑名单为未完成状态,直到收集完所述主存储节点所在逻辑分区内所有存储节点的对象降级写日志,所述黑名单的状态为完成状态;当所述主存储节点获得的所述状态为完成状态时,确定所述黑名单完整;当所述主存储节点获得的所述状态为未完成状态时,确定所述黑名单不完整。
在第四种可能的实现方式中,若所述黑名单中包含在所述主存储节点上写失败的数据对象,则所述方法还包括:所述主存储节点逐个重构所述黑名单中写失败的数据对象,并从所述黑名单中删除重构成功的数据对象所对应的降级写日志;当所述黑名单中的全部数据对象都重构成功后,标记所述主存储节点为可信状态。
第二方面,本发明提供一种基于分布式存储系统的数据处理方法,包括:主存储节点接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象,所述逻辑分区包括多个存储节点,每个逻辑分区中有一个存储节点是主存储节点;当所述主存储节点写入所述目标数据对象失败时,直接向所述写IO请求的发起端返回写失败响应消息;当所述主存储节点写入所述目标数据对象成功时,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上;当所述主存储节点接收到所述主存储节点所在逻辑分区内预设数量个存储节点返回的写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
第二方面提供的基于分布式存储系统的数据处理方法,在写数据时,如果主存储节点写数据失败,则直接向写IO请求的发起端返回写失败的响应消息。如果主存储节点写成功,则还需要在当前逻辑分区的预设数量个其它存储节点上写成功才向发起写IO请求的发起端返回写成功的响应消息,这样,才能保证当前主存储节点上的数据对象是可信数据。
第三方面,本发明提供一种存储设备,所述存储设备是分布式存储系统中一个逻辑分区中的主存储节点,所述存储设备包括:第一接收模块,用于接收读IO请求,所述读IO请求用于请求读取所述主存储节点所在逻辑分区上的目标数据对象;判断模块,用于判断 所述存储设备中存储的所述目标数据对象是否可信;读取模块,用于当所述判断模块确定所述目标数据对象可信,则读取所述存储设备中的所述目标数据对象;发送模块,用于将所述读取模块读取的所述目标数据对象发送给所述读IO请求的发起端。
在第三方面的第一种可能的实现方式中,所述判断模块包括:第一判断子模块,用于判断所述存储设备的状态,所述存储设备的状态包括可信状态和不可信状态;如果所述主存储节点的状态为可信状态,则确定所述主存储节点上存储的目标数据对象可信;第二判断子模块,用于当所述第一判断子模块确定所述主存储节点的状态为不可信状态时,获取所述主存储节点上的黑名单,并判断所述黑名单是否完整,所述黑名单中存储所述主存储节点上写失败的数据对象;如果所述黑名单不完整,则确定所述主存储节点上的目标数据对象不可信;第三判断子模块,用于当所述第二判断子模块确定所述黑名单完整时,判断所述黑名单中是否包含所述目标数据对象,如果包含所述目标数据对象,则确定所述主存储节点上的目标数据对象不可信;如果不包含所述目标数据对象,则确定所述主存储节点上的目标数据对象可信。
在第三方面的第二种可能的实现方式中,所述存储设备还包括:第二接收模块,用于接收主存储节点确定消息,并根据所述主存储节点确定消息确定自身为主存储节点,所述主存储节点确定消息包含主存储节点的标识信息;收集模块,用于向所述主存储节点所在逻辑分区内的全部存储节点收集对象降级写日志,并标记所述主存储节点为不可信状态;所述对象降级写日志用于记录数据对象写失败的存储节点的日志,且记录在所述逻辑分区内所述数据对象写成功的全部存储节点上;黑名单构建模块,用于从所述对象降级写日志中,选取所述主存储节点上写失败的全部数据对象获得黑名单;可信标记模块,用于当所述对象降级写日志中不包含所述主存储节点上写失败的数据对象,则标记所述主存储节点为可信状态。
在第三方面的第三种可能的实现方式中,所述第二判断子模块具体用于:
获取所述黑名单的状态,所述黑名单的状态包括完成状态和未完成状态,所述主存储节点收集所述对象降级写日志的过程中,所述黑名单为未完成状态,直到收集完所述主存储节点所在逻辑分区内所有存储节点的对象降级写日志,所述黑名单的状态为完成状态;当所述黑名单的状态为完成状态时,确定所述黑名单完整;当所述黑名单的状态为未完成状态时,确定所述黑名单不完整。
在第三方面的第四种可能的实现方式中,若所述黑名单中包含在所述主存储节点上写失败的数据对象,则所述存储设备还包括:数据重构模块,用于逐个重构所述黑名单中写失败的数据对象,并从所述黑名单中删除重构成功的数据对象所对应的降级写日志;状态修改模块,用于当所述黑名单中的全部数据对象都重构成功后,标记所述主存储节点为可信状态。
在第三方面的第五种可能的实现方式中,还包括:第三接收模块,用于接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象;写数据模块,用于根据所述写IO请求将所述目标数据对象写入所述存储设备的相应存储空间中;第一返回模块,用于当所述目标数据对象写入失败时,直接向所述写IO请求的发起端返回写失败响应消息;
复制模块,用于当所述目标数据对象写入成功时,将所述目标数据对象复制到所述主 存储节点所在逻辑分区的其它存储节点上;第二返回模块,用于当所述主存储节点接收到所述主存储节点所在逻辑分区内预设数量个存储节点返回的写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
第四方面,本发明提供一种存储设备,所述存储设备是分布式存储系统中一个逻辑分区中的主存储节点,所述存储设备包括:接收器,用于读IO请求,所述读IO请求用于请求读取所述主存储节点所在逻辑分区上的目标数据对象;处理器,用于判断所述存储设备中存储的所述目标数据对象是否可信;如果确定所述目标数据对象可信,则读取所述存储设备中的所述目标数据对象;发送器,用于将读取的所述目标数据对象发送给所述读IO请求的发起端。
在第四方面的第一种可能的实现方式中,所述接收器,还用于接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象;所述处理器,还用于根据所述写IO请求将所述目标数据对象写入所述存储设备的相应存储空间中,如果所述目标数据对象写入失败时,直接向所述写IO请求的发起端返回写失败响应消息;如果所述目标数据对象写入成功,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上,并接收所述其它存储节点返回的写成功响应消息;发送器,还用于当接收到所述其它存储节点返回的预设数量个写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
本发明提供的基于分布式存储系统的数据处理方法,首先,对于分布式存储系统中的任意一个逻辑分区(Partition,PT)选取一个存储节点作为主存储节点。主存储节点接收到读IO请求,读IO请求用于读取所述主存储节点所在逻辑分区中的目标数据对象。主存储节点接收到读IO请求后,先判断主存储节点上存储的目标数据对象是否可信,如果可信,则只从主存储节点上读取目标数据对象并返回给发送读IO请求的发起端。采用该方法,在读取数据时,如果主存储节点上的目标数据对象可信只需从主存储节点上读取,不需要从其它存储节点上读取目标数据对象,与Quorum机制相比,大大减少了读操作的IO请求数量,从而减少了读操作的时延,提高了读操作的性能。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本发明实施例一种分布式存储系统的示意图;
图2是本发明实施例一种逻辑分区的示意图;
图3是本发明实施例一种基于分布式存储系统的数据处理方法的流程示意图;
图4是本发明实施例一种主存储节点构建黑名单的方法流程示意图;
图5是本发明实施例提供的一种主存储节点判断自身存储的目标数据对象是否可信的流程示意图;
图6是本发明实施例另一种基于分布式存储系统的数据处理方法的流程图;
图7是本发明实施例一种存储设备的框图;
图8是本发明实施例一种判断模块的框图;
图9是本发明实施例另一种存储设备的框图;
图10是本发明实施例又一种存储设备的框图;
图11是本发明实施例又一种存储设备的框图。
具体实施方式
图3是本发明实施例一种基于分布式存储系统的数据处理方法的流程示意图,本实施例以图2所示的PT分区为例进行说明,假设PT1分区中的主存储节点是disk1,则数据处理方法包括以下步骤:
S110,disk1接收主存储节点确定消息,并确定自身为当前逻辑分区的主存储节点。
集群管理模块从PT1中确定一个存储节点(例如,disk1)作为PT1分区的主存储节点,并向确定的主存储节点(disk1)发送主存储节点确定消息,其中,主存储节点确定消息中携带主存储节点的标识信息(例如,设备唯一标识)。主存储节点根据主存储节点确定消息中的标识信息确定自身为该逻辑分区的主存储节点。
S120,disk1接收读IO请求;
IO请求的发起端(例如,可以是PC机客户端或智能终端客户端等),发起端将IO请求发送给分布式存储系统中的数据集群服务器,然后,由数据集群服务器将该IO请求发送给对应的PT分区的主存储节点(即,本实施例中的disk1)。disk1判断IO请求的类型,IO请求包括读IO请求和写IO请求;读IO请求用于读取存储设备中的目标数据对象;写IO请求用于写存储设备中的目标数据对象。IO请求中包括所请求读或写的数据对象的对象标识。
S130,disk1判断自身上存储的读IO请求所请求读取的目标数据对象是否可信;如果可信,则执行S140;如果不可信,则执行S150。
disk1根据读IO请求携带的对象标识判断自身存储的所述对象标识对应的对象是否可信。
S140,disk1读取自身上存储的目标数据对象,并向读IO请求的发起端返回目标数据对象。
如果disk1上的目标数据对象可信,则直接从disk1上读取目标数据对象并返回给IO请求的发起端。
S150,按照Quorum机制读取目标数据对象。
如果disk1上的目标数据对象不可信,则按照传统的Quorum机制进行读取操作。
本实施例中,PT1包括7个存储节点,根据R+W>7的原则可知,R为4,W为4,即,至少读成功4份才认为读操作成功。
在本发明的一些应用场景中,如果disk1没有故障,只是disk1内存储的目标数据对象不可信,则可以由disk1向PT1内的其它存储节点转发读IO请求。为了能够尽快收集到至少4份读成功响应,通常需要向其它6个存储节点发送读IO请求,如图3所示,disk1分别向disk2~disk7转发读IO请求。
然后,disk1收集各个存储节点返回的读响应结果,收集到4个读成功响应结果时,确定读取操作成功;根据Quorum机制可知,4个读成功结果中必定包含一个最新的目标数 据对象。如果收集到的4个读成功结果中目标数据对象的版本号不相同,则选取最新版本号对应的目标数据对象发送给IO请求的发起端。
在本发明的另一些应用场景中,如果集群管理模块检测到disk1故障,则由集群管理模块从当前逻辑分区中重新选取一个存储节点作为主存储节点,然后,将读IO请求重新发给新的主存储节点。
本实施例提供的基于分布式存储系统的数据处理方法,对分布式存储系统中的任意一个逻辑分区选取一个存储节点作为主存储节点。当接收到读IO请求后,首先判断相应的逻辑分区中主存储节点上的读IO请求所请求读取的目标数据对象是否可信,如果可信,则直接从主存储节点上读取目标数据对象并返回给发送读IO请求的发起端。如果主存储节点上的目标数据对象不可信,则按照传统的Quorum机制进行读取操作。采用该方法,在读取数据时,如果主存储节点上的目标数据对象可信只需从主存储节点上读取,此种应用场景只需从一个存储节点上读取数据,与Quorum机制相比,大大减少了读操作的副本数量,从而减少了读操作的时延,提高了读操作的性能。
图4是本发明实施例一种主存储节点构建黑名单的方法流程示意图,本实施例仍以图2所示的PT分区为例进行说明,本实施例将详细介绍主存储节点构建黑名单的过程,黑名单是存储节点成为当前逻辑分区的主存储节点之前,在该存储节点上写失败的数据对象的列表,即,该存储节点上黑名单内的数据对象不可信。该方法在图3所示实施例的S110和S120之间包括以下步骤:
S210,disk1接收主存储节点确定消息,并确定自身为当前逻辑分区的主存储节点。
S220,disk1收集当前逻辑分区内其它存储节点上的对象降级写日志,并标记disk1为不可信状态。
对象降级写日志是指数据对象在存储节点上写失败的日志并记录在当前逻辑分区内所有写成功的存储节点上。对象降级写日志中记录了写失败数据对象的对象标识、写失败的盘、所属的逻辑分区,以及写失败所发生的时间信息。例如,T1时刻数据对象1在disk2上写失败、其它存储节点均写成功,则数据对象1在disk2上写失败的日志会记录在disk1和disk3~disk7的日志中。
disk1只需要收集自身成为主存储节点之前、当前逻辑分区内的全部其它存储节点记录的对象降级写日志。disk1向当前逻辑分区内的其它各个节点发送收集对象降级写日志的请求,接收各个存储节点返回的对象降级写日志。在对象降级写日志收集过程中,将disk1标记为不可信状态,不可信状态表示不知道disk1上的数据对象是否完整。
S230,disk1判断收集到的对象降级写日志中是否包含在disk1上写失败的数据对象;如果包含,则执行S240;如果收集到的全部对象降级写日志中都不包含,则执行S270。
disk1在收集对象降级写日志过程中判断已收集到的对象降级写日志中是否包含在disk1上写失败的数据对象。
S240,将对象降级写日志中包含的在disk1上写失败的数据对象添加到黑名单中,并标记黑名单为未完成状态。
黑名单为未完成状态时,表示黑名单还不完整,不知道要读取的数据对象是否在黑名单中。
S250,当disk1收集到全部对象降级写日志得到完整的黑名单后,并标记黑名单为完成 状态。
S260,逐个重构黑名单中写失败的数据对象,并从黑名单中删除将恢复成功的数据对象对应的降级写日志,直到黑名单中全部数据对象都重构成功,执行S270。
数据重构就是根据数据对象写成功的存储节点上的数据对象内容重新恢复数据对象写失败的存储节点上的该数据对象。具体的,可以由主存储节点(disk1)主动读取数据对象写成功的存储节点中存储的所要恢复的数据对象对应的对象内容。
例如,黑名单中写失败的数据对象包括数据对象1、数据对象2和数据对象3,然后,分别从数据对象1写成功的存储节点中复制数据对象1到disk1中;同理,从数据对象2写成功的存储节点中复制数据对象2到disk1中;从数据对象3写成功的存储节点中复制数据对象3到disk1中。
S270,标记disk1为可信状态。
如果黑名单中的数据对象全部重构完成后,重新将disk1标记为可信状态。如果disk1收集到的全部对象降级写日志中都不包含在disk1上写失败的数据对象,则确定disk1上不存在写失败的数据对象,即,disk1上的数据对象全部可信,则可以标记disk1为可信状态。
在本发明的其它应用场景中,当集群管理模块判断出PT1中的主存储节点disk1由于某种原因故障时,集群管理模块从PT1中重新选取一个存储节点作为新的主存储节点(根据每个存储节点都作为主存储节点的均衡原则选取),新的主存储节点确定自身为主存储节点后重复执行上述S210~S270的方法流程。
本实施例提供的数据处理方法,在disk1确定自身为当前逻辑分区的主存储节点后,收集当前逻辑分区中其它存储节点上的对象降级写日志,并从中选取在在自身作为主存储节点之前在自身上写失败的数据对象,得到黑名单,并在此期间标记disk1为不可信状态,需要进一步判断disk1上的目标数据对象是否可信。
图5是本发明实施例提供的一种主存储节点判断自身存储的目标数据对象是否可信的流程示意图,该方法根据图4所示实施例中构建的黑名单判断主存储节点上的目标数据对象是否可信,本实施例仍以图2所示的逻辑分区为例进行说明,PT1分区中disk1为主存储节点。如图5所示,该方法可以包括以下步骤:
S310,disk1接收读IO请求。
S320,disk1判断自身的状态,主存储节点的状态包括可信状态和非可信状态;如果disk1为不可信状态,则执行S330;如果disk1为可信状态,则执行S350。
S330,disk1判断自身对应的黑名单是否完整;如果完整,则执行S340;如果不完整,则执行S360。
S340,disk1判断黑名单中是否包含读IO请求所请求读取的目标数据对象;如果包含,则执行S360;如果不包含,则执行S350;
S350,disk1确定自身上存储的目标数据对象可信。
S360,disk1确定自身上存储的目标数据对象不可信。
本实施例提供的判断主存储节点上的目标数据对象是否可信的过程,首先判断主存储节点的状态,如果主存储节点为可信状态,则确定主存储节点上的目标数据对象可信;如果主存储节点为不可信状态,则判断主存储节点的黑名单是否完整;如果黑名单完整,则进一步判断黑名单中是否包含目标数据对象;如果包含,则确定主存储节点上的目标数据 对象不可信;如果黑名单中不包含目标数据对象,或者,黑名单不完整,则确定主存储节点上的目标数据对象可信。确定主存储节点上的目标数据对象可信后,只需要从主存储节点上读取该目标数据对象即可,大大降低了读操作的时延,提高了读取性能。
本发明实施例为了减少读取数据的时延,从逻辑分区的多个存储节点中选取一个存储节点作为主存储节点,并在主存储节点上的目标数据对象可信的前提下,只需从主存储节点上读取目标数据对象。相应地,在写数据时,必须保证主存储节点的数据对象写成功。下面结合图6对写数据的过程进行详细说明。
图6是本发明实施例一种基于分布式存储系统的写数据方法的流程图,该方法仍以图2所示的逻辑分区为例进行说明,如图6所示,写数据方法可以包括以下步骤:
S410,disk1接收写IO请求,写IO请求中携带待写入的目标数据对象。
S420,disk1将目标数据对象写入本地。
S430,如果目标数据对象在disk1上写失败,则disk1向写IO请求的发起端返回写失败响应消息。
在本发明的一些应用场景中,如果由于disk1故障导致写操作失败,则由集群管理模块从当前逻辑分区中重新选择一个其它存储节点作为主存储节点,集群管理模块重新向新的主存储节点发送写IO请求进行写操作。
在本发明的另一些应用场景中,如果disk1没有故障,则发起重试请求,即重新发起写IO请求。
S440,如果目标数据对象在disk1上写成功,则直接将该目标数据对象复制到其它存储节点上。
S450,当disk1接收到其它存储节点返回的至少3个返回写成功的响应结果后,向写IO请求的发起端返回写成功响应消息。
写操作仍按照Quorum机制,而且,主存储节点必须包含在写成功的至少4个写成功的存储节点中,即,写成功的存储节点必须包括主存储节点及3个其它存储节点,此时,认为写操作成功,并向写IO请求的发起端返回写成功响应结果。
本实施例提供的基于分布式存储系统的写数据方法,主存储节点接收写IO请求并在主存储节点本地,如果主存储节点写成功,还需要在其它存储节点写数据,并写包含主存储节点在内的写成功的数量需要满足Quorum机制的规定,此时,才认为写操作成功,并由主存储节点向写IO请求的发起端返回写成功响应结果;如果主存储节点写操作失败,则直接由主存储节点向写IO请求的发起端返回写失败响应结果。保证主存储节点写成功能够最大程度地提高只从主存储节点上成功读取数据的概率。
对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
相应于上述的数据处理方法实施例,本发明还提供了相应的存储设备的实施例。
请参见图7,示出了本发明实施例一种存储设备的框图,该存储设备是分布式存储系统中一个逻辑分区的主存储节点。如图7所示,该存储设备包括:第一接收模块110、判断 模块120、读取模块130和发送模块140;
第一接收模块110接收读IO请求,判断模块120判断存储设备的目标数据对象是否可信,目标数据对象是读IO请求所要读取的数据对象;如果可信,则由读取模块130从存储设备中读取目标数据对象;发送模块140将读取模块130读取的目标数据对象返回给读IO请求的发起端。
在本发明的一些实施例中,如图8所示,判断模块120包括第一判断子模块121、第二判断子模块122和第三判断子模块123。
第一判断子模块121,用于判断所述存储设备的状态,所述存储设备的状态包括可信状态和不可信状态;如果所述主存储节点的状态为可信状态,则确定所述主存储节点上存储的目标数据对象可信。
第二判断子模块122,用于当所述第一判断子模块确定所述主存储节点的状态为不可信状态时,获取所述主存储节点上的黑名单,并判断所述黑名单是否完整,所述黑名单中存储所述主存储节点上写失败的数据对象;如果所述黑名单不完整,则确定所述主存储节点上的目标数据对象不可信。
第二判断子模块具体用于:获取黑名单的状态,黑名单的状态包括完成状态和未完成状态,主存储节点收集所述对象降级写日志的过程中,所述黑名单为未完成状态,直到收集完所述主存储节点所在逻辑分区内所有存储节点的对象降级写日志,所述黑名单的状态为完成状态;当黑名单的状态为完成状态时,确定黑名单完整;当黑名单的状态为未完成状态时,确定黑名单不完整。
第三判断子模块123,用于当所述第二判断子模块确定所述黑名单完整时,判断所述黑名单中是否包含所述目标数据对象,如果包含所述目标数据对象,则确定所述主存储节点上的目标数据对象不可信;如果不包含所述目标数据对象,则确定所述主存储节点上的目标数据对象可信。
本实施例提供的存储设备,当接收到读IO请求后,首先判断相应的逻辑分区中主存储节点上的读IO请求所请求读取的目标数据对象是否可信,如果可信,则直接从主存储节点上读取目标数据对象并返回给发送读IO请求的发起端。如果主存储节点上的目标数据对象不可信,则按照传统的Quorum机制进行读取操作。采用该存储设备,在读取数据时,如果主存储节点上的目标数据对象可信只需从主存储节点上读取,此种应用场景只需从一个存储节点上读取数据,与Quorum机制相比,大大减少了读操作的副本数量,从而减少了读操作的时延,提高了读操作的性能。
请参见图9,示出了本发明实施例另一种存储设备的框图,该存储设备在图7所示实施例的基础上还包括:第二接收模块210、收集模块220、黑名单构建模块230、数据重构模块240、状态修改模块250和可信标记模块260。
第二接收模块210,用于接收主存储节点确定消息,并根据所述主存储节点确定消息确定自身为主存储节点。
所述主存储节点确定消息包含主存储节点的标识信息。
需要说明的是,在存储设备接收到主存储节点确定消息后,随时可能接收到读IO请求,例如,可能在主存储节点创建黑名单的过程中接收到读IO请求,或者,还有可能在创建 完黑名单之后接收到读IO请求,或者,在黑名单中的数据对象重构完成后接收到读IO请求。
收集模块220,用于向所述主存储节点所在逻辑分区内的全部存储节点收集对象降级写日志,并标记所述主存储节点为不可信状态;所述对象降级写日志用于记录数据对象写失败的存储节点的日志,且记录在所述逻辑分区内所述数据对象写成功的全部存储节点上;
黑名单构建模块230,用于从所述对象降级写日志中,选取所述主存储节点上写失败的全部数据对象获得黑名单;
数据重构模块240,用于逐个重构所述黑名单中写失败的数据对象,并从所述黑名单中删除重构成功的数据对象所对应的降级写日志;
状态修改模块250,用于当所述黑名单中的全部数据对象都重构成功后,标记所述主存储节点为可信状态。
可信标记模块260,用于当所述对象降级写日志中不包含所述主存储节点上写失败的数据对象,则标记所述主存储节点为可信状态。
本实施例提供的存储设备,在存储设备确定自身为当前逻辑分区的主存储节点后,收集当前逻辑分区中其它存储节点上的对象降级写日志,并从中选取在在自身作为主存储节点之前在自身上写失败的数据对象,得到黑名单,并在此期间标记自身为不可信状态,需要进一步判断自身存储的目标数据对象是否可信。
请参见图10,示出了本发明实施例又一种存储设备的框图,该存储设备在图7所示实施例的基础上还包括:第二接收模块210、第三接收模块310、写数据模块320、第一返回模块330、复制模块340和第二返回模块350。
第二接收模块210,用于接收主存储节点确定消息,并根据所述主存储节点确定消息确定自身为主存储节点。
所述主存储节点确定消息包含主存储节点的标识信息。
第三接收模块310,用于接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象。
写数据模块320,用于根据所述写IO请求将所述目标数据对象写入所述存储设备的相应存储空间中。
第一返回模块330,用于当所述目标数据对象写入失败时,直接向所述写IO请求的发起端返回写失败响应消息。
复制模块340,用于当所述目标数据对象写入成功时,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上。
第二返回模块350,用于当所述主存储节点接收到所述主存储节点所在逻辑分区内预设数量个存储节点返回的写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息。
所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
本实施例提供的存储设备,主存储节点接收写IO请求并在主存储节点本地,如果主存储节点写成功,还需要在其它存储节点写数据,并写包含主存储节点在内的写成功的数量需要满足Quorum机制的规定,此时,才认为写操作成功,并由主存储节点向写IO请求的发起端返回写成功响应结果;如果主存储节点写操作失败,则直接由主存储节点向写IO 请求的发起端返回写失败响应结果。保证主存储节点写成功能够最大程度地提高只从主存储节点上成功读取数据的概率。
请参见图11,示出了本发明实施例又一种存储设备的框图,该存储设备是分布式存储系统中一个逻辑分区中的主存储节点。
如图11所示,该存储设备包括处理器410,以及与处理器410连接的接收器420和发送器430。
接收器420,用于接收IO请求并提供给处理器410。
IO请求的类型包括读IO请求和写IO请求,如果是读IO请求,则该读IO请求用于读取主存储节点所在逻辑分区上的目标数据对象;如果是写IO请求,则用于请求向所述主存储节点所在的逻辑分区写入目标数据对象;
处理器410,用于执行图3~图6所示实施例中的方法。
发送器430,用于将读取的目标数据对象发送给读IO请求的发起端,或者,当处理器410接收到所述其它存储节点返回的预设数量个写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息。
本实施例提供的存储设备,当接收到读IO请求后,首先判断相应的逻辑分区中主存储节点上的读IO请求所请求读取的目标数据对象是否可信,如果可信,则直接从主存储节点上读取目标数据对象并返回给发送读IO请求的发起端。如果主存储节点上的目标数据对象不可信,则按照传统的Quorum机制进行读取操作。采用该存储设备,在读取数据时,如果主存储节点上的目标数据对象可信只需从主存储节点上读取,此种应用场景只需从一个存储节点上读取数据,与Quorum机制相比,大大减少了读操作的副本数量,从而减少了读操作的时延,提高了读操作的性能。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置类实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (14)

  1. 一种基于分布式存储系统的数据处理方法,其特征在于,包括:
    主存储节点接收读IO请求,所述读IO请求用于请求读取所述主存储节点所在逻辑分区上的目标数据对象,每个所述逻辑分区包括多个存储节点,每个存储节点存储有数据对象,且每个逻辑分区中有一个存储节点是主存储节点;
    所述主存储节点判断所述主存储节点上存储的目标数据对象是否可信;
    如果所述主存储节点上存储的所述目标数据对象可信,则读取所述主存储节点上存储所述目标数据对象,并将所述目标数据对象直接发送给发起所述读IO请求的发起端。
  2. 根据权利要求1所述的方法,其特征在于,所述主存储节点判断所述主存储节点上存储的目标数据对象是否可信,包括:
    判断所述主存储节点的状态,所述主存储节点的状态包括可信状态和不可信状态;
    如果所述主存储节点的状态为可信状态,则确定所述主存储节点上存储的目标数据对象可信;
    如果所述主存储节点的状态为不可信状态,则获取所述主存储节点上的黑名单,并判断所述黑名单是否完整,所述黑名单中存储所述主存储节点上写失败的数据对象;
    如果所述黑名单不完整,则确定所述主存储节点上的目标数据对象不可信;
    如果所述黑名单完整,则判断所述黑名单中是否包含所述目标数据对象,如果包含所述目标数据对象,则确定所述主存储节点上的目标数据对象不可信;如果不包含所述目标数据对象,则确定所述主存储节点上的目标数据对象可信。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    所述主存储节点接收主存储节点确定消息,所述主存储节点确定消息包含主存储节点的标识信息;
    当所述主存储节点根据所述主存储节点确定消息确定自身为主存储节点时,向所述主存储节点所在逻辑分区内的全部存储节点收集对象降级写日志,并标记所述主存储节点为不可信状态;所述对象降级写日志用于记录数据对象写失败的存储节点的日志,且记录在所述逻辑分区内所述数据对象写成功的全部存储节点上;
    如果所述对象降级写日志中包含所述主存储节点上写失败的数据对象,则从所述对象降级写日志中,选取所述主存储节点上写失败的全部数据对象获得黑名单;
    如果所述对象降级写日志中不包含所述主存储节点上写失败的数据对象,则标记所述主存储节点为可信状态。
  4. 根据权利要求2所述的方法,其特征在于,所述判断所述黑名单是否完整,包括:
    所述主存储节点获取所述黑名单的状态,所述黑名单的状态包括完成状态和未完成状态,所述主存储节点收集所述对象降级写日志的过程中,所述黑名单为未完成状态,直到收集完所述主存储节点所在逻辑分区内所有存储节点的对象降级写日志,所述黑名单的状态为完成状态;
    当所述主存储节点获得的所述状态为完成状态时,确定所述黑名单完整;
    当所述主存储节点获得的所述状态为未完成状态时,确定所述黑名单不完整。
  5. 根据权利要求2所述的方法,其特征在于,若所述黑名单中包含在所述主存储 节点上写失败的数据对象,则所述方法还包括:
    所述主存储节点逐个重构所述黑名单中写失败的数据对象,并从所述黑名单中删除重构成功的数据对象所对应的降级写日志;
    当所述黑名单中的全部数据对象都重构成功后,标记所述主存储节点为可信状态。
  6. 一种基于分布式存储系统的数据处理方法,其特征在于,包括:
    主存储节点接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象,所述逻辑分区包括多个存储节点,每个逻辑分区中有一个存储节点是主存储节点;
    当所述主存储节点写入所述目标数据对象失败时,直接向所述写IO请求的发起端返回写失败响应消息;
    当所述主存储节点写入所述目标数据对象成功时,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上;
    当所述主存储节点接收到所述主存储节点所在逻辑分区内预设数量个存储节点返回的写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
  7. 一种存储设备,其特征在于,所述存储设备是分布式存储系统中一个逻辑分区中的主存储节点,所述存储设备包括:
    第一接收模块,用于接收读IO请求,所述读IO请求用于请求读取所述主存储节点所在逻辑分区上的目标数据对象;
    判断模块,用于判断所述存储设备中存储的所述目标数据对象是否可信;
    读取模块,用于当所述判断模块确定所述目标数据对象可信,则读取所述存储设备中的所述目标数据对象;
    发送模块,用于将所述读取模块读取的所述目标数据对象发送给所述读IO请求的发起端。
  8. 根据权利要求7所述的存储设备,其特征在于,所述判断模块包括:
    第一判断子模块,用于判断所述存储设备的状态,所述存储设备的状态包括可信状态和不可信状态;如果所述主存储节点的状态为可信状态,则确定所述主存储节点上存储的目标数据对象可信;
    第二判断子模块,用于当所述第一判断子模块确定所述主存储节点的状态为不可信状态时,获取所述主存储节点上的黑名单,并判断所述黑名单是否完整,所述黑名单中存储所述主存储节点上写失败的数据对象;如果所述黑名单不完整,则确定所述主存储节点上的目标数据对象不可信;
    第三判断子模块,用于当所述第二判断子模块确定所述黑名单完整时,判断所述黑名单中是否包含所述目标数据对象,如果包含所述目标数据对象,则确定所述主存储节点上的目标数据对象不可信;如果不包含所述目标数据对象,则确定所述主存储节点上的目标数据对象可信。
  9. 根据权利要求8所述的存储设备,其特征在于,所述存储设备还包括:
    第二接收模块,用于接收主存储节点确定消息,并根据所述主存储节点确定消息确定自身为主存储节点,所述主存储节点确定消息包含主存储节点的标识信息;
    收集模块,用于向所述主存储节点所在逻辑分区内的全部存储节点收集对象降级写日志,并标记所述主存储节点为不可信状态;所述对象降级写日志用于记录数据对象写失败的存储节点的日志,且记录在所述逻辑分区内所述数据对象写成功的全部存储节点上;
    黑名单构建模块,用于从所述对象降级写日志中,选取所述主存储节点上写失败的全部数据对象获得黑名单;
    可信标记模块,用于当所述对象降级写日志中不包含所述主存储节点上写失败的数据对象,则标记所述主存储节点为可信状态。
  10. 根据权利要求8所述的存储设备,其特征在于,所述第二判断子模块具体用于:
    获取所述黑名单的状态,所述黑名单的状态包括完成状态和未完成状态,所述主存储节点收集所述对象降级写日志的过程中,所述黑名单为未完成状态,直到收集完所述主存储节点所在逻辑分区内所有存储节点的对象降级写日志,所述黑名单的状态为完成状态;
    当所述黑名单的状态为完成状态时,确定所述黑名单完整;
    当所述黑名单的状态为未完成状态时,确定所述黑名单不完整。
  11. 根据权利要求8所述的存储设备,其特征在于,若所述黑名单中包含在所述主存储节点上写失败的数据对象,则所述存储设备还包括:
    数据重构模块,用于逐个重构所述黑名单中写失败的数据对象,并从所述黑名单中删除重构成功的数据对象所对应的降级写日志;
    状态修改模块,用于当所述黑名单中的全部数据对象都重构成功后,标记所述主存储节点为可信状态。
  12. 根据权利要求7所述的存储设备,其特征在于,还包括:
    第三接收模块,用于接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象;
    写数据模块,用于根据所述写IO请求将所述目标数据对象写入所述存储设备的相应存储空间中;
    第一返回模块,用于当所述目标数据对象写入失败时,直接向所述写IO请求的发起端返回写失败响应消息;
    复制模块,用于当所述目标数据对象写入成功时,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上;
    第二返回模块,用于当所述主存储节点接收到所述主存储节点所在逻辑分区内预设数量个存储节点返回的写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
  13. 一种存储设备,其特征在于,所述存储设备是分布式存储系统中一个逻辑分区中的主存储节点,所述存储设备包括:
    接收器,用于读IO请求,所述读IO请求用于请求读取所述主存储节点所在逻辑分区上的目标数据对象;
    处理器,用于判断所述存储设备中存储的所述目标数据对象是否可信;如果确定所 述目标数据对象可信,则读取所述存储设备中的所述目标数据对象;
    发送器,用于将读取的所述目标数据对象发送给所述读IO请求的发起端。
  14. 根据权利要求13所述的存储设备,其特征在于:
    所述接收器,还用于接收写IO请求,所述写IO请求用于请求向所述主存储节点所在的逻辑分区写入目标数据对象;
    所述处理器,还用于根据所述写IO请求将所述目标数据对象写入所述存储设备的相应存储空间中,如果所述目标数据对象写入失败时,直接向所述写IO请求的发起端返回写失败响应消息;如果所述目标数据对象写入成功,将所述目标数据对象复制到所述主存储节点所在逻辑分区的其它存储节点上,并接收所述其它存储节点返回的写成功响应消息;
    发送器,还用于当接收到所述其它存储节点返回的预设数量个写成功响应消息时,向所述写IO请求的发起端返回写成功响应消息,所述预设数量根据所述主存储节点所在逻辑分区的存储节点数量及Quorum机制确定。
PCT/CN2017/081339 2016-09-05 2017-04-21 一种基于分布式存储系统的数据处理方法及存储设备 WO2018040589A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/293,098 US11614867B2 (en) 2016-09-05 2019-03-05 Distributed storage system-based data processing method and storage device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610807454.9A CN106406758B (zh) 2016-09-05 2016-09-05 一种基于分布式存储系统的数据处理方法及存储设备
CN201610807454.9 2016-09-05

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/293,098 Continuation US11614867B2 (en) 2016-09-05 2019-03-05 Distributed storage system-based data processing method and storage device

Publications (1)

Publication Number Publication Date
WO2018040589A1 true WO2018040589A1 (zh) 2018-03-08

Family

ID=57999583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/081339 WO2018040589A1 (zh) 2016-09-05 2017-04-21 一种基于分布式存储系统的数据处理方法及存储设备

Country Status (3)

Country Link
US (1) US11614867B2 (zh)
CN (1) CN106406758B (zh)
WO (1) WO2018040589A1 (zh)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106406758B (zh) 2016-09-05 2019-06-18 华为技术有限公司 一种基于分布式存储系统的数据处理方法及存储设备
SG11201901608VA (en) 2017-03-29 2019-03-28 Huawei Tech Co Ltd Method for accessing distributed storage system, related apparatus, and related system
CN110546620B (zh) * 2017-04-14 2022-05-17 华为技术有限公司 数据处理方法、存储系统和交换设备
CN108235751B (zh) 2017-12-18 2020-04-14 华为技术有限公司 识别对象存储设备亚健康的方法、装置和数据存储系统
CN108780386B (zh) * 2017-12-20 2020-09-04 华为技术有限公司 一种数据存储的方法、装置和系统
CN109002256B (zh) * 2018-05-04 2022-12-06 中国信息安全研究院有限公司 一种用于可信计算环境的存储系统
CN108959513A (zh) * 2018-06-28 2018-12-07 郑州云海信息技术有限公司 一种分布式存储系统下读取数据的方法及其数据处理装置
CN110825309B (zh) * 2018-08-08 2021-06-29 华为技术有限公司 数据读取方法、装置及系统、分布式系统
CN111385327B (zh) * 2018-12-28 2022-06-14 阿里巴巴集团控股有限公司 数据处理方法和系统
CN109981741A (zh) * 2019-02-26 2019-07-05 启迪云计算有限公司 一种分布式存储系统的维护方法
US11334279B2 (en) * 2019-11-14 2022-05-17 Western Digital Technologies, Inc. Hierarchical blacklisting of storage system components
US11314431B2 (en) 2019-11-14 2022-04-26 Western Digital Technologies, Inc. Distributed data blocks using storage path cost values
CN111399761B (zh) * 2019-11-19 2023-06-30 杭州海康威视系统技术有限公司 存储资源分配方法、装置及设备、存储介质
CN110989934B (zh) * 2019-12-05 2023-08-25 达闼机器人股份有限公司 区块链节点数据存储方法、区块链系统及区块链节点
CN111176900A (zh) * 2019-12-30 2020-05-19 浪潮电子信息产业股份有限公司 一种分布式存储系统及其数据恢复方法、装置和介质
CN111970520B (zh) * 2020-08-13 2022-04-08 北京中电兴发科技有限公司 一种异构节点流式数据分布式存储方法
US11487432B2 (en) * 2020-10-08 2022-11-01 EMC IP Holding Company LLC Direct response to IO request in storage system with remote replication
CN113535656B (zh) * 2021-06-25 2022-08-09 中国人民大学 数据访问方法、装置、设备及存储介质
CN113721849B (zh) * 2021-08-23 2024-04-12 深圳市杉岩数据技术有限公司 一种基于分布式存储的数据复制卸载方法及终端设备
JP2023037883A (ja) * 2021-09-06 2023-03-16 キオクシア株式会社 情報処理装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617568A (en) * 1994-12-14 1997-04-01 International Business Machines Corporation System and method for supporting file attributes on a distributed file system without native support therefor
CN102567438A (zh) * 2010-09-28 2012-07-11 迈塔斯威士网络有限公司 对分布式存储系统中的数据项进行访问的方法
CN102937964A (zh) * 2012-09-28 2013-02-20 无锡江南计算技术研究所 基于分布式系统的智能数据服务方法
CN105278877A (zh) * 2015-09-30 2016-01-27 成都华为技术有限公司 一种对象存储方法和装置
CN106406758A (zh) * 2016-09-05 2017-02-15 华为技术有限公司 一种基于分布式存储系统的数据处理方法及存储设备

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219807B1 (en) * 2004-12-17 2012-07-10 Novell, Inc. Fine grained access control for linux services
US8132259B2 (en) * 2007-01-04 2012-03-06 International Business Machines Corporation System and method for security planning with soft security constraints
US8336108B2 (en) * 2007-06-22 2012-12-18 Red Hat, Inc. Method and system for collaboration involving enterprise nodes
US8370902B2 (en) * 2010-01-29 2013-02-05 Microsoft Corporation Rescuing trusted nodes from filtering of untrusted network entities
US20140372607A1 (en) * 2010-03-15 2014-12-18 Cleversafe, Inc. Adjusting allocation of dispersed storage network resources
US8683119B2 (en) * 2010-03-15 2014-03-25 Cleversafe, Inc. Access control in a dispersed storage network
US8925101B2 (en) * 2010-07-28 2014-12-30 Mcafee, Inc. System and method for local protection against malicious software
US8849877B2 (en) * 2010-08-31 2014-09-30 Datadirect Networks, Inc. Object file system
US8793328B2 (en) * 2010-12-17 2014-07-29 Facebook, Inc. Distributed storage system
CN102981976B (zh) * 2012-12-05 2016-05-25 清华大学 用于数据存储的访问控制方法
US10701148B2 (en) * 2012-12-13 2020-06-30 Level 3 Communications, Llc Content delivery framework having storage services
US20150127607A1 (en) * 2013-01-10 2015-05-07 Unicom Systems, Inc. Distributed data system with document management and access control
US9313274B2 (en) * 2013-09-05 2016-04-12 Google Inc. Isolating clients of distributed storage systems
US10157021B2 (en) * 2016-06-29 2018-12-18 International Business Machines Corporation Processing incomplete data access transactions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617568A (en) * 1994-12-14 1997-04-01 International Business Machines Corporation System and method for supporting file attributes on a distributed file system without native support therefor
CN102567438A (zh) * 2010-09-28 2012-07-11 迈塔斯威士网络有限公司 对分布式存储系统中的数据项进行访问的方法
CN102937964A (zh) * 2012-09-28 2013-02-20 无锡江南计算技术研究所 基于分布式系统的智能数据服务方法
CN105278877A (zh) * 2015-09-30 2016-01-27 成都华为技术有限公司 一种对象存储方法和装置
CN106406758A (zh) * 2016-09-05 2017-02-15 华为技术有限公司 一种基于分布式存储系统的数据处理方法及存储设备

Also Published As

Publication number Publication date
US20190196728A1 (en) 2019-06-27
US11614867B2 (en) 2023-03-28
CN106406758A (zh) 2017-02-15
CN106406758B (zh) 2019-06-18

Similar Documents

Publication Publication Date Title
WO2018040589A1 (zh) 一种基于分布式存储系统的数据处理方法及存储设备
CN103268318B (zh) 一种强一致性的分布式键值数据库系统及其读写方法
US9081841B2 (en) Asynchronous distributed garbage collection for replicated storage clusters
CN106776130B (zh) 一种日志恢复方法、存储装置和存储节点
JP6039655B2 (ja) 分散記憶環境における同期複製
US11307776B2 (en) Method for accessing distributed storage system, related apparatus, and related system
CN106547859B (zh) 一种多租户数据存储系统下的数据文件的存储方法及装置
US20150261626A1 (en) Data restoration method and system
CN113168404B (zh) 用于在分布式数据库系统中复制数据的系统和方法
JP6225262B2 (ja) 分散データグリッドにおいてデータを同期させるためにパーティションレベルジャーナリングをサポートするためのシステムおよび方法
US20180101558A1 (en) Log-shipping data replication with early log record fetching
KR20110086114A (ko) 분산형 데이터 스토리지
CN105824846B (zh) 数据迁移方法及装置
WO2018121456A1 (zh) 一种数据存储方法、服务器以及存储系统
WO2021226905A1 (zh) 一种数据存储方法、系统及存储介质
CN112988683A (zh) 数据处理方法、装置、电子设备及存储介质
TWI705368B (zh) 區塊鏈資料的壓縮處理方法和裝置、電腦設備以及電腦可讀儲存媒體
CN111309245B (zh) 一种分层存储写入方法和装置、读取方法和装置及系统
CN104965835B (zh) 一种分布式文件系统的文件读写方法及装置
CN112130758B (zh) 一种数据读请求的处理方法、系统、电子设备及存储介质
CN110121694B (zh) 一种日志管理方法、服务器和数据库系统
CN111797172A (zh) 数据迁移方法、装置、设备、分布式系统及存储介质
CN116257186A (zh) 一种数据对象纠删码存储方法、装置、设备及介质
WO2022048416A1 (zh) 操作请求的处理方法、装置、设备、可读存储介质及系统
CN114741449A (zh) 一种基于分布式数据库的对象存储方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17844897

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17844897

Country of ref document: EP

Kind code of ref document: A1