CN112052124B - Data redundancy method and distributed storage cluster - Google Patents

Data redundancy method and distributed storage cluster Download PDF

Info

Publication number
CN112052124B
CN112052124B CN202011025578.4A CN202011025578A CN112052124B CN 112052124 B CN112052124 B CN 112052124B CN 202011025578 A CN202011025578 A CN 202011025578A CN 112052124 B CN112052124 B CN 112052124B
Authority
CN
China
Prior art keywords
sub
data
disk
management module
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011025578.4A
Other languages
Chinese (zh)
Other versions
CN112052124A (en
Inventor
苏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Macrosan Technologies Co Ltd
Original Assignee
Macrosan Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Macrosan Technologies Co Ltd filed Critical Macrosan Technologies Co Ltd
Priority to CN202011025578.4A priority Critical patent/CN112052124B/en
Publication of CN112052124A publication Critical patent/CN112052124A/en
Application granted granted Critical
Publication of CN112052124B publication Critical patent/CN112052124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Security & Cryptography (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data redundancy method and a distributed storage cluster. The distributed storage cluster writes data in a copy mode so as to ensure the data writing performance of the distributed storage cluster; and when the data is cooled, converting the data stored in the copy mode into erasure code mode for storage, so as to reduce disk consumption and improve the disk yield of the distributed storage cluster.

Description

Data redundancy method and distributed storage cluster
Technical Field
The present application relates to the field of storage technologies, and in particular, to a data redundancy method and a distributed storage cluster.
Background
In distributed storage clusters, data redundancy techniques are typically employed to ensure reliability of the data. The current data redundancy technique mainly includes: copy technology and error correction code (EC) technology.
The copy technology refers to copying the same data into multiple copies, and storing the copies in different fault domains of the distributed storage cluster respectively. The technology is simple to realize, but the disk yield of the system is lower. Taking 2 copies (i.e., 2 copies of the same data stored in a distributed storage cluster) as an example, the disk yield is 50%.
The EC technology divides data into a plurality of data fragments, calculates check fragments according to the data fragments, and then writes each data fragment and the check fragments into different fault domains. And the data fragments participating in the same check calculation and the calculated check fragments form an EC stripe. It can be seen that when writing data, the technology needs to perform verification calculation, which can increase system overhead and affect writing performance.
Disclosure of Invention
In view of this, the present application provides a data redundancy method and a distributed storage cluster, which are used for reducing disk consumption and improving the disk yield of the distributed storage cluster on the premise of ensuring the writing performance.
In order to achieve the purposes of the application, the application provides the following technical scheme:
in a first aspect, the present application provides a data redundancy method, applied to a distributed storage cluster, where the distributed storage cluster includes at least one cluster node, each cluster node includes at least one disk for storing data, each disk is divided into a plurality of blocks according to a preset Block size, the distributed storage cluster is configured with at least one LUN, each LUN is divided into a plurality of logical sections according to a preset Segment size, each logical section is divided into a plurality of sub-logical sections according to a preset Block size, each cluster node deploys a corresponding disk management module for each disk on the node, each Segment corresponds to a write bitmap, each bit in the write bitmap is used to identify whether the corresponding sub-logical section has been written with data, each Segment also corresponds to a storage bitmap, each bit in the storage bitmap is used to identify a storage mode of the data corresponding to the sub-logical section, and the distributed storage cluster adopts an N copy mode, where N is greater than or equal to 2, and the method includes:
When the target cluster node in the at least one cluster node monitors that the duration of the non-access target Segment reaches the preset duration, acquiring a writing bitmap and a storage bitmap corresponding to the target Segment;
the target cluster node traverses the writing bitmap and the storage bitmap to find N first sub-logic intervals in which data are written and the storage mode is a copy mode;
the target cluster node selects a target disk from N first disks for storing data corresponding to the target Segment;
for each first sub-logic section, the target cluster node sends a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk;
the target cluster node calculates check data according to the data of each first sub-logic interval returned by the target disk management module;
the target cluster node sends a write command for indicating to write the verification data into a second disk to a second disk management module corresponding to the second disk, wherein the second disk is a pre-designated disk for storing the verification data corresponding to the target Segment;
the second disk management module allocates a first Block for the check data from the second disk and writes the check data into the first Block;
And for each first disk, the target cluster node sends a deleting command for indicating to delete the specified data copy to the first disk management module corresponding to the first disk, so that the data corresponding to the N first sub-logic sections are respectively stored in the N first disks.
Optionally, after the target cluster node sends a deletion command for indicating to delete the specified data copy to the first disk management module corresponding to the first disk for each first disk, the method further includes:
and for each first sub-logic interval, the target cluster node updates the storage mode identified by the corresponding bit of the first sub-logic interval in the storage bitmap to be an erasure code mode.
Optionally, the method further comprises:
when the target cluster node receives a write request which needs to be written into the target Segment, determining each second sub-logic section in the target Segment related to the write request, and splitting the write request into sub-write requests aiming at each second sub-logic section;
the following is performed for each second sub-logic section:
the target cluster node inquires a writing bitmap of the target Segment and determines whether data is written in a second sub-logic section;
If the second sub-logic section is written with data, the target cluster node queries a storage bitmap of the target Segment and determines a storage mode of data corresponding to the second sub-logic section;
if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, the target cluster node respectively sends a sub-write request aiming at the second sub-logic section to each first disk management module, wherein the sub-write request carries the data to be written into the second sub-logic section and erasure code marks;
when the first disk management module determines that the sub-write request carries the erasure code mark, a second Block is allocated to the second sub-logic section, data is written into the second Block, and the mapping relation between the second sub-logic section and the second Block is recorded;
and the target cluster node updates the storage mode identified by the corresponding bit of the second sub-logic section in the storage bitmap to be a copy mode.
Optionally, after the target cluster node sends a deletion command for indicating to delete the specified data copy to the first disk management module corresponding to the first disk for each first disk, so that the N first sub-logical section corresponding data are stored in the N first disks, the method further includes:
The first disk management module records the mapping relation between a first sub-logic section to which the deleted data copy belongs and a Block to which the undeleted data copy belongs, and adds an association mark for the mapping relation;
when the target cluster node receives a read request for reading the target Segment, determining each third sub-logic section in the target Segment related to the read request, and splitting the read request into sub-read requests for each third sub-logic section;
the following processing is performed for each third sub-logic section:
the target cluster node queries the storage bitmap and determines a storage mode of data corresponding to a third sub-logic interval;
if the storage mode of the data corresponding to the third sub-logic section is an erasure code mode, the target cluster node respectively sends a first sub-read request aiming at the third sub-logic section to each first disk management module, wherein the first sub-read request does not comprise an association mark;
and when the first disk management module determines that the mapping relation between the third sub-logic section and the third Block of the local record does not have the association mark, reading corresponding data from the third Block and returning the corresponding data to the target cluster node.
Optionally, the write command includes a start address of each first sub-logical section, the second disk management module allocates a first Block for the check data from the second disk, and after writing the check data into the first Block, the method further includes:
the second disk management module records the mapping relation between each first sub-logic section and the first Block, and adds an association mark;
after the target cluster node sends the first sub-read requests aiming at the third sub-logic section to each first disk management module respectively, the method further comprises the following steps:
if the data of the third sub-logic section is not read, the target cluster node sends a second sub-read request aiming at the third sub-logic section to each first disk management module, wherein the second sub-read request carries an association mark;
when the first disk management module determines that the mapping relation between the third sub-logic section and the third Block of the local record has an association mark, corresponding data is read from the third Block and returned to the target cluster node;
the target cluster node sends a third sub-read request aiming at the third sub-logic section to the second disk management module;
The second disk management module reads check data from a fourth Block and returns the check data to the target cluster node according to the mapping relation between the third sub-logic section and the fourth Block recorded locally;
and the target cluster node performs verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module to obtain the data corresponding to the third sub-logic section.
Optionally, the method further comprises:
when the target cluster node determines that the fault disk exists in the N first magnetic disks, the following processing is executed for each fourth sub-logic section of the written data in the target Segment:
the target cluster node reads the data corresponding to the fourth sub-logic interval;
the target cluster node sends write commands for indicating to write the data corresponding to the fourth sub-logic section to each third disk management module, wherein the third disk management modules refer to disk management modules corresponding to disks except for the fault disk and the disk returning the data corresponding to the fourth sub-logic section in the first disk and the second disk;
when the third disk management module determines that the mapping relation between the fourth sub-logic section and the fifth Block of the local record has an associated mark, a sixth Block is allocated to the fourth sub-logic section, and data corresponding to the fourth sub-logic section is written into the sixth Block;
The third disk management module updates the mapping relation between the fourth sub-logic section and the fifth Block into the mapping relation between the fourth sub-logic section and the sixth Block, and removes the corresponding association mark.
Optionally, after the third disk management module updates the mapping relationship between the fourth sub-logic section and the fifth Block to the mapping relationship between the fourth sub-logic section and the sixth Block, the method further includes:
the third disk management module judges whether a mapping relation comprising the fifth Block exists or not;
and if not, recycling the fifth Block.
In a second aspect, the present application provides a distributed storage cluster, where the distributed storage cluster includes at least one cluster node, each cluster node includes at least one disk for storing data, each disk is divided into a plurality of blocks according to a preset Block size, the distributed storage cluster is configured with at least one LUN, each LUN is divided into a plurality of logical sections according to a preset Segment size, each logical section is divided into a plurality of sub-logical sections according to a preset Block size, each cluster node deploys a corresponding disk management module for each disk on the present node, each Segment corresponds to a write bitmap, each bit in the write bitmap is used to identify whether the corresponding sub-logical section has been written with data, each Segment also corresponds to a storage bitmap, each bit in the storage bitmap is used to identify a storage manner of the data of the corresponding sub-logical section, and the distributed storage cluster is written in an N-copy manner, where N is greater than or equal to 2;
The target cluster node in the at least one cluster node is used for acquiring a writing bitmap and a storage bitmap corresponding to a target Segment when the duration of the non-accessed target Segment is monitored to reach a preset duration; traversing the writing bitmap and the storage bitmap to find N first sub-logic intervals in which the data are written and the storage mode is a copy mode; selecting a target disk from N first disks for storing data corresponding to the target Segment; for each first sub-logic section, sending a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk; calculating check data according to the data of each first sub-logic section returned by the target disk management module; sending a write command for indicating to write the verification data into a second disk to a second disk management module corresponding to the second disk, wherein the second disk is a preassigned disk for storing the verification data corresponding to the target Segment;
the second disk management module is configured to allocate a first Block for the check data from the second disk, and write the check data into the first Block;
The target cluster node is further configured to send, to a first disk management module corresponding to each first disk, a deletion command for instructing to delete a copy of specified data, so that data corresponding to the N first sub-logical intervals are respectively stored in the N first disks.
Optionally, the target cluster node is further configured to update, for each first sub-logical section, a storage mode identified by a corresponding bit of the first sub-logical section in the storage bitmap to be an erasure coding mode.
Optionally, the target cluster node is further configured to determine each second sub-logical section in the target Segment related to the write request when receiving the write request for writing the target Segment, and split the write request into sub-write requests for each second sub-logical section; inquiring the writing bitmap of the target Segment for each second sub-logic interval, and determining whether the second sub-logic interval is written with data; if the second sub-logic section is written with data, inquiring a storage bitmap of the target Segment, and determining a storage mode of data corresponding to the second sub-logic section; if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, respectively sending a sub-write request aiming at the second sub-logic section to each first disk management module, wherein the sub-write request carries the data to be written into the second sub-logic section and erasure code marks;
The first disk management module is further configured to allocate a second Block to the second sub-logical section when determining that the sub-write request carries the erasure code flag, write data into the second Block, and record a mapping relationship between the second sub-logical section and the second Block;
and the target cluster node is further configured to update a copy mode of the storage mode identified by the corresponding bit in the storage bitmap in the second sub-logic section.
Optionally, the first disk management module is further configured to record a mapping relationship between a first sub-logical section to which the deleted data copy belongs and a Block to which the undeleted data copy belongs, and add an association flag to the mapping relationship;
the target cluster node is further configured to determine each third sub-logical section in the target Segment related to the read request when receiving the read request for reading the target Segment, and split the read request into sub-read requests for each third sub-logical section; inquiring the storage bitmap aiming at each third sub-logic interval, and determining a storage mode of data corresponding to the third sub-logic interval; if the storage mode of the data corresponding to the third sub-logic section is an erasure code mode, respectively sending a first sub-read request aiming at the third sub-logic section to each first disk management module, wherein the first sub-read request does not comprise an association mark;
And the first disk management module is further configured to read corresponding data from the third Block and return the corresponding data to the target cluster node when it is determined that the mapping relationship between the third sub-logical section and the third Block of the local record does not have an association mark.
Optionally, the second disk management module is further configured to record a mapping relationship between each first sub-logical section and the first Block, and add an association flag;
the target cluster node is further configured to send a second sub-read request for the third sub-logical section to each first disk management module if the data in the third sub-logical section is not read, where the second sub-read request carries an association flag;
the first disk management module is further configured to, when determining that a mapping relationship between the third sub-logical section and a third Block of the local record has an association tag, read corresponding data from the third Block and return the corresponding data to the target cluster node;
the target cluster node is further configured to send a third sub-read request for the third sub-logical section to the second disk management module;
the second disk management module is further configured to read check data from a fourth Block and return the check data to the target cluster node according to a mapping relationship between the third sub-logical section and the fourth Block, where the mapping relationship is recorded locally;
And the target cluster node is also used for performing verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module to obtain the data corresponding to the third sub-logic section.
Optionally, the target cluster node is further configured to, when determining that a failed disk exists in the N first disks, read, for each fourth sub-logical section of the data written in the target Segment, data corresponding to the fourth sub-logical section; sending write commands for indicating the data corresponding to the fourth sub-logic section to each third disk management module, wherein the third disk management modules refer to disk management modules corresponding to disks except for the fault disk and the disk returning the data corresponding to the fourth sub-logic section in the first disk and the second disk;
the third disk management module is configured to allocate a sixth Block to the fourth sub-logical section when determining that the mapping relationship between the fourth sub-logical section and the fifth Block of the local record has an association flag, and write data corresponding to the fourth sub-logical section into the sixth Block; and updating the mapping relation between the fourth sub-logic section and the fifth Block into the mapping relation between the fourth sub-logic section and the sixth Block, and removing the corresponding association mark.
Optionally, the third disk management module is further configured to determine whether a mapping relationship including the fifth Block still exists; and if not, recycling the fifth Block.
As can be seen from the above description, in the embodiment of the present application, the distributed storage cluster writes data in a duplicate manner, so as to ensure the data writing performance of the distributed storage cluster; and when the data is cooled, converting the data stored in the copy mode into erasure code mode for storage, so as to reduce disk consumption and improve the disk yield of the distributed storage cluster.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an exemplary illustrative distributed storage cluster employing replica techniques;
FIG. 2 is a schematic diagram of an exemplary illustrative distributed storage cluster employing EC technology;
FIG. 3 is a flow chart of a data redundancy method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application;
FIG. 6 is a flow diagram illustrating a write request process according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application;
FIG. 8 is a flow chart of a read request process shown in an embodiment of the application;
FIG. 9 is a fault handling flow shown in an embodiment of the present application;
FIG. 10 is a flow chart of data reconstruction according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application;
FIG. 12 is a schematic diagram of a mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application;
fig. 13 is a schematic diagram of mapping relationship of Seg1 in a distributed storage cluster according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the application. As used in this embodiment of the application, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the negotiation information may also be referred to as second information, and similarly, the second information may also be referred to as negotiation information, without departing from the scope of the embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Distributed storage clusters typically include multiple servers (also referred to as cluster nodes). Each cluster node includes at least one disk (also referred to as a data disk) for storing data. In the following description, magnetic disks refer to data disks unless otherwise specified.
To ensure reliability of data, data is typically stored using data redundancy techniques. The existing data redundancy technology mainly comprises: replica technology and EC technology.
Referring to FIG. 1, a schematic diagram of a distributed storage cluster employing replica techniques is illustratively shown. As can be seen from fig. 1, the same data corresponds to 3 data copies (data copy 1 to data copy 3) in the distributed storage cluster, and are respectively stored in cluster nodes Server1 to Server3. The storage mode is simple to implement, but the disk yield is low, for example, the 3 copies shown in fig. 1 have the disk yield of only 33%.
Referring to FIG. 2, a schematic diagram of a distributed storage cluster employing EC technology is illustratively shown. As can be seen from fig. 2, the original data is split into 3 data fragments, and stored in the cluster node servers 1 to 3 respectively, and meanwhile, the verification calculation is performed according to the 3 data fragments, so as to obtain 1 verification fragment, and the 1 verification fragment is stored in the cluster node Server4. The storage mode can increase the system overhead and influence the writing performance due to the fact that verification calculation is needed.
In view of the above problems, an embodiment of the present application provides a data redundancy method, where in the method, when writing data, a copy mode is used for writing, so as to ensure the data writing performance. After the data is cooled, converting the data stored in a duplicate mode into Erasure Code (EC) mode for storage, so as to reduce disk consumption and improve the disk yield of the distributed storage cluster.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application is performed in conjunction with the accompanying drawings and specific embodiments:
referring to fig. 3, a flow chart of a data redundancy method is shown in an embodiment of the present application. The flow applies to distributed storage clusters.
The distributed storage cluster includes at least one cluster node. Each cluster node includes at least one disk for storing data. Each disk is divided into a plurality of blocks (abbreviated as Blk) according to a preset Block size (e.g., 64 KB).
The distributed storage cluster is configured with at least one LUN, each LUN being divided into a plurality of logical intervals according to a preset Segment size (e.g., 256 MB), such as [ 0, 256MB ], [ 256MB,512MB ], [ 512MB,768MB ], and so on.
Each logical section is divided into a plurality of sub-logical sections according to a preset Block size (e.g., 64 KB), such as [ 0, 64KB ], [ 64KB,128KB ], [ 128KB,192KB ], and so on.
During the data writing process, the distributed storage cluster establishes a mapping of logical resources to physical resources. The mapping mainly comprises: mapping of logical intervals to segments (which may be abbreviated as Seg), segment to disk, and sub-logical interval to Block.
Referring to fig. 4, a schematic diagram of mapping logical resources to physical resources is shown in an embodiment of the present application. The distributed storage cluster shown in fig. 4 includes 4 cluster nodes (Server 1 to Server 4), each of which includes a disk for storing data. Of course, the present embodiments are not limited to the number of disks in a cluster node, and only for simplicity of illustration, each cluster node includes a block of disks.
In fig. 4, the data block a, the data block B, and the data block C are written in the logical section [ 0, 256MB ] of the LUN1, and each data block has a size of 64KB (for convenience of description, a data block having a size of 64KB is taken as an example). Wherein, the data block A is written into the sub-logic section [ 0, 64KB ]; data block B has been written into the sub-logical interval [ 128KB,192KB ]; the data block C has been written into the sub-logical space [ 192KB,256KB ].
In FIG. 4, the logical interval [ 0, 256MB ] has been mapped to Seg1, and the mapping table for the current LUN1 may be expressed as:
LUN1→[0:Seg1]
wherein "0" is the start address of the logical interval [ 0, 256MB ] for identifying the logical interval.
In order to ensure the data writing performance, the embodiment of the application adopts a copy mode to write the data. As shown in fig. 4, if the discs OSD1 to OSD3 are respectively designated to store the data corresponding to Seg1, the mapping relationship between Seg1 and each disc may be expressed as:
Seg1→[OSD1;OSD2;OSD3]
Since the data block a, the data block B, and the data block C are all located in Seg1, the 3 data copies corresponding to the data blocks are stored in OSD1 to OSD3, respectively. As shown in fig. 4, the data block A, B, C is marked as A1, B1, C1 in the OSD1 corresponding to the data copies; corresponding data copies in the OSD2 are marked as A2, B2 and C2; the corresponding data copies are denoted as A3, B3, C3 in OSD3.
The storage position of each data copy in the disk is determined by the corresponding disk management module of each disk. Here, the disk management module generally refers to a service process of a disk. The cluster node deploys a corresponding disk management module for each disk on the cluster node.
Taking OSD1 as an example, the disk management module corresponding to OSD1 allocates corresponding blocks for the sub-logic intervals to which the data copies A1, B1 and C1 belong respectively, and establishes a mapping relationship between the sub-logic intervals and the blocks, which can be expressed as:
Seg1→[0:Blk20;128KB:Blk10;192KB:Blk100]
the "0", "128KB" and "192KB" are the start addresses of the sub-logic sections [ 0, 64KB ], [ 128KB,192KB ] and [ 192KB,256KB ] of the data copies A1, B1 and C1 respectively, and are used for identifying the corresponding sub-logic sections. Blk20 is the Block mapped to by the sub-logic interval [ 0, 64KB ] in OSD 1; blk10 is the Block to which the sub-logic interval [ 128KB,192KB ] maps in OSD 1; blk100 is the Block to which the sub-logic interval [ 192KB,256KB ] maps in OSD 1.
Similarly, the mapping relationship of the sub-logic intervals [ 0, 64KB ], the sub-logic intervals [ 128KB,192KB ], the sub-logic intervals [ 192KB,256KB ] of the data copies A2, B2 and C2 in the OSD2 can be expressed as follows:
Seg1→[0:Blk10;128KB:Blk100;192KB:Blk40]
the mapping relationship of the sub-logic intervals [ 0, 64KB ], the sub-logic intervals [ 128KB,192KB ], the sub-logic intervals [ 192KB,256KB ] of the data copies A3, B3 and C3 in the OSD3 can be expressed as follows:
Seg1→[0:Blk20;128KB:Blk100;192KB:Blk60]
based on the storage structure and the mapping relationship of the distributed storage clusters, the data redundancy flow of the embodiment of the application is described below. As shown in fig. 3, the process may include the steps of:
step 301, when the target cluster node monitors that the duration of the non-access target Segment reaches the preset duration, the target cluster node acquires a write bitmap and a storage bitmap corresponding to the target Segment.
Here, the target cluster node may be any cluster node in the distributed storage cluster. Each cluster node monitors segments of LUNs accessible to the node. Still taking the distributed storage cluster illustrated in fig. 4 as an example, server1 is configured to process the access request of LUN1, and Server1 can monitor each Segment included in LUN 1.
The segments currently monitored by the target cluster node are referred to herein as target segments.
It should be understood that the naming of the target cluster node and the target Segment is only for convenience of distinction, and is not limited thereto.
When the target cluster node monitors that the target Segment is not accessed for a long time, the target Segment corresponding data is cooled, and at the moment, follow-up operation is executed on the target Segment corresponding data, so that the influence on the distributed storage cluster is minimal.
And the target cluster node acquires the write bitmap and the storage bitmap corresponding to the target Segment. Here, it should be noted that each Segment corresponds to a write bitmap and a storage bitmap. Each bit in the write bitmap and the storage bitmap corresponds to a sub-logical section of Segment. Each bit of the writing bitmap is used for identifying whether data is written in a corresponding sub-logic section or not; each bit of the storage bitmap is used to identify the storage manner of the corresponding sub-logical section. The storage mode can be a copy mode or an erasure code mode.
Taking the Seg1 shown in fig. 4 as an example, the sub-logical intervals [ 0, 64KB ], the sub-logical intervals [ 128KB,192KB ], the sub-logical intervals [ 192KB,256KB ] of the Seg1 already have data written, and then the corresponding write bitmap (written as Wmap) of the current Seg1 can be expressed as:
Seg1→[Wmap:10110000……000]
the write bitmap identifies, from left to right, whether or not data has been written in the sub-logical intervals [ 0, 64KB ], [ 64KB,128KB ], [ 128KB,192KB ], [ 192KB,256KB ], … …, [ 256MB-64KB,256MB ]. Wherein, "1" indicates that data has been written in the corresponding sub-logic section; "0" indicates that no data is written in the corresponding sub-logic section.
The storage bitmap (denoted as Smap) corresponding to current Seg1 can be expressed as:
Seg1→[Smap:00000000……000]
the storage bitmap sequentially identifies storage modes of data corresponding to sub-logic intervals [ 0, 64KB ], [ 64KB,128KB ], [ 128KB,192KB ], [ 192KB,256KB ], … …, [ 256MB-64KB,256MB ] from left to right. Wherein, 0 represents that the storage mode of the data corresponding to the sub-logic section is a copy mode; "1" indicates that the data storage method of the corresponding sub-logical section is an erasure code method. Here, the default storage method of the Segment corresponding data is a copy method.
Step 302, the target cluster node traverses the write bitmap and the storage bitmap of the target Segment to find N first sub-logical intervals in which the data is written and the storage mode is a copy mode.
Here, N is the number of copies supported by the distributed storage cluster, where N is equal to or greater than 2. For example, n=2, indicating that the cluster supports 2 copies; n=3, indicating that the cluster supports 3 copies.
Here, the first sub-logical section refers to a sub-logical section in which data is written and in which a storage system is a copy system. It is to be understood that the first sub-logic section is named for convenience of distinction, and is not intended to be limiting.
Still taking the distributed storage cluster illustrated in fig. 4 as an example, the distributed storage cluster supports 3 copies. The current Seg1 corresponding write bitmap is:
Seg1→[Wmap:10110000……000]
the storage bitmap corresponding to Seg1 is:
Seg1→[Smap:00000000……000]
the write bitmap and the storage bitmap of the Seg1 are traversed by the Server1, so that the sub-logical intervals [ 0, 64KB ], the sub-logical intervals [ 128KB,192KB ], the sub-logical intervals [ 192KB,256KB ] of the Seg1 are all copy modes.
In step 303, the target cluster node selects a target disk from N first disks for storing data corresponding to the target Segment.
Here, a disk for storing data corresponding to the target Segment is referred to as a first disk. It will be appreciated that the first disk is named for convenience of distinction and is not intended to be limiting.
As described above, when the cluster designates the disks OSD1 to OSD3 to store the Seg1 corresponding data, the disks OSD1 to OSD3 are all the first disks for storing the Seg1 corresponding data.
Since each first disk stores data of a target Segment stored in a copy manner, any first disk can be selected as a target disk. Of course, as an embodiment, the first disk included in the cluster node with the smaller load may be selected as the target disk. It should be understood that the term "target disk" is used for convenience of distinction and is not intended to be limiting.
Still taking Seg1 as an example corresponding to the first disks OSD1 to OSD3, server1 may select OSD1 as the target disk.
Step 304, for each first sub-logic section, the target cluster node sends a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk.
Here, the disk management module corresponding to the target disk is referred to as a target disk management module. It will be appreciated that the term "target disk management module" is used for convenience of distinction and is not intended to be limiting.
In this step, the target cluster node sends a read command to the target disk management module for each first sub-logical section.
For example, the Server1 sends a read command to the disk management module corresponding to OSD1 for the sub-logic intervals [ 0, 64KB ], [ 128KB,192KB ], and [ 192KB,256KB ] of Seg 1. The OSD1 corresponds to the magnetic disk management module according to the mapping relation between the locally recorded sub logic interval and the Block:
Seg1→[0:Blk20;128KB:Blk10;192KB:Blk100]
reading a data copy A1 from Blk20 mapped to the sub-logical interval [ 0, 64KB ]; read data copy B1 from Blk10 mapped to [ 128kb,192kb ]; the data copy C1 is read from Blk100 mapped to [ 192KB,256KB ].
In step 305, the target cluster node calculates check data according to the data of each first sub-logic section returned by the target disk management module.
Taking the data copies A1 (corresponding to the data block a), B1 (corresponding to the data block B) and C1 (corresponding to the data block C) returned by the OSD1 corresponding to the disk management module as an example, the Server1 performs a check calculation on the data block A, B, C according to a preset erasure code algorithm to obtain check data P.
And 306, the target cluster node sends a write command for indicating to write the verification data into the second disk to a second disk management module corresponding to the second disk. The write command includes a start address of each first sub-logical section.
Here, the second disk is a disk designated in advance for storing the target Segment correspondence check data. For example, the OSD4 is designated to store Seg1 correspondence check data.
Here, the disk management module corresponding to the second disk is referred to as a second disk management module.
It should be understood that the second disk and the second disk management module are named for convenience of distinction, and are not limited thereto.
Taking OSD4 as a second disk, server1 sends a write command for indicating to write verification data P into OSD4 to the disk management module corresponding to OSD4, where the write command includes a start address of a sub-logic interval [ 0, 64KB ] to which data block a belongs, a start address of a sub-logic interval [ 128KB,192KB ] to which data block C belongs, and a start address of a sub-logic interval [ 192KB,256KB ] to which data block C belongs, which participate in the calculation of the verification data.
In step 307, the second disk management module allocates a first Block for the check data from the second disk, and writes the check data into the first Block.
Here, the first Block is named for convenience of distinction, and is not limited thereto.
After the verification data is written, the second disk management module records the mapping relation between each first sub-logic section and the first Block, and adds an association mark for identifying that the first Block stores data associated with the data corresponding to the first sub-logic section.
Taking OSD4 as an example, the OSD4 allocates Blk4 to the check data P corresponding to the disk management module, writes the check data P into Blk4 of OSD4, and records the mapping relationship between the sub-logic intervals [ 0, 64KB ], 128KB,192KB ], 192KB,256KB and Blk4, and adds an association flag, which can be expressed as:
Seg1→[0:Blk4:R;128KB:Blk4:R;192KB:Blk4:R]
wherein R is an association tag.
Step 308, for each first disk, the target cluster node sends a deletion command for indicating to delete the specified data copy to the first disk management module corresponding to the first disk, so that the N first sub-logical section corresponding data are respectively stored in the N first disks.
Here, the disk management module corresponding to the first disk is referred to as a first disk management module. It should be understood that the first disk management module is named for convenience of distinction, and is not limited thereto.
Taking OSD1, OSD2, OSD3 shown in fig. 4 as an example, the Server1 may issue a delete command for the sub-logic intervals [ 128kb,192kb ] and a delete command for the sub-logic intervals [ 192kb,256kb ] to the OSD1, where the OSD1 corresponds to the mapping relationship between the locally recorded sub-logic intervals and blocks by the disk management module:
Seg1→[0:Blk20;128KB:Blk10;192KB:Blk100]
deleting the data copy B1 in Blk10 mapped with the sub-logical interval [ 128KB,192KB ]; the data copy C1 in Blk100 mapped with the sub-logical intervals [ 192KB,256KB ] is deleted. And updates the mapping table of Seg1 as:
Seg1→[0:Blk20;128KB:Blk20:R;192KB:Blk20:R]
as can be seen from the mapping table, the sub-logical intervals [ 128KB,192KB ] to which the deleted data copy B1 belongs and the sub-logical intervals [ 192KB,256KB ] to which the deleted data copy C1 belongs are mapped to the Blk20 to which the undeleted data copy A1 belongs, and the data associated with the sub-logical intervals [ 128KB,192KB ] and the sub-logical intervals [ 192KB,256KB ] corresponding to the data are stored in the Blk20 through the association mark R. The association relation in the embodiment of the application refers to participating in the same check calculation and belongs to the same EC stripe.
Similarly, the Server1 may issue a delete command for the sub-logic interval [ 0, 64KB ] and a delete command for the sub-logic interval [ 192KB,256KB ] to the OSD2, where the OSD2 corresponds to the mapping relationship between the locally recorded sub-logic interval and the Block by the disk management module:
Seg1→[0:Blk10;128KB:Blk100;192KB:Blk40]
Deleting the data copy A2 in Blk10 mapped with the sub-logical interval [ 0, 64KB ]; the data copy C2 in Blk40 mapped with the sub-logical intervals [ 192KB,256KB ]. And updates the mapping table of Seg1 as:
Seg1→[0:Blk100:R;128KB:Blk100;192KB:Blk100:R]
that is, the sub-logical section [ 0, 64KB ] to which the deleted data copy A2 belongs and the sub-logical section [ 192KB,256KB ] to which the deleted data copy C2 belongs are mapped to the Blk100 to which the undeleted data copy B2 belongs, and the associated flag R identifies that the Blk100 stores data corresponding to the sub-logical section [ 0, 64KB ] and data associated with the sub-logical section [ 192KB,256KB ] corresponding to the data.
Server1 can issue a delete command for sub-logic intervals [ 0, 64KB ] and a delete command for sub-logic intervals [ 128KB,192KB ] to OSD3, wherein OSD3 corresponds to a disk management module according to the mapping relation between the locally recorded sub-logic intervals and blocks:
Seg1→[0:Blk20;128KB:Blk100;192KB:Blk60]
deleting a data copy A3 in Blk20 mapped with the sub-logical interval [ 0, 64KB ]; the data copy B3 in Blk100 mapped with the sub-logical intervals [ 128KB,192KB ] is deleted. And updates the mapping table of Seg1 as:
Seg1→[0:Blk60:R;128KB:Blk60:R;192KB:Blk60]
that is, the sub-logical section [ 0, 64KB ] to which the deleted data copy A3 belongs and the sub-logical section [ 128KB,192KB ] to which the deleted data copy B3 belongs are mapped to the Blk60 to which the undeleted data copy C3 belongs, and the associated flag R identifies that the Blk60 stores data corresponding to the sub-logical section [ 0, 64KB ] and data associated with the sub-logical section [ 128KB,192KB ] corresponding to the data.
At this time, the mapping relationship of Seg1 in the distributed storage cluster is shown in fig. 5. In fig. 5, data copy A1 (corresponding to data block a), data copy B2 (corresponding to data block B), data copy C3 (corresponding to data block C), and verification data P constitute one EC stripe. That is, the data blocks A, B, C in Seg1 have been converted from copy mode storage to Erasure Code (EC) mode storage, saving storage space of the distributed storage cluster.
In addition, it should be noted that, after the conversion into the erasure code mode for storage, the target cluster node needs to update the storage mode identified by the corresponding bit in the storage bitmap of the target Segment in each first sub-logical section to be the erasure code mode.
Taking the sub-logical intervals [ 0, 64KB ], the sub-logical intervals [ 128KB,192KB ] and the sub-logical intervals [ 192KB,256KB ] of Seg1 as examples, the data blocks A, B, C in the 3 sub-logical intervals are all converted into erasure codes for storage, and then the storage bitmap of Seg1 is updated as follows:
Seg1→[Smap:10110000……000]
wherein, 1 indicates that the storage mode of the data corresponding to the sub-logic section is an erasure code mode.
Thus, the flow shown in fig. 3 is completed.
As can be seen from the flow shown in FIG. 3, in the embodiment of the present application, when writing data, the data is written in a copy manner, so as to ensure the data writing performance. After the data is cooled, converting the data stored in the copy mode into erasure code mode for storage, so as to reduce disk consumption and improve the disk yield of the distributed storage cluster.
Based on the above storage manner, a description is given below of a processing procedure of the distributed storage cluster to receive the write request. Referring to FIG. 6, a write request processing flow is shown for an embodiment of the present application.
As shown in fig. 6, the process may include the steps of:
in step 601, when the target cluster node receives a write request for writing a target Segment, determining each second sub-logical section in the target Segment related to the write request, and splitting the write request into sub-write requests for each second sub-logical section.
Here, the second sub-logic section is named for convenience of distinction, and is not intended to be limiting.
Still taking Server1 as an example to process a write request for Seg1, where the write range corresponding to the write request in Seg1 is [ 0, 256KB ], the sub-logical intervals (second sub-logical intervals) in Seg1 related to the write request are [ 0, 64KB ], 64KB,128KB ], 128KB,192KB, 256KB, respectively. Server1 splits the write request for Seg1 into sub-write requests for each sub-logical section.
For each second sub-logic section, a subsequent process is performed.
Step 602, the target cluster node queries the write bitmap of the target Segment to determine whether the second sub-logical section has data written therein.
If the second sub-logic section is not written with data, that is, the first writing is performed this time, the second sub-logic section is directly written in a copy manner, which is not described herein. If the second sub-logic section has data written, then step 603 is entered.
Taking Seg1 as an example, the current write bitmap for Seg1 is:
Seg1→[Wmap:10110000……000]
if the second sub-logic interval to be processed currently is [ 64KB,128KB ], the Server1 can know through inquiring the write-in bitmap that the value of the corresponding bit in the write-in bitmap of the second sub-logic interval [ 64KB,128KB ] is 0, and the second sub-logic interval [ 64KB,128KB ] is directly stored in a duplicate mode if data is not written in.
If the second sub-logic section to be processed is "0, 64 KB", the Server1 can know that the value of the corresponding bit in the write bitmap of the second sub-logic section "0, 64 KB" is 1, which indicates that the second sub-logic section "0, 64 KB" has written data, and step 603 is shifted.
Step 603, if the second sub-logical section has data written therein, the target cluster node queries the storage bitmap of the target Segment, and determines the storage mode of the data corresponding to the second sub-logical section.
Taking the second sub-logical section [ 0, 64KB ] as an example, if Server1 determines that the sub-logical section is written with data in step 602, server1 continues to query the storage bitmap of Seg 1:
Seg1→[Smap:10110000……000]
As can be seen from the storage bitmap, the value of the corresponding bit in the storage bitmap of the second sub-logic section [ 0, 64KB ] is 1, which indicates that the storage mode of the data corresponding to the second sub-logic section [ 0, 64KB ] is erasure coding mode.
Here, it should be noted that if the storage bitmap is queried to determine that the storage mode of the data corresponding to the second sub-logical section is a copy mode, the existing copy mode is directly adopted for writing, which is not described herein again. If it is determined that the storage mode of the data corresponding to the second sub-logic section is the erasure coding mode, step 604 is performed.
In step 604, if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, the target cluster node sends a sub-write request for the second sub-logic section to each first disk management module, where the sub-write request carries the data to be written into the second sub-logic section and an erasure code flag.
Taking the second sub-logic section [ 0, 64KB ] as an example, the Server1 determines that the sub-logic section is written with data through step 602 and step 603, and the corresponding data storage mode is an erasure code mode, and then sends sub-write requests for the sub-logic section [ 0, 64KB ] to the corresponding disk management modules of OSD1 to OSD3 for storing the corresponding data of Seg1, where the sub-write requests carry the data block D to be written and the erasure code flag.
In step 605, when determining that the sub-write request carries the erasure code flag, the first disk management module allocates a second Block for the second sub-logical section, writes data into the second Block, and records a mapping relationship between the second sub-logical section and the second Block.
And when the first disk management module determines that the sub-write request carries the erasure code mark, a new Block is allocated to the second sub-logic section and is recorded as the second Block. It should be understood that this is referred to as the second Block, and is named for convenience of distinction only, and is not intended to be limiting. And recording the mapping relation between the second sub-logic section and the newly allocated second Block.
Taking the second sub-logic interval [ 0, 64KB ] as an example, when the OSD1 corresponding disk management module receives a sub-write command for the sub-logic interval and determines that the sub-write command carries an erasure code flag, a new Block is allocated for the sub-logic interval [ 0, 64KB ], and recorded as Block1, the data Block D is written into the Block1, and the mapping relation between the sub-logic interval [ 0, 64KB ] and the Block in the Seg1 is updated, and the mapping table of the updated Seg1 is:
Seg1→[0:Blk1;128KB:Blk20:R;192KB:Blk20:R]
here, by assigning a new Block to the sub-logical section [ 0, 64KB ], it is possible to avoid the data copy A1 in the Blk20 to which the sub-logical section is originally mapped from being overwritten. Thus, when the data of the sub-logic interval [ 128KB,192KB ] or the sub-logic interval [ 192KB,256KB ] needs to be recovered later, the data recovery can be performed by searching the data copy A1 in the Blk20 with the association relation. The specific procedures are described below and are not repeated here.
Similarly, when the OSD2 corresponding disc management module receives a sub-write command for the sub-logic section [ 0, 64KB ], and determines that the sub-write command carries an erasure code flag, a new Block is allocated for the sub-logic section [ 0, 64KB ], denoted as Block2, the data Block D is written into the Block2, and the mapping relationship between the sub-logic section [ 0, 64KB ] and the Block in the Seg1 is updated, and the mapping table of the Seg1 after the update is:
Seg1→[0:Blk2;128KB:Blk100;192KB:Blk100:R]
here, since the data in the sub-logical section [ 0, 64KB ] is stored in Blk2, the mapping relationship between the sub-logical section [ 0, 64KB ] and Blk2 does not need to be added with an association flag.
Similarly, when the OSD3 corresponding disc management module receives a sub-write command for the sub-logic section [ 0, 64KB ], and determines that the sub-write command carries an erasure code flag, a new Block is allocated for the sub-logic section [ 0, 64KB ], denoted as Block3, the data Block D is written into the Block3, and the mapping relationship between the sub-logic section [ 0, 64KB ] and the Block in the Seg1 is updated, and the mapping table of the Seg1 after the update is:
Seg1→[0:Blk3;128KB:Blk60:R;192KB:Blk60]
since the data in the sub-logic section [ 0, 64KB ] is stored in Blk3, the mapping relation between the sub-logic section [ 0, 64KB ] and Blk3 does not need to be added with an association mark.
At this time, the mapping relationship of Seg1 in the distributed storage cluster is shown in fig. 7. Wherein D1 is a data copy of the data block D in OSD 1; d2 is a data copy of data block D in OSD 2; d3 is a copy of data block D in OSD 3. That is, the data block D is written in a copy manner.
In step 606, the target cluster node updates the storage mode identified by the corresponding bit in the storage bitmap of the target Segment in the second sub-logical section to be a copy mode.
Taking Seg1 as an example, server1 updates the value of the corresponding bit in the storage bitmap to 0 (the storage mode is the copy mode) in the sub-logical interval [ 0, 64KB ] of Seg1, and the updated Seg1 corresponding storage bitmap can be expressed as:
Seg1→[Smap:00110000……000]
thus, the flow shown in fig. 6 is completed. Processing of the write request is achieved by the flow shown in fig. 6.
The processing of a read request received by a distributed storage cluster is described below. Referring to fig. 8, a read request processing flow is shown in an embodiment of the present application.
As shown in fig. 8, the process may include the steps of:
in step 801, when the target cluster node receives a read request for reading a target Segment, it determines each third sub-logical section in the target Segment related to the read request, and splits the read request into sub-read requests for each third sub-logical section.
Here, the third sub-logic section is named for convenience of distinction, and is not intended to be limiting.
Still taking Server1 as an example to process a read request for Seg1, where the read request has a corresponding read range of [ 0, 256KB ] in Seg1, the sub-logical intervals (third sub-logical intervals) in Seg1 related to the read request are [ 0, 64KB ], [ 64KB,128KB ], [ 128KB,192KB ], and [ 192KB,256KB ]. Server1 splits the read request for Seg1 into sub-read requests for each sub-logical section.
The subsequent processing is performed for each third sub-logic section.
Step 802, the target cluster node queries a storage bitmap of the target Segment, and determines a storage mode of data corresponding to the third sub-logical section.
If the storage mode of the data corresponding to the third sub-logic section is a copy mode, a sub-read request aiming at the third sub-logic section can be sent to any first disk management module; if the storage mode of the data corresponding to the third sub-logic section is the erasure coding mode, step 803 is shifted.
Taking the reading sub-logic interval [ 0, 64KB ] as an example, the current corresponding storage bitmap of Seg1 is:
Seg1→[Smap:00110000……000]
the Server1 can know that the value of the bit corresponding to the sub-logic section [ 0, 64KB ] is 0 through inquiring the storage bitmap, and the storage mode of the data corresponding to the sub-logic section is a copy mode. Server1 sends sub-read request for reading the data corresponding to the sub-logic interval to any one OSD (for example, OSD 1) of OSD 1-OSD 3 for storing the data corresponding to Seg 1. OSD1 corresponds to the mapping table of the disk management module according to the locally recorded Seg 1:
Seg1→[0:Blk1;128KB:Blk20:R;192KB:Blk20:R]
The data copy D1 is read from Blk1 with mapping relation with the sub-logic interval [ 0, 64KB ], and returned to Server1.
Taking the read sub-logical interval [ 128KB,192KB ] as an example, the current corresponding storage bitmap of Seg1 is:
Seg1→[Smap:00110000……000]
the Server1 can know that the value of the bit corresponding to the sub-logical section [ 128kb,192kb ] is 1 by querying the storage bitmap, which indicates that the storage mode of the data corresponding to the sub-logical section is the erasure coding mode, and then go to step 803.
In step 803, if the storage mode of the data corresponding to the third sub-logical section is an erasure code mode, the target cluster node sends first sub-read requests for the third sub-logical section to each first disk management module, where the first sub-read requests do not include an association flag.
The first sub-read request is named here for ease of distinction and is not intended to be limiting.
Taking reading the sub-logic intervals [ 128KB,192KB ] as an example, after the Server1 determines that the storage mode of the data corresponding to the sub-logic intervals [ 128KB,192KB ] is an erasure code mode through step 802, sub-reading requests aiming at the sub-logic intervals [ 128KB,192KB ] are respectively sent to the corresponding disk management modules of the OSD 1-OSD 3.
Here, it should be noted that, by sending a sub-read request that does not include the association flag, the target cluster node instructs the first disk management module to prohibit the return of the associated data (i.e., the data associated with the data in the third sub-logical section) in the third sub-logical section.
In step 804, when the first disk management module determines that the mapping relationship between the third sub-logical section of the local record and the third Block does not have the association mark, the first disk management module reads the corresponding data from the third Block and returns the corresponding data to the target cluster node.
Here, the third Block is named for convenience of distinction, and is not limited thereto.
The mapping relation between the third sub-logic section and the third Block does not have an associated mark, which indicates that the data stored in the third Block is the data of the third sub-logic section, so that the corresponding data can be directly read from the third Block and returned to the target cluster node.
If the mapping relation between the third sub-logic section and the third Block has an association mark, which indicates that only the data associated with the data corresponding to the third sub-logic section is stored in the third Block, the first disk management module does not read the data from the third Block.
The following description will be made by taking the case that the first sub-read request is received by the disk management module corresponding to OSD1 to OSD3 respectively:
after the OSD1 corresponding disk management module receives a sub-read request (not including an association tag) for the sub-logical interval [ 128kb,192kb ], the mapping table of Seg1 of the local record is queried:
Seg1→[0:Blk1;128KB:Blk20:R;192KB:Blk20:R]
according to the mapping table, the sub-logical intervals [ 128KB,192KB ] and Blk20 are associated (with an associated mark R), and the data stored in Blk20 are not data in the sub-logical intervals [ 128KB,192KB ], so that the OSD1 corresponding to the disk management module cannot return the data in the sub-logical intervals [ 128KB,192KB ] to the Server 1.
After the OSD2 corresponding disk management module receives a sub-read request (not including an association tag) for the sub-logical interval [ 128kb,192kb ], it queries the mapping table of Seg1 of the local record:
Seg1→[0:Blk2;128KB:Blk100;192KB:Blk100:R]
according to the mapping table, the sub-logical intervals [ 128KB,192KB ] and Blk100 have a direct mapping relationship (no associated mark R), so that the corresponding disk management module of OSD2 can read the data copy B2 from Blk100 of OSD2 and return to Server1.
After the OSD3 corresponding disk management module receives a sub-read request (not including an association tag) for the sub-logical interval [ 128kb,192kb ], it queries the mapping table of Seg1 of the local record:
Seg1→[0:Blk3;128KB:Blk60:R;192KB:Blk60]
according to the mapping table, the sub-logical intervals [ 128KB,192KB ] and Blk60 are associated (with associated marks R), and the data stored in Blk60 are not data in the sub-logical intervals [ 128KB,192KB ], so that the data of the sub-logical intervals [ 128KB,192KB ] cannot be returned to the Server1 by the corresponding disk management module of OSD 3.
Thus, the flow shown in fig. 8 is completed. Processing of the read request is achieved by the flow shown in fig. 8.
As an embodiment, after step 803, if the target cluster node does not successfully read the data in the third sub-logical section, the fault processing flow shown in fig. 9 may be executed.
As shown in fig. 9, the process may include the steps of:
in step 901, the target cluster node sends a second sub-read request for the third sub-logical section to each first disk management module, where the second sub-read request carries an association flag.
The second sub-read request is named here for ease of distinction only and is not intended to be limiting.
And the target cluster node instructs the first disk management module to return the data associated with the data corresponding to the third sub-logic section by sending a second sub-reading request carrying the association mark.
In step 902, when determining that the mapping relationship between the third sub-logical section of the local record and the third Block has the association label, the first disk management module reads the corresponding data from the third Block and returns the corresponding data to the target cluster node.
Taking the sub-logic intervals [ 128KB,192KB ] as an example, if the corresponding data (OSD 2 failure) of the sub-logic intervals is not read through the flow shown in FIG. 8, the Server1 sends sub-read requests containing associated marks to the corresponding disk management modules of the OSD 1-OSD 3 respectively.
After the OSD1 corresponding disk management module receives a sub-read request (including an association flag) for the sub-logical interval [ 128kb,192kb ], the mapping table of Seg1 of the local record is queried:
Seg1→[0:Blk1;128KB:Blk20:R;192KB:Blk20:R]
According to the mapping table, the sub-logical intervals [ 128KB,192KB ] and Blk20 are associated (with an associated mark R), that is, the data associated with the data corresponding to the sub-logical intervals [ 128KB,192KB ] are stored in the Blk20, and the data (A1) read from the Blk20 by the corresponding disk management module of the OSD1 is returned to the Server1.
OSD2 fails, and thus the corresponding disk management module cannot return data.
After the OSD3 corresponding disk management module receives a sub-read request (including an association flag) for the sub-logical interval [ 128kb,192kb ], it queries the mapping table of Seg1 of the local record:
Seg1→[0:Blk3;128KB:Blk60:R;192KB:Blk60]
according to the mapping table, the sub-logical intervals [ 128KB,192KB ] and Blk60 are associated (with an associated mark R), that is, the data associated with the data corresponding to the sub-logical intervals [ 128KB,192KB ] are stored in the Blk60, and the OSD3 corresponding to the disk management module reads the associated data (C3) from the Blk60 and returns the data to the Server1.
In step 903, the target cluster node sends a third sub-read request for the third sub-logical section to the second disk management module, where the third sub-read request includes an association flag.
Here, the third sub read request is named for convenience of description, and is not meant to be limiting.
As described above, the second disk management module is a disk management module corresponding to the second disk for storing the verification data corresponding to the target Segment.
And the target cluster node acquires the verification data related to the data corresponding to the third sub-logic section by sending a third sub-read request to the second disk management module.
And step 904, the second disk management module reads the check data from the fourth Block and returns the check data to the target cluster node according to the mapping relation between the third sub-logic section and the fourth Block recorded locally.
Here, the fourth Block is named for convenience of description, and is not limited thereto.
Taking the example that the Server1 sends sub-read requests aiming at sub-logic intervals [ 128KB,192KB ] to the corresponding disk management module of the OSD4, the corresponding disk management module of the OSD4 inquires the mapping relation between the sub-logic intervals and the blocks of the local records:
Seg1→[0:Blk4:R;128KB:Blk4:R;192KB:Blk4:R]
the verification data P is read from Blk4 mapped to the sub-logic intervals [ 128KB,192KB ], and returned to Server1.
In step 905, the target cluster node performs verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module, so as to obtain the data corresponding to the third sub-logic section.
For example, the Server performs verification calculation according to the A1 returned by the OSD1 corresponding to the disk management module, the C3 returned by the OSD3 corresponding to the disk management module, and the P returned by the OSD4 corresponding to the disk management module, to obtain the data B2 in the sub-logic intervals [ 128kb,192kb ] stored in the failed disk OSD 2.
Thus, the flow shown in fig. 9 is completed, and the data can still be read when the disk fails.
Based on the storage structure in the embodiment of the present application, a data reconstruction process after a disk failure is described below.
And when the target cluster node determines that the fault disk exists in the N first disks, executing the data reconstruction flow shown in fig. 10 aiming at each fourth sub-logic section of the written data in the target Segment. Here, the fourth sub-logic section is named for convenience of distinction, and is not intended to be limiting.
As shown in fig. 10, the process may include the steps of:
in step 1001, the target cluster node reads the data corresponding to the fourth sub-logic section.
The process of reading the data corresponding to the fourth sub-logic section by the target cluster node may refer to the foregoing read request processing flow, which is not described herein again.
Taking the distributed storage cluster shown in fig. 7 as an example, if OSD2 fails, the data needs to be reconstructed for each sub-logic section (0, 64 KB), 128KB,192KB, 256 KB) of the written data in Seg 1.
Taking the sub-logic interval [ 0, 64KB ] as an example, server1 can directly read the data D1 corresponding to the sub-logic interval from OSD 1.
Taking the sub-logic section [ 128KB,192KB ] as an example, because the OSD2 fails, the data corresponding to the sub-logic section cannot be directly read, the Server1 can perform verification calculation by reading A1, C3 and P to obtain the data B2 corresponding to the sub-logic section [ 128KB,192KB ].
Step 1002, the target cluster node sends a write command for indicating to write data corresponding to the fourth sub-logical section to each third disk management module.
Here, the third disk management module refers to a disk management module corresponding to a disk except for a failed disk and a disk returning data corresponding to the fourth sub-logical section in the first disk and the second disk.
For example, when the Server1 reads the data D1 corresponding to the sub-logic interval [ 0, 64KB ] from the OSD1, the Server1 sends a write command for writing the data D1 corresponding to the sub-logic interval [ 0, 64KB ] to the OSD3 and OSD4 corresponding to the disk management module except for the OSD1 (data source disk) and OSD2 (failed disk).
In step 1003, when determining that the mapping relationship between the fourth sub-logical section and the fifth Block of the local record has the association label, the third disk management module allocates a sixth Block to the fourth sub-logical section, and writes the data corresponding to the fourth sub-logical section into the sixth Block.
Here, the fifth Block and the sixth Block are named for convenience of distinction, and are not limited thereto.
When the third disk management module determines that the mapping relationship between the locally recorded fourth sub-logical section and the fifth Block has the association flag, it indicates that only the data associated with the data corresponding to the fourth sub-logical section exists in the corresponding disk (in the fifth Block), and the data corresponding to the fourth sub-logical section does not actually exist, so the third disk management module needs to allocate a new Block (sixth Block) for the fourth sub-logical section, and store the data corresponding to the fourth sub-logical section into the sixth Block, and step 1004 is performed.
On the contrary, when the third disk management module determines that the mapping relationship between the fourth sub-logical section and the fifth Block of the local record does not have the associated mark, it is indicated that the fourth sub-logical section corresponding data is stored in the fifth Block, so that the reconstruction operation may not be executed.
In step 1004, the third disk management module updates the mapping relationship between the fourth sub-logical section and the fifth Block to the mapping relationship between the fourth sub-logical section and the sixth Block, and removes the corresponding association flag.
That is, a mapping relationship between the fourth sub-logical section and a Block (sixth Block) storing data corresponding to the fourth sub-logical section is established. And because the real mapping relation exists, the data corresponding to the fourth sub-logic section does not need to be recovered depending on the associated data, so that the original associated mark is deleted.
In addition, it should be noted that after updating the mapping relationship, the third disk management module may further determine whether the mapping relationship including the fifth Block still exists in the local record, and if not, it indicates that the data that is not needed to be recovered by relying on the data in the fifth Block any more, so that the fifth Block may be recovered to save storage resources.
Taking the example that the Server1 sends write commands of the data block D corresponding to the write sub-logic intervals [ 0, 64KB ] to the disk management module corresponding to the OSD3 and the disk management module corresponding to the OSD4 respectively:
after the OSD3 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 of the local record:
Seg1→[0:Blk3;128KB:Blk60:R;192KB:Blk60]
it is known that the mapping relationship between the sub-logic region [ 0, 64KB ] and Blk3 does not have the associated tag R, that is, the sub-logic region [ 0, 64KB ] corresponding data is stored in Blk3, so that the write operation can be prohibited.
After the OSD4 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 of the local record:
Seg1→[0:Blk4:R;128KB:Blk4:R;192KB:Blk4:R]
it is known that, when the mapping relationship between the sub-logic interval [ 0, 64KB ] and Blk4 has the association flag R, the OSD4 corresponding to the disk management module allocates a new Block for the sub-logic interval [ 0, 64KB ], denoted as Blk70, and writes the data corresponding to the sub-logic interval [ 0, 64KB ] into Blk70. Here, the sub-logic section [ 0, 64KB ] is denoted as D4 (data copy of data block D) in OSD4 for the corresponding data. The mapping table of Seg1 is updated by the OSD4 corresponding to the disk management module as follows:
Seg1→[0:Blk70;128KB:Blk4:R;192KB:Blk4:R]
At this time, the mapping relationship of Seg1 in the distributed storage cluster is shown in fig. 11.
On the basis of fig. 11, server1 sends write commands of writing sub-logic intervals [ 128kb,192kb ] corresponding to the data block B to the OSD1 corresponding to the disk management module, OSD3 corresponding to the disk management module, and OSD4 corresponding to the disk management module, respectively:
after the OSD1 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 recorded locally:
Seg1→[0:Blk1;128KB:Blk20:R;192KB:Blk20:R]
it can be seen that, when the mapping relationship between the sub-logic intervals [ 128kb,192kb ] and Blk20 has the association flag R, the OSD1 corresponding disk management module allocates a new Block for the sub-logic intervals [ 128kb,192kb ], denoted as Blk10, and writes the data corresponding to the sub-logic intervals [ 128kb,192kb ] into Blk10. Here, the sub-logic intervals [ 128kb,192kb ] are denoted as B1 (data copy of data block B) for the corresponding data in OSD 1. The mapping table of Seg1 is updated by the OSD1 corresponding to the disk management module as follows:
Seg1→[0:Blk1;128KB:Blk10;192KB:Blk20:R]
after the OSD3 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 of the local record:
Seg1→[0:Blk3;128KB:Blk60:R;192KB:Blk60]
it can be seen that, when the mapping relationship between the sub-logic intervals [ 128kb,192kb ] and Blk60 has the association flag R, the OSD3 corresponding disk management module allocates a new Block for the sub-logic intervals [ 128kb,192kb ], which is denoted as Blk20, and writes the data corresponding to the sub-logic intervals [ 128kb,192kb ] into Blk20. Here, the sub-logic intervals [ 128kb,192kb ] are denoted as B3 (data copy of data block B) for the corresponding data in OSD 3. The mapping table of Seg1 is updated by the OSD3 corresponding to the disk management module as follows:
Seg1→[0:Blk3;128KB:Blk20;192KB:Blk60]
After the OSD4 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 of the local record:
Seg1→[0:Blk70;128KB:Blk4:R;192KB:Blk4:R]
it is known that, when the mapping relationship between the sub-logic intervals [ 128kb,192kb ] and Blk4 has the association flag R, the OSD4 corresponding to the disk management module allocates a new Block for the sub-logic intervals [ 128kb,192kb ], which is denoted as Blk100, and writes the data corresponding to the sub-logic intervals [ 128kb,192kb ] into Blk100. Here, the sub-logic intervals [ 128kb,192kb ] are denoted as B4 (data copy of data block B) for the corresponding data in OSD 4. The mapping table of Seg1 is updated by the OSD4 corresponding to the disk management module as follows:
Seg1→[0:Blk70;128KB:Blk100;192KB:Blk4:R]
at this time, the mapping relationship of Seg1 in the distributed storage cluster is shown in fig. 12.
On the basis of fig. 12, server1 sends write commands corresponding to the data blocks C in the write sub-logical intervals [ 192kb,256kb ] to the disk management module corresponding to OSD1 and the disk management module corresponding to OSD4, respectively:
after the OSD1 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 recorded locally:
Seg1→[0:Blk1;128KB:Blk10;192KB:Blk20:R]
it is known that, when the mapping relationship between the sub-logic intervals [ 192kb,256kb ] and Blk20 has the association flag R, the OSD1 corresponding disk management module allocates a new Block for the sub-logic intervals [ 192kb,256kb ], denoted as Blk100, and writes the data corresponding to the sub-logic intervals [ 192kb,256kb ] into Blk100. Here, the sub-logic intervals [ 192kb,256kb ] are denoted as C1 (data copy of data block C) for the corresponding data in OSD 1. The mapping table of Seg1 is updated by the OSD1 corresponding to the disk management module as follows:
Seg1→[0:Blk1;128KB:Blk10;192KB:Blk100]
Here, it should be noted that, after the update of 192kb: blk20: r to 192kb: blk100, the OSD1 corresponding disk management module may determine that there is no mapping relationship including Blk20 any more locally, that is, the data A1 in Blk20 is meaningless, and thus, blk20 may be recycled.
After the OSD4 corresponding to the disk management module receives the write command, inquiring a mapping table of the Seg1 of the local record:
Seg1→[0:Blk70;128KB:Blk100;192KB:Blk4:R]
it is known that, when the mapping relationship between the sub-logic intervals [ 192kb,256kb ] and Blk4 has the association flag R, the OSD4 corresponding to the disk management module allocates a new Block for the sub-logic intervals [ 192kb,256kb ], denoted as Blk20, and writes the data corresponding to the sub-logic intervals [ 192kb,256kb ] into Blk20. Here, the sub-logic intervals [ 192kb,256kb ] are denoted as C4 (data copy of data block C) for the corresponding data in OSD 4. The mapping table of Seg1 is updated by the OSD4 corresponding to the disk management module as follows:
Seg1→[0:Blk70;128KB:Blk100;192KB:Blk20]
here, it should be noted that, after the update of 192kb: blk4: r to 192kb:20, the OSD4 corresponding disk management module may determine that there is no mapping relationship including Blk4 any more locally, that is, the data P in Blk4 is meaningless, and thus, blk4 may be recovered.
At this time, the mapping relationship of Seg1 in the distributed storage cluster is shown in fig. 13.
Thus, the flow shown in fig. 10 is completed. The reconstruction of data can be achieved by the flow shown in fig. 10.
The method provided by the embodiment of the application is described above, and the distributed storage cluster provided by the embodiment of the application is described below:
the distributed storage cluster comprises at least one cluster node, each cluster node comprises at least one disk for storing data, each disk is divided into a plurality of blocks according to a preset Block size, the distributed storage cluster is configured with at least one LUN, each LUN is divided into a plurality of logic sections according to a preset Segment size, each logic section is divided into a plurality of sub-logic sections according to a preset Block size, each cluster node deploys a corresponding disk management module for each disk on the node, each Segment corresponds to a write bitmap, each bit in the write bitmap is used for identifying whether the corresponding sub-logic section is written with data, each Segment also corresponds to a storage bitmap, each bit in the storage bitmap is used for identifying the storage mode of the data of the corresponding sub-logic section, and the distributed storage cluster is written in an N copy mode, wherein N is more than or equal to 2;
The target cluster node in the at least one cluster node is used for acquiring a writing bitmap and a storage bitmap corresponding to a target Segment when the duration of the non-accessed target Segment is monitored to reach a preset duration; traversing the writing bitmap and the storage bitmap to find N first sub-logic intervals in which the data are written and the storage mode is a copy mode; selecting a target disk from N first disks for storing data corresponding to the target Segment; for each first sub-logic section, sending a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk; calculating check data according to the data of each first sub-logic section returned by the target disk management module; sending a write command for indicating to write the verification data into a second disk to a second disk management module corresponding to the second disk, wherein the second disk is a preassigned disk for storing the verification data corresponding to the target Segment;
the second disk management module is configured to allocate a first Block for the check data from the second disk, and write the check data into the first Block;
The target cluster node is further configured to send, to a first disk management module corresponding to each first disk, a deletion command for instructing to delete a copy of specified data, so that data corresponding to the N first sub-logical intervals are respectively stored in the N first disks.
As an embodiment, the target cluster node is further configured to update, for each first sub-logical section, a storage manner identified by a corresponding bit of the first sub-logical section in the storage bitmap to be an erasure code manner.
As an embodiment, the target cluster node is further configured to determine each second sub-logical section in the target Segment related to the write request when receiving the write request for writing the target Segment, and split the write request into sub-write requests for each second sub-logical section; inquiring the writing bitmap of the target Segment for each second sub-logic interval, and determining whether the second sub-logic interval is written with data; if the second sub-logic section is written with data, inquiring a storage bitmap of the target Segment, and determining a storage mode of data corresponding to the second sub-logic section; if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, respectively sending a sub-write request aiming at the second sub-logic section to each first disk management module, wherein the sub-write request carries the data to be written into the second sub-logic section and erasure code marks;
The first disk management module is further configured to allocate a second Block to the second sub-logical section when determining that the sub-write request carries the erasure code flag, write data into the second Block, and record a mapping relationship between the second sub-logical section and the second Block;
and the target cluster node is further configured to update a copy mode of the storage mode identified by the corresponding bit in the storage bitmap in the second sub-logic section.
As an embodiment, the first disk management module is further configured to record a mapping relationship between a first sub-logical section to which the deleted data copy belongs and a Block to which the undeleted data copy belongs, and add an association flag to the mapping relationship;
the target cluster node is further configured to determine each third sub-logical section in the target Segment related to the read request when receiving the read request for reading the target Segment, and split the read request into sub-read requests for each third sub-logical section; inquiring the storage bitmap aiming at each third sub-logic interval, and determining a storage mode of data corresponding to the third sub-logic interval; if the storage mode of the data corresponding to the third sub-logic section is an erasure code mode, respectively sending a first sub-read request aiming at the third sub-logic section to each first disk management module, wherein the first sub-read request does not comprise an association mark;
And the first disk management module is further configured to read corresponding data from the third Block and return the corresponding data to the target cluster node when it is determined that the mapping relationship between the third sub-logical section and the third Block of the local record does not have an association mark.
As an embodiment, the second disk management module is further configured to record a mapping relationship between each first sub-logical section and the first Block, and add an association flag;
the target cluster node is further configured to send a second sub-read request for the third sub-logical section to each first disk management module if the data in the third sub-logical section is not read, where the second sub-read request carries an association flag;
the first disk management module is further configured to, when determining that a mapping relationship between the third sub-logical section and a third Block of the local record has an association tag, read corresponding data from the third Block and return the corresponding data to the target cluster node;
the target cluster node is further configured to send a third sub-read request for the third sub-logical section to the second disk management module;
the second disk management module is further configured to read check data from a fourth Block and return the check data to the target cluster node according to a mapping relationship between the third sub-logical section and the fourth Block, where the mapping relationship is recorded locally;
And the target cluster node is also used for performing verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module to obtain the data corresponding to the third sub-logic section.
As an embodiment, the target cluster node is further configured to, when determining that there is a failed disk in the N first disks, read, for each fourth sub-logical section of the data written in the target Segment, data corresponding to the fourth sub-logical section; sending write commands for indicating the data corresponding to the fourth sub-logic section to each third disk management module, wherein the third disk management modules refer to disk management modules corresponding to disks except for the fault disk and the disk returning the data corresponding to the fourth sub-logic section in the first disk and the second disk;
the third disk management module is configured to allocate a sixth Block to the fourth sub-logical section when determining that the mapping relationship between the fourth sub-logical section and the fifth Block of the local record has an association flag, and write data corresponding to the fourth sub-logical section into the sixth Block; and updating the mapping relation between the fourth sub-logic section and the fifth Block into the mapping relation between the fourth sub-logic section and the sixth Block, and removing the corresponding association mark.
As an embodiment, the third disk management module is further configured to determine whether a mapping relationship including the fifth Block still exists; and if not, recycling the fifth Block.
As can be seen from the above description, in the embodiment of the present application, the distributed storage cluster writes data in a duplicate manner, so as to ensure the data writing performance of the distributed storage cluster; and when the data is cooled, converting the data stored in the copy mode into erasure code mode for storage, so as to reduce disk consumption and improve the disk yield of the distributed storage cluster.
The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the embodiments of the present application.

Claims (14)

1. The utility model provides a data redundancy method which is characterized in that the utility model is applied to a distributed storage cluster, the distributed storage cluster includes at least one cluster node, each cluster node includes at least one disk that is used for storing data, each disk is divided into a plurality of blocks according to the preset Block size, the distributed storage cluster is configured with at least one LUN, each LUN is divided into a plurality of logic intervals according to the preset Segment size, each logic interval is divided into a plurality of sub-logic intervals according to the preset Block size, each cluster node deploys the disk management module that corresponds to each disk on this node, each Segment corresponds to a write bitmap, each bit in the write bitmap is used for identifying whether the data has been written in the corresponding sub-logic interval, each Segment also corresponds to a storage bitmap, each bit in the storage bitmap is used for identifying the storage mode of the data of the corresponding sub-logic interval, the distributed storage cluster adopts N copy write mode, wherein, N is more than or equal to 2, the method includes:
When the target cluster node in the at least one cluster node monitors that the duration of the non-access target Segment reaches the preset duration, acquiring a writing bitmap and a storage bitmap corresponding to the target Segment;
the target cluster node traverses the writing bitmap and the storage bitmap to find N first sub-logic intervals in which data are written and the storage mode is a copy mode;
the target cluster node selects a target disk from N first disks for storing data corresponding to the target Segment;
for each first sub-logic section, the target cluster node sends a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk;
the target cluster node calculates check data according to the data of each first sub-logic interval returned by the target disk management module;
the target cluster node sends a write command for indicating to write the verification data into a second disk to a second disk management module corresponding to the second disk, wherein the second disk is a pre-designated disk for storing the verification data corresponding to the target Segment;
the second disk management module allocates a first Block for the check data from the second disk and writes the check data into the first Block;
And for each first disk, the target cluster node sends a deleting command for indicating to delete the specified data copy to the first disk management module corresponding to the first disk, so that the data corresponding to the N first sub-logic sections are respectively stored in the N first disks.
2. The method of claim 1, wherein, for each first disk, after the target cluster node sends a delete command to the first disk management module corresponding to the first disk to instruct deletion of the specified data copy, the method further comprises:
and for each first sub-logic interval, the target cluster node updates the storage mode identified by the corresponding bit of the first sub-logic interval in the storage bitmap to be an erasure code mode.
3. The method of claim 1, wherein the method further comprises:
when the target cluster node receives a write request which needs to be written into the target Segment, determining each second sub-logic section in the target Segment related to the write request, and splitting the write request into sub-write requests aiming at each second sub-logic section;
the following is performed for each second sub-logic section:
The target cluster node inquires a writing bitmap of the target Segment and determines whether data is written in a second sub-logic section;
if the second sub-logic section is written with data, the target cluster node queries a storage bitmap of the target Segment and determines a storage mode of data corresponding to the second sub-logic section;
if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, the target cluster node respectively sends a sub-write request aiming at the second sub-logic section to each first disk management module, wherein the sub-write request carries the data to be written into the second sub-logic section and erasure code marks;
when the first disk management module determines that the sub-write request carries the erasure code mark, a second Block is allocated to the second sub-logic section, data is written into the second Block, and the mapping relation between the second sub-logic section and the second Block is recorded;
and the target cluster node updates the storage mode identified by the corresponding bit of the second sub-logic section in the storage bitmap to be a copy mode.
4. The method of claim 1, wherein, for each first disk, the target cluster node sends a delete command to the first disk management module corresponding to the first disk for instructing to delete the specified data copy, so that the N first sub-logical section corresponding data are stored in N first disks, respectively, and the method further comprises:
The first disk management module records the mapping relation between a first sub-logic section to which the deleted data copy belongs and a Block to which the undeleted data copy belongs, and adds an association mark for the mapping relation;
when the target cluster node receives a read request for reading the target Segment, determining each third sub-logic section in the target Segment related to the read request, and splitting the read request into sub-read requests for each third sub-logic section;
the following processing is performed for each third sub-logic section:
the target cluster node queries the storage bitmap and determines a storage mode of data corresponding to a third sub-logic interval;
if the storage mode of the data corresponding to the third sub-logic section is an erasure code mode, the target cluster node respectively sends a first sub-read request aiming at the third sub-logic section to each first disk management module, wherein the first sub-read request does not comprise an association mark;
and when the first disk management module determines that the mapping relation between the third sub-logic section and the third Block of the local record does not have the association mark, reading corresponding data from the third Block and returning the corresponding data to the target cluster node.
5. The method of claim 4, wherein the write command includes a start address of each first sub-logical section, the second disk management module allocates a first Block for the parity data from the second disk, and after writing the parity data to the first Block, the method further comprises:
the second disk management module records the mapping relation between each first sub-logic section and the first Block, and adds an association mark;
after the target cluster node sends the first sub-read requests aiming at the third sub-logic section to each first disk management module respectively, the method further comprises the following steps:
if the data of the third sub-logic section is not read, the target cluster node sends a second sub-read request aiming at the third sub-logic section to each first disk management module, wherein the second sub-read request carries an association mark;
when the first disk management module determines that the mapping relation between the third sub-logic section and the third Block of the local record has an association mark, corresponding data is read from the third Block and returned to the target cluster node;
the target cluster node sends a third sub-read request aiming at the third sub-logic section to the second disk management module;
The second disk management module reads check data from a fourth Block and returns the check data to the target cluster node according to the mapping relation between the third sub-logic section and the fourth Block recorded locally;
and the target cluster node performs verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module to obtain the data corresponding to the third sub-logic section.
6. The method of claim 1, wherein the method further comprises:
when the target cluster node determines that the fault disk exists in the N first magnetic disks, the following processing is executed for each fourth sub-logic section of the written data in the target Segment:
the target cluster node reads the data corresponding to the fourth sub-logic interval;
the target cluster node sends write commands for indicating to write the data corresponding to the fourth sub-logic section to each third disk management module, wherein the third disk management modules refer to disk management modules corresponding to disks except for the fault disk and the disk returning the data corresponding to the fourth sub-logic section in the first disk and the second disk;
when the third disk management module determines that the mapping relation between the fourth sub-logic section and the fifth Block of the local record has an associated mark, a sixth Block is allocated to the fourth sub-logic section, and data corresponding to the fourth sub-logic section is written into the sixth Block;
The third disk management module updates the mapping relation between the fourth sub-logic section and the fifth Block into the mapping relation between the fourth sub-logic section and the sixth Block, and removes the corresponding association mark.
7. The method of claim 6, wherein after the third disk management module updates the mapping relationship between the fourth sub-logical interval and the fifth Block to the mapping relationship between the fourth sub-logical interval and the sixth Block, the method further comprises:
the third disk management module judges whether a mapping relation comprising the fifth Block exists or not;
and if not, recycling the fifth Block.
8. The distributed storage cluster is characterized by comprising at least one cluster node, wherein each cluster node comprises at least one disk for storing data, each disk is divided into a plurality of blocks according to a preset Block size, the distributed storage cluster is configured with at least one LUN, each LUN is divided into a plurality of logic sections according to a preset Segment size, each logic section is divided into a plurality of sub-logic sections according to a preset Block size, each cluster node deploys a corresponding disk management module for each disk on the node, each Segment corresponds to a write bitmap, each bit in the write bitmap is used for identifying whether the corresponding sub-logic section has written data, each Segment also corresponds to a storage bitmap, each bit in the storage bitmap is used for identifying the storage mode of the data corresponding to the sub-logic section, and the distributed storage cluster is written in an N copy mode, wherein N is more than or equal to 2;
The target cluster node in the at least one cluster node is used for acquiring a writing bitmap and a storage bitmap corresponding to a target Segment when the duration of the non-accessed target Segment is monitored to reach a preset duration; traversing the writing bitmap and the storage bitmap to find N first sub-logic intervals in which the data are written and the storage mode is a copy mode; selecting a target disk from N first disks for storing data corresponding to the target Segment; for each first sub-logic section, sending a read command for reading data corresponding to the first sub-logic section to a target disk management module corresponding to the target disk; calculating check data according to the data of each first sub-logic section returned by the target disk management module; sending a write command for indicating to write the verification data into a second disk to a second disk management module corresponding to the second disk, wherein the second disk is a preassigned disk for storing the verification data corresponding to the target Segment;
the second disk management module is configured to allocate a first Block for the check data from the second disk, and write the check data into the first Block;
The target cluster node is further configured to send, to a first disk management module corresponding to each first disk, a deletion command for instructing to delete a copy of specified data, so that data corresponding to the N first sub-logical intervals are respectively stored in the N first disks.
9. The cluster of claim 8, wherein:
the target cluster node is further configured to update, for each first sub-logical section, a storage mode identified by a corresponding bit of the first sub-logical section in the storage bitmap to be an erasure code mode.
10. The cluster of claim 8, wherein:
the target cluster node is further configured to determine each second sub-logical section in the target Segment related to the write request when receiving the write request for writing the target Segment, and split the write request into sub-write requests for each second sub-logical section; inquiring the writing bitmap of the target Segment for each second sub-logic interval, and determining whether the second sub-logic interval is written with data; if the second sub-logic section is written with data, inquiring a storage bitmap of the target Segment, and determining a storage mode of data corresponding to the second sub-logic section; if the storage mode of the data corresponding to the second sub-logic section is an erasure code mode, respectively sending a sub-write request aiming at the second sub-logic section to each first disk management module, wherein the sub-write request carries the data to be written into the second sub-logic section and erasure code marks;
The first disk management module is further configured to allocate a second Block to the second sub-logical section when determining that the sub-write request carries the erasure code flag, write data into the second Block, and record a mapping relationship between the second sub-logical section and the second Block;
and the target cluster node is further configured to update a copy mode of the storage mode identified by the corresponding bit in the storage bitmap in the second sub-logic section.
11. The cluster of claim 8, wherein:
the first disk management module is further configured to record a mapping relationship between a first sub-logical section to which the deleted data copy belongs and a Block to which the undeleted data copy belongs, and add an association flag to the mapping relationship;
the target cluster node is further configured to determine each third sub-logical section in the target Segment related to the read request when receiving the read request for reading the target Segment, and split the read request into sub-read requests for each third sub-logical section; inquiring the storage bitmap aiming at each third sub-logic interval, and determining a storage mode of data corresponding to the third sub-logic interval; if the storage mode of the data corresponding to the third sub-logic section is an erasure code mode, respectively sending a first sub-read request aiming at the third sub-logic section to each first disk management module, wherein the first sub-read request does not comprise an association mark;
And the first disk management module is further configured to read corresponding data from the third Block and return the corresponding data to the target cluster node when it is determined that the mapping relationship between the third sub-logical section and the third Block of the local record does not have an association mark.
12. The cluster of claim 11, wherein:
the second disk management module is further configured to record a mapping relationship between each first sub-logical section and the first Block, and add an association flag;
the target cluster node is further configured to send a second sub-read request for the third sub-logical section to each first disk management module if the data in the third sub-logical section is not read, where the second sub-read request carries an association flag;
the first disk management module is further configured to, when determining that a mapping relationship between the third sub-logical section and a third Block of the local record has an association tag, read corresponding data from the third Block and return the corresponding data to the target cluster node;
the target cluster node is further configured to send a third sub-read request for the third sub-logical section to the second disk management module;
the second disk management module is further configured to read check data from a fourth Block and return the check data to the target cluster node according to a mapping relationship between the third sub-logical section and the fourth Block, where the mapping relationship is recorded locally;
And the target cluster node is also used for performing verification calculation according to the data returned by the first disk management module and the verification data returned by the second disk management module to obtain the data corresponding to the third sub-logic section.
13. The cluster of claim 8, wherein:
the target cluster node is further configured to, when determining that a failed disk exists in the N first disks, read, for each fourth sub-logical section of the data written in the target Segment, data corresponding to the fourth sub-logical section; sending write commands for indicating the data corresponding to the fourth sub-logic section to each third disk management module, wherein the third disk management modules refer to disk management modules corresponding to disks except for the fault disk and the disk returning the data corresponding to the fourth sub-logic section in the first disk and the second disk;
the third disk management module is configured to allocate a sixth Block to the fourth sub-logical section when determining that the mapping relationship between the fourth sub-logical section and the fifth Block of the local record has an association flag, and write data corresponding to the fourth sub-logical section into the sixth Block; and updating the mapping relation between the fourth sub-logic section and the fifth Block into the mapping relation between the fourth sub-logic section and the sixth Block, and removing the corresponding association mark.
14. The cluster of claim 13, wherein:
the third disk management module is further configured to determine whether a mapping relationship including the fifth Block exists; and if not, recycling the fifth Block.
CN202011025578.4A 2020-09-25 2020-09-25 Data redundancy method and distributed storage cluster Active CN112052124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011025578.4A CN112052124B (en) 2020-09-25 2020-09-25 Data redundancy method and distributed storage cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011025578.4A CN112052124B (en) 2020-09-25 2020-09-25 Data redundancy method and distributed storage cluster

Publications (2)

Publication Number Publication Date
CN112052124A CN112052124A (en) 2020-12-08
CN112052124B true CN112052124B (en) 2023-09-22

Family

ID=73604833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011025578.4A Active CN112052124B (en) 2020-09-25 2020-09-25 Data redundancy method and distributed storage cluster

Country Status (1)

Country Link
CN (1) CN112052124B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867035A (en) * 2012-08-28 2013-01-09 浪潮(北京)电子信息产业有限公司 High-availability method and device of distributed document system cluster
CN102981927A (en) * 2011-09-06 2013-03-20 阿里巴巴集团控股有限公司 Distribution type independent redundant disk array storage method and distribution type cluster storage system
CN109783016A (en) * 2018-12-25 2019-05-21 西安交通大学 A kind of elastic various dimensions redundancy approach in distributed memory system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7343467B2 (en) * 2004-12-20 2008-03-11 Emc Corporation Method to perform parallel data migration in a clustered storage environment
US10613944B2 (en) * 2017-04-18 2020-04-07 Netapp, Inc. Systems and methods for backup and restore of distributed master-slave database clusters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981927A (en) * 2011-09-06 2013-03-20 阿里巴巴集团控股有限公司 Distribution type independent redundant disk array storage method and distribution type cluster storage system
CN102867035A (en) * 2012-08-28 2013-01-09 浪潮(北京)电子信息产业有限公司 High-availability method and device of distributed document system cluster
CN109783016A (en) * 2018-12-25 2019-05-21 西安交通大学 A kind of elastic various dimensions redundancy approach in distributed memory system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分布式系统中数据冗余的研究;刘桂华;;电脑知识与技术(第18期);全文 *

Also Published As

Publication number Publication date
CN112052124A (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN107943867B (en) High-performance hierarchical storage system supporting heterogeneous storage
US5778411A (en) Method for virtual to physical mapping in a mapped compressed virtual storage subsystem
US9588892B2 (en) Data access method in a storage architecture
US6532527B2 (en) Using current recovery mechanisms to implement dynamic mapping operations
CN106951375B (en) Method and device for deleting snapshot volume in storage system
KR101678868B1 (en) Apparatus for flash address translation apparatus and method thereof
US20060010290A1 (en) Logical disk management method and apparatus
US20030236944A1 (en) System and method for reorganizing data in a raid storage system
CN107924291B (en) Storage system
US20060156059A1 (en) Method and apparatus for reconstructing data in object-based storage arrays
CN113868192B (en) Data storage device and method and distributed data storage system
US20050163014A1 (en) Duplicate data storing system, duplicate data storing method, and duplicate data storing program for storage device
CN111026329B (en) Key value storage system based on host management tile record disk and data processing method
JPH1196686A (en) Rotary storage device
JPH08123629A (en) Method for monitoring of data loss of hierarchical data storage device
CN104620230A (en) Method of managing memory
CN114356246B (en) Storage management method and device for SSD internal data, storage medium and SSD device
US6658528B2 (en) System and method for improving file system transfer through the use of an intelligent geometry engine
CN112181299B (en) Data restoration method and distributed storage cluster
CN112052218B (en) Snapshot implementation method and distributed storage cluster
CN112052124B (en) Data redundancy method and distributed storage cluster
CN105068896B (en) Data processing method and device based on RAID backup
JPH07152498A (en) Information processing system
EP3182267B1 (en) Method and device for isolating disk regions
CN113050891B (en) Method and device for protecting deduplication data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant