CN114895856B - Distributed storage system based on high-density storage hardware - Google Patents

Distributed storage system based on high-density storage hardware Download PDF

Info

Publication number
CN114895856B
CN114895856B CN202210814643.4A CN202210814643A CN114895856B CN 114895856 B CN114895856 B CN 114895856B CN 202210814643 A CN202210814643 A CN 202210814643A CN 114895856 B CN114895856 B CN 114895856B
Authority
CN
China
Prior art keywords
storage
disk
sas
chunk
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210814643.4A
Other languages
Chinese (zh)
Other versions
CN114895856A (en
Inventor
张颖
李铁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chuangyun Rongda Information Technology Tianjin Co ltd
Original Assignee
Chuangyun Rongda Information Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chuangyun Rongda Information Technology Tianjin Co ltd filed Critical Chuangyun Rongda Information Technology Tianjin Co ltd
Priority to CN202210814643.4A priority Critical patent/CN114895856B/en
Publication of CN114895856A publication Critical patent/CN114895856A/en
Application granted granted Critical
Publication of CN114895856B publication Critical patent/CN114895856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed storage system based on high-density storage hardware, which comprises two nodes and an SAS disk cage, wherein each node comprises a CPU processor, a memory, an internal SAS adapter, a system disk and an external SAS HBA card, the SAS disk cage comprises an SAS controller and two storage components, each storage component is managed by only one node, the two storage components are logically in a subordinate relationship, the two storage components are connected through a connecting line, the disks in the SAS disk cage are grouped and then distributed to two different storage components in a system configuration mode, each storage component comprises a plurality of Chunk groups and a plurality of disks, each Chunk Group comprises a plurality of physical chunks, the invention can realize that the two nodes can provide storage service in a dual-active mode, simultaneously can tolerate the damage of any node, fully utilizes the disk cage with high storage density and solves the single-point fault problem of the disk cage, high availability is realized, and the method has wide application prospect and is beneficial to popularization and application.

Description

Distributed storage system based on high-density storage hardware
Technical Field
The invention relates to the technical field of storage equipment, in particular to a distributed storage system based on high-density storage hardware.
Background
At present, a traditional distributed storage system is often constructed based on a standard rack server, only pluggable hard disks can be installed on a front panel or a part of a rear panel, and the space of a case cannot be fully utilized, so that the utilization rate of the space stored in the rack is low. To increase the utilization of storage in rack space, each rack server is typically connected to a disk cage through an SAS line, with all of the disk cages being hard disks. This approach may increase utilization, but the system has a single point of failure domain, and once the SAS line fails, all disks in the disk cage will not be accessible. Therefore, it is urgently needed to develop a distributed storage system based on high-density storage hardware to solve the above technical problems.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a distributed storage system based on high-density storage hardware, which can realize that two nodes provide storage service in a dual-active mode and can tolerate the damage of any node, thereby fully utilizing a magnetic disk cage with high storage density, solving the problem of single-point failure of the magnetic disk cage, realizing high availability, having wide application prospect and being beneficial to popularization and application.
In order to achieve the above object, the present invention provides a distributed storage system based on high-density storage hardware, including two nodes and an SAS disk cage, where the nodes include CPU processors, memories, internal SAS adapters, system disks, and external SAS HBA cards, the SAS disk cage includes SAS controllers and two storage components, each storage component is managed by only one node, the two storage components are logically in a dependent relationship, and are connected by a connection line, the disks in the SAS disk cage are grouped by way of system configuration and then distributed to two different storage components, the storage components include several Chunk groups and several disks, the Chunk groups include several physical chunks, the number of the physical chunks is consistent with the number of the disks, each disk is distributed with one physical Chunk, each physical Chunk forms an erasure correction Group in an XD + YP manner, and stores data in a manner of adding X number of chunks and Y number of erasure correction codes, and reading and writing data blocks on a plurality of disks in parallel by using a Chunk Group, calculating erasure codes after reading and writing IO is finished, and writing the erasure codes into erasure strips so as to achieve the purpose of data protection.
Preferably, the storage component comprises an entire disk in an SAS disk cage, or only a part of a physical storage area on the disk, and if two storage components comprise the same disk, different storage areas on the disk are selected.
Preferably, the storage component stores data on multiple disks in an erasure correcting manner of XD + YP, where XD + YP indicates that X data blocks and Y erasure correcting blocks are included in one erasure correcting group, the erasure correcting group tolerates any loss or damage of Y data blocks or less, and recovers the lost or damaged data blocks using the remaining data blocks or erasure correcting blocks in the erasure correcting group.
Preferably, the storage system constructs a set of management interface and read-write IO access interface based on a logical volume for upper-layer application, the logical volume has a maximum capacity value, and is gradually expanded in a simplified preparation manner after being created, a basic addressing unit in the logical volume is a logical block with a fixed length, a plurality of continuous logical blocks form a logical Chunk with a fixed length, the logical Chunk and the physical Chunk have the same length, the logical chunks in the logical volume are distributed to different storage components, and are stored in the storage components in a log-structured storage mode.
Preferably, the SAS disk cage further includes a system data area formed by two SSD disks, and the SSD disks perform data protection in a RAID 0 manner.
Preferably, the disk is a large-capacity disk.
The distributed storage system based on the high-density storage hardware has the following beneficial effects.
The invention can realize that two nodes provide storage service in a dual-active mode, and can tolerate the damage of any node, thereby fully utilizing the disk cage with high storage density, solving the problem of single-point failure of the disk cage, realizing high availability, having wide application prospect and being beneficial to popularization and application.
Drawings
FIG. 1 is a schematic structural diagram of a distributed storage system based on high-density storage hardware according to the present invention;
FIG. 2-1 is a storage component mapping table;
FIG. 2-2 is a node status table;
FIGS. 2-3 are disk state tables;
FIG. 3 is a logical volume management table;
FIG. 4-1 is a logical volume;
FIG. 4-2 is a logic block static mapping table;
FIG. 5 is a diagram of an unwinding process for a logical volume;
FIG. 6-1 is a log-structured space allocation table;
FIG. 6-2 is a physical chunk reverse mapping table;
FIGS. 6-3 are bad stripe index tables;
FIG. 7 is a new data block table, an old data block table, and a free data block table;
FIG. 8-1 is a schematic diagram of the spatial allocation sequence of physical Chunk;
FIG. 8-2 is a diagram illustrating an actual writing sequence of data in a physical Chunk;
FIG. 9-1 is a pending queue;
FIG. 9-2 is a dirty data queue;
FIGS. 9-3 are erasure code queues;
FIGS. 9-4 are Cache bit maps tables;
FIGS. 9-5 are Clean Cache queues;
FIGS. 9-6 are idle Cache queues;
FIG. 10 is an IO decomposition flow diagram for a storage system;
FIG. 11 is a flow diagram of a write IO sub-request;
FIG. 12 is a flow diagram of a read IO sub-request;
FIG. 13 is a data stripe destage flow diagram;
FIG. 14 is a flow chart of erasure correction striping;
FIG. 15 is a bad stripe process flow diagram;
FIG. 16 is a physical Chunk recovery flow diagram;
FIG. 17 is a flow chart of disk recovery.
In the figure:
101. node 1011, CPU processor 1012, memory 1013, internal SAS adapter 1014, system disk 1015, external SAS HBA card 1016, connection 103, SAS disk cage 1031, SAS controller 1032, storage component 1033, disk 1035, Chunk Group 1036, system data area 401, logical volume 402, logical Chunk405, physical Chunk 406, logical block.
Detailed Description
The present invention will be further described with reference to the following specific embodiments and accompanying drawings to assist in understanding the contents of the invention.
Fig. 1 is a schematic structural diagram of a distributed storage system based on high-density storage hardware according to the present invention. The high-density storage hardware-based distributed storage system comprises two nodes 101 and a SAS disk cage 103. After the standard Linux operating system is installed on the two nodes 101, the two nodes 101 may mount all the disks 1033 in the SAS disk cage 103 respectively at the block device layer of the Linux operating system. Then, on the Linux operating systems of the two nodes 101, the storage system software is installed. The invention can provide a management interface based on the logic volume and a read-write IO access RPC interface for external application or software. After receiving the IO access request, the request is split to each storage component 1032, and the disk 1033 access operation is completed in parallel on all the disks 1033 inside each storage component 1032, thereby achieving the purpose of improving the storage IO performance. The node 101 includes a CPU processor 1011, a memory 1012, an internal SAS adapter 1013, a system disk 1014, and an external SAS HBA card 1015, the SAS disk cage 103 includes a SAS controller 1031 and two storage components 1032, and the storage components 1032 include a plurality of disks 1033 on the SAS disk cage 103. Preferably, the disk 1033 is a large-capacity disk. The storage system groups disks 1033 in high-density SAS disk cages 103 by way of system configuration, and then allocates the groups to different storage components 1032. One storage component 1032 comprises an entire disk 1033 within SAS disk cage 103 or only a portion of the physical storage area on disk 1033. If two storage components 1032 contain the same disk 1033, they must select different storage areas on disk 1033. One storage component 1032 is managed by only one node 101, which are logically subordinate, represented by line 1016. Since one storage component 1032 belongs to only one node 101, that is, when two nodes 101 perform IO access to a storage area of the disk 1033, simultaneous access to one storage area does not occur, and therefore, a lock mechanism does not need to be designed on an IO path. Storage component 1032 maintains data on multiple disks 1033 in an erasure manner for XD + YP. XD + YP denotes that in one erasure group, X data blocks and Y erasure blocks are included. This erasure correction group can tolerate any data blocks less than or equal to Y lost or corrupted and can recover those lost or corrupted data blocks with the remaining data blocks or erasure correction blocks in the erasure correction group. The storage component 1032 may be viewed as a container that contains a number of Chunk Group 1035. The Chunk Group1035 contains a number of physical chunks that is consistent with the number of disks 1033 in the storage component 1032. Each disk 1033 is allocated with a physical Chunk, and each physical Chunk forms an erasure correction group in the manner of XD + YP, that is, stores data in a manner of X data chunks plus Y erasure correction codes chunks. By using Chunk Group1035, data blocks can be read and written in parallel on the plurality of disks 1033, erasure codes are calculated after the read/write IO is completed, and the erasure codes are written into erasure stripes, so that the purpose of data protection is achieved. The storage system constructs a set of management interfaces and read-write IO access interfaces based on the logical volume for upper-layer application. The logical volume has a maximum capacity value and is progressively unrolled in a streamlined manner after creation. The basic addressing unit in a logical volume is a fixed-length logical block. Several consecutive logical blocks constitute a fixed-length logical Chunk. The logical Chunk is consistent with the physical Chunk length described above. The logical Chunk in the logical volume is distributed across different storage components 1032, and is maintained in log-structured storage mode within the storage components 1032. In the SAS disk cage 103, a system data area 1036 formed by two SSD disks is further included, and preferably, the two SSD disks perform data protection in a RAID 0 manner.
As shown in fig. 2-1, 2-2, and 2-3, the configuration and management of storage components by a storage system is described. The storage component mapping table contains configuration information of each storage component, and comprises the following steps: storage component number, Owner node, data disk number and erasure disk number in this storage component, and physical storage area on disk (starting LBA and block number). The storage component mapping table is determined by the configuration file at system installation. After the system is started, the CPU processor on each node loads the storage component mapping table into the memory. The CPU processor determines which storage components are managed by the node according to the Owner node of the storage component mapping table. When one node fails, the CPU processor attributes the storage components managed by the failed node to another node. One node maintains the health of another node through a node state table. The CPU processor detects the state of another node by sending a heartbeat request at regular time, and if an error occurs, the error count of the corresponding node in the node state table is updated. When the error count reaches the WARNING threshold, the node state is set to WARNING. When the error count reaches the FAIL threshold, the node status is set to FAILED. And when the CPU processor of one node finds that the node state of another node is FAILED, updating the storage component mapping table and the node state table, and taking over all the storage components managed by the FAILED node. The disk status table maintains the status of each disk in the SAS disk cage, including the number of disks, the number of blocks (number of sectors), the type (SSD, SATA, NL-SAS, etc.), and the error count. The CPU processor counts an error of +1 when a disk failure occurs in the read/write operation described later. When the error count reaches the WARNING threshold, the disk state is set to WARNING. When the error count reaches the FAIL threshold, the disk state is set to FAILED. The storage component mapping table, the node state table and the disk state table are all stored in a system data area and loaded into a memory by a CPU (central processing unit) of each node. And saving the disk at any time after the state is changed.
As shown in fig. 3, it is described how the storage system manages logical volume information. A logical volume can be viewed as a logical block device. Basic management information of the logical volume is maintained by the logical volume management table. The method comprises the following steps: volume number, target volume of volume (total chunk number), already-unwound volume (already-unwound chunk number), and user information. The block storage system provides an access interface to the outside on a logical volume basis. Management of logical volumes is provided through the management REST interface of the present storage system, including, but not limited to: creating a logical volume, namely creating a new row in a logical volume management table, allocating a logical volume number, and filling in the total chunk number and user information; expanding the logical volume, and changing the column of the total chunk number of the corresponding volume in the logical volume management table; deleting the logical volume, and deleting rows from the logical volume management table; and changing the user of the logical volume, and changing the 'user information' column of the corresponding volume in the logical volume management table. The logical volume management table is maintained in the memory by the CPU processor of each node, and stores its image in the system data area.
As shown in fig. 4-1 and 4-2, the composition of the logical volume 401 and the mapping of logical blocks 406 to storage components are illustrated. The logical volume 401 corresponds to a container that contains several fixed-length logical chunks 402. Each logical Chunk402 is a child container that contains several fixed-length logical blocks 406. The logical blocks 406 are the basic data addressing units in the logical volume 401, and one logical block 406 is an integer multiple of 512 bytes (typically 512 or 4096 bytes). The logical volume 401 is thin-provisioned. When creating the logical volume 401, the column "total chunk number" is simply filled in the logical volume management table, and physical storage space is not actually allocated. In the write IO process described later, the logical volume 401 is gradually expanded, and at this time, the mapping relationship between the logical volume 401 and the storage component 1032 is stored in the logical volume static mapping table, and the expanded chunk number of the logical volume management table is updated. The storage component 1032 internally holds data in log-structured mode, and therefore, in the logical volume static mapping table, only the mapping relationship of the logical Chunk402 to the storage component 1032 is recorded. At the time of allocation of logical volume 401, the specific location of logical Chunk402 in storage component 1032 cannot be known, and can only be determined during write IO request processing. The storage component 1032 is internally comprised of a plurality of Chunk Group 1035. The Chunk Group1035 includes one physical Chunk405 per disk 1033 of the storage component 1032, and the physical chunks 405 form an erasure Group in an XDYP manner, that is, one Chunk Group1035 includes X data chunks and Y erasure codes chunks. When the write process is complete, logical Chunk402 in logical volume 401 and physical Chunk405 in Chunk Group1035 establish a one-to-one mapping relationship, which is recorded in a log-structured mapping table described later. The logical volume static mapping table is maintained in the memory by the CPU processor of each node, and stores its mapping in the system data area.
As shown in fig. 5, the unrolling process of the logical volume 401 is shown, i.e. the mapping relationship between the logical chunk in the logical volume 401 and the storage component 1032 is established and recorded in the logical volume static mapping table. The logical volume 401 will perform the unrolling process during the initial setup or when it is discovered at the time of writing IO that no new physical Chunk can be allocated to the current Chunk Group. During the expansion process, the CPU processor first looks up the storage component mapping table to find all available storage components 1032 for which this node is responsible. If no storage component 1032 is currently available, a process is performed to allocate a new storage component 1032, several new storage components 1032 being allocated for their own ownership. Then, the CPU processor randomly selects an available storage component 1032, allocates the logical Chunk of the logical volume 401 to the storage component 1032, and records the mapping relationship in the logical volume static mapping table. And after the execution is finished, updating the expanded chunk number in the logical volume management table. Data is stored in log-structured mode within storage component 1032, i.e., when an application modifies a logical block of data, rather than overwriting old data with new data, new space is allocated for the logical block within storage component 1032. As described later, after storing data in log-structured mode, the storage system may sequentially write data and compute erasure codes in a pipeline-like manner, thereby improving IO performance of the storage system.
As shown in fig. 6-1, 6-2, and 6-3, the storage system maintains a Log-structured space allocation table for maintaining each logical block, a physical block address in a physical Chunk in the storage component, and a reverse mapping table for maintaining to which logical block the block in each physical Chunk corresponds. The arrows in the figure each represent pointers, and the latest data of each logical block is stored, and the old data of each logical block can also be stored in the structure shown in fig. 5, and an updated version history of the logical volume in the storage system is maintained through a version management system. And the CPU processor maintains a structural body shown by a Log-structured space allocation table and a reverse mapping table for each storage component in the memory of each node, and stores the structural body in a system data area in a persistent mode.
As shown in FIG. 7, it is maintained that the block in each Chunk Group is the latest data (701-A), i.e. the latest written block, in the log-structured storage mode; old data (701-B), i.e., blocks that have been updated. Free data (701-C) refers to old data blocks that have been reclaimed and can be used to reallocate space. After the log-structured storage mode is realized, the write operation on the logical volume, whether writing a new logical block or updating an old logical block, becomes the operation of performing additional write on the current Chunk Group on the storage component, and the performance of the additional write is much higher than that of random write on a large-capacity mechanical disk medium. The CPU processor maintains the structure shown at 701A-C for each storage component in the memory of each node and persists in the system data area.
As shown in FIGS. 8-1 and 8-2, the CPU processor completes the append write operation on the current Chunk Group in a pipelined manner in the case of a normal space allocation. The additional writing adopts a mode of parallel execution of a plurality of erasure correction strips to achieve the purpose of improving the writing performance. This parallel approach makes an erasure stripe look like a pipeline. Fig. 8-1 shows the spatial allocation order of physical Chunk, in which physical Chunk is used as a unit, and logical blocks on a logical volume are mapped one to one. The logical block address order inside such a logical volume is identical to the physical block order inside the physical Chunk. This mapping may improve the locality of the data. When the CPU processor writes the logic Chunk data into the disk, the physical Chunk is not written in sequence by taking the physical Chunk as a unit, but the physical Chunk is decomposed into a plurality of strips with fixed length, and the strips on different chunks form an erasure correcting strip. And taking the erasure correction strips as a pipeline, and writing data into the disk by a plurality of pipelines simultaneously. Meanwhile, erasure codes are also calculated in units of stripes and then written to a disk.
As shown in fig. 9-1, 9-2, 9-3, 9-4, 9-5, and 9-6, the Cache information in the memory is shown. And each node maintains a respective Cache information table in the memory thereof. The pending queue maintains a pending queue for each logical volume. The system stores read-write IO requests received from a network interface into corresponding queues to be processed according to the logical addresses of the requests, nodes in the queues represent the requests to be processed, arrows represent pointers, and solid circles represent head of queue pointers. The dirty data queue is indexed by storage component number, one queue is maintained for each erasure stripe of a storage component. Each block in the queue is a structure body, and the structure body comprises a pointer to a partial logic block in a certain row in the Cache bit chart, the physical Chunk information of the strip, the context information of the request and the like. The erasure code queue is indexed by the storage component number, maintaining a queue for each stripe of the storage component. Each block in the queue is a structure body, and the structure body comprises a pointer to a certain line in the Cache bit chart, the physical Chunk information of the strip and the like. The Clean Cache queue maintains the stripe Cache pointers (pointers to the Cache bit map) that have completed the write operation. The elements in the Clean Cache queue are completely consistent with the data in the disk, so that the Clean Cache queue can be used as a read Cache. The free Cache queue maintains the Cache that can allocate space. And when the space of the allocable Cache is insufficient, clearing data from the Clean Cache, and then moving to an idle Cache queue. The record of the Cache bit diagram is the allocation condition and description information of the Cache pages in the memory. The method comprises the following steps: logical address information (logical volume number, logical block start address, block number); a pointer to a Cache page; the dirty bitmap is a string of 0101 codes, each bit of the dirty bitmap corresponds to a logical block, and a bit of the dirty bitmap is set to indicate that the corresponding logical block has new data in the Cache and has not been landed yet; the stabing bitmap is also a series of 0101 codes, each bit of which corresponds to a logical block, and a bit of which is set to indicate that the corresponding logical block is performing a disk-down operation.
Fig. 10 is a flowchart illustrating IO decomposition of the storage system. All network write requests first enter the pending queue. The CPU processor sequentially fetches a request from the queue to be processed (S1001). The request is then split into several sub-requests according to the address range of the requested logical block, where each sub-request corresponds to a logical Chunk and may contain part or all of the logical blocks inside the logical Chunk (S1002). Next, the processor encapsulates each sub-request into an asynchronous IO task, issues all IO tasks into the request thread pool, executes the tasks in an asynchronous manner, and tracks the execution state of each IO task (S1003). The processor maintains an infinite loop in a thread waiting for the completion event or timeout event of the IO task to be accepted (S1004). If a sub-request times out or a timeout event is received (S1006Y), the sub-request is allowed to retry for a certain number of times, and if the retry fails all the time (S1008Y), the system considers that the operation failed (S1013). If a sub-request returns an error message (S1009N), the sub-request is allowed to retry for a certain number of times, and if the retry fails all the time (S1008Y), the system regards the operation as failed (S1013). If all the sub-requests return execution success (S1011Y), the write request operation is considered to be successful. And in the multithreading processing process, the CPU processor maintains an IO tracker for each sub-request, and the IO tracker maintains the processing state of each logic block in the sub-request. The CPU processor checks the IO tracker and confirms the success or failure of the sub-request operation after determining that all the logic blocks have successfully landed or that confirmation has failed.
FIG. 11 is a flow chart illustrating the process of writing the IO sub-request. The CPU processor uses a single thread to perform the processing of a sub-request. It is first determined whether the logical chunk corresponding to the sub-request has been unrolled. The CPU processor inquires the logical volume static mapping table to see whether the requested logical volume and logical Chunk exist (S1101). If not (S1102-N), an expand space process is performed (S1103) to assign the logical Chunk to a storage component, as described above in FIG. 5. If it already exists (S1102Y), it is determined whether the allocated storage component belongs to the present node. If not, the node is redirected to another node (S1106). If it belongs to the present node (S1104Y), a pre-allocation space flow is performed (S1105). The pre-allocation space process mounts the logical block in the logical Chunk402 corresponding to the request to the tail of each erasure correction strip in the dirty data queue. Each erasure stripe belongs to the same physical Chunk in the longitudinal direction, and therefore, at the time of mounting, it is completed in the order shown in fig. 8-1, in which one physical Chunk is preferentially filled in the longitudinal direction. And ending the pre-space allocation process until all the logic blocks are mounted to the dirty data queue. During the pre-allocation of space (S1105), it may happen that a Chunk Group is mounted to the end of a Chunk Group and there are logical blocks that are not allocated, and it is necessary to start a new Chunk Group in the Group and continue allocation. If the intra-group space is not sufficient (S1107N), an intra-group space cleaning process (S1110), which will be described later, is performed. If not enough after the cleanup (S1111N), the state information of the write Cache failure is updated to the IO tracker (S1112). If the space within the group is sufficient (S1107Y), the data is written to the Cache (S1109). And after writing, updating the Cache bit map, and setting the dirty bit map corresponding to the written logic block. Then, the status information successfully written into the Cache is updated to the IO tracker (S1113).
FIG. 12 is a flow chart of a read IO sub-request. The CPU processor uses a single thread to perform the processing of a read IO sub-request. The processor queries the logical volume static mapping table according to the requested target logical address, and determines the winner node of the logical Chunk (S1201). If not, the node is redirected to another node (S1204), and the success of the sub-request (S1210) or the failure of the sub-request (S1206) is determined according to the processing state returned by the node. If the node belongs to the local node (S1202Y), the Cache bitmap table is queried to see if the requested logical blocks are all in the Cache. If the Cache does not hit (S1203N), the CPU processor needs to reserve the Cache space (S1207). When reserving the Cache space, the CPU processor tries to fetch the Cache space from the idle Cache queue; and if the space in the idle Cache queue is insufficient, executing a Cache cleaning process on the Cache in the to-be-cleaned Cache queue, and entering the cleaned Cache into the idle Cache queue. And then the CPU processor fetches the Cache space from the idle Cache queue. The CPU processor then reads data from the disk into the reserved Cache space (S1208), and then returns the Cache data (S1209). The sub-request processing is completed (S1210). If the Cache directly hits (S1203Y), the Cache data is directly returned (S1209). The sub-request processing is completed (S1210). In the process of the falling of the stripes, the CPU processor creates two threads for each erasure correcting stripe in the storage component, wherein one thread is used for processing the falling of the data stripes, and the other thread is used for processing the falling of the erasure correcting code stripes. Multiple threads are executed in parallel to increase processing speed.
Fig. 13 shows a data stripe destaging flow chart. The CPU processor looks up from the head of the queue from the dirty data queue and pieces up a complete stripe (S1301). The physical address range in each block of the dirty data queue does not exceed one stripe, so it may be necessary to piece together several contiguous records into a complete stripe. The stripe data is then written to disk in an asynchronous IO manner (S1302). Due to the asynchronous IO mode, the CPU processor can quickly commit all data stripes within one erasure correction stripe to the disk (S1303Y). Then waits for the IO completion event to be processed (S1304). If an IO failure event is received (S1305N), which indicates that there is a disk IO failure of one stripe, the bad stripe processing flow is entered (S1310), and then the number of failed stripes is counted. The bad stripe processing flow is described later, and it will allocate the logic blocks corresponding to the bad stripe to other positions of the storage component. If the number of failed stripes exceeds the Quorum permitted range of the storage component (for example, in the erasure mode of XDYP, more than Y stripe IOs have failed), it indicates that the whole storage component has a problem, and then the storage component exception processing flow is entered (S1311). If the number of failed stripes does not exceed the Quorum allowed range of the storage component, the erasure-correcting stripe is still discardable. At this time, an erasure code of the erasure correction stripe will be calculated (S1307). Alternatively, the erasure code calculation may adopt a Reed Solomon coding manner. For example, let X =3 and Y =2 in the XD-YP erasure pattern of the memory component design. Three data stripes are X1, X2, X3, then:
Y1 = X1 xor X2 xor X3
Y2 = (X1 * a) xor (X2 * b) xor (X3 * c)
y1 and Y2 are calculated erasure code stripes, and are written into erasure stripe queues corresponding to the erasure code queues.
If there is a bad stripe in the calculated stripes, but the total number of bad stripes is within the range permitted by the Quorum, then the bad stripes are calculated as all 0's when the erasure codes are calculated. At this time, the bad stripe does not contain valid data, and the position of the bad stripe is recorded in the control information block of the Chunk Group. Subsequently, the CPU processor updates the dirty bitmap and the Staging bitmap in the Cache bitmap table, and hangs the strip record in the dirty data queue to the end of the Clean Cache queue. The IO Tracker is then updated to mark the logical disk corresponding to the stripe that has successfully dropped off as successful (S1308). If there are more records in the dirty data queue (S1309Y), then we go back to S1301, otherwise the flow ends.
Fig. 14 shows a flow chart of erasure correction striping. The CPU processor fetches a stripe from the erasure code queue (S1401), and then writes it to the disk in an asynchronous IO manner (S1402). The CPU processor determines to fetch several stripes from the erasure code queue according to the erasure pattern (XDYP) of the present storage component. For example, if the storage component is in 10D + 3P mode, three stripe landings need to be taken from the erasure code queue in turn. After the CPU processor commits all erasure correction stripes (S1403Y), it waits for an IO completion event returned by the processing asynchronous IO task (S1404). If an IO failure event is received (S1405N), the bad stripe process flow is entered first (S1409), and then it is determined whether the failed IO number exceeds Quorum (S1410). At this time, the IO failure count is accumulated with the number of bad stripes of the previous data stripe in the same erasure correction stripe. If the number of failed stripes exceeds the Quorum permitted range of the storage component (for example, in the erasure mode of XDYP, more than Y stripe IOs have failed), it indicates that a problem occurs in the entire storage component 1032, and then an exception handling flow of the storage component 1032 is entered (S1411). If Quorum is not exceeded (S1410Y), then continue to wait for all IO events to collect (S1406Y). The Cache status is then updated (S1407). Because the Cache pages pointed to in the erasure code queue are not mapped to any logic block, the Cache pages can be directly moved to the idle Cache queue. The above completes the erasure correction strip processing task of one batch. If there is more data in the erasure code queue, then return to S1401 for the next batch.
Fig. 15 shows a flowchart of the bad stripe processing. The processor first records the state of the bad stripe into the reverse mapping table in the storage component 1032 (S1501). Then, the disk failure count is registered and saved in the disk state table (S1502). Thereafter, the CPU processor writes data in the temporary disc (S1503), and records an entry in the temporary disc index table. And finishing the bad strip processing.
As shown in fig. 16, a physical Chunk recovery flow chart is shown. The CPU processor first allocates a temporary Cache space, the page number of which is 2 times the number of stripes to be restored (S1601). Then, the CPU processor reads the stripes in order from the first stripe of the erasure correction stripes, and places them into a temporary Cache page (S1602). And performs erasure operation with the previous temporary Cache page and then saves (S1603). When the number of the Quorum stripes is calculated (S1604Y), the lost stripes are restored in the temporary Cache page where the erasure correction operation is stored. The CPU processor writes the recovered data to the temporary disk (S1605), and then updates the log-structured space allocation table in the storage component 1032 with the physical block address on the temporary disk, and the reverse mapping table (S1606). And records the arrival in the bad band index table (S1607).
Fig. 17 shows a flowchart of the disk recovery. The processor tracks the disk state table in a separate system task to determine bad disks. The bad disk is determined according to whether or not the number of disk failures exceeds a failure threshold (S1701). The CPU processor then determines the affected storage components using the storage component mapping table and performs data recovery on these storage components in turn. If the damaged storage component and the lost disk exceed the Quorum (S1703Y), the storage component exception processing flow is entered. If the Quorum is not exceeded (S1703N), the CPU processor selects an unused disk from the disk cage and replaces the bad disk in the storage assembly. The storage component mapping table will be updated at this point. Then, the CPU scans the bad stripe index table, and sequentially restores the bad stripes corresponding to the damaged storage components (S1705). The CPU reads data from the corresponding location of the temporary disk and writes the data to the new disk after replacement (S1706), and the process ends until all the stripes in the bad stripe index table have been processed.
The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

Claims (6)

1. A distributed storage system based on high-density storage hardware, which is characterized by comprising two nodes (101) and a SAS disk cage (103), wherein the nodes (101) comprise CPU processors (1011), internal memories (1012), internal SAS adapters (1013), system disks (1014) and external SAS HBA cards (1015), the SAS disk cage (103) comprises an SAS controller (1031) and two storage components (1032), each storage component (1032) is managed by only one node (101), the two storage components are logically in a subordination relationship, the storage components (1032) are connected through a connection line (1016), the disks (1033) in the SAS disk cage (103) are grouped and then distributed to two different storage components (1032) in a system configuration mode, the storage components (1032) comprise a plurality of Chunk groups (1035) and a plurality of disks (1033), and the Chunk groups (1035) comprise a plurality of physical chunks, the number of the physical chunks is consistent with the number of the disks (1033), each disk (1033) is allocated with one physical Chunk, each physical Chunk forms an erasure correction Group in an XD + YP mode, data is stored in a mode of adding X data chunks and Y erasure correction codes chunks, data blocks are read and written on the multiple disks (1033) in parallel by using a Chunk Group (1035), erasure correction codes are calculated after reading and writing IO is finished, and the erasure correction codes are written into erasure correction strips, so that the purpose of data protection is achieved.
2. A distributed storage system based on high density storage hardware as claimed in claim 1, characterized in that the storage elements (1032) comprise an entire disk (1033) in a SAS disk cage (103), or comprise only a part of the physical storage area on the disk (1033), and if two storage elements (1032) comprise the same disk (1033), a different storage area on the disk (1033) is selected.
3. The distributed storage system for high-density storage hardware based on claim 2, wherein the storage component (1032) maintains data on the multi-block disk (1033) in an erasure correction manner of XD + YP, XD + YP indicating that X data blocks and Y erasure correction blocks are included in an erasure correction group, the erasure correction group tolerates any loss or corruption of Y data blocks or less, and recovers the lost or corrupted data blocks using the remaining data blocks or erasure correction blocks in the erasure correction group.
4. The distributed storage system based on the high-density storage hardware as claimed in claim 3, wherein the storage system constructs a set of management interface and read-write IO access interface based on logical volume for upper layer application, the logical volume has maximum capacity value, after being created, the logical volume is gradually expanded in a simplified preparation manner, the basic addressing unit in the logical volume is a logical block with fixed length, several continuous logical blocks form a logical Chunk with fixed length, the logical Chunk is consistent with the physical Chunk in length, the logical chunks in the logical volume are distributed to different storage components (1032), and are stored in a log-structured storage mode inside the storage components (1032).
5. The distributed storage system based on high-density storage hardware according to claim 4, characterized in that, in the SAS disk cage (103), it further includes a system data area (1036) formed by two SSD disks, and the SSD disks are data protected in RAID 0 mode.
6. The distributed storage system based on high-density storage hardware as claimed in claim 5, wherein said magnetic disk (1033) is a large-capacity magnetic disk.
CN202210814643.4A 2022-07-12 2022-07-12 Distributed storage system based on high-density storage hardware Active CN114895856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210814643.4A CN114895856B (en) 2022-07-12 2022-07-12 Distributed storage system based on high-density storage hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210814643.4A CN114895856B (en) 2022-07-12 2022-07-12 Distributed storage system based on high-density storage hardware

Publications (2)

Publication Number Publication Date
CN114895856A CN114895856A (en) 2022-08-12
CN114895856B true CN114895856B (en) 2022-09-16

Family

ID=82729744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210814643.4A Active CN114895856B (en) 2022-07-12 2022-07-12 Distributed storage system based on high-density storage hardware

Country Status (1)

Country Link
CN (1) CN114895856B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860858B (en) * 2023-09-01 2023-11-17 北京四维纵横数据技术有限公司 IO tracking method, device, equipment and medium for database operation level

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049225A (en) * 2013-01-05 2013-04-17 浪潮电子信息产业股份有限公司 Double-controller active-active storage system
JP2019128960A (en) * 2018-01-19 2019-08-01 三星電子株式会社Samsung Electronics Co.,Ltd. Data storage system, and method for accessing objects of key-value pair
CN110431531A (en) * 2017-03-16 2019-11-08 华为技术有限公司 Storage control, data processing chip and data processing method
CN111587420A (en) * 2017-11-13 2020-08-25 维卡艾欧有限公司 Method and system for rapid failure recovery of distributed storage system
CN111897486A (en) * 2020-06-08 2020-11-06 华北电力大学 Intelligent unified storage system based on software definition
CN113268374A (en) * 2020-01-29 2021-08-17 三星电子株式会社 Method for storing data, storage device and data storage system
CN113568580A (en) * 2021-07-29 2021-10-29 广州市品高软件股份有限公司 Method, device and medium for realizing distributed storage system and storage system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049225A (en) * 2013-01-05 2013-04-17 浪潮电子信息产业股份有限公司 Double-controller active-active storage system
CN110431531A (en) * 2017-03-16 2019-11-08 华为技术有限公司 Storage control, data processing chip and data processing method
CN111587420A (en) * 2017-11-13 2020-08-25 维卡艾欧有限公司 Method and system for rapid failure recovery of distributed storage system
JP2019128960A (en) * 2018-01-19 2019-08-01 三星電子株式会社Samsung Electronics Co.,Ltd. Data storage system, and method for accessing objects of key-value pair
CN113268374A (en) * 2020-01-29 2021-08-17 三星电子株式会社 Method for storing data, storage device and data storage system
CN111897486A (en) * 2020-06-08 2020-11-06 华北电力大学 Intelligent unified storage system based on software definition
CN113568580A (en) * 2021-07-29 2021-10-29 广州市品高软件股份有限公司 Method, device and medium for realizing distributed storage system and storage system

Also Published As

Publication number Publication date
CN114895856A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
JP6009095B2 (en) Storage system and storage control method
EP1816563B1 (en) Method of managing data snapshot images in a storage system
JP5116151B2 (en) A dynamically expandable and contractible fault-tolerant storage system using virtual hot spares
US6341342B1 (en) Method and apparatus for zeroing a transfer buffer memory as a background task
US8510508B2 (en) Storage subsystem and storage system architecture performing storage virtualization and method thereof
CN103761190B (en) Data processing method and apparatus
US8601312B2 (en) Storage apparatus, controller, and method for allocating storage area in storage apparatus
US20130003211A1 (en) Fast data recovery from hdd failure
US6553509B1 (en) Log record parsing for a distributed log on a disk array data storage system
JPWO2014115320A1 (en) Storage system
US20090100237A1 (en) Storage system that dynamically allocates real area to virtual area in virtual volume
US8694563B1 (en) Space recovery for thin-provisioned storage volumes
US9223655B2 (en) Storage system and method for controlling storage system
EP3062209A1 (en) Method and apparatus for improving disk array performance
JP2011530746A (en) System and method for transmitting data between different RAID data storage formats for current data and playback data
CN114895856B (en) Distributed storage system based on high-density storage hardware
WO2016139787A1 (en) Storage system and data writing control method
US20130246710A1 (en) Storage system and data management method
WO2013088474A2 (en) Storage subsystem and method for recovering data in storage subsystem
US11592988B2 (en) Utilizing a hybrid tier which mixes solid state device storage and hard disk drive storage
JP5275691B2 (en) Storage system
US20230350752A1 (en) Flexible raid scheme allowing fast rebuild
JP7443404B2 (en) storage system
CN109086010B (en) Method for improving metadata reliability on full flash memory array
CN117111841A (en) NFS sharing acceleration method for data partition based on domestic double-control disk array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant