WO2021017782A1 - Procédé d'accès à un système de stockage distribué, client et produit programme d'ordinateur - Google Patents

Procédé d'accès à un système de stockage distribué, client et produit programme d'ordinateur Download PDF

Info

Publication number
WO2021017782A1
WO2021017782A1 PCT/CN2020/100814 CN2020100814W WO2021017782A1 WO 2021017782 A1 WO2021017782 A1 WO 2021017782A1 CN 2020100814 W CN2020100814 W CN 2020100814W WO 2021017782 A1 WO2021017782 A1 WO 2021017782A1
Authority
WO
WIPO (PCT)
Prior art keywords
hard disk
client
partition
data
hard
Prior art date
Application number
PCT/CN2020/100814
Other languages
English (en)
Chinese (zh)
Inventor
杨瑞
陈虎
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021017782A1 publication Critical patent/WO2021017782A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present invention relates to the field of information technology, in particular to a distributed storage system access method, a client and a computer program product.
  • the distributed storage system contains multiple storage nodes.
  • the client writes data to the distributed storage system according to the write request sent by the host, the data is stored in striped form on the hard disk of the storage node in the partition to which the strip belongs.
  • the partition includes hard disks with multiple storage nodes, and usually one storage node provides one hard disk for one partition.
  • EC Erasure Coding
  • the number of data strips in the stripe is N
  • the number of check strips is M
  • the stripe length is N+M, where N and M are both positive integers .
  • the client divides the data to be stored into data blocks of N data strips, and generates the check data of M check strips according to the EC algorithm, that is, the check block.
  • the client writes N+M-length striped data into the hard disk of the corresponding N+M storage node.
  • the distributed storage system records the correspondence between the partition and the hard disks of the storage nodes contained in the partition.
  • a partition will contain multiple strips.
  • the distributed storage system writes data, the strips are allocated according to the partition where the data is located.
  • the distributed storage system reads data, it determines the partition where the read data is located according to the storage address of the read data, and reads the data from the corresponding partition.
  • the above-mentioned data read and write access methods of the distributed storage system involve multiple hard disks.
  • slow disks appear in multiple hard disks, the access delay will be slowed down by the slow disks, thereby affecting the access performance of the distributed storage system.
  • the prior art distributed storage system will isolate the slow disk when it detects and confirms that it is a slow disk. That is, the slow disk is deleted from the partition, and the client no longer accesses the hard disk, which affects the access performance of the distributed storage system.
  • This application provides a distributed storage system access method, client and computer program product, which avoids the problem of increased access delay caused by slow disk detection in the distributed storage system, and improves the access performance of the distributed storage system.
  • the present invention provides a distributed storage system.
  • the distributed storage system includes a client and a plurality of storage nodes, each storage node includes a hard disk; the distributed storage system includes partitions, and the partitions include Multiple hard disks; the multiple hard disks are located in different storage nodes; the method includes: the client detects the performance of the multiple hard disks in the partition; when the performance of the first hard disk is abnormal but has not been confirmed When the first hard disk is a slow disk, the client sends an access request to one or more hard disks other than the first hard disk in the partition; the first hard disk is the first storage in the partition The hard disk on the node.
  • This solution detects abnormal hard disk performance but has not confirmed that the hard disk is a slow disk, such as increased access delay, that is, no longer access to the abnormal hard disk, thus avoiding the problem of increased access delay caused by slow disk detection in the distributed storage system. Improved access performance of distributed storage systems.
  • the method further includes: the client detecting whether the first hard disk with abnormal performance is a slow disk.
  • the client detects whether the first hard disk with abnormal performance is a slow disk.
  • the detection by the client of whether the first hard disk with abnormal performance is a slow disk specifically includes: the client sends an access detection request to the first hard disk for determining whether the first hard disk is a slow disk .
  • the partition uses a multiple copy redundancy mechanism; the client sends an access request to one or more hard disks in the partition except the first hard disk, which specifically includes: The client sends a read request to the second hard disk in the second storage node in the partition; the second hard disk stores a copy of the data of the first hard disk.
  • the partition uses a multiple copy redundancy mechanism.
  • the client accesses the data in the second hard disk in the partition. Because the partition uses multiple copy redundancy, the first hard disk and the second hard disk in the partition store the same data copy.
  • the partition uses an erasure code mechanism; the client sends an access request to one or more hard disks in the partition except the first hard disk, which specifically includes: The client sends a read request to other hard disks in the partition except the first hard disk; the client restores the data in the first hard disk according to the erasure code mechanism and the data read from the other hard disks data.
  • the partition uses the erasure code mechanism.
  • the client accesses the data in other hard disks in the partition, according to the erasure code mechanism and from the The data read by other hard disks restores the data in the first hard disk, thereby avoiding access delay caused by accessing the first hard disk and improving the access performance of the distributed storage system.
  • the partition uses a multiple copy redundancy mechanism; the client sends an access request to one or more hard disks in the partition except the first hard disk, which specifically includes: The client respectively sends write requests carrying the same data to other hard disks in the partition.
  • the partition uses a multi-copy redundancy mechanism.
  • the client sends a write request carrying the same data to the hard disks of other storage nodes in the partition, thereby It realizes that other hard disks in the partition store data copies, avoids writing data to the first hard disk, thereby avoiding access delay caused by accessing the first hard disk, and improving the access performance of the distributed storage system.
  • the partition uses an erasure code mechanism; the client sends an access request to hard disks in one or more storage nodes other than the first hard disk in the partition, specifically Including: the client determines the data block according to the number of check blocks in the erasure code mechanism and the number of the other hard disks in the partition; the client determines the data block based on the check in the erasure code mechanism The number of blocks and the data block calculate the check block; the client respectively sends a write request carrying the check block to the hard disk storing the check block in the other hard disk, to the storage data block in the other hard disk The hard drive sends a write request carrying a data block.
  • the detection by the client of the performance of the first hard disk of the first storage node in the partition specifically includes: when the client obtains access requests sent to the multiple storage nodes Delay; the client compares the delays of the access requests sent to the multiple storage nodes to determine the performance of the multiple hard disks in the partition.
  • the client side judges the performance of at least two hard drives in the partition by comparing the latency of access requests sent to multiple storage nodes.
  • the present invention provides a client in a distributed storage system
  • the distributed storage system includes a client and a plurality of storage nodes, each storage node includes a hard disk;
  • the distributed storage system includes partitions, The partition includes a plurality of hard disks; the plurality of hard disks are respectively located in different storage nodes, and the client includes a unit that implements the solution of the first aspect and any one of the possible implementation manners of the first aspect.
  • the present invention provides a client in a distributed storage system.
  • the distributed storage system includes a client and multiple storage nodes, each storage node includes a hard disk; the distributed storage system includes partitions, The partition includes multiple hard disks; the multiple hard disks are located in different storage nodes, the client includes an interface and a processor, and the interface communicates with the processor; the processor is used to implement the first aspect and Any possible implementation scheme of the first aspect.
  • the present invention provides a computer program product, the computer program product containing computer instructions, the computer program product can be applied to a client in a distributed storage system, the distributed storage system containing the client
  • Each storage node includes a hard disk; the distributed storage system includes a partition, and the partition includes a plurality of hard disks; the plurality of hard disks are located in different storage nodes; the processor of the client executes
  • the computer instructions are used to implement the first aspect and any possible implementation solution of the first aspect.
  • the present invention provides a computer-readable storage medium containing computer instructions, and the computer-readable storage medium can be applied to a client in a distributed storage system.
  • the storage system includes the client and multiple storage nodes, each storage node includes a hard disk; the distributed storage system includes partitions, and the partition includes multiple hard disks; the multiple hard disks are located in different storage nodes;
  • the processor of the client executes the computer instructions to implement the first aspect and any possible implementation solution of the first aspect.
  • Figure 1 is a schematic diagram of a distributed storage system according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a server structure in a distributed block storage system according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a partition view of a distributed block storage system according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the relationship between stripes and storage nodes in a distributed block storage system according to an embodiment of the present invention
  • FIG. 5 is a flowchart of a method for a client to write data in a distributed block storage system according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of a client side determining a partition in a distributed block storage system according to an embodiment of the present invention
  • FIG. 7 is a flowchart of a method for a storage node to write striped data according to an embodiment of the present invention
  • FIG. 8 is a schematic flowchart of an access method according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a client structure in a distributed storage system according to an embodiment of the present invention.
  • the distributed storage system in the embodiment of the present invention such as of series.
  • the distributed storage system includes multiple servers, such as server 1, server 2, server 3, server 4, server 5, and server 6, and the servers communicate with each other through InfiniBand or Ethernet.
  • the server in the distributed storage system is also called a storage node.
  • the number of servers in the distributed storage system can be increased according to actual needs, which is not limited in the embodiment of the present invention.
  • each server in the distributed storage system contains the structure shown in Figure 2.
  • each server in the distributed storage system includes a central processing unit (CPU) 201, a memory 202, an interface 203, a hard disk 1, a hard disk 2, and a hard disk 3.
  • the memory 202 stores computer instructions.
  • the CPU 201 executes computer instructions in the memory 202 to perform corresponding operations.
  • the interface 203 may be a hardware interface, such as a network interface card (NIC) or a host bus adaptor (HBA), etc., or a program interface module, etc.
  • Hard disks include solid state disks (Solid State Disk, SSD) and/or mechanical hard disks (Hard Disk Drive, HDD).
  • the hard disk interface can be Serial Advanced Technology Attachment (SATA) interface, Serial Attached Small Computer System Interface (SAS) interface, Fibre Channel (FC) interface, fast external Set up interconnection (Peripheral Component Interconnect-Express, PCIe) interface and so on.
  • the CPU 201 can be replaced by a Field Programmable Gate Array (FPGA) or other hardware, or the FPGA or other hardware and the CPU 201 can perform the aforementioned corresponding operations together.
  • FPGA Field Programmable Gate Array
  • the embodiment of the present invention refers to the CPU 201 or the hardware replacing the CPU 201 as a processor, or the combination of the CPU 201 and other hardware as a processor.
  • the client in the distributed storage system writes data to the distributed storage system according to the host write request or reads data from the distributed storage system according to the host read request.
  • the client provides an access interface to the host, for example, provides at least one of a block access interface, a file access interface, and an object access interface.
  • the client provides block resources, such as logical unit (LU), for the host to provide data access operations for the host.
  • the logical unit is also called logical unit number (Logical Unit Number, LUN).
  • LUN Logical Unit Number
  • the client provides file resources for the host
  • a distributed object storage system the client provides object resources for the host.
  • the client in the embodiment of the present invention may run on the server shown in FIG.
  • the specific form of the host in the embodiment of the present invention may be a server, a virtual machine (VM), a terminal device, etc., which is not limited in the embodiment of the present invention.
  • the client of the distributed storage system provides the storage resources of the distributed storage system for the host.
  • the embodiment of the present invention takes a distributed block storage system as an example.
  • the client provides a block protocol access interface, so that the client provides distributed block storage access point services, and the host can access the storage resource pool in the distributed block storage system through the client Storage resources.
  • this block protocol access interface is used to provide LUNs to the host.
  • the server runs the distributed block storage system program to make the server containing the hard disk as the storage node for storing the data of the host.
  • the hash space (such as 0 to 2 ⁇ 32) is divided into N equal parts, and each equal part is a partition (Partition), and the N equal parts are equally divided according to the number of hard disks.
  • N defaults to 3600, that is, the partitions are P1, P2, P3...P3600.
  • each storage node carries 200 partitions.
  • Each partition contains several strips.
  • Partition P contains M storage nodes N j , and the corresponding relationship between the partitions and the space of the storage nodes, that is, the storage space mapping provided by the partitions and the hard disks in the storage nodes N j , is also called the partition view, as shown in Figure 3, Take the hard disk D j with 4 storage nodes N j as an example, the partition view is "P2-(N 1 -D 1 )-(N 2 -D 2 )-(N 3 -D 3 )-(N 4 -D 4 )". Among them, j is each value of integers 1 to M. The partition view will be allocated when the distributed block storage system is initialized, and will be adjusted as the number of hard disks in the distributed block storage system changes.
  • P2-(N 1 -D 1 )-(N 2 -D 2 )-(N 3 -D 3 )-(N 4 -D 4 ) describes: in partition P2, including D 1 , D 2 , D There are 4 hard disks in 3 and D 4 , and the 4 hard disks come from storage node N1, storage node N 2 , storage node N 3 and storage node N 4 respectively .
  • partitions can use erasure coding (EC) mechanisms to improve data reliability. For example, use the 3+1 mode, that is, 3 data strips (strips are also called strips). Unit) and one check strip constitute the strips in the partition.
  • the partitions store data in the form of strips, and a partition contains R strips S i , where i is each value of integers 1 to R.
  • P2 the number of bits in the partition.
  • the distributed block storage system manages the hard disk slices in units of 8 kilobytes (KB), and records the allocation information of each 8KB slice in the metadata management area of the hard disk.
  • the slices of the hard disk form a storage resource pool.
  • the distributed block storage system includes a striping server, and the specific implementation can be that the striping management program runs on one or more servers in the distributed block storage system.
  • the striping server assigns strips to the partition.
  • the striping server allocates strips for the stripe S i of the partition P2 from the hard disk D j of the storage node N j corresponding to the partition
  • the storage address of the middle strip SU ij that is, the storage space, specifically includes: allocating a storage address for SU i1 from the hard disk D 1 of the storage node N 1 , allocating a storage address for SU i2 from the hard disk D 2 of the storage node N 2 , and for SU i2 i3 D 3 assigned from the hard disk memory address of the memory node N 3, N 4 assigned D hard disk memory address 4 from the storage node SU i4.
  • the stripe server does not allocate the storage address of stripe SU ij in the stripe from the hard disk D j of the storage node N j corresponding to the partition P2 for the stripe S i of the partition P2, that is, when the client writes to the storage node For data, the storage address of the strip SU ij in the stripe is allocated from the hard disk D j of the storage node N j .
  • the storage address allocated to the strip SU ij can be the logical address of the hard disk D j of the storage node N j , such as the logical block address (Logical Block Address, LBA); another implementation is to support Open-channel (Open-channel In the SSD of ), the stripe server allocates the storage address of the stripe SU ij in the stripe from the hard disk D j of the storage node Nj, or the physical address of the SSD.
  • the hard disk in the storage node N j is a LUN, that is, a LUN mounted on the storage node.
  • the striping metadata records the mapping relationship between the stripe ID and the stripe ID, that is, the corresponding relation between S i and the stripe SU ij in the allocation stripe of the hard disk D j of the storage node N j . According to the corresponding relation, the stripe S can be found Article i included with SU ij. Further, the striping server also records the corresponding relationship between the strip SU ij and the strip S i . According to the corresponding relationship, the strip SU ij can be used to find the strip S i , so as to query the strip information, for example, S i contains All strips SU ij of .
  • the stripe server assigns a version number to the stripe ID in the stripe.
  • the stripe ID version number of the stripe in the released stripe is updated. , So as to serve as the strip identification of the strip in the new strip.
  • the stripe server allocates strips SU ij to the stripe S i in advance, which can reduce the waiting time when the client writes data, thereby improving the write performance of the distributed block storage system.
  • the stripe SU ij in the stripe S i has a unique identifier in the distributed block storage system.
  • the strip SU ij in the stripe S i is only a section of storage space before the client writes data.
  • the client receives data from the host, it generates data containing M strips SU Nj for the data based on the erasure code EC algorithm, where the M strips SU Nj include L data strips SU Nx and (ML) corrections.
  • L and M are both positive integers, and L is less than M.
  • the length of the check strip is the same as the length of the data strip.
  • the client divides the data of the host into data blocks of L data strips, and generates check data of (ML) check strips of the data of the L data strips based on the EC algorithm, that is, the check block.
  • the storage node storing the data block of the data strip is called the data storage node
  • the storage node storing the check block of the check strip is called the check storage node.
  • the client provides the host with a logical unit allocated by the distributed block storage system, and the host mounts the LUN provided by the client, thereby providing the host with a data access operation.
  • the access address of the host is the LBA of the LUN.
  • the client performs the following steps:
  • Step 501 The client receives a write request sent by the host.
  • the client receives the write request sent by the host, and the write request contains the data of the host and the host access address of the data.
  • the host accesses the LBA of the LUN.
  • Step 502 The client determines the partition P where the data of the host is located.
  • partition P2 is taken as an example.
  • the client stores a partition view of the distributed block storage system. As shown in Figure 6, according to the partition view, determine the partition where the data is located.
  • the client generates a key key according to the LBA of the LUN of the data, calculates the hash value of the Key according to the hash algorithm, and determines the partition corresponding to the hash value, thereby determining that the LBA of the LUN is distributed in partition P2. Also called data distribution in partition P2.
  • Step 503 The client obtains a stripe S N from R strips; where N is a value from integer 1 to R.
  • the striping server manages the correspondence between the partitions and the strips, and the striping server records the mapping between the strips and the strips of the strip and the mapping between the strips in the strip and the hard disk of the storage node.
  • the client obtains a stripe SN from R strips.
  • the client determines that the LBA of the LUN is distributed in the partition P2.
  • the client queries the stripe server to obtain one of the R strips contained in the partition P2. S N. Because the LBA of the LUN is the address where the client writes data in the distributed block storage system. Therefore, the distribution of the LBA of the LUN in the partition P has the same meaning as the distribution of the data to be stored in the partition P.
  • the client is obtained from a S N points of the R stripe in another implementation, the client may have been dispensed from the R sub-strips to the client stripe S N striping obtained.
  • Step 504 The client host data into data striping strip S N SU Nj data block.
  • the stripe S N is composed of strips.
  • the client divides the data of the host according to the size of the strip in the strip. For example, the client divides the data of the host according to the length of the strip in the strip to obtain a data block of the strip size.
  • storage nodes partitions in M (e.g. 4) modulo the LUN data block stripe size of the LBA, thereby determining the position of the stripe size of the data blocks in the stripe, i.e.
  • the stripe SU Nj corresponds to the hard disk D j of the storage node N j , so that the data blocks of the stripe of the LBA of the LUN are distributed on the hard disk D j of the same storage node.
  • the data of the host is divided into one or more data blocks with SU Nj .
  • S N comprises slitting tape strip 4, respectively, SU N1, SU N2, SU N3 and SU N4.
  • the data strips SU N2 data blocks and data stripe SU N3 data block are also referred SU N4 check block.
  • the parity stripe parity data blocks are also referred SU N4 check block.
  • the data block and the parity block of parity data stripe bands are referred to as data slice SU Nj.
  • Step 505 The client strips SU Nj data transmitted to the corresponding hard disk storage node N j D j.
  • Block data slice SU Nj further include metadata, e.g. SU Nj identification data stripe, the data strips host access address of the data block SU Nj.
  • the LUN here refers to the LUN mounted on the host.
  • the storage node N j performs the following steps:
  • Step 701 The storage node N j receives the data of the strip S Nj in the strip S N sent by the client.
  • the data storage node N 1 receives the client sends data strip SU N1 embodiment
  • the data storage node N 2 receives the client sends data strip SU N2 block of data
  • received data storage node N 3 N 4 receives client data blocks
  • the client sends the data storage node strips SU N3 transmitting parity check block strip of SU N4.
  • Step 702 the identified host access address and data of data block disk D j memory address memory node N j stored strip SU Nj data to the data storage node N x establishing data strip SU Nx with SU Nx mapping .
  • the value of X is an integer from 1 to L, where L is an integer and L is less than M.
  • L is the number of data strips in the stripe S N
  • M is the number of strips in the stripe S N.
  • the host access address of the data block of the data strip SU Nx is the LBA of the LUN mounted by the host.
  • the storage node N j stores the data of the strip SU Nj to the storage address of the hard disk D j , specifically, the storage node N j stores the data of the strip SU Nj to the storage address of the hard disk D j of the storage node N j .
  • the storage address of the hard disk D j is the logical address of the hard disk; in the other implementation, in the SSD that supports Open-channel, it is the physical address of the SSD.
  • the hard disk D j can also be a LUN in a storage array.
  • the data storage node N hard disk D 1 1 to store the data block address data stripe SU N1 allocated N hard disk D 2 is Ad1
  • hard disk data storage node D N 3 is a data strip 3 is SU N3 Ad3
  • hard disk check storage node D N 4 of the strip 4 is the check parity block allocation SU N4
  • the storage address is Ad4.
  • the storage node in the partition view includes the main storage node.
  • the client writes data to the storage node according to the partition view
  • the data is first sent to the main storage node, and the main storage node sends the data to the storage node.
  • the data block and the check block are sent to the hard disk of the storage node where the corresponding strip is located.
  • the host initiates a read request, and the client reads data from the hard disk D j of the corresponding storage node N j in the partition view according to the read request of the host and the partition view and the address carried in the read request.
  • the read request of the host includes the LBA of the LUN storing the read data.
  • the LUN here refers to the LUN mounted on the host.
  • the stripes in the partition may also include stripes composed of multiple copy stripes.
  • the strips SU ij in the striping are all data strips, and the data of the strips SU ij are the same. That is, the partition can use the EC mechanism or the multiple copy redundancy mechanism.
  • the embodiments of the present invention are also applicable to a distributed storage system that does not perform partition management based on partitions.
  • multiple clients may access the same strip of data stored in the same storage node, such as data blocks of a data strip, or multiple hosts can mount the same LUN, and multiple hosts can access Data blocks of the same data strip stored on the same storage node.
  • the client sends the striped data to the storage node.
  • the access performance of the distributed storage system will decrease.
  • the client of the distributed storage system will check the hard disk of the storage node to determine whether it is a slow disk.
  • Slow disks are hard disks with consistently low performance. It should be noted that the original delays of hard disks belonging to the same partition are similar, but due to hard disk failure or other reasons, the delay of individual hard disks increases significantly. This kind of hard disk with a continuous increase in delay is called slow plate.
  • the present invention provides an embodiment as shown in FIG. 8, including:
  • Step 801 The client detects the performance of multiple hard disks in the partition.
  • the client obtains the delay of access requests sent to multiple storage nodes, and compares the delays of access requests sent to multiple storage nodes to determine the performance of the multiple hard disks.
  • the client side compares the access request latency of multiple storage nodes (the access request of a storage node is specifically an access request to a hard disk, so the latency of a storage node’s access request can describe the access time of the storage hard disk in the storage node. Extension) to detect the performance of at least two hard drives in the partition.
  • the access delay of hard disk A is 10ms
  • the access delay of hard disk B is 500ms
  • there is a significant difference between the two then it can be concluded that the performance of hard disk B is abnormal in conclusion.
  • the embodiment of the present invention takes the partition P2 shown in FIG. 3 as an example, and the client detects the performance of the hard disks D 1 , D 2 , D 3 and D 4 .
  • Step 802 When the performance of the first hard disk in the partition is abnormal but it is not confirmed that the first hard disk is a slow disk, the client sends an access request to one or more hard disks in the partition except the first hard disk.
  • the first hard disk is D 1 .
  • the abnormal performance of D 1 can be caused by a sudden increase in access delay and a decrease in processing bandwidth.
  • the embodiment of the present invention does not limit this.
  • the hard disk with abnormal performance when abnormal performance of the hard disk in the partition is detected, one or more other hard disks in the partition are accessed (and the hard disk with abnormal performance is not accessed). It is no longer necessary to determine that the hard disk is a slow disk. After detecting and judging that the hard disk with abnormal performance is a slow disk, stop accessing the hard disk, thereby avoiding re-accessing the hard disk with abnormal performance, thereby improving the performance of the distributed storage system.
  • the hard disk with abnormal performance may or may not be a slow disk.
  • the abnormal performance is stopped. Only the hard disks that are not abnormal in the partition are accessed.
  • the embodiment of the present invention can further detect whether the first hard disk with abnormal performance is a slow disk. Because the embodiment of the present invention detects that the performance of the first hard disk is abnormal, it no longer sends an access request to the first hard disk.
  • the client may send a detection request to the first hard disk to determine whether the first hard disk is a slow disk.
  • the slow disk is isolated from the distributed storage system and is no longer allocated to partitions for use. For example: when partitioning, the slow disk is "invisible" to the storage system.
  • a new hard disk can be added to a partition to replace the slow disk, that is, the first hard disk.
  • the partition removes the slow disk from the partition, that is, reduces the number of hard disks in the partition.
  • Access request is a read request scenario, the performance of an abnormality in the first hard disk D, the hard disk partition client to send access requests to one or more other storage nodes except the first node storing a first hard outer in particular Including: the client sends a read request to the second hard disk in the second storage node in the partition.
  • the second storage node is N 2
  • the second hard disk is D 2 .
  • the client can send a read request to D 2 to avoid access to D 1 , but still obtain the same data copy as the data stored in D 1 , thereby avoiding the access delay caused by access to D 1 , Improve the performance of the distributed storage system.
  • the access request is a write request
  • the client detects that the performance of D 1 is abnormal, and sends write requests carrying the same data to the hard disks of other storage nodes in the partition, namely to hard disks D 2 , D 3 and D 4 carrying the same transmission data write request (D 1 does not send a write request) to allow multiple copies of data stored in a plurality of hard disks.
  • the performance is determined to avoid the abnormality detecting slow hard disk and then access the drive is stopped, and therefore the same access latency to avoid sending a write request to cause the hard disk D 1, thereby improving the performance of the distributed storage system.
  • each striped strip in the partition P2 includes a data strip and a check strip.
  • the client sends an access request to one or more hard disks in the partition except the first hard disk, which specifically includes: the client sends a read request to other hard disks in the partition except the first hard disk. Request; the client recovers the data in the first hard disk according to the erasure code mechanism and the data read from other hard disks.
  • D 1 As the first hard disk as an example, when the abnormal performance of D 1 is detected, the read request is no longer sent to D 1 , and the client sends a read request to D 2 , D 3 and D 4 , according to the read data and erasure code mechanism to recover the data D 1, thereby D1 is determined before the slower disk ceases to D 1 sends a read request, to avoid access delays increase due to access D 1, to improve the access performance of a distributed storage system .
  • the client sends an access request to one or more hard disks in the partition except the first hard disk, which specifically includes: the client according to the number of check blocks in the erasure code mechanism and the partition The number of hard disks of other storage nodes determines the data block; the client calculates the check block based on the number of check blocks in the erasure code mechanism and the data block; the client sends the check block to the hard disk storing the check block in other hard disks.
  • To verify the write request of the block send the write request carrying the data block to the hard disk storing the data block in the other hard disk. Still take D 1 as the first hard disk as an example.
  • the partition When the performance of the first hard disk is normal, the partition contains 4 hard disks, hard disks D 1 , D 2 , D 3 and D 4 .
  • the partition uses an erasure code mechanism.
  • the strip in the partition contains 3 data strips and 1 check strip, and one strip can store 3 data blocks and 1 check block.
  • D 1 stores data block 1
  • D 2 stores data block 2
  • D 3 stores data block 3
  • D 4 stores check blocks.
  • the check block is the check data of data block 1, data block 2, and data block 3.
  • the check block sent by the client to D 4 is the check data of data block 2 and data block 3.
  • the embodiment of the present invention ensures that the number of check blocks in the partition is not reduced, and only reduces the number of data blocks in the partition, that is, only reduces the number of data strips in a strip, and does not reduce the number of check strips, thereby Ensure the reliability of the distributed storage system.
  • the embodiment of the present invention further provides a client in a distributed storage system as shown in FIG. 9.
  • the distributed storage system includes a client and a plurality of storage nodes, and each storage node includes a hard disk; the distributed storage system Including partitions, the partitions include multiple hard disks; the multiple hard disks are located in different storage nodes; the client includes a detection unit 901 and a sending unit 902; wherein, the detection unit 901 is used to detect all The performance of the multiple hard disks; the sending unit 902 is configured to send information to one or more of the partitions except the first hard disk when the performance of the first hard disk is abnormal but the first hard disk is not confirmed to be a slow disk.
  • a hard disk sends an access request; the first hard disk is a hard disk on a first storage node in the partition.
  • the detection unit 901 of the client shown in FIG. 9 is also used to detect whether the first hard disk with abnormal performance is a slow disk.
  • the detecting unit 901 is specifically configured to send an access detection request to the first hard disk for determining whether the first hard disk is a slow disk.
  • the partition uses a multiple copy redundancy mechanism
  • the sending unit 902 is specifically configured to send a read request to the second hard disk in the second storage node in the partition; the second hard disk storage There is a copy of the data of the first hard disk.
  • the partition uses a multiple-copy redundancy mechanism
  • the sending unit 902 is specifically configured to send write requests carrying the same data to other hard disks in the partition.
  • the partition uses an erasure code mechanism
  • the sending unit 902 is specifically configured to send read requests to other hard disks in the partition except the first hard disk; the client also includes a recovery unit, It is used to restore the data in the first hard disk according to the erasure code mechanism and the data read from the other hard disk.
  • the partition uses an erasure code mechanism
  • the client also includes a determination unit and a calculation unit; wherein, the determination unit is used to determine the number of check blocks in the erasure code mechanism and The number of the other hard disks in the partition determines the data block; the calculation unit is configured to calculate the check block based on the number of the check block in the erasure code mechanism and the data block; the sending unit 902 is specifically configured to separately Send a write request carrying the check block to the hard disk storing the check block in the other hard disk, and send the write request carrying the data block to the hard disk storing the data block in the other hard disk.
  • the detection unit 901 is specifically configured to obtain the time delay of the access request sent to the multiple storage nodes, and compare the time delay of the access request sent to the multiple storage nodes to determine the The performance of multiple hard drives.
  • Another implementation manner of the client in the distributed storage system provided by the embodiment of the present invention includes an interface and a processor, and the interface communicates with the processor, and the processor is used to implement various solutions executed by the client in the embodiment of the present invention.
  • the identifiers used in the description of the stripe, data stripe, check stripe, and storage node in the embodiment of the present invention are only used to describe the embodiment of the present invention more clearly.
  • the actual product implementation does not require a similar identifier, so
  • the embodiment of the present invention describes strips, data strips, check strips, and identifiers used by storage nodes, which do not limit the present invention.
  • the embodiments of the present invention also provide a computer-readable storage medium and a computer program product.
  • the computer-readable storage medium and the computer program product contain computer instructions for implementing various solutions described in the embodiments of the present invention.
  • the disclosed device and method can be implemented in other ways.
  • the division of the units described in the device embodiments described above is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated into another system, or Some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé permettant d'accéder à un système de stockage distribué. Un client détecte les performances d'une pluralité de disques durs dans une partition et, si les performances d'un premier disque dur dans la partition sont anormales, le client envoie une demande d'accès à un ou plusieurs disques durs dans la partition autre que le premier disque dur. Ainsi, le client n'a pas besoin de déterminer que le premier disque dur est un disque lent, c'est-à-dire que le client ne peut pas accéder au premier disque dur, ce qui permet d'éviter une augmentation du retard d'accès qui est provoquée par le premier disque dur pour lequel les performances d'accès sont anormales et d'améliorer les performances d'accès dans un système de stockage distribué.
PCT/CN2020/100814 2019-07-30 2020-07-08 Procédé d'accès à un système de stockage distribué, client et produit programme d'ordinateur WO2021017782A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910696998.6A CN110554839A (zh) 2019-07-30 2019-07-30 分布式存储系统访问方法、客户端及计算机程序产品
CN201910696998.6 2019-07-30

Publications (1)

Publication Number Publication Date
WO2021017782A1 true WO2021017782A1 (fr) 2021-02-04

Family

ID=68737202

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100814 WO2021017782A1 (fr) 2019-07-30 2020-07-08 Procédé d'accès à un système de stockage distribué, client et produit programme d'ordinateur

Country Status (2)

Country Link
CN (1) CN110554839A (fr)
WO (1) WO2021017782A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110554839A (zh) * 2019-07-30 2019-12-10 华为技术有限公司 分布式存储系统访问方法、客户端及计算机程序产品
CN111400083B (zh) * 2020-03-17 2024-02-23 上海七牛信息技术有限公司 数据存储方法及系统、存储介质
CN113608701A (zh) * 2021-08-18 2021-11-05 合肥大唐存储科技有限公司 一种存储系统中数据管理方法和固态硬盘

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294314A1 (en) * 2006-06-16 2007-12-20 Michael Padovano Bitmap based synchronization
CN107908370A (zh) * 2017-11-30 2018-04-13 新华三技术有限公司 数据存储方法及装置
CN108984107A (zh) * 2017-06-02 2018-12-11 伊姆西Ip控股有限责任公司 提高存储系统的可用性
CN109582514A (zh) * 2018-12-03 2019-04-05 郑州云海信息技术有限公司 一种硬盘筛选方法、装置、设备及可读存储介质
CN110554839A (zh) * 2019-07-30 2019-12-10 华为技术有限公司 分布式存储系统访问方法、客户端及计算机程序产品

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984324B2 (en) * 2008-03-27 2011-07-19 Emc Corporation Systems and methods for managing stalled storage devices
CN103218273A (zh) * 2012-01-20 2013-07-24 深圳市腾讯计算机系统有限公司 硬盘数据恢复方法、服务器及分布式存储系统
US9992298B2 (en) * 2014-08-14 2018-06-05 International Business Machines Corporation Relationship-based WAN caching for object stores
CN107656695B (zh) * 2016-07-25 2020-12-25 杭州海康威视数字技术股份有限公司 一种数据存储、删除方法、装置及分布式存储系统
CN106778369A (zh) * 2016-11-09 2017-05-31 百望金赋科技有限公司 一种硬盘数据访问方法、税控服务器
CN106911802B (zh) * 2017-04-18 2018-07-03 北京华云网际科技有限公司 分布式块存储系统的管理平台的部署方法和装置
US10484413B2 (en) * 2017-08-21 2019-11-19 Cognizant Technology Solutions India Pvt. Ltd. System and a method for detecting anomalous activities in a blockchain network
CN109426592A (zh) * 2017-08-24 2019-03-05 上海交通大学 一种磁盘检测方法
CN107577441B (zh) * 2017-10-17 2020-08-21 苏州浪潮智能科技有限公司 一种osd慢盘处理方法、系统、装置及计算机存储介质
CN109104299B (zh) * 2018-07-11 2021-12-07 新华三技术有限公司成都分公司 降低集群震荡的方法及装置
CN109274544B (zh) * 2018-12-11 2021-06-29 浪潮(北京)电子信息产业有限公司 一种分布式存储系统的故障检测方法及装置
CN109992220B (zh) * 2019-04-12 2022-07-08 苏州浪潮智能科技有限公司 一种锁释放方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294314A1 (en) * 2006-06-16 2007-12-20 Michael Padovano Bitmap based synchronization
CN108984107A (zh) * 2017-06-02 2018-12-11 伊姆西Ip控股有限责任公司 提高存储系统的可用性
CN107908370A (zh) * 2017-11-30 2018-04-13 新华三技术有限公司 数据存储方法及装置
CN109582514A (zh) * 2018-12-03 2019-04-05 郑州云海信息技术有限公司 一种硬盘筛选方法、装置、设备及可读存储介质
CN110554839A (zh) * 2019-07-30 2019-12-10 华为技术有限公司 分布式存储系统访问方法、客户端及计算机程序产品

Also Published As

Publication number Publication date
CN110554839A (zh) 2019-12-10

Similar Documents

Publication Publication Date Title
US11070479B2 (en) Dynamic resource allocation based upon network flow control
EP3188449B1 (fr) Procédé et système de partage de ressources de stockage
WO2021017782A1 (fr) Procédé d'accès à un système de stockage distribué, client et produit programme d'ordinateur
US11709603B2 (en) Multi-tier write allocation
US11899533B2 (en) Stripe reassembling method in storage system and stripe server
CN111949210A (zh) 分布式存储系统中元数据存储方法、系统及存储介质
US11340829B1 (en) Techniques for log space management involving storing a plurality of page descriptor (PDESC) page block (PB) pairs in the log
US20190114076A1 (en) Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium
US20210318826A1 (en) Data Storage Method and Apparatus in Distributed Storage System, and Computer Program Product
US11100008B2 (en) Efficient memory usage for snapshots
US11068299B1 (en) Managing file system metadata using persistent cache
WO2021088586A1 (fr) Procédé et appareil de gestion de métadonnées dans un système de stockage
US11593182B2 (en) Storage system
EP3889778B1 (fr) Système de stockage distribué et produit-programme d'ordinateur
US12032849B2 (en) Distributed storage system and computer program product
US9778850B1 (en) Techniques for zeroing non-user data areas on allocation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20846322

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20846322

Country of ref document: EP

Kind code of ref document: A1