WO2019000949A1 - Metadata storage method and system in distributed storage system, and storage medium - Google Patents

Metadata storage method and system in distributed storage system, and storage medium Download PDF

Info

Publication number
WO2019000949A1
WO2019000949A1 PCT/CN2018/075077 CN2018075077W WO2019000949A1 WO 2019000949 A1 WO2019000949 A1 WO 2019000949A1 CN 2018075077 W CN2018075077 W CN 2018075077W WO 2019000949 A1 WO2019000949 A1 WO 2019000949A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage node
metadata
data storage
node
stripe
Prior art date
Application number
PCT/CN2018/075077
Other languages
French (fr)
Chinese (zh)
Inventor
饶蓉
魏明昌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019000949A1 publication Critical patent/WO2019000949A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments

Definitions

  • the present invention relates to the field of data storage technologies, and in particular, to a metadata storage method, system, and storage medium in a distributed storage system.
  • Metadata such as a logical address, a physical address, and the like of the recorded data are generated, and the metadata is also stored in the storage node.
  • a common metadata storage method is to break up the blocks in the metadata stripe to each storage node. When reading the metadata, the blocks in the metadata stripe need to be read from each storage node, and the pieces are pieced together into metadata strips. However, the amount of data forwarding between storage nodes is large, which affects performance. Another way metadata is stored in multiple copies on the storage node, but it increases storage overhead.
  • an embodiment of the present invention provides a metadata storage solution in a distributed storage system, where the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M) +N) storage nodes each store a partitioned view of the metadata stripe;
  • the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ;
  • M is a natural number not less than 1
  • A is one of natural numbers 1 to N
  • i is each of natural numbers 1 to N except A
  • r is each of natural numbers 1 to M
  • the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping;
  • the metadata stripe includes a metadata block D A , D i
  • the primary data storage node DS A backs up the other metadata blocks D i in the metadata strip because only the data storage node DS needs to be used.
  • metadata block D i on i on the primary backup data storage node DS a compared with the prior art multiple copies all metadata block does not need a copy of the check block, reduced storage space, while access to the metadata client At the same time, all metadata blocks can be accessed from the primary data storage node DS A , which improves the speed of metadata access.
  • the distributed storage system of the present solution can be stored for a distributed file system, a distributed object storage system, or a distributed block device.
  • the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping, specifically including: Determining, by the management node, a partition corresponding to the metadata strip according to a write request for generating metadata in the metadata stripe; the management node querying the metadata strip according to a partition corresponding to the metadata stripe The partitioned view determines the primary data storage node DS A , the data storage node DS i , and the parity storage node CS r .
  • the management node determines, according to the address carried by the write request, a partition corresponding to the metadata stripe.
  • the verifying the storage node CS r storing the Cr specifically includes: the verification storage node CS r allocates a fragment S r to the Cr, and establishes the identifier of the Cr and the fragment S r mapping relationship;
  • the data storage node DS i D i memory comprises: a data storage node DS i D i is the slice allocated SD i, and establishing the identity of D i and SD i of the slice mapping relationship;
  • said main memory data storage node DS a D a and D i comprises: a primary data storage DS a to the node D a dispensing fragment SD a, and the identification and the establishment of D a Describe the mapping relationship of the fragment SD A , allocate the fragment SD i to the D i , and establish a mapping relationship between the identifier of the D i and the fragment SD i .
  • the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the primary data storage node DS A .
  • the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
  • the embodiment of the present invention further provides a distributed storage system, where the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+ Each of the N) storage nodes stores a partitioned view of the metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M;
  • the distributed storage system is used to implement various implementations of the first aspect.
  • the present invention also provides a non-volatile computer readable storage medium and a computer program product, which are included in a memory-loaded non-volatile computer readable storage medium and computer program product of a storage device provided by an embodiment of the present invention.
  • Computer program instructions being operative in a distributed storage system, the distributed storage system comprising a management node and (M+N) storage nodes, the management node and (M+N) storage nodes all storing a partitioned view having a metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, M For a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M; when one or more computers execute the computer program instructions are stored as the management node of the distributed system, the data storage master node DS a, the data storage node DS i and the check node memory for implementing a first aspect of the CS r Kind of implementation.
  • the metadata storage scheme in the various distributed storage systems disclosed in the first aspect can also be applied to the storage of data corresponding to the metadata. Accordingly, the distributed storage system of the second aspect and the non-transitory computer readable storage medium and computer program product of the third aspect are equally applicable to data storage.
  • FIG. 1 is a schematic diagram of a storage structure of a distributed block device according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a server in a distributed block device according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a relationship between a data stripe and a partition view according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of data striping according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a partition view according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a relationship between a metadata stripe and a partition view according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of metadata storage according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of metadata striping according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of metadata storage according to an embodiment of the present invention.
  • Distributed storage systems mainly include distributed file system storage, distributed object storage, and distributed block device storage, such as of Series products.
  • the embodiment of the present invention is described by taking a distributed block device storage as an example.
  • the distributed block device storage includes a plurality of servers 1, a server 2, a server 3, a server 4, a server 5, and a server 6, and the servers communicate with each other.
  • the number of servers in the distributed block device storage may be increased according to actual requirements, which is not limited by the embodiment of the present invention.
  • the server stored in the distributed block device includes the structure as shown in FIG. 2.
  • each server in the distributed block device storage includes a central processing unit (CPU) 201, a memory 202, a hard disk 1, a hard disk 2, and a hard disk 3.
  • the memory 202 stores computer instructions, and the CPU 201 executes Program instructions in memory 202 perform the corresponding operations.
  • the hard disk can be at least one of a mechanical hard disk and a solid state hard disk.
  • a Field Programmable Gate Array (FPGA) or other hardware may also be used for the corresponding operations of the CPU 201, or the FPGA or other hardware may perform the corresponding operations together with the CPU 201.
  • FPGA Field Programmable Gate Array
  • the embodiments of the present invention are generally described as a processor for implementing the corresponding operations described above.
  • an application is loaded in the memory 202, and the CPU 201 executes an application instruction in the memory 202, and the server serves as a client.
  • the application can be a virtual machine (VM) or a specific application, such as office software.
  • the client stores write data to or reads data from the distributed block device store.
  • the storage management program is loaded in the memory 202, and the CPU 201 executes the storage management program instruction as the virtual block storage management program in the memory 202, and the server acts as a management node, and is responsible for managing the volume metadata, and provides a block protocol access interface to the client.
  • a distributed storage access point service is provided for the client, so that the client can access the storage resource stored by the distributed block device through the management node.
  • the storage object program is loaded in the memory 202, and the CPU 201 executes the storage object program instruction in the memory 202, and the server functions as a storage node for performing a specific input/output (I/O) operation.
  • I/O input/output
  • one hard disk corresponds to running a storage object program process by default.
  • Each storage object program process manages one hard disk, and the server runs each storage object program.
  • the process acts as a storage node.
  • the embodiment of the present invention describes a case where a storage object program process manages a hard disk.
  • each storage object program When the distributed block device is initialized, the process of each storage object program will manage the hard disk in units of 1 MB, and record the allocation information of each 1 MB fragment in the metadata management area of the hard disk.
  • Storage resource pool The storage management program communicates with all the storage object programs of the resource pools that it can access, that is, the management node communicates with all the storage nodes of the resource pool that the management node can access, so that the management node can concurrently access all the hard disks of the resource pool.
  • the hash space (such as 0 ⁇ 2 ⁇ 32,) is divided into N equal parts, and each partition is a partition, and the N equal parts are equally divided according to the number of hard disks.
  • the default block storage device storage N defaults to 3600, that is, the partitions are P1, P2, P3...P3600, respectively.
  • each storage node carries 200 partitions. The corresponding relationship between the partition and the storage node, that is, the partition view, is allocated when the distributed block device is initialized, and then adjusted according to the change of the number of hard disks in the distributed block device storage.
  • the server stored by the distributed block device saves the partition view in the memory 202, and the management node uses the partition view for fast routing.
  • Each partition node also stores all partitioned views of the distributed block device storage system, that is, the correspondence between each partition and the storage node.
  • the Erasure Coding (EC) algorithm can be used to improve data reliability, such as using 3+1 mode, that is, 3 data blocks and 1 check block to form data points.
  • the partition view is "Partition - Primary Data Storage Node - Data Storage Node 1 - Data Storage Node 2 - Verify Storage Node, for example, the partition view is shown in Figure 5.
  • This partition view a data storage node 1 and a data node 2 representing a partition corresponding primary data node and other data blocks for storing data strips, and a check storage node storing check data, which are stored in the data storage node 1 and the data storage node 2
  • the backup data storage node of the data block is the primary data storage node.
  • the distributed block device storage will logically slice each logical unit number (LUN) by 1MB. For example, a 1GB LUN will be sliced into 1024*1MB fragments.
  • LUN ID Identity
  • logical block address Logical Block
  • SCSI Small Computer System Interface
  • Address, LBA) ID and data to be written the management node where the client is located receives the write request, and forms a key according to the LUN ID and the LBA ID.
  • the key will contain the rounding calculation information of the LBA ID to 1 MB.
  • An integer (within 0 ⁇ 2 ⁇ 32) is calculated by the Distributed Hash Table (DHT) Hash and falls in the specified partition; the management node where the client is located determines the main according to the partitioned view recorded in the memory 202.
  • the management node sends the data block 1, the data block 2, the data block 3, and the check block 4 in the EC data stripe to the main data storage, respectively.
  • the main data storage node stores the data block 1, the data storage node 1 stores the data block 2, the data storage node 2 stores the data block 3, and the check storage node stores the check block 1.
  • the data storage nodes 1 and 2 respectively determine the primary data storage node according to the partition view, the data storage node 1 backs up the data block 2 to the primary data storage node, and the data storage node 2 backs up the data block 3 to the primary data storage node, the primary data storage node Data block 2 and data block 3 are stored separately.
  • the primary data storage node allocates the fragment 1 from the hard disk managed by the data block 1 to establish a mapping relationship between the identifier of the data block 1 and the fragment 1;
  • the data storage node 1 is a data block from the hard disk managed by the data storage node 1 2Assigning the slice 2, establishing the mapping relationship between the identifier of the data block 2 and the slice 2;
  • the data storage node 2 assigns the slice 3 to the data block 3 from the hard disk it manages, and establishes the identifier of the data block 3 and the slice 3 Mapping relationship;
  • the verification storage node allocates the slice 4 to the check block 1 from the hard disk managed by the storage node, and establishes the mapping relationship between the identifier of the check block 1 and the slice 4.
  • the primary data storage node receives the data block 2 sent by the data storage node 1 and the data block 3 sent by the data storage node 2, and the primary data storage node allocates the fragment 5 and the fragment 6 from the hard disk managed by the primary data storage node, and the primary data storage node establishes the data.
  • the mapping relationship between the identifier of the data block and the fragment is taken as an example.
  • the mapping relationship between the data block is the mapping relationship between the identifier of the data block and the physical address of the slice; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identification and fragmentation of the data block
  • the mapping relationship is a mapping including the identification of the data block and the hard disk storing the data block, and the hard disk to slice storage of the data block.
  • the data blocks 2 are respectively stored into the slice 2 and the slice 5
  • the data block 3 is stored in the slices 3 and 6, respectively, and the management node establishes and holds the identifier of the data block 2 and the data storage node 1 and the main data storage node.
  • mapping establishing and saving the mapping relationship between the identity of the data block 3 and the data storage node 2 and the primary data storage node.
  • the data storage node 1 holds a mapping of the identifier of the saved data block 2 with the data storage node 1 and the primary data storage node
  • the data storage node 2 holds the mapping relationship between the identifier of the data block 3 and the data storage node 2 and the primary data storage node.
  • garbage collection is performed on the data stripe
  • the management node can recover the data of the data block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the data block in the data strip and the storage node, thereby improving the data. The efficiency of recycling.
  • the metadata storage corresponding to the data uses the same EC algorithm as the data storage.
  • the metadata striping based on the EC algorithm has the same partitioned view as the above-described composition data striping based on the EC algorithm, as shown in FIG. 6.
  • the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes each store a partitioned view of metadata strips;
  • the partitioned view of the data stripe includes a primary data storage node DS A , a data storage node DS i , and a parity storage node CS r ; wherein N is a natural number not less than 2, M is a natural number not less than 1, and A is a natural number 1 to One of N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; the flow shown in FIG. 7 is executed in the distributed storage system storage:
  • Step 701 The management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the check storage node CS r for the metadata striping; the metadata strip includes Metadata blocks D A , D i and check block Cr.
  • the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping, specifically: The management node determines, according to the write request that generates the metadata in the metadata stripe, the partition corresponding to the metadata stripe; the management node queries the metadata stripe according to the partition corresponding to the metadata stripe The partitioned view determines the primary data storage node DS A , the data storage node DS i , and the parity storage node CS r .
  • the management node determines, according to the address carried by the write request, a partition corresponding to the metadata stripe.
  • a partition corresponding to the metadata stripe For details, refer to the scheme in which the distributed block device stores the write request sent by the client, and details are not described herein.
  • Step 702 The management node sends D i to the data storage node DS i , sends D A to the primary data storage node DS A , and sends C r to the verification storage node CS r .
  • Step 703 The verification storage node CS r receives and stores C r .
  • Step 704 The data storage node DS i receives and stores D i , and sends D i to the primary data storage node DS A according to the partitioned view of the metadata strip.
  • Step 705 The primary data storage node DS A receives and stores D A and D i .
  • the verifying the storage node CS r storing the Cr specifically includes: the verification storage node CS r allocates a fragment S r to the Cr, and establishes a mapping between the identifier of the Cr and the fragment S r relation;
  • the data storage node DS i D i memory comprises: a data storage node DS i D i is the slice allocated SD i, and establishing the identity of D i of the slice of the mapping SD i relation;
  • said main memory data storage node DS a D a and D i comprises: a primary data storage DS a to the node D a dispensing fragment SD a, D a and establishing the identity of the SD a mapping relationship between the slice, said slice allocation D i SD i, and the mapping relation of D i and identifying the fragmentation of SD i.
  • the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the primary data storage node DS A . Further, further, the data storage node 1 holds a mapping relationship between the identifier of the saved D i and the data storage node DS i and the primary data storage node DS A .
  • the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
  • the metadata blocks in the metadata stripe using the EC algorithm are D 1 , D 2 and D 3 ,
  • the block is C 1 .
  • the management node where the client is located determines the primary data storage node, the data storage node 1, and the data storage according to the partitioned view "Partition - Primary Data Storage Node - Data Storage Node 1 - Data Storage Node 2 - Verify Storage Node" recorded in the memory 202. Node 2 and the check storage node.
  • the partitioned view represents a data storage node 1 and a data node 2 corresponding to the primary data node and other data blocks for storing the metadata stripe, and a check storage node storing the check data, which are stored in the data storage node 1 and the data.
  • the backup data storage node of the metadata block of the storage node 2 is the primary data storage node.
  • the management node transmits D 1 , D 2 , D 3 , and C 1 in the metadata strip based on the EC algorithm to the primary data storage node, the data storage node 1, the data storage node 2, and the verification storage node 4, respectively.
  • Primary data storage node receives and stores D 1, a data storage node receives and stores D 2, data storage node 2 receives and stores the D 3, check storage node receives and stores C 1.
  • the data storage nodes 1 and 2 respectively determine the primary data storage node according to the partition view, the data storage node 1 backs up D 2 to the primary data storage node, and the data storage node 2 backs up D 3 to the primary data storage node, and the primary data storage node receives and Store D 2 and D 3 .
  • 9, 7 D 1 slice allocated from the hard disk management a mapping relationship between the identifier 7 of slices 1 D primary data storage node; data storage node from a hard disk management
  • a slice 8 is allocated for D 2 to establish a mapping relationship between the identifier of D 2 and the slice 8; the data storage node 2 allocates a slice 9 to D 3 from the hard disk it manages, and establishes the identifier of the D 3 and the slice 9 Mapping relationship; verifying that the storage node allocates a fragment 10 to C 1 from its managed hard disk, and establishes a mapping relationship between the identifier of C 1 and the fragment 10 .
  • the primary data storage node receives D 2 sent by the data storage node 1 and D 3 sent by the data storage node 2, and the primary data storage node allocates the fragment 11 and the fragment 12 from the hard disk managed by the primary data storage node, and the primary data storage node establishes D 2
  • the mapping relationship between the identifier and the fragment 11 and the mapping relationship between the identifier of the D 3 and the fragment 12 are identified.
  • the mapping relationship between the identifier of the metadata block and the fragment is taken as an example.
  • the mapping relationship with the fragment is the mapping relationship between the identifier of the metadata block and the physical address of the fragment; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the metadata block
  • the mapping relationship with the slice is a mapping including the identity of the metadata block and the hard disk storing the metadata block, and a hard disk to slice mapping of the metadata block.
  • D 2 is stored to the slice 8 and the slice 11 respectively
  • D 3 is stored in the fragments 9 and 12, respectively
  • the management node establishes and saves the mapping of the identifier of the D 2 with the data storage node 1 and the primary data storage node, and establishes D 3 and save the mapping relationship between the identifier and the data storage nodes 2 and the master data storage node.
  • the data storage node 1 holds a mapping of the identifier of the saved D 2 with the data storage node 1 and the primary data storage node
  • the data storage node 2 stores the mapping relationship between the identifier of D 3 and the data storage node 2 and the primary data storage node.
  • the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
  • the primary data storage node backs up other metadata blocks in the metadata strip, because only the metadata block on the data storage node needs to be in the main data. Backup on the storage node, compared with multiple copies of all metadata blocks in the prior art, the storage space is reduced, and when the client accesses the metadata, only all metadata blocks need to be accessed from the primary data storage node, thereby improving metadata access. speed.
  • the embodiment of the present invention further provides a non-transitory computer readable storage medium and a computer program product, a non-transitory computer readable storage medium, and computer program instructions contained in a computer program product, the CPU executing the computer loaded in the memory
  • the program instructions are used to implement functions corresponding to the management node and the storage node (the primary data storage node, the data storage node, and the verification storage node) in the implementations of the present invention.
  • the slice in the embodiment of the present invention may be a physical block or the like in the hard disk.
  • the hard disk in the embodiment of the present invention may be at least one of a mechanical disk and a solid state hard disk as described above.
  • the hard disk corresponding to the process of storing the object program in the embodiment of the present invention may also be a storage array or the like, which is not limited in the embodiment of the present invention.
  • the disclosed apparatus and method may be implemented in other manners.
  • the division of the units described in the device embodiments described above is only one logical function division, and may be further divided in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or Some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a metadata storage method in a distributed storage system. In a distributed storage system, in a scenario where a metadata stripe composed of an EC algorithm realizes data reliability, a main data storage node backs up other metadata blocks in the metadata stripe. Since metadata blocks on a data storage node only need to be backed up on the main data storage node, in comparison with multiple replicas of all the metadata blocks, the storage space is reduced, and when a client accesses metadata, it is only necessary to access all the metadata blocks from the main data storage node, thus increasing the metadata access speed.

Description

分布式存储系统中元数据存储方法、系统及存储介质Metadata storage method, system and storage medium in distributed storage system 技术领域Technical field
本发明涉及数据存储技术领域,尤其涉及一种分布式存储系统中元数据存储方法、系统及存储介质。The present invention relates to the field of data storage technologies, and in particular, to a metadata storage method, system, and storage medium in a distributed storage system.
背景技术Background technique
在分布式存储系统中,管理节点将用户数据存储到存储节点后,会产生记录数据的逻辑地址,物理地址等的元数据,元数据也要存储到存储节点。常见的元数据存储方式是将元数据分条中的块打散到各存储节点,读取该元数据时,需要从各存储节点读取元数据分条中的块,拼凑成元数据分条,但存储节点间数据转发量大,影响性能。另外一种方式元数据在存储节点以多副本形式存储,但会增加存储空间开销。In a distributed storage system, after the management node stores the user data to the storage node, metadata such as a logical address, a physical address, and the like of the recorded data are generated, and the metadata is also stored in the storage node. A common metadata storage method is to break up the blocks in the metadata stripe to each storage node. When reading the metadata, the blocks in the metadata stripe need to be read from each storage node, and the pieces are pieced together into metadata strips. However, the amount of data forwarding between storage nodes is large, which affects performance. Another way metadata is stored in multiple copies on the storage node, but it increases storage overhead.
发明内容Summary of the invention
第一方面,本发明实施例提供了一种分布式存储系统中元数据存储方案,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在所述存储方案中:所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;所述元数据分条包含元数据块D A、D i以及校验块C r,将D i发送到所述数据存储节点DS i,将D A发送到所述主数据存储节点DS A,将Cr发送到所述校验存储节点CS r;所述校验存储节点CS r接收并存储C r;所述数据存储节点DS i接收并存储D i,并根据所述元数据分条的分区视图将D i发送到所述主数据存储节点DS A;所述主数据存储节点DS A接收并存储D A和D i。在本方案中,在实现元数据使用纠删码(Erasure Coding,EC)保护机制下,主数据存储节点DS A备份元数据分条中其他元数据块D i,因为只需要将数据存储节点DS i上的元数据块D i在主数据存储节点DS A上备份,相比现有技术中所有元数据块多副本,不需要校验块副本,减少了存储空间,同时在客户端访问元数据时,可以从主数据存储节点DS A访问所有元数据块,提高了元数据访问速度。本方案的分布式存储系统可以为分布式文件系统、分布式对象存储系统或分布式块设备存储。 In a first aspect, an embodiment of the present invention provides a metadata storage solution in a distributed storage system, where the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M) +N) storage nodes each store a partitioned view of the metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; For a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M In the storage scheme: the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping; The metadata stripe includes a metadata block D A , D i and a check block C r , sends D i to the data storage node DS i , and sends D A to the primary data storage node DS A , Cr is sent to the check storage node CS r ; the check storage node CS r receiving and storing C r ; the data storage node DS i receives and stores D i and transmits D i to the primary data storage node DS A according to the partitioned view of the metadata strip; the primary data storage Node DS A receives and stores D A and D i . In this solution, under the implementation of the metadata using the Erasure Coding (EC) protection mechanism, the primary data storage node DS A backs up the other metadata blocks D i in the metadata strip because only the data storage node DS needs to be used. metadata block D i on i on the primary backup data storage node DS a, compared with the prior art multiple copies all metadata block does not need a copy of the check block, reduced storage space, while access to the metadata client At the same time, all metadata blocks can be accessed from the primary data storage node DS A , which improves the speed of metadata access. The distributed storage system of the present solution can be stored for a distributed file system, a distributed object storage system, or a distributed block device.
可选的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS A、所述数据存储节点DS i和所述校验存储节点CS rOptionally, the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping, specifically including: Determining, by the management node, a partition corresponding to the metadata strip according to a write request for generating metadata in the metadata stripe; the management node querying the metadata strip according to a partition corresponding to the metadata stripe The partitioned view determines the primary data storage node DS A , the data storage node DS i , and the parity storage node CS r .
可选的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。Optionally, the management node determines, according to the address carried by the write request, a partition corresponding to the metadata stripe.
可选的,所述校验存储节点CS r存储Cr具体包括:所述校验存储节点CS r为所述Cr分配分片S r,并且建立所述Cr的标识与所述分片S r的映射关系;所述数据存储节点DS i存储D i具体包括:所述数据存储节点DS i为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系;所述主数据存储节点DS A存储D A和D i,具体包括:所述主数据存储节点DS A为所述D A分配分片SD A,并且建立所述D A的标识与所述分片SD A的映射关系,为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系。 Optionally, the verifying the storage node CS r storing the Cr specifically includes: the verification storage node CS r allocates a fragment S r to the Cr, and establishes the identifier of the Cr and the fragment S r mapping relationship; the data storage node DS i D i memory comprises: a data storage node DS i D i is the slice allocated SD i, and establishing the identity of D i and SD i of the slice mapping relationship; said main memory data storage node DS a D a and D i, comprises: a primary data storage DS a to the node D a dispensing fragment SD a, and the identification and the establishment of D a Describe the mapping relationship of the fragment SD A , allocate the fragment SD i to the D i , and establish a mapping relationship between the identifier of the D i and the fragment SD i .
进一步地,管理节点建立D i的标识与数据存储节点DS i和主数据存储节点DS A的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。 Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the primary data storage node DS A . When garbage collection is performed on the metadata stripe, the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
第二方面,相应地,本发明实施例还提供了一种分布式存储系统,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;所述分布式存储系统用于实现第一方面各种实现方案。 In a second aspect, the embodiment of the present invention further provides a distributed storage system, where the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+ Each of the N) storage nodes stores a partitioned view of the metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M; The distributed storage system is used to implement various implementations of the first aspect.
相应地,本发明还提供了非易失性计算机可读存储介质和计算机程序产品,当本发明实施例提供的存储设备的存储器加载非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,所述计算机程序指令可运行于分布式存储系统中,分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;当一个或多个计算机执行所述计算机程序指令分别作为所述分布式存储系统中的管理节点、主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r用于实现第一方面各种实现方案。 Accordingly, the present invention also provides a non-volatile computer readable storage medium and a computer program product, which are included in a memory-loaded non-volatile computer readable storage medium and computer program product of a storage device provided by an embodiment of the present invention. Computer program instructions, the computer program instructions being operative in a distributed storage system, the distributed storage system comprising a management node and (M+N) storage nodes, the management node and (M+N) storage nodes all storing a partitioned view having a metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, M For a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M; when one or more computers execute the computer program instructions are stored as the management node of the distributed system, the data storage master node DS a, the data storage node DS i and the check node memory for implementing a first aspect of the CS r Kind of implementation.
在第一方面公开的各种分布式存储系统中元数据存储方案也可以适用元数据对应的数据的存储。相应的,第二方面方向的分布式存储系统以及第三方面的非易失性计算机可读存储介质和计算机程序产品同样也适用于数据存储。The metadata storage scheme in the various distributed storage systems disclosed in the first aspect can also be applied to the storage of data corresponding to the metadata. Accordingly, the distributed storage system of the second aspect and the non-transitory computer readable storage medium and computer program product of the third aspect are equally applicable to data storage.
附图说明DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below.
图1为本发明实施例提供的一种分布式块设备存储架构示意图;FIG. 1 is a schematic diagram of a storage structure of a distributed block device according to an embodiment of the present invention;
图2为本发明实施例提供的一种分布式块设备中服务器结构示意图;2 is a schematic structural diagram of a server in a distributed block device according to an embodiment of the present invention;
图3为本发明实施例提供的一种数据分条与分区视图关系示意图;FIG. 3 is a schematic diagram of a relationship between a data stripe and a partition view according to an embodiment of the present invention;
图4为本发明实施例提供的一种数据分条示意图;4 is a schematic diagram of data striping according to an embodiment of the present invention;
图5为本发明实施例提供的分区视图示意图;FIG. 5 is a schematic diagram of a partition view according to an embodiment of the present invention;
图6为本发明实施例提供的一种元数据分条与分区视图关系示意图;FIG. 6 is a schematic diagram of a relationship between a metadata stripe and a partition view according to an embodiment of the present invention;
图7为本发明实施例元数据存储流程图;FIG. 7 is a flowchart of metadata storage according to an embodiment of the present invention;
图8为本发明实施例提供的一种元数据分条示意图;FIG. 8 is a schematic diagram of metadata striping according to an embodiment of the present invention;
图9为本发明实施例提供的元数据存储示意图。FIG. 9 is a schematic diagram of metadata storage according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述。The technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.
分布式存储系统主要有分布式文件系统存储、分布式对象存储和分布式块设备存储等几种形式,例如
Figure PCTCN2018075077-appb-000001
Figure PCTCN2018075077-appb-000002
系列产品。本发明实施例以分布式块设备存储为例进行说明。示例性的如图1所示,分布式块设备存储包括多台服务器1、服务器2、服务器3、服务器4、服务器5和服务器6,服务器间互相通信。在实际应用当中,分布式块设备存储中服务器的数量可以根据实际需求增加,本发明实施例对此不作限定。分布式块设备存储的服务器中包含如图2所示的结构。
Distributed storage systems mainly include distributed file system storage, distributed object storage, and distributed block device storage, such as
Figure PCTCN2018075077-appb-000001
of
Figure PCTCN2018075077-appb-000002
Series products. The embodiment of the present invention is described by taking a distributed block device storage as an example. Illustratively, as shown in FIG. 1, the distributed block device storage includes a plurality of servers 1, a server 2, a server 3, a server 4, a server 5, and a server 6, and the servers communicate with each other. In an actual application, the number of servers in the distributed block device storage may be increased according to actual requirements, which is not limited by the embodiment of the present invention. The server stored in the distributed block device includes the structure as shown in FIG. 2.
如图2所示,分布式块设备存储中的每台服务器包含中央处理单元(Central Processing Unit,CPU)201、内存202、硬盘1、硬盘2和硬盘3,内存202中存储计算机指令,CPU201执行内存202中的程序指令执行相应的操作。硬盘可以为机械硬盘和固态硬盘中的至少一种。另外,为节省CPU201的计算资源,现场可编程门阵列(Field Programmable Gate Array,FPGA)或其他硬件也可以用于CPU201上述相应的操作,或者,FPGA或其他硬件与CPU201共同完成上述相应的操作。为方便描述,本发明实施例统一描述为处理器用于实现上述相应的操作。As shown in FIG. 2, each server in the distributed block device storage includes a central processing unit (CPU) 201, a memory 202, a hard disk 1, a hard disk 2, and a hard disk 3. The memory 202 stores computer instructions, and the CPU 201 executes Program instructions in memory 202 perform the corresponding operations. The hard disk can be at least one of a mechanical hard disk and a solid state hard disk. In addition, in order to save the computing resources of the CPU 201, a Field Programmable Gate Array (FPGA) or other hardware may also be used for the corresponding operations of the CPU 201, or the FPGA or other hardware may perform the corresponding operations together with the CPU 201. For convenience of description, the embodiments of the present invention are generally described as a processor for implementing the corresponding operations described above.
在图2所示的结构中,内存202中加载应用程序,CPU201执行内存202中的应用程序指令,则服务器作为客户端。其中,应用程序可以为虚拟机(Virtual Machine,VM),也可以为某一个特定应用,如办公软件等。客户端向分布式块设备存储写入数据或从分布式块设备存储中读取数据。内存202中加载存储管理程序,CPU201执行内存202中的作为虚拟块存储管理程序的存储管理程序指令,则服务器作为管理节点,负责卷元数据的管理,用于向客户端提供块协议访问接口,为客户端提供分布式存储接入点服务,使客户端能够通过管理节点访问分布式块设备存储的存储资源。内存202中加载存储对象程序,CPU201执行内存202中的存储对象程序指令,则服务器作为存储节点,用于执行具体的输入输出(Input/Output,I/O)操作。在每个服务器上可以运行多个存储对象程序的进程,示例性的,一块硬盘默认对应运行一个存储对象程序进程,每个存储对象程序进程会管理一块硬盘,则服务器运行每一个存储对象程序的进程作为一个存储节点。具体实现,也可以一个服务器上运行一个存储对象程序的进程对应服务器上的所有硬盘。本发明实施例以一个存储对象程序进程会管理一块硬盘为例进行描述。分布式块设备存储初始化时,每个存储对象程序的进程会按照1MB为单位对硬盘进行分片管理,并在硬盘的元数据管理区域记录每个1MB分片的分配信息,硬盘的分片组成存储资源池。存储管理程序与其所能访问的资源池的所有存储对象程 序的进程点对点通信,即管理节点与其所能访问的资源池的所有存储节点进行通信,从而管理节点能并发访问资源池的所有硬盘。In the configuration shown in FIG. 2, an application is loaded in the memory 202, and the CPU 201 executes an application instruction in the memory 202, and the server serves as a client. The application can be a virtual machine (VM) or a specific application, such as office software. The client stores write data to or reads data from the distributed block device store. The storage management program is loaded in the memory 202, and the CPU 201 executes the storage management program instruction as the virtual block storage management program in the memory 202, and the server acts as a management node, and is responsible for managing the volume metadata, and provides a block protocol access interface to the client. A distributed storage access point service is provided for the client, so that the client can access the storage resource stored by the distributed block device through the management node. The storage object program is loaded in the memory 202, and the CPU 201 executes the storage object program instruction in the memory 202, and the server functions as a storage node for performing a specific input/output (I/O) operation. On each server, you can run multiple storage object program processes. For example, one hard disk corresponds to running a storage object program process by default. Each storage object program process manages one hard disk, and the server runs each storage object program. The process acts as a storage node. Specifically, it is also possible to run a storage object program on a server corresponding to all hard disks on the server. The embodiment of the present invention describes a case where a storage object program process manages a hard disk. When the distributed block device is initialized, the process of each storage object program will manage the hard disk in units of 1 MB, and record the allocation information of each 1 MB fragment in the metadata management area of the hard disk. Storage resource pool. The storage management program communicates with all the storage object programs of the resource pools that it can access, that is, the management node communicates with all the storage nodes of the resource pool that the management node can access, so that the management node can concurrently access all the hard disks of the resource pool.
在分布式块设备存储初始化时,将哈希空间(如0ˉ2^32,)划分为N等份,每1等份是1个分区(Partition),这N等份按照硬盘数量进行均分。例如,分布式块存储设备存储中N默认为3600,即分区分别为P1,P2,P3…P3600。如图3所示,假设当前分布式块设备存储有18块硬盘(存储节点),则每块存储节点承载200个分区。上述分区与存储节点对应关系,即分区视图,在分布式块设备存储初始化时会分配好,后续会随着分布式块设备存储中硬盘数量的变化进行调整。分布式块设备存储的服务器会在内存202中保存该分区视图,管理节点使用该分区视图进行快速路由。每一个存储节点中也保存有分布式块设备存储系统的所有分区视图,即每一个分区与存储节点的对应关系。同时根据分布式块设备存储的可靠性要求,可以使用纠删码(Erasure Coding,EC)算法提高数据可靠性,如使用3+1模式,即3个数据块和1个校验块组成数据分条,如图4所示,则分区视图为“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点,示例性的,分区视图如图5所示。该分区视图表示分区对应主数据节点以及用于存储数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的数据块的备份数据存储节点为主数据存储节点。When the distributed block device is initialized, the hash space (such as 0ˉ2^32,) is divided into N equal parts, and each partition is a partition, and the N equal parts are equally divided according to the number of hard disks. For example, the default block storage device storage N defaults to 3600, that is, the partitions are P1, P2, P3...P3600, respectively. As shown in FIG. 3, assuming that the current distributed block device stores 18 hard disks (storage nodes), each storage node carries 200 partitions. The corresponding relationship between the partition and the storage node, that is, the partition view, is allocated when the distributed block device is initialized, and then adjusted according to the change of the number of hard disks in the distributed block device storage. The server stored by the distributed block device saves the partition view in the memory 202, and the management node uses the partition view for fast routing. Each partition node also stores all partitioned views of the distributed block device storage system, that is, the correspondence between each partition and the storage node. At the same time, according to the reliability requirements of distributed block device storage, the Erasure Coding (EC) algorithm can be used to improve data reliability, such as using 3+1 mode, that is, 3 data blocks and 1 check block to form data points. As shown in Figure 4, the partition view is "Partition - Primary Data Storage Node - Data Storage Node 1 - Data Storage Node 2 - Verify Storage Node, for example, the partition view is shown in Figure 5. This partition view a data storage node 1 and a data node 2 representing a partition corresponding primary data node and other data blocks for storing data strips, and a check storage node storing check data, which are stored in the data storage node 1 and the data storage node 2 The backup data storage node of the data block is the primary data storage node.
分布式块设备存储会对每个逻辑单元号(Logical Unit Number,LUN)在逻辑上按照1MB大小进行切片,例如1GB的LUN则会被切成1024*1MB分片。如图3所示,客户端通过管理节点向LUN发送写请求的时候,在小型计算机系统接口(Small Computer System Interface,SCSI)命令中会带LUN标识(Identifier,ID)、逻辑块地址(Logical Block Address,LBA)ID以及待写数据,客户端所在的管理节点接收写请求,根据LUN ID和LBA ID组成一个键key,该key会包含LBA ID对1MB的取整计算信息。通过分布式哈希表(Distributed Hash Table,DHT)Hash计算出一个整数(范围在0ˉ2^32内),并落在指定分区中;客户端所在的管理节点根据内存202中记录的分区视图确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点,管理节点将EC数据分条中的数据块1、数据块2、数据块3和校验块4分别发送到主数据存储节点1、数据存储节点2、数据存储节点3和校验存储节点4。主数据存储节点存储数据块1,数据存储节点1存储数据块2,数据存储节点2存储数据块3,校验存储节点存储校验块1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将数据块2备份到主数据存储节点,数据存储节点2将数据块3备份到主数据存储节点,主数据存储节点分别存储数据块2和数据块3。具体实现中,主数据存储节点为数据块1从其管理的硬盘中分配分片1,建立数据块1的标识与分片1的映射关系;数据存储节点1从其管理的硬盘中为数据块2分配分片2,建立数据块2的标识与分片2的映射关系;数据存储节点2从其管理的硬盘中为数据块3分配分片3,建立数据块3的标识与分片3的映射关系;校验存储节点从其管理的硬盘中为校验块1分配分片4,建立校验块1的标识与分片4的映射关系。主数据存储节点接收数据存储节点1发送的数据块2和数据存储节点2发送的数据块3,主数据存储节点从其管理的硬盘 中分配分片5和分片6,主数据存储节点建立数据块2的标识与分片5的映射关系,以及数据块3的标识与分片6的映射关系。本发明实施例中,以数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则数据块的标识与分片的映射关系为数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬盘,则数据块的标识与分片的映射关系为包括数据块的标识与存储该数据块的硬盘的映射,以及存储该数据块的硬盘到分片的映射。分片物理地址的映射关系当。进一步地,数据块2分别存储到分片2和分片5,数据块3分别存储在分片3和6,管理节点建立并保存数据块2的标识与数据存储节点1和主数据存储节点的映射,建立并保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存数据块2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。在对数据分条进行垃圾回收时,管理节点可以根据数据分条中数据块的标识与存储节点的映射关系,将数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了数据回收的效率。The distributed block device storage will logically slice each logical unit number (LUN) by 1MB. For example, a 1GB LUN will be sliced into 1024*1MB fragments. As shown in Figure 3, when the client sends a write request to the LUN through the management node, it will carry the LUN ID (Identifier, ID) and logical block address (Logical Block) in the Small Computer System Interface (SCSI) command. Address, LBA) ID and data to be written, the management node where the client is located receives the write request, and forms a key according to the LUN ID and the LBA ID. The key will contain the rounding calculation information of the LBA ID to 1 MB. An integer (within 0ˉ2^32) is calculated by the Distributed Hash Table (DHT) Hash and falls in the specified partition; the management node where the client is located determines the main according to the partitioned view recorded in the memory 202. a data storage node, a data storage node 1, a data storage node 2, and a check storage node. The management node sends the data block 1, the data block 2, the data block 3, and the check block 4 in the EC data stripe to the main data storage, respectively. Node 1, data storage node 2, data storage node 3, and verification storage node 4. The main data storage node stores the data block 1, the data storage node 1 stores the data block 2, the data storage node 2 stores the data block 3, and the check storage node stores the check block 1. The data storage nodes 1 and 2 respectively determine the primary data storage node according to the partition view, the data storage node 1 backs up the data block 2 to the primary data storage node, and the data storage node 2 backs up the data block 3 to the primary data storage node, the primary data storage node Data block 2 and data block 3 are stored separately. In a specific implementation, the primary data storage node allocates the fragment 1 from the hard disk managed by the data block 1 to establish a mapping relationship between the identifier of the data block 1 and the fragment 1; the data storage node 1 is a data block from the hard disk managed by the data storage node 1 2Assigning the slice 2, establishing the mapping relationship between the identifier of the data block 2 and the slice 2; the data storage node 2 assigns the slice 3 to the data block 3 from the hard disk it manages, and establishes the identifier of the data block 3 and the slice 3 Mapping relationship; the verification storage node allocates the slice 4 to the check block 1 from the hard disk managed by the storage node, and establishes the mapping relationship between the identifier of the check block 1 and the slice 4. The primary data storage node receives the data block 2 sent by the data storage node 1 and the data block 3 sent by the data storage node 2, and the primary data storage node allocates the fragment 5 and the fragment 6 from the hard disk managed by the primary data storage node, and the primary data storage node establishes the data. The mapping relationship between the identification of the block 2 and the slice 5, and the mapping relationship between the identification of the data block 3 and the slice 6. In the embodiment of the present invention, the mapping relationship between the identifier of the data block and the fragment is taken as an example. When one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identification and division of the data block. The mapping relationship between the data block is the mapping relationship between the identifier of the data block and the physical address of the slice; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identification and fragmentation of the data block The mapping relationship is a mapping including the identification of the data block and the hard disk storing the data block, and the hard disk to slice storage of the data block. The mapping relationship of the physical addresses of the fragments. Further, the data blocks 2 are respectively stored into the slice 2 and the slice 5, and the data block 3 is stored in the slices 3 and 6, respectively, and the management node establishes and holds the identifier of the data block 2 and the data storage node 1 and the main data storage node. Mapping, establishing and saving the mapping relationship between the identity of the data block 3 and the data storage node 2 and the primary data storage node. Further, the data storage node 1 holds a mapping of the identifier of the saved data block 2 with the data storage node 1 and the primary data storage node, and the data storage node 2 holds the mapping relationship between the identifier of the data block 3 and the data storage node 2 and the primary data storage node. . When garbage collection is performed on the data stripe, the management node can recover the data of the data block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the data block in the data strip and the storage node, thereby improving the data. The efficiency of recycling.
本发明实施例中,客户端向分布式块设备存储发送写请求写入数据时,会产生元数据,用于记录数据的逻辑地址和物理地址等。本发明实施例中,数据对应的元数据存储与数据存储使用相同的EC算法。基于EC算法组成的元数据分条与上述基于EC算法的组成数据分条具有相同的分区视图,如图6所示。In the embodiment of the present invention, when the client sends a write request write data to the distributed block device, metadata is generated for recording the logical address and physical address of the data. In the embodiment of the present invention, the metadata storage corresponding to the data uses the same EC algorithm as the data storage. The metadata striping based on the EC algorithm has the same partitioned view as the above-described composition data striping based on the EC algorithm, as shown in FIG. 6.
在分布式存储系统存储元数据,其中分布式存储系统包含管理节点和(M+N)个存储节点,管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在该分布式存储系统存储执行如图7所示的流程: Storing metadata in a distributed storage system, wherein the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes each store a partitioned view of metadata strips; The partitioned view of the data stripe includes a primary data storage node DS A , a data storage node DS i , and a parity storage node CS r ; wherein N is a natural number not less than 2, M is a natural number not less than 1, and A is a natural number 1 to One of N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; the flow shown in FIG. 7 is executed in the distributed storage system storage:
步骤701:管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;所述元数据分条包含元数据块D A、D i以及校验块Cr。 Step 701: The management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the check storage node CS r for the metadata striping; the metadata strip includes Metadata blocks D A , D i and check block Cr.
具体的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS A、所述数据存储节点DS i和所述校验存储节点CS rSpecifically, the management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping, specifically: The management node determines, according to the write request that generates the metadata in the metadata stripe, the partition corresponding to the metadata stripe; the management node queries the metadata stripe according to the partition corresponding to the metadata stripe The partitioned view determines the primary data storage node DS A , the data storage node DS i , and the parity storage node CS r .
具体的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。具体可参见分布式块设备存储在存储客户端发送的写请求时的方案,在此不再赘述。Specifically, the management node determines, according to the address carried by the write request, a partition corresponding to the metadata stripe. For details, refer to the scheme in which the distributed block device stores the write request sent by the client, and details are not described herein.
步骤702:所述管理节点将D i发送到所述数据存储节点DS i,将D A发送到所述主数据存储节点DS A,将C r发送到所述校验存储节点CS rStep 702: The management node sends D i to the data storage node DS i , sends D A to the primary data storage node DS A , and sends C r to the verification storage node CS r .
步骤703:所述校验存储节点CS r接收并存储C rStep 703: The verification storage node CS r receives and stores C r .
步骤704:所述数据存储节点DS i接收并存储D i,并根据所述元数据分条的分区视图将D i发送到所述主数据存储节点DS AStep 704: The data storage node DS i receives and stores D i , and sends D i to the primary data storage node DS A according to the partitioned view of the metadata strip.
步骤705:所述主数据存储节点DS A接收并存储D A和D iStep 705: The primary data storage node DS A receives and stores D A and D i .
具体的,所述校验存储节点CS r存储Cr具体包括:所述校验存储节点CS r为所述Cr分配分片S r,并且建立所述Cr的标识与所述分片S r的映射关系;所述数据存储节点DS i存储D i具体包括:所述数据存储节点DS i为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系;所述主数据存储节点DS A存储D A和D i,具体包括:所述主数据存储节点DS A为所述D A分配分片SD A,并且建立所述D A的标识与所述分片SD A的映射关系,为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系。进一步地,管理节点建立D i的标识与数据存储节点DS i和主数据存储节点DS A的映射关系。进一步地,进一步地,数据存储节点1保存保存D i的标识与数据存储节点DS i和主数据存储节点DS A的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。 Specifically, the verifying the storage node CS r storing the Cr specifically includes: the verification storage node CS r allocates a fragment S r to the Cr, and establishes a mapping between the identifier of the Cr and the fragment S r relation; the data storage node DS i D i memory comprises: a data storage node DS i D i is the slice allocated SD i, and establishing the identity of D i of the slice of the mapping SD i relation; said main memory data storage node DS a D a and D i, comprises: a primary data storage DS a to the node D a dispensing fragment SD a, D a and establishing the identity of the SD a mapping relationship between the slice, said slice allocation D i SD i, and the mapping relation of D i and identifying the fragmentation of SD i. Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the primary data storage node DS A . Further, further, the data storage node 1 holds a mapping relationship between the identifier of the saved D i and the data storage node DS i and the primary data storage node DS A . When garbage collection is performed on the metadata stripe, the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
本发明实施例中,结合前面所述的分布式块设备存储及数据存储方式,如图8所示,使用EC算法的元数据分条中元数据块为D 1,D 2和D 3,校验块为C 1。客户端所在的管理节点根据内存202中记录的分区视图“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点”确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点。该分区视图表示分区对应主数据节点以及用于存储元数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的元数据块的备份数据存储节点为主数据存储节点。管理节点将基于EC算法的元数据分条中的D 1、D 2、D 3和C 1分别发送到主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点4。主数据存储节点接收并存储D 1,数据存储节点1接收并存储D 2,数据存储节点2接收并存储D 3,校验存储节点接收并存储C 1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将D 2备份到主数据存储节点,数据存储节点2将D 3备份到主数据存储节点,主数据存储节点接收并存储D 2和D 3。具体实现中,如图9所示,主数据存储节点为D 1从其管理的硬盘中分配分片7,建立D 1的标识与分片7的映射关系;数据存储节点1从其管理的硬盘中为D 2分配分片8,建立D 2的标识与分片8的映射关系;数据存储节点2从其管理的硬盘中为D 3分配分片9,建立D 3的标识与分片9的映射关系;校验存储节点从其管理的硬盘中为C 1分配分片10,建立C 1的标识与分片10的映射关系。主数据存储节点接收数据存储节点1发送的D 2和数据存储节点2发送的D 3,主数据存储节点从其管理的硬盘中分配分片11和分片12,主数据存储节点建立D 2的标识与分片11的映射关系,以及D 3的标识与分片12的映射关系。本发明实施例中,以元数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则元数据块的标识与分片的映射关系为元数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬 盘,则元数据块的标识与分片的映射关系为包括元数据块的标识与存储该元数据块的硬盘的映射,以及存储该元数据块的硬盘到分片的映射。进一步地,D 2分别存储到分片8和分片11,D 3分别存储在分片9和12,管理节点建立并保存D 2的标识与数据存储节点1和主数据存储节点的映射,建立并保存D 3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存D 2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存D 3的标识与数据存储节点2和主数据存储节点的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。 In the embodiment of the present invention, in combination with the foregoing distributed block device storage and data storage manner, as shown in FIG. 8, the metadata blocks in the metadata stripe using the EC algorithm are D 1 , D 2 and D 3 , The block is C 1 . The management node where the client is located determines the primary data storage node, the data storage node 1, and the data storage according to the partitioned view "Partition - Primary Data Storage Node - Data Storage Node 1 - Data Storage Node 2 - Verify Storage Node" recorded in the memory 202. Node 2 and the check storage node. The partitioned view represents a data storage node 1 and a data node 2 corresponding to the primary data node and other data blocks for storing the metadata stripe, and a check storage node storing the check data, which are stored in the data storage node 1 and the data. The backup data storage node of the metadata block of the storage node 2 is the primary data storage node. The management node transmits D 1 , D 2 , D 3 , and C 1 in the metadata strip based on the EC algorithm to the primary data storage node, the data storage node 1, the data storage node 2, and the verification storage node 4, respectively. Primary data storage node receives and stores D 1, a data storage node receives and stores D 2, data storage node 2 receives and stores the D 3, check storage node receives and stores C 1. The data storage nodes 1 and 2 respectively determine the primary data storage node according to the partition view, the data storage node 1 backs up D 2 to the primary data storage node, and the data storage node 2 backs up D 3 to the primary data storage node, and the primary data storage node receives and Store D 2 and D 3 . In specific implementation, 9, 7 D 1 slice allocated from the hard disk management, a mapping relationship between the identifier 7 of slices 1 D primary data storage node; data storage node from a hard disk management A slice 8 is allocated for D 2 to establish a mapping relationship between the identifier of D 2 and the slice 8; the data storage node 2 allocates a slice 9 to D 3 from the hard disk it manages, and establishes the identifier of the D 3 and the slice 9 Mapping relationship; verifying that the storage node allocates a fragment 10 to C 1 from its managed hard disk, and establishes a mapping relationship between the identifier of C 1 and the fragment 10 . The primary data storage node receives D 2 sent by the data storage node 1 and D 3 sent by the data storage node 2, and the primary data storage node allocates the fragment 11 and the fragment 12 from the hard disk managed by the primary data storage node, and the primary data storage node establishes D 2 The mapping relationship between the identifier and the fragment 11 and the mapping relationship between the identifier of the D 3 and the fragment 12 are identified. In the embodiment of the present invention, the mapping relationship between the identifier of the metadata block and the fragment is taken as an example. When one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identifier of the metadata block. The mapping relationship with the fragment is the mapping relationship between the identifier of the metadata block and the physical address of the fragment; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the metadata block The mapping relationship with the slice is a mapping including the identity of the metadata block and the hard disk storing the metadata block, and a hard disk to slice mapping of the metadata block. Further, D 2 is stored to the slice 8 and the slice 11 respectively, and D 3 is stored in the fragments 9 and 12, respectively, and the management node establishes and saves the mapping of the identifier of the D 2 with the data storage node 1 and the primary data storage node, and establishes D 3 and save the mapping relationship between the identifier and the data storage nodes 2 and the master data storage node. Further, the data storage node 1 holds a mapping of the identifier of the saved D 2 with the data storage node 1 and the primary data storage node, and the data storage node 2 stores the mapping relationship between the identifier of D 3 and the data storage node 2 and the primary data storage node. When garbage collection is performed on the metadata stripe, the management node may recover the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata strip and the storage node. , improve the efficiency of metadata recovery.
因此,在使用EC算法组成的元数据分条实现数据可靠性的场景下,主数据存储节点备份元数据分条中其他元数据块,因为只需要将数据存储节点上的元数据块在主数据存储节点上备份,相比现有技术中所有元数据块多副本,减少了存储空间,同时在客户端访问元数据时,只需要从主数据存储节点访问所有元数据块,提高了元数据访问速度。Therefore, in the scenario where the metadata is composed of the EC algorithm to achieve data reliability, the primary data storage node backs up other metadata blocks in the metadata strip, because only the metadata block on the data storage node needs to be in the main data. Backup on the storage node, compared with multiple copies of all metadata blocks in the prior art, the storage space is reduced, and when the client accesses the metadata, only all metadata blocks need to be accessed from the primary data storage node, thereby improving metadata access. speed.
本发明实施例,还提供了非易失性计算机可读存储介质和计算机程序产品,非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,CPU执行内存中加载的该计算机程序指令用于实现本发明各实施中管理节点和存储节点(主数据存储节点、数据存储节点和校验存储节点)对应的功能。The embodiment of the present invention further provides a non-transitory computer readable storage medium and a computer program product, a non-transitory computer readable storage medium, and computer program instructions contained in a computer program product, the CPU executing the computer loaded in the memory The program instructions are used to implement functions corresponding to the management node and the storage node (the primary data storage node, the data storage node, and the verification storage node) in the implementations of the present invention.
本发明实施例中给出的示例性描述。本发明实施例中的“分片1”、“分片2”。。。“分片12”等并不是用于严格限定先后关系,只是用于区分不同的分片。本发明实施例中的分片可以为硬盘中的物理块等。本发明实施例中的硬盘,如前所述,可以为机械盘和固态硬盘中的至少一种。本发明实施例中存储对象程序的进程对应的硬盘还可以为存储阵列等,本发明实施例对此不作限定。An exemplary description given in the embodiments of the present invention. "Slice 1" and "Slice 2" in the embodiment of the present invention. . . "Shard 12" and the like are not used to strictly define the order relationship, but are used to distinguish different pieces. The slice in the embodiment of the present invention may be a physical block or the like in the hard disk. The hard disk in the embodiment of the present invention may be at least one of a mechanical disk and a solid state hard disk as described above. The hard disk corresponding to the process of storing the object program in the embodiment of the present invention may also be a storage array or the like, which is not limited in the embodiment of the present invention.
在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the division of the units described in the device embodiments described above is only one logical function division, and may be further divided in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or Some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Claims (10)

  1. 一种分布式存储系统中元数据存储方法,其特征在于,所述分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;所述方法包括:所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;所述元数据分条包含元数据块D A、D i以及校验块Cr; A metadata storage method in a distributed storage system, characterized in that the distributed storage system comprises a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes are stored. a partitioned view having a metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, M A natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, and r is each of natural numbers 1 to M; the method includes: The management node determines, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the check storage node CS r for the metadata strip; the metadata strip includes the metadata block D A , D i and check block Cr;
    所述管理节点将D i发送到所述数据存储节点DS i,将D A发送到所述主数据存储节点DS A,将C r发送到所述校验存储节点CS rThe management node sends D i to the data storage node DS i , sends D A to the primary data storage node DS A , and sends C r to the verification storage node CS r ;
    所述校验存储节点CS r接收并存储C rThe verification storage node CS r receives and stores C r ;
    所述数据存储节点DS i接收并存储D i,并根据所述元数据分条的分区视图将D i发送到所述主数据存储节点DS AThe data storage node DS i receives and stores D i , and sends D i to the primary data storage node DS A according to the partitioned view of the metadata stripe;
    所述主数据存储节点DS A接收并存储D A和D iThe primary data storage node DS A receives and stores D A and D i .
  2. 根据权利要求1所述的方法,其特征在于,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r,具体包括: The method according to claim 1, wherein the management node determines the primary data storage node DS A , the data storage node DS i and the check for the metadata strip according to the partitioned view of the metadata stripe The storage node CS r specifically includes:
    所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;Determining, by the management node, a partition corresponding to the metadata strip according to a write request that generates metadata in the metadata stripe;
    所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS A、所述数据存储节点DS i和所述校验存储节点CS rDetermining, by the management node, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r according to the partitioned view of the metadata query corresponding to the partitioning of the metadata stripe .
  3. 根据权利要求2所述的方法,其特征在于,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。The method according to claim 2, wherein the management node determines a partition corresponding to the metadata strip according to an address carried by the write request.
  4. 根据权利要求1所述的方法,其特征在于,所述校验存储节点CS r存储Cr具体包括:所述校验存储节点CS r为所述Cr分配分片S r,并且建立所述Cr的标识与所述分片S r的映射关系; The method according to claim 1, wherein the verifying the storage node CS r storing the Cr specifically comprises: the verifying storage node CS r allocating the fragment S r to the Cr, and establishing the Cr Identifying a mapping relationship with the fragment S r ;
    所述数据存储节点DS i存储D i具体包括:所述数据存储节点DS i为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系; The data storage node DS i D i memory comprises: a data storage node DS i D i is the slice allocated SD i, identifying and establishing the mapping relation of D i and SD i of the slice;
    所述主数据存储节点DS A存储D A和D i,具体包括:所述主数据存储节点DS A为所述D A分配分片SD A,并且建立所述D A的标识与所述分片SD A的映射关系,为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系。 Said main memory data storage node DS A D A and D i, comprises: a primary data storage DS A to the node D A dispensing fragment SD A, D A and establishing the identity of the fragment with SD a mapping relationship, and D i is the slice allocated SD i, and the mapping relation of D i and identifying the fragmentation of SD i.
  5. 一种分布式存储系统,其特征在于,所述分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个; A distributed storage system, characterized in that the distributed storage system comprises a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes each store metadata strips. The partitioned view of the metadata stripe includes a primary data storage node DS A , a data storage node DS i , and a check storage node CS r ; wherein N is a natural number not less than 2, and M is not less than 1. Natural number, A is one of natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of natural numbers 1 to M;
    所述管理节点用于根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;所述元数据分条包含元数据块D A、D i以及校验块Cr,将D i发送到所述数据存储节点DS i,将D A发送到所述主数据存储节点DS A,将C r发送到所述校验存储节点CS rThe management node is configured to determine, according to the partitioned view of the metadata stripe, the primary data storage node DS A , the data storage node DS i , and the verification storage node CS r for the metadata striping; the metadata striping Include metadata block D A , D i and check block Cr, send D i to the data storage node DS i , send D A to the primary data storage node DS A , and send C r to the school Verify the storage node CS r ;
    所述校验存储节点CS r用于接收并存储C rThe check node memory for receiving and storing CS r C r;
    所述数据存储节点DS i用于接收并存储D i,并根据所述元数据分条的分区视图将D i发送到所述主数据存储节点DS AThe data storage node DS i is configured to receive and store D i , and send D i to the primary data storage node DS A according to the partitioned view of the metadata stripe;
    所述主数据存储节点DS A用于接收并存储D A和D iThe primary data storage node DS A is for receiving and storing D A and D i .
  6. 根据权利要求5所述的系统,其特征在于,所述管理节点具体用于根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区,根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS A、所述数据存储节点DS i和所述校验存储节点CS rThe system according to claim 5, wherein the management node is specifically configured to determine, according to a write request for generating metadata in the metadata stripe, a partition corresponding to the metadata stripe, according to the element The partitioned view corresponding to the data stripe queries the partitioned view of the metadata stripe to determine the primary data storage node DS A , the data storage node DS i , and the parity storage node CS r .
  7. 根据权利要求6所述的系统,其特征在于,所述管理节点还用于根据所述写请求携带的地址确定所述元数据分条对应的分区。The system according to claim 6, wherein the management node is further configured to determine a partition corresponding to the metadata strip according to an address carried by the write request.
  8. 根据权利要求5所述的系统,其特征在于,所述校验存储节点CS r具体用于为所述Cr分配分片S r,并且建立所述Cr的标识与所述分片S r的映射关系; A system according to claim 5, characterized in that the check for the particular storage node CS r S r slice allocated to the Cr, Cr and establishing the identity of the fragment mapping S r relationship;
    所述数据存储节点DS i具体用于为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系; The data storage node DS i D i for said particular slice allocation SD i, and the mapping relationship between the identifier of D i and SD i of the slice;
    所述主数据存储节点DS A具体用于为所述D A分配分片SD A,并且建立所述D A的标识与所述分片SD A的映射关系,为所述D i分配分片SD i,并且建立所述D i的标识与所述分片SD i的映射关系。 The primary data storage node for said particular DS A D A A dispensing fragment SD, and establishing the mapping relationship between the identifier D A of the A fragment of SD, D i is the slice allocated SD i, and the mapping relation of D i and identifying the fragmentation of SD i.
  9. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质包含计算机程序指令,所述计算机程序指令可运行于分布式存储系统中,分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;当所述一个或多个计算机执行所述计算机指令时,所述一个或多个计算机作为所述管理节点用于根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS A、数据存储节点DS i和校验存储节点CS r;所述元数据分条包含元数据块D A、D i以及校验块Cr,将D i发送到所述数据存储节点DS i,将D A发送到所述主数据存储节点DS A,将C r发送到所述校验存储节点CS r;所述一个或多个计算机作为所述校验存储节点CS r用于接收并存储C rA non-volatile readable storage medium, characterized in that the non-volatile readable storage medium comprises computer program instructions, which can be run in a distributed storage system, the distributed storage system comprising management a node and (M+N) storage nodes, the management node and (M+N) storage nodes each store a partitioned view of the metadata stripe; the partitioned view of the metadata stripe includes a primary data storage node DS A , the data storage node DS i and the check storage node CS r ; wherein N is a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, and i is a natural number 1 to N Each of R except for A, r is each of the natural numbers 1 to M; when the one or more computers execute the computer instructions, the one or more computers are used as the management node for Defining a partitioned view of the metadata stripe to determine the primary data storage node DS A , the data storage node DS i , and the check storage node CS r for the metadata strip; the metadata stripe includes the metadata block D A , D i and check block Cr, send D i Sent to the data storage node DS i , send D A to the primary data storage node DS A , and send C r to the verification storage node CS r ; the one or more computers serve as the verification The storage node CS r is used to receive and store C r ;
    所述一个或多个计算机作为所述数据存储节点DS i用于接收并存储D i,并根据所述元数据分条的分区视图将D i发送到所述主数据存储节点DS AThe one or more computers are used as the data storage node DS i to receive and store D i , and send D i to the primary data storage node DS A according to the partitioned view of the metadata striping;
    所述一个或多个计算机作为所述主数据存储节点DS A用于接收并存储D A和D iThe one or more computers are used as the primary data storage node DS A for receiving and storing D A and D i .
  10. 根据权利要求9所述的存储介质,其特征在于,还包括计算机程序指令使得 所述一个或多个计算机作为所述管理节点具体用于根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区,根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS A、所述数据存储节点DS i和所述校验存储节点CS rA storage medium according to claim 9, further comprising computer program instructions to cause said one or more computers to be used as said management node in particular for generating a write request based on metadata in said metadata stripe Determining a partition corresponding to the metadata stripe, determining, according to the partitioned view of the metadata stripe corresponding to the metadata stripe, the primary data storage node DS A , the data storage node DS i and the Check the storage node CS r .
PCT/CN2018/075077 2017-06-28 2018-02-02 Metadata storage method and system in distributed storage system, and storage medium WO2019000949A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710508014.8A CN109144406B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system
CN201710508014.8 2017-06-28

Publications (1)

Publication Number Publication Date
WO2019000949A1 true WO2019000949A1 (en) 2019-01-03

Family

ID=64740945

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/075077 WO2019000949A1 (en) 2017-06-28 2018-02-02 Metadata storage method and system in distributed storage system, and storage medium

Country Status (2)

Country Link
CN (2) CN109144406B (en)
WO (1) WO2019000949A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7270755B2 (en) * 2019-03-04 2023-05-10 ヒタチ ヴァンタラ エルエルシー Metadata routing in distributed systems
EP3971701A4 (en) * 2019-09-09 2022-06-15 Huawei Cloud Computing Technologies Co., Ltd. Data processing method in storage system, device, and storage system
CN111444274B (en) * 2020-03-26 2021-04-30 上海依图网络科技有限公司 Data synchronization method, data synchronization system, and apparatus, medium, and system thereof
CN116490847A (en) * 2020-11-05 2023-07-25 阿里巴巴集团控股有限公司 Virtual data replication supporting garbage collection in a distributed file system
CN112947864B (en) * 2021-03-29 2024-03-08 南方电网数字平台科技(广东)有限公司 Metadata storage method, apparatus, device and storage medium
CN115904794A (en) * 2021-08-18 2023-04-04 华为技术有限公司 Data processing method and device
CN115268801B (en) * 2022-09-30 2023-01-10 天津卓朗昆仑云软件技术有限公司 Backup system and method for block device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411637A (en) * 2011-12-30 2012-04-11 创新科软件技术(深圳)有限公司 Metadata management method of distributed file system
CN103699494A (en) * 2013-12-06 2014-04-02 北京奇虎科技有限公司 Data storage method, data storage equipment and distributed storage system
US20140310489A1 (en) * 2013-04-16 2014-10-16 International Business Machines Corporation Managing metadata and data for a logical volume in a distributed and declustered system
CN106599308A (en) * 2016-12-29 2017-04-26 郭晓凤 Distributed metadata management method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7051155B2 (en) * 2002-08-05 2006-05-23 Sun Microsystems, Inc. Method and system for striping data to accommodate integrity metadata
CN103399823B (en) * 2011-12-31 2016-03-30 华为数字技术(成都)有限公司 The storage means of business datum, equipment and system
US8914668B2 (en) * 2012-09-06 2014-12-16 International Business Machines Corporation Asynchronous raid stripe writes to enable response to media errors
CN102937964B (en) * 2012-09-28 2015-02-11 无锡江南计算技术研究所 Intelligent data service method based on distributed system
CN103729436A (en) * 2013-12-27 2014-04-16 中国科学院信息工程研究所 Distributed metadata management method and system
US9772787B2 (en) * 2014-03-31 2017-09-26 Amazon Technologies, Inc. File storage using variable stripe sizes
WO2015188014A1 (en) * 2014-06-04 2015-12-10 Pure Storage, Inc. Automatically reconfiguring a storage memory topology
WO2017113276A1 (en) * 2015-12-31 2017-07-06 华为技术有限公司 Data reconstruction method, apparatus and system in distributed storage system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411637A (en) * 2011-12-30 2012-04-11 创新科软件技术(深圳)有限公司 Metadata management method of distributed file system
US20140310489A1 (en) * 2013-04-16 2014-10-16 International Business Machines Corporation Managing metadata and data for a logical volume in a distributed and declustered system
CN103699494A (en) * 2013-12-06 2014-04-02 北京奇虎科技有限公司 Data storage method, data storage equipment and distributed storage system
CN106599308A (en) * 2016-12-29 2017-04-26 郭晓凤 Distributed metadata management method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG, BO.: "Research on the metadata management of multinamenodes based on HDFS", CHINA MASTER'S THESES FULL-TEXT DATABASE (ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE), vol. 2014, no. 5, 15 May 2014 (2014-05-15), ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN109144406A (en) 2019-01-04
CN109144406B (en) 2020-08-07
CN111949210A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
US11379142B2 (en) Snapshot-enabled storage system implementing algorithm for efficient reclamation of snapshot storage space
US11853780B2 (en) Architecture for managing I/O and storage for a virtualization environment
US11386042B2 (en) Snapshot-enabled storage system implementing algorithm for efficient reading of data from stored snapshots
WO2019000949A1 (en) Metadata storage method and system in distributed storage system, and storage medium
US11243706B2 (en) Fragment management method and fragment management apparatus
US10169365B2 (en) Multiple deduplication domains in network storage system
US10374792B1 (en) Layout-independent cryptographic stamp of a distributed dataset
US9733848B2 (en) Method and system for pooling, partitioning, and sharing network storage resources
US11061594B1 (en) Enhanced data encryption in distributed datastores using a cluster-wide fixed random tweak
US8868877B2 (en) Creating encrypted storage volumes based on thin-provisioning mode information
JP2018532166A (en) Method, storage system and controller for deduplication in a storage system
US11199990B2 (en) Data reduction reporting in storage systems
US8566541B2 (en) Storage system storing electronic modules applied to electronic objects common to several computers, and storage control method for the same
US20190114076A1 (en) Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium
US11573711B2 (en) Enhanced data encryption in distributed datastores using random tweaks stored in data blocks
WO2020134143A1 (en) Stripe reconstruction method in storage system and striping server
US11775194B2 (en) Data storage method and apparatus in distributed storage system, and computer program product
US20210311654A1 (en) Distributed Storage System and Computer Program Product
CN107145305B (en) Use method of distributed physical disk and virtual machine
US11144445B1 (en) Use of compression domains that are more granular than storage allocation units

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18825500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18825500

Country of ref document: EP

Kind code of ref document: A1