WO2019071595A1 - 分布式块存储系统中数据存储方法、装置及计算机可读存储介质 - Google Patents

分布式块存储系统中数据存储方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2019071595A1
WO2019071595A1 PCT/CN2017/106147 CN2017106147W WO2019071595A1 WO 2019071595 A1 WO2019071595 A1 WO 2019071595A1 CN 2017106147 W CN2017106147 W CN 2017106147W WO 2019071595 A1 WO2019071595 A1 WO 2019071595A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
strip
stripe
client
storage node
Prior art date
Application number
PCT/CN2017/106147
Other languages
English (en)
French (fr)
Inventor
魏明昌
饶蓉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201780002700.6A priority Critical patent/CN110325958B/zh
Priority to PCT/CN2017/106147 priority patent/WO2019071595A1/zh
Priority to EP17890845.5A priority patent/EP3495939B1/en
Priority to US16/172,264 priority patent/US20190114076A1/en
Publication of WO2019071595A1 publication Critical patent/WO2019071595A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Definitions

  • the present invention relates to the field of information technology, and in particular, to a data storage method, apparatus, and computer readable storage medium in a distributed block storage system.
  • the distributed block storage system includes a partition, and the partition includes a storage node and a stripe.
  • Each stripe in the partition includes a plurality of strips, and one storage node in the partition corresponds to a stripe in the stripe, that is, a storage node in the partition.
  • the partition includes a primary storage node (storage node 1), and the primary storage node is configured to receive data sent by the client, and then the primary storage node selects the stripe to divide the data into stripe data.
  • the data of the stripe stored by the other storage nodes is sent to the corresponding storage node (storage node 2, storage node 3, and storage node 4).
  • the above operation makes the primary storage node easily become a data write bottleneck, and increases the data interaction between the storage nodes, which reduces the write performance of the distributed block storage system.
  • the present application provides a distributed block storage system data storage method and apparatus, which does not require a primary storage node, reduces data interaction between storage nodes, and improves write performance of a distributed block storage system.
  • a first aspect of the present application provides a data storage method in a distributed block storage system.
  • the distributed block storage system includes a partition P, where the partition P includes M storage nodes N j and R strips S i , each stripe Include a strip SU ij ; where j is each of the integers 1 through M, i is each of the integers 1 through R; in the method, the first client receives the first write request, wherein A write request includes the first data and the logical address; the first client determines that the logical address is distributed in the partition P, and the first client obtains the stripe S N from the R strips included in the partition P; wherein N is an integer 1 to a value in R, the first client divides the first data to obtain data of one or more strips SU Nj in the stripe S N , and sends the data of the one or more strips SU Nj to the storage node N j .
  • the client obtains the strip according to the partition, divides the data into stripe stripe data, and sends the stripe data to the corresponding storage node, does not need the main storage node, reduces data interaction between the storage nodes, and simultaneously divides the data.
  • the stripe data of the strip is concurrently written to the corresponding storage node, which also improves the write performance of the distributed block storage system.
  • the physical address of the stripe SU ij in each stripe at the storage node N j may be pre-allocated by the stripe metadata server.
  • the striping may be a stripe generated according to an EC algorithm or a stripe generated by a multi-copy algorithm.
  • the strip SU ij in the stripe When the striping is a stripe generated by the EC algorithm, the strip SU ij in the stripe includes a data stripe and a check strip; when the stripe is a stripe generated by the multi-copy algorithm, the strips SU ij in the stripe are both Striped for data, and the data in the data strip is the same.
  • the data strip SU Nj data also includes metadata, such as the data strip SU Nj identifier, the logical address of the data strip SU Nj data.
  • the first client receives the second write request, where the second write request includes the second data and the logical address; that is, the logic of the first data
  • the address and the logical address of the second data are the same; the first client determines that the logical address is distributed in the partition P, and the first client obtains the stripe S Y from the R strips included in the partition P; wherein Y is an integer 1 to R One of the values, N is different from Y; the first client divides the second data to obtain data of one or more strips SU Yj in the stripe S Y , and sends the data of the one or more strips SU Yj to Storage node N j .
  • the data of the data strip SU Yj also includes metadata such as the logical address of the data strip SU Yj identifier and the data strip SU Yj data.
  • the second client receives the third write request, where the third write request includes the third data and the logical address; that is, the logic of the first data
  • the logical address of the address and the third data is the same;
  • the second client determines that the logical address is distributed in the partition P, and the second client obtains the stripe S K from the R strips included in the partition P; wherein K is an integer 1 to A value in R, N is different from K;
  • the second client divides the third data to obtain data of one or more strips SU Kj in the stripe S K , and sends data of one or more strips SU Kj to Storage node N j .
  • the data strip SU Kj data also includes metadata such as the data strip SU Kj identifier, the logical address of the data strip SU Kj data.
  • the first client and the second client can access the same logical address.
  • each of the data of the one or more strips SU Nj includes a first client identifier and a first client obtains a stripe S N At least one of the time stamps TP N .
  • the storage node of the distributed block storage system may determine that the stripe is written by the first client according to the first client identifier in the data of the strip SU Nj ; the storage node of the distributed block storage system may be according to the strip SU
  • the first client in the data of Nj obtains the timestamp TP N of the stripe S N to determine the order in which the first client writes the stripe.
  • each of the data of the one or more strips SU Yj includes a first client identifier and a first
  • the client obtains at least one of the time stamps TP Y of the stripe S Y .
  • the storage node of the distributed block storage system may determine that the stripe is written by the first client according to the first client identifier in the data of the strip SU Yj ; the storage node of the distributed block storage system may be according to the strip SU
  • the first client in the Yj data obtains the timestamp TP Y of the stripe S Y to determine the order in which the first client writes the stripe.
  • each of the one or more strips SU Kj data includes a second client identifier and a second The client obtains at least one of the time stamps TP K of the stripe S K .
  • the storage node of the distributed block storage system may determine that the stripe is written by the first client according to the second client identifier in the data of the strip SU Kj ; the storage node of the distributed block storage system may be according to the strip SU
  • the first client in Kj 's data obtains the timestamp TP K of the stripe S K to determine the order in which the second client writes the stripe.
  • the strip SU ij in the stripe S i is a mapping by the stripe metadata server according to the partition P and the storage node N j included in the partition. Assigned from storage node N j .
  • the stripe metadata server pre-allocates the physical storage address from the storage node N j for the strip SU ij in the stripe S i , which can reduce the waiting time when the client writes data, thereby improving the write performance of the distributed block storage system.
  • each of the one or more strips SU Nj data The data stripe state information is further included, and the data stripe state information is used to identify whether each data strip of the stripe S N is empty, so that it is not necessary to use all 0 data instead of the stripe data of the empty data and write to the storage. Nodes reduce the amount of data written to the distributed block storage system.
  • the second aspect of the present application further provides a data storage method in a distributed block storage system, where the distributed block storage system includes a partition P, where the partition P includes M storage nodes N j and R strips S i , each stripe Include a strip SU ij ; where j is each of the integers 1 through M, i is each of the integers 1 through R; in the method, the storage node N j receives the strip sent by the first client The data of the strip SU Nj in the S N ; wherein the data of the strip SU Nj is obtained by dividing the first data by the first client; the first data is obtained by the first client receiving the first write request; mapping the first physical address stored SU Nj N j tape storage node and the storage node identified SU Nj N j in accordance with article; first write request comprising data and a first logical address; logical address data for determining a first distribution of partition P The data to the first physical address.
  • the logical address is an address in the distributed block storage system where the client write data is stored
  • the logical address distribution in the partition P has the same meaning as the first data distribution in the partition P.
  • the storage node N j only receives the data of the strip SU Nj sent by the client. Therefore, the distributed block storage system does not need the primary storage node, reduces the data interaction between the storage nodes, and concurrently writes the striped stripe data concurrently. The write performance of the distributed block storage system is also improved to the corresponding storage node.
  • the physical address of the stripe SU ij in each stripe at the storage node N j may be pre-allocated by the stripe metadata server, and therefore, the first physical address of the strip SU Nj at the storage node N j is also The metadata server is pre-allocated.
  • the striping may be a stripe generated according to an EC algorithm or a stripe generated by a multi-copy algorithm.
  • the strip SU ij in the stripe includes a data stripe and a check strip; when the stripe is a stripe generated by the multi-copy algorithm, the strips SU ij in the stripe are both Striped for data, and the data in the data strip is the same.
  • the data strip SU Nj data also includes metadata, such as the data strip SU Nj identifier, the logical address of the data strip SU Nj data.
  • the method further includes:
  • TP Nj stamp dispensing data storage node N j is a strip SU Nj
  • the time stamp can be used as a reference time stamp TP Nj recover data strip slitting S N after the other storage node failures.
  • the method further includes:
  • the data of the SU Nj includes the first client identifier and the first client At least one of the time stamps TP N of the stripe S N is obtained.
  • the storage node N j may determine that the stripe is written by the first client according to the first client identifier in the data of the strip SU Nj ; the storage node N j may be based on the first client in the data of the strip SU Nj
  • the timestamp TP N of the stripe S N is obtained by the end to determine the order in which the first client writes the stripe.
  • the method further includes: the storage node N j receiving the first client slitting the strip S Y sends data with SU Yj; SU Yj data stripe is divided by the first client of the second data obtained; the second data is the first client receives second write request obtained
  • the second write request includes the first data and the logical address; the logical address is used to determine that the second data is distributed in the partition P; that is, the logical address of the first data and the logical address of the second data are the same; the storage node Nj
  • the data of SU Yj is stored to the second physical address based on the mapping of the strip SU Y flag to the second physical address of the storage node N j .
  • the data of the data strip SU Yj also includes metadata such as the logical address of the data strip SU Yj identifier and the data strip SU Yj data.
  • the timestamp TP Yj can be used as a reference time stamp for restoring the data of the stripe in the stripe S Y after the failure of other storage nodes.
  • the method further includes: storing, by the node N j , a logical address of the data of the strip SU Yj The correspondence with the strip SU Yj identifier, so that the client accesses the data of the strip SU Yj stored by the storage node N j in the distributed block storage system with a logical address.
  • the data of the SU Yj includes the first client identifier and the first client obtains the stripe S at least one Y in the Y timestamp TP.
  • the storage node N j may determine that the stripe is written by the first client according to the first client identifier in the data of the strip SU Yj ; the storage node N j may be based on the first client in the data of the strip SU Yj
  • the timestamp TP Y of the stripe S Y is determined to determine the order in which the first client writes the stripe.
  • the method further includes: the storage node N j receiving the second client sending slitting the strip S K SU Kj data; SU Kj data stripe is divided by the second client obtained third data; third data is a third client receives the second write request obtained; first The three write request includes a third data and a logical address; the logical address is used to determine that the third data is distributed in the partition P; that is, the logical address of the first data and the logical address of the third data are the same; the storage node N j is according to the strip SU
  • the mapping of the K identifier to the third physical address of the storage node N j stores the data of the SU Kj to the third physical address.
  • the data strip SU Kj data also includes metadata such as the data strip SU Kj identifier, the logical address of the data strip SU Kj data.
  • the ninth embodiment may be implemented in a second aspect, the method further comprising: storing the node N j is the slice data allocated SU Kj timestamp TP Kj.
  • the time stamp TP Kj can be used as a reference time stamp for recovering the data of the stripe in the stripe S K after other storage node failures.
  • the method further includes: storing, by the node N j , a logical address of the data of the strip SU Kj Corresponding relationship with the strip SU Kj identifier, so that the client accesses the data of the strip SU Kj stored by the storage node N j in the distributed block storage system with a logical address.
  • the data of the SU Kj includes the second client identifier and the second client obtains the stripe At least one of the time stamps TP K of S K .
  • the storage node N j may determine that the stripe is written by the second client according to the second client identifier in the data of the strip SU Kj ; the storage node N j may be based on the second client in the data of the strip SU Kj
  • the time stamp TP K of the stripe S K is determined to determine the order in which the second client writes the stripe.
  • the strip SU ij in the stripe S i is a storage node N j included by the stripe metadata server according to the partition P and the partition P
  • the mapping is assigned from the storage node N j .
  • the stripe metadata server pre-allocates the physical storage address from the storage node N j for the strip SU ij in the stripe S i , which can reduce the waiting time when the client writes data, thereby improving the write performance of the distributed block storage system.
  • each of the data stripe state information also includes data stripe state information for identifying whether each data strip of the stripe S N is empty, thereby eliminating the need to use all 0 data instead of the stripe data of the empty data and writing Into the storage node, reducing the amount of data writes to the distributed block storage system.
  • the new storage node recovers the strip according to the strips S N and S K respectively With the data of SU Nj and the data of SU Kj , the new storage node obtains the time stamp TP NX of the data of the strip SU NX in the storage node N X as the reference time stamp of the data of the strip SU Nj , and obtains the storage node N X TP KX stamp strip SU KX data as a reference time stamp strip SU Kj data, and new data storage node according to the time stamp and time stamp TP NX tape TP KX SU Nj from the strip in the cache
  • the time in the SU Kj data is eliminated; where X is any one of the integers 1 to M except j, and the latest stripe data is retained in the storage system, thereby saving cache space.
  • the new storage node recovers the strip according to the strips S N and S Y respectively With SU Nj data and SU Yj data, strip SU NX data contains time stamp TP N , strip SU Yj data contains time stamp TP Y , new storage node according to time stamp TP N and time stamp TP Y In the cache, the data of the strip SU Nj and the time of the SU Yj data are eliminated; wherein X is any one of the integers 1 to M except j, and the latest strip of the same client is retained in the storage system. Band, which saves cache space.
  • the third aspect of the present application further provides a data writing device in a distributed block storage system.
  • the data writing device comprises a plurality of units for performing any of the first to seventh possible implementations of the first aspect or the first aspect of the present application.
  • the fourth aspect of the present application further provides data in a distributed block storage system.
  • the storage device the data storage device comprises a plurality of units for performing the first to fifteenth possible implementations of the second aspect or the second aspect of the present application.
  • the fifth aspect of the present application further provides the distributed block storage system in the first to fifteenth possible implementation manner of the second aspect of the present application or the first aspect, the storage node N j in the distributed block storage system, A first to fifteen possible implementations for performing the second or second aspect of the present application.
  • the sixth aspect of the present application further provides a client, which is applied to the distributed block storage system in any one of the first to seventh possible implementation manners of the first aspect or the first aspect of the present application, where the client includes a processor and The interface, the processor is in communication with the interface, and the processor is configured to perform any of the first to seventh possible implementations of the first aspect or the first aspect of the present application.
  • the seventh aspect of the present application further provides a storage node, which is applied to the distributed block storage system in any one of the first to fifteen possible implementation manners of the second aspect or the second aspect of the present application, where the storage node is used as a storage node.
  • N j includes a processor and an interface, the processor is in communication with an interface, and the processor is configured to perform any of the first to fifteenth possible implementations of the second or second aspect of the present application.
  • the eighth aspect of the present application further provides a computer readable storage medium, which is applicable to the distributed block storage system in any one of the first to seventh possible implementations of the first aspect or the first aspect of the present application, which is readable by a computer
  • the storage medium contains computer instructions for causing the client to perform any of the first to seventh possible implementations of the first aspect or the first aspect of the present application.
  • the ninth aspect of the present application further provides a computer readable storage medium, which is applicable to the distributed block storage system in any one of the first to fifteenth possible implementation manners of the second aspect or the second aspect of the present application, the computer may be
  • the read storage medium contains computer instructions for causing the storage node to perform any of the first to fifteenth possible implementations of the second or second aspect of the present application.
  • the tenth aspect of the present application further provides a computer program product, which is applied to the distributed block storage system in any one of the first to seventh possible implementations of the first aspect or the first aspect of the present application, the computer program product comprises a computer An instruction for causing a client to perform any of the first to seventh possible implementations of the first aspect or the first aspect of the present application.
  • the eleventh aspect of the present application further provides a computer program product, which is applied to the distributed block storage system, the computer program product in any one of the first to fifteen possible implementation manners of the second aspect or the second aspect of the present application.
  • Computer instructions are included for causing a storage node to perform any of the first to fifteenth possible implementations of the second or second aspect of the present application.
  • FIG. 1 is a schematic diagram of data storage of a distributed block storage system in the prior art
  • FIG. 2 is a schematic diagram of a distributed block storage system according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a server in a distributed block storage system according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a partitioned view of a distributed block storage system according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a relationship between a stripe and a storage node in a distributed block storage system according to an embodiment of the present invention
  • FIG. 6 is a flowchart of a method for a client to write data in a distributed block storage system according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of determining a partition of a client in a distributed block storage system according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of a method for storing data stored by a storage node in a distributed block storage system according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram of storage node storage and stripe in a distributed block storage system according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of storage node storage and stripe in a distributed block storage system according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a data writing apparatus in a distributed block storage system according to an embodiment of the present invention.
  • FIG. 12 is a schematic structural diagram of a data storage device in a distributed block storage system according to an embodiment of the present invention.
  • a distributed block storage system in an embodiment of the present invention such as of series.
  • the distributed block storage system includes a plurality of servers, such as a server 1, a server 2, a server 3, a server 4, a server 5, and a server 6, and the servers communicate with each other through InfiniBand or an Ethernet network.
  • the number of servers in the distributed block storage system may be increased according to actual requirements, which is not limited by the embodiment of the present invention.
  • the server of the distributed block storage system includes the structure as shown in FIG. As shown in FIG. 3, each server in the distributed block storage system includes a central processing unit (CPU) 301, a memory 302, an interface 303, a hard disk 1, a hard disk 2, and a hard disk 3.
  • the computer 302 stores computer instructions.
  • the CPU 301 executes the program instructions in the memory 302 to perform the corresponding operations.
  • the interface 303 can be a hardware interface, such as a network interface card (NIC) or a host bus adapter (HBA), or a program interface module.
  • the hard disk contains a Solid State Disk (SSD), a mechanical hard disk, or a hybrid hard disk. Mechanical hard drives such as HDD (Hard Disk Drive).
  • a Field Programmable Gate Array or other hardware may also perform the above-described corresponding operations instead of the CPU 301, or the FPGA or other hardware may perform the above-described corresponding operations together with the CPU 301.
  • the embodiment of the present invention collectively refers to the combination of the CPU 301 and the memory 302, the FPGA and other hardware or FPGA that replaces the CPU 301 and other hardware that replaces the CPU 301 and the CPU 301 as a processor.
  • an application is loaded in the memory 302, and the CPU 301 executes an application instruction in the memory 302, and the server serves as a client.
  • the client can also be a device independent of the server shown in FIG. 2.
  • the application can be a virtual machine (VM) or a specific application, such as office software.
  • the client writes data to or reads data from the distributed block device storage.
  • the structure of the client can be referred to FIG. 3 and related description.
  • the distributed storage system program is loaded in the memory 302.
  • the CPU 301 executes the distributed block storage system program in the memory 302, provides a block protocol access interface to the client, and provides a distributed block storage access point service for the client to enable the client to access.
  • a storage resource in a resource pool is stored in a distributed block storage system.
  • the block protocol access interface is used to provide logical units to clients.
  • the server runs the distributed block storage system program to make the server containing the hard disk a storage node for storing client data.
  • the server can use a hard disk as a storage node by default, that is, when the server contains multiple hard disks, it can serve as multiple storage nodes, and another implementation, the server runs a distributed block storage system program as a storage node. This embodiment of the present invention does not limit this. Therefore, the structure of the storage node can be referred to FIG. 3 and related description.
  • the hash space (such as 0 ⁇ 2 ⁇ 32,) is divided into N equal parts, and each partition is a partition (Partition), and the N equal parts are equally divided according to the number of hard disks.
  • N defaults to 3600, that is, partitions are P1, P2, P3...P3600, respectively.
  • each storage node carries 200 partitions.
  • the partition P includes M storage nodes N j , a mapping relationship between the partition and the storage node, that is, a mapping between the partition and the storage node N j included in the partition, also referred to as a partition view, as shown in FIG.
  • the partition includes four storage nodes N
  • j is a partition view of "P2-storage node N 1 - storage node N 2 - storage node N 3 - storage node N 4 ".
  • j is each of the integers 1 through M. It is allocated when the distributed block storage system is initialized, and subsequent adjustments are made as the number of hard disks in the distributed block storage system changes. The client saves the partition view.
  • the Erasure Coding (EC) algorithm can be used to improve data reliability, such as using 3+1 mode, that is, 3 data strips and 1 check strip. article.
  • the partition stores data in the form of strips, one partition containing R strips S i , where i is each of the integers 1 through R.
  • the embodiment of the present invention takes P2 as an example for description.
  • the distributed block storage system performs slice management on the hard disk in units of 4 KB, and records the allocation information of each 4 KB slice in the metadata management area of the hard disk, and the slice of the hard disk constitutes a storage resource pool.
  • the distributed block storage system includes a stripe metadata server, and the specific implementation may run on the one or more servers in the distributed block storage system for the stripe metadata management program.
  • a striping metadata server allocates strips to partitions. Partition the view shown in FIG.
  • the physical storage address that is, the storage space, specifically includes: allocating a physical storage address from the storage node N 1 for the SU i1 , a physical storage address from the storage node N 2 for the SU i2 , and a physical storage address from the storage node N 3 for the SU i3 Assign a physical storage address to storage node N 4 for SU i4 .
  • the storage node Nj records the mapping of the strip SU ij identifier to the physical storage address.
  • the stripe metadata server allocates physical addresses from the storage nodes for strips in the stripe, which may be pre-allocated when the distributed block storage system is initialized, or pre-allocated before the client sends data to the storage node.
  • the strip SU ij in the stripe S i is only a section of storage space before the client writes data.
  • points S i SU ij of the strip obtained by dividing the size of SU ij of the data strip, i.e. strip stripe S i contained in a belt strip SU ij of storing the divided data obtained by the client with SU Ij 's data.
  • the stripe metadata server assigns a version number to the stripe identifier in the stripe.
  • the stripe in the stripe in the released stripe is updated.
  • the identification version number is used as the stripe identifier for the strip in the new stripe.
  • the stripe metadata server pre-allocates the physical storage address from the storage node N j for the strip SU ij in the stripe S i , which can reduce the waiting time when the client writes data, thereby improving the write performance of the distributed block storage system.
  • the client mounts the logical unit allocated by the distributed block storage system, thereby performing a data access operation.
  • the logical unit is also called a logical unit number (LUN).
  • LUN logical unit number
  • one logical unit can be mounted to only one client; one logical unit can also be mounted to multiple clients, that is, multiple clients share one logical unit.
  • the logical unit is provided by the storage resource pool shown in FIG. 2.
  • the first client performs the following steps:
  • Step 601 The first client receives the first write request; the first write request includes the first data and the logical address.
  • the first client can be a VM, or a server.
  • the application is run in the first client, and the application accesses the logical unit mounted by the first client, for example, sends a first write request to the logical unit.
  • the first write request contains the first data and a logical address, which is also referred to as a Logical Block Address (LBA).
  • LBA Logical Block Address
  • the logical address is used to indicate the write location of the first data in the logical unit.
  • Step 602 The first client determines that the logical address is distributed in the partition P.
  • the partition P2 is taken as an example.
  • the first client stores a partitioned view of the distributed block storage system. As shown in FIG. 7, the first client determines, according to the partition view, the partition in which the logical address included in the first write request is located. In one implementation, the first client generates a key Key according to the logical address, and the root The hash value of the Key is calculated according to the hash algorithm, and the partition corresponding to the hash value is determined, thereby determining that the logical address is distributed in the partition P2. Also referred to as the first data is distributed in partition P2.
  • Step 603 The first client obtains the stripe S N from the R strips; wherein N is a value of the integers 1 to R.
  • the stripe metadata server manages the correspondence between partitions and strips, and the relationship between strips and storage nodes in strips.
  • the first client obtains an implementation of the stripe S N from the R strips.
  • the first client determines that the logical address is distributed in the partition P2, and the first client queries the stripe metadata server to obtain the R blocks included in the partition P2.
  • One of the strips S N in the strip Since the logical address is an address in the distributed block storage system where the client write data is stored, the logical address distribution in the partition P has the same meaning as the first data distribution in the partition P.
  • the first client from the obtained points of the R points of the S N in another implementation, the client may be from a first sub-stripe R has been assigned to the first client terminal to obtain the S N points of the stripe.
  • Step 604 The first client divides the first data into data of one or more strips SU Nj in the stripe S N .
  • the stripe S N is composed of strips
  • the first client receives the first write request, caches the first data included in the first write request, and divides the cached data according to the size of the stripe in the stripe.
  • the first client The strip obtains the stripe size data according to the length of the stripe in the stripe, and modulo the number of storage nodes M (for example, 4) in the partition according to the logical address of the stripe size data, thereby determining the strip size data in the sub-division.
  • the position in the strip that is, the corresponding strip SU Nj , further determines that the strip SU Nj corresponds to the storage node N j according to the partition view, so that the stripe data of the same logical address is distributed in the same storage node.
  • the first data is divided into data of one or more strips SU Nj .
  • the stripe S N includes four strips, which are SU N1 , SU N2 , SU N3 , and SU N4 , respectively .
  • Take the data of dividing the first data into two strips as an example, that is, data of SU N1 and data of SU N2 , respectively.
  • the data of the stripe SU N3 may be obtained by dividing the data in other write requests sent by the first client, and may refer to the description of the first request. Further, data of the check strip SU N4 is generated based on the data of the SU N1 , the data of the SU N2 , and the data of the SU N3 , and the data of the check strip SU N4 is also referred to as check data. For the data of the data stripe in the stripe, how to generate the data of the check strip can be referred to the existing stripe implementation algorithm, which is not described in detail in the embodiment of the present invention.
  • the stripe S N includes 4 strips, that is, 3 strips of data and 1 strip of strips.
  • the first client caches data for more than a certain period of time and needs to write data to the storage node, but cannot fill the data of the data stripe, for example, only the data of the strip SU N1 and the data of the SU N2 obtained by dividing the first data are used.
  • a check strip is generated based on the data of SU N1 and the data of SU N2 .
  • the data of the valid data strip SU Nj includes the data stripe state information of the stripe S N , and the valid data strip SU Nj refers to the stripe of the data that is not empty.
  • the valid data strip SU N1 Both the data and the data of the SU N2 include the data stripe state information of the stripe S N , and the stripe state information is used to identify whether each strip of the stripe S N is empty. If the data strip is not empty by 1 and the data strip is empty with 0, the data stripe state information of the SU N1 data is 110, and the data stripe state information of the SU N2 data is 110, that is, Indicates that SU N1 is not empty, SU N2 is not empty, and SU N3 is empty.
  • the data of the check strip SU N4 generated based on the data of SU N1 and the data of SU N2 includes the check data of the data strip state information.
  • the first client does not need to replace the data of SU N3 with all 0 data and write to the storage node N 3 , which reduces the amount of data write.
  • the first client reads the stripe S N , it determines that the data of SU N3 is empty according to the data stripe state information of the stripe S N included in the data strip SU N1 or the SU N2 data.
  • the data of the SU N1 , the data of the SU N2 , and the data of the SU N3 included in the embodiment of the present invention include 111, according to the data of the SU N1 , the data of the SU N2 , and the SU N3.
  • the data generated by the check strip SU N4 data contains the check data of the data strip status information.
  • the data of the data strip SU Nj further includes at least one of the first client identifier and the time stamp TP N of the first client obtaining the stripe S N , that is, the first client is included.
  • the identifier and the first client obtain any one or combination of the time stamps TP N of the stripe S N .
  • the data of the check strip SU Nj also includes the first client identifier and the time stamp TP of the first client to obtain the stripe S N .
  • the data of the data strip SU Nj further includes metadata, such as a logical address of the data strip SU Nj identifier and the data strip SU Nj .
  • Step 605 The first client sends data of one or more strips SU Nj to the storage node N j .
  • the first client sends the data of the SU N1 obtained by the first data division to the storage node N 1 , and sends the data of the SU N2 obtained by the first data division to the storage node N 2 .
  • the first client can concurrently send the data of the stripe SU Nj of the stripe S N to the storage node N j , does not need the main storage node, reduces the storage node data interaction, and improves the write concurrency, thereby improving the distributed The write performance of a block storage system.
  • the first client receives the second write request; the second write request includes the second data and the logical address described in FIG. 6, the first client according to the figure
  • the algorithm described in the process 6 determines that the logical address is distributed in the partition P2, the first client obtains the stripe S Y from the R strips, and the first client divides the second data into one or more of the strips S Y
  • the data of the strip SU Yj such as the data of SU Y1 and the data of SU Y2
  • the first client sends data of one or more strips SU Yj to the storage node N j , that is, the data of SU Y1 is sent to the storage node N 1.
  • the data of the valid data strip SU Yj includes the stripe state information of the stripe S Y .
  • the data of the data strip SU Yj further includes at least one of the first client identifier and the time stamp TP Y of the first client obtaining the stripe S Y .
  • Metadata of the data further, data strip SU Yj further comprising a data strip SU Yj data, e.g.
  • strip SU Yj strip identification, strip SU Yj logical address data For further description, reference may be made to the description of the first client in FIG. 6, and details are not described herein again.
  • the meaning of the stripe S Y obtained by the first client from the R strips can be referred to the meaning of the stripe S N obtained from the R strips by the first client, and details are not described herein again.
  • the second client receives the third write request; the third write request includes the third data and Figure 6 depicts the logical address.
  • the second client determines that the logical address is distributed in the partition P2 according to the algorithm described in the flow of FIG.
  • the second client obtains the stripe S K from the R strips, and the second client divides the third data into strips S K
  • the data of one or more strips SU Kj in the medium such as the data of SU K1 and the data of SU K2
  • the second client sends the data of one or more strips SU Kj to the storage node N j , that is, the SU K1
  • the data is sent to the storage node N 1 and the data of the SU K2 is sent to the storage node N 2 .
  • K is a value from integers 1 to R
  • N is different from K.
  • the logical address distribution has the same meaning as the partition P and the third data distribution in the partition P.
  • the meaning of the stripe S K obtained by the second client from the R strips can be referred to the meaning of the stripe S N obtained from the R strips by the first client, and details are not described herein again.
  • the data of the valid data strip SU Kj includes the stripe state information of the stripe S K .
  • the data of the data strip SU Kj further includes at least one of the second client identifier and the time stamp TP K of the second client obtaining the stripe S K .
  • the data strip SU Kj data also includes metadata, such as the data strip SU Kj identifier, and the logical address of the data strip SU Kj data.
  • the client needs to send data to the primary storage node first, and the primary storage node divides the data into stripe data, and sends data of other strips stored in the primary storage node to the corresponding storage node. Therefore, the primary storage node becomes a data storage bottleneck in the distributed block storage system, and the data interaction between the storage nodes is increased.
  • the client divides the data into stripe data, and The stripe data is sent to the corresponding storage node, which does not require the main storage node, reduces the pressure of the main storage node, reduces the data interaction between the storage nodes, and simultaneously writes the striped stripe data to the corresponding storage node. Improved write performance of distributed block storage systems.
  • the storage node Nj performs the following steps:
  • Step 801 The storage node N j receives the data of the strip SU Nj in the stripe S N sent by the first client.
  • the data storage node N 1 receives a first client sending the SU N1
  • memory node N 2 receives the first client data transmitted SU N2.
  • Step 802 The data storage node N j stored mapping the first physical address of the SU Nj strips SU Nj identifying the storage node N j to the first physical address.
  • the stripe metadata server pre-allocates the first physical address in the storage node N j for the strip SU Nj of the stripe S N in the partition according to the partition view, and the strip SU Nj identifier is stored in the metadata of the storage node N j and storing the first physical address mapping of the node N j, N j storage node receives the data in SU Nj strip, the strip according to the mapping data stored in the SU Nj first physical address.
  • SU N2 data such as SU N1. 1 N receiving data sent by a first client node memory, data stored in the SU N1 to N first physical address, the storage node N2 1 receiving a second client sent by the SU N2 the data is stored in a first physical address N 2.
  • the primary storage node needs data sent by the client, divides the data into data strips in the stripe, and forms data of the check strip according to the data stripe, and stores the other storage nodes.
  • the data of the strip is sent to the corresponding storage node.
  • the storage node N j only receives the data of the strip SU Nj sent by the client, and does not need the main storage node, thereby reducing the data interaction between the storage nodes.
  • the stripe data is concurrently written to the corresponding storage node, which improves the write performance of the distributed block storage system.
  • the data of the strip SU Nj is obtained by dividing the first data
  • the first write request includes the logical address of the first data
  • the data of the strip SU Nj is used as a part of the first data, and the corresponding logical address is also included.
  • the storage node N j strip establish logical address and a data strip with SU SU Nj Nj identity mapping.
  • the first client still uses the logical address to access the data of the strip SU Nj .
  • the first client uses the logical address of the data of the strip SU Nj to model the number of storage nodes M (for example, 4) in the partition P, and determines the strip SU.
  • Nj distributed storage node N j the read logical address message carrying strip the SU Nj data request to the storage node N j, the storage node N j with mapping the SU Nj identified from bands logical address bar the SU Nj data mapping the first physical address obtained SU Nj identification strip, tape storage node N j and the storage node identified SU Nj N j obtained under section strip SU Nj data.
  • the storage node N j receives the data of the strip SU Yj in the stripe S Y sent by the first client, for example, the storage node N 1 receives the first client to send.
  • a second map storage SU Yj SU Y2 physical address data SU Y1 data storage node N2 receives the first client sent, the storage node N j SU Yj with identifying the storage node N j pieces of data according to the second
  • the second physical address such as storing the data of SU Y1 to the second physical address of N 1 , stores the data of SU Y2 to the second physical address of N 2 .
  • the storage node N j establishing logical address and the strip article with SU Yj data mapping identified SU Yj.
  • the first client still uses the logical address to access the data of the strip SU Yj .
  • the data of the strip SU Yj has the same logical address as the data of the strip SU Nj .
  • the storage node N j when a logical unit is mounted to the first client and the second client, further, the storage node N j receives the stripe S K sent by the second client.
  • the storage node N 1 receives the data of the SU K1 sent by the second client
  • the storage node N2 receives the data of the SU K2 sent by the second client
  • the storage node N j is identified according to the strip SU Kj
  • the mapping of the third physical address of the storage node N j stores the data of the SU Kj to the third physical address, such as storing the data of SU K1 to the third physical address of N 1 and storing the data of SU K2 to the third of N 2 Physical address.
  • SU Kj data strip as part of the third data also has a corresponding logical address, and therefore, the storage node N j establishing logical address and the strip article with SU Kj data mapping identified SU Kj.
  • the second client still uses the logical address to access the data of the strip SU Kj .
  • the data of the strip SU Kj has the same logical address as the data of the strip SU Nj .
  • the storage node N j of slice data allocated SU Nj timestamp TP Nj is a strip SU Kj stamp dispensing data TP Kj
  • the storage node N j is assigned time slice data of SU Yj Poke TP Yj .
  • the storage node N j can eliminate the data of the previous stripe with the same logical address in the cache according to the time stamp, and retain the latest stripe data, thereby saving the cache space.
  • the data of the strip SU Nj sent by the first client to the storage node N j includes the first client obtained.
  • S N points of time stamp TP N, to a first client of storage node N j with SU Yj data comprises obtaining a first client stripe S Y timestamp TP Y, shown in Figure 9, S N points of the data strip are not empty, the data SU N1, SU N2 and SU N3 data are included in the data the first client gets the time stamps S N slitting TP N, S N points of the
  • the data of the check strip SU N4 includes the check data TP Np of the time stamp TP N ; the data strips of the strip S Y are not empty, and the data of SU Y1 , the data of SU Y2 and the data of SU Y3 are both comprising obtaining a first client stripe S Y timestamp TP Y,
  • the new storage node eliminates the data of the previous stripe from the cache according to the timestamp TP N and the timestamp TP Y .
  • the new storage node may be the storage node after the recovery of the faulty storage node Nj , or the storage node of the partition in which the stripe is newly added in the distributed block storage system.
  • the data of the strip SU N1 and the data of SU Y1 are included in the cache of the new storage node.
  • the data of SU N1 includes the time stamp TP N
  • the data of SU Y1 includes the time stamp TP Y
  • the time stamp TP N precedes the time stamp TP Y
  • the new storage node eliminates the data of the strip SU N1 from the cache in the storage system. Save the cache space by keeping the latest stripe data.
  • the storage node N j can eliminate the data of the previous stripe from the same client in the cache and the data of the previous stripe with the same logical address according to the timestamp allocated by the same client, and retain the latest stripe data, thereby saving the cache space.
  • the storage node N j assigns a time stamp to the data of the strip SU Nj sent by the first client.
  • TP Nj, N j of storage node to the second client transmits the data assigned with the timestamp SU Kj TP Kj, 10, S N points of the data strip are not empty
  • the storage node N 1 is The timestamp of the data distribution of the stripe SU N1 is TP N1
  • the time stamp assigned by the storage node N 2 for the data of the strip SU N2 is TP N2
  • the time stamp of the data of the storage node N 3 being the strip SU N3 is TP.
  • the storage node N 4 assigned to the time stamp data stripe SU N4 is TP N4; S K points of the data strip are not empty, the storage node N 1 assigned to the time stamp data is striped SU K1 TP K1 , storage node N 2 is the time slot assigned to the data of the strip SU K2 is TP K2 , the time slot assigned by the storage node N 3 to the data of the strip SU K3 is TP K3 , and the storage node N 4 is the strip SU K4 The timestamp of the data allocation is TP K4 .
  • stripe S N recover data in the storage node N j
  • the data of the strip SU Nj when the data of the strip SU Nj includes the timestamp TP N assigned by the first client, the data of the strip SU Kj includes the time stamp TP K assigned by the second client, and TP N and TP K are When assigned by the same timestamp server, the timestamp TP N of the data of the strip SU Nj and the time stamp TP K of the data of the SU Kj can be directly compared, and the new storage node is cached according to the timestamp TP N and the time stamp TP K The data of the strips in the previous time is eliminated.
  • the data of the strip SU Nj does not include the time stamp TP N assigned by the first client and/or the data of the strip SU Kj does not include the time stamp TP K assigned by the second client, or the time stamps TP N and TP K
  • the data of the strip SU Nj and the data of SU Kj are included in the cache of the new storage node.
  • the new storage node may query timestamp data strip storage node S N N X equatorial strip and S K, e.g.
  • the storage node acquires a new storage tape data storage node N X assigned to SU NX stamp strip TP NX the TP NX SU Nj as a reference time stamp data
  • the storage node acquires a new storage node N X-band data is allocated SU NX stamp strip TP NX, the TP KX SU Kj as a reference timestamp data
  • the new storage node eliminates the data of the strip SU Nj and the time of the SU Kj data from the cache according to the time stamp TP NX and the time stamp TP KX , where X is an integer 1 to M except j. anyone.
  • the data of the strip SU N1 and the data of SU K1 are included in the cache of the new storage node.
  • the reference timestamp of the data, the timestamp TP N2 precedes the timestamp TP K2
  • the new storage node eliminates the data of the strip SU N1 from the cache, and retains the latest stripe data in the storage system, thereby saving cache space.
  • the storage node N j also assigns a time stamp TP Yj to the data of the strip SU Yj .
  • the timestamp assigned by the storage node Nj may be from the timestamp server or may be generated by the storage node Nj itself.
  • the data of the data strip SU Nj includes the first client identifier, the first client obtains the timestamp of the stripe S N , the data strip SU Nj identifier, the logical address of the data strip SU Nj data, and the data strip status information may be stored in the storage node N j with extended physical address of the SU Nj is assigned stripes, thereby saving the storage node N j physical address.
  • Extended physical address of the memory node N j is valid physical address size is not visible outside the physical address, the storage node N j when receiving a read request to access a physical address, the default data is expanded in the physical address is read out.
  • the second client identifier included in the data strip SU Kj data, the time stamp of the second client obtaining the stripe S K , the data strip SU Kj identifier, the logical address of the data strip SU Kj data, and the data strip status may also be stored in an extended address of the physical address assigned by the storage node Nj to the data strip SU Kj .
  • the first client identifier included in the data strip SU Yj may also be stored in an extended address of the physical address assigned by the storage node Nj to the data strip SU Yj .
  • Physical address extension timestamp TP Nj Further, with the data storage node N j is assigned a strip SU Nj may be stored in the storage node N j SU Nj band allocated to the stripe.
  • TP Kj timestamp stored with the data distribution node N j is a strip SU Kj may be stored in the storage node N j with extended physical address of the SU Kj allocated as stripes.
  • Physical address extension TP Yj timestamp stored with the data distribution node N j is a strip SU Yj may be stored in the storage node N j SU Yj band allocated to the stripe.
  • the embodiment of the present invention provides a data writing apparatus 11 for use in a distributed block storage system in the embodiment of the present invention.
  • the data writing apparatus 11 includes a receiving unit.
  • the data writing device 11 can be a software module that can be run on the client, so that the client can complete various implementations described in the embodiments of the present invention.
  • the data writing device 11 can also be a hardware device. Specifically, referring to the structure shown in FIG. 3, each unit of the data writing device 11 can be implemented by the processor of the server described in FIG. Therefore, a detailed description of the data writing device 11 can be referred to the description of the client in the embodiment of the present invention.
  • the embodiment of the present invention provides a data storage device 12, which is applied to the distributed block storage system in the embodiment of the present invention.
  • the data storage device 12 includes a receiving unit 121 and Storage unit 122.
  • the receiving unit 121 configured to receive a first article sent by a client in a stripe S N SU Nj band data; data strip SU Nj is determined by the first client data obtained by dividing the first; the first a data is obtained by the first client receiving the first write request; the first write request includes the first data and the logical address; the logical address is used to determine that the first data is distributed in the partition P; the storage unit 122 is configured to: the map data stored in the first physical address of the SU Nj strips SU Nj identifying the storage node N j to the first physical address.
  • the data storage device 12 further comprises a dispensing means for dispensing a strip data SU Nj timestamp TP Nj.
  • the data storage device 12 further comprises an establishing unit configured to establish a logical address and data strip pieces SU Nj corresponding relationship with the identified SU Nj.
  • a first receiving section is further configured to transmit the client stripe S Y SU Yj with the data; SU Yj data stripe is divided by the first client a second data Obtained; the second data is obtained by the first client receiving the second write request; the second write request includes the second data and the logical address; the logical address is used to determine that the second data is distributed in the partition P; the storage unit 122.
  • the method further includes storing the data of the SU Yj to the second physical address according to the mapping of the strip SU Y identifier and the second physical address of the storage node N j .
  • the dispensing unit is further configured with data distribution of a strip SU Yj stamp TP Yj.
  • establishing unit is further configured to establish a correspondence relationship data strip SU Yj logical address of the strip SU Yj identified. Further, the data of the SU Yj includes at least one of the first client identifier and the time stamp TP Y of the first client obtaining the stripe S Y .
  • the receiving unit 121 is further configured to stripe striping S K client receiving a second transmission band in the SU Kj; SU Kj data stripe is divided by the second client third data Obtained; the third data is obtained by the second client receiving the third write request; the third write request includes the third data and the logical address; the logical address is used to determine that the third data is distributed in the partition P; the storage unit 122 is further used the map data stored in the third physical address of the SU Kj strip SU Kj identifying the storage node N j to the third physical address. Further, the allocating unit is further configured to assign a strip with SU Kj stamp TP Kj.
  • establishing unit is further configured to establish the logical address of the data strip with pieces of tape SU Kj corresponding relationship between the identified SU Kj.
  • the data storage device 12 further includes a recovery unit, configured to recover the data of the strip SU Nj according to the stripe S N after the storage node N j fails, and recover the data of the strip SU Kj according to the stripe S K ;
  • the data storage device 12 further includes an obtaining unit, configured to acquire a timestamp TP NX of the data of the strip SU NX in the storage node N X as a reference time stamp of the data of the strip SU Nj , and obtain a strip SU KX in the storage node N X Time stamp TP KX of the data as a reference time stamp of the data of the strip SU Kj , the data storage device 12 further comprises a phasing unit for striping the buffer from the new storage node according to the time stamp TP NX and the time stamp TP KX
  • the data storage device 12 can be a software module that can be run on a server, so that the storage node completes various implementations described in the embodiments of the present invention.
  • the data storage device 12 can also be a hardware device. Specifically, referring to the structure shown in FIG. 3, the units of the data storage device 12 can be implemented by the processor of the server described in FIG. Therefore, for a detailed description of the data storage device 12, reference may be made to the description of the storage node in the embodiment of the present invention.
  • the stripe may be a stripe generated by the multi-copy algorithm in addition to the stripe generated according to the EC algorithm described above.
  • the strip SU ij in the stripe includes a data stripe and a check strip;
  • the strips SU ij in the stripe are both Striped for data, and the data for strip SU ij is the same.
  • the embodiments of the present invention further provide a computer readable storage medium and a computer program product, the computer readable storage medium and the computer program product comprising computer instructions for implementing various solutions described in the embodiments of the present invention.
  • the disclosed apparatus and method may be implemented in other manners.
  • the division of the units described in the device embodiments described above is only one logical function division, and may be further divided in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or Some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分布式块存储系统数据存储方法,包括:客户端生成分条的数据,将分条中的条带的数据并发地发送到条带对应的存储节点。该方法可以减少存储节点间数据交互,提高了写并发度,从而提高了分布式块存储系统的写性能。

Description

分布式块存储系统中数据存储方法、装置及计算机可读存储介质 技术领域
本发明涉及信息技术领域,尤其涉及一种分布式块存储系统中数据存储方法、装置及计算机可读存储介质。
背景技术
分布式块存储系统中包含分区,分区包含存储节点和分条,分区中每一个分条包含多个条带,分区中一个存储节点对应分条中的一个条带,也就是分区中一个存储节点为分条中的一个条带提供存储空间。通常,如图1所示,分区中会包含一个主存储节点(存储节点1),主存储节点用于接收客户端发送的数据,然后主存储节点选择分条,将数据划分成条带的数据,将其他存储节点存储的条带的数据发送到相应的存储节点(存储节点2、存储节点3和存储节点4)。上述操作使得主存储节点很容易成为数据写入瓶颈,而且增加了存储节点间的数据交互,降低了分布式块存储系统的写性能。
发明内容
本申请提供了一种分布式块存储系统数据存储方法及装置,不需要主存储节点,减少存储节点间的数据交互,提高分布式块存储系统的写性能。
本申请的第一方面提供了一种分布式块存储系统中数据存储方法,分布式块存储系统包含分区P,分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;在该方法中,第一客户端接收第一写请求,其中,第一写请求包含第一数据和逻辑地址;第一客户端确定逻辑地址分布在分区P,第一客户端从分区P包含的R个分条中获得分条SN;其中,N为整数1到R中的一个值,第一客户端划分第一数据得到分条SN中的一个或多个条带SUNj的数据,将该一个或多个条带SUNj的数据发送到存储节点Nj。客户端根据分区获得分条,将数据划分为分条的条带的数据,将条带的数据发送到对应的存储节点,不需要主存储节点,减少了存储节点间的数据交互,同时将分条的条带的数据并发写到相应存储节点,也提升了分布式块存储系统的写性能。进一步的,每一个分条中的条带SUij在存储节点Nj的物理地址可以由分条元数据服务器预先分配。分条可以为根据EC算法生成的分条,也可以是由多副本算法生成的分条。当分条是EC算法生成的分条时,分条中的条带SUij包含数据条带和校验条带;当分条是多副本算法生成的分条时,分条中的条带SUij均为数据条带,并且数据条带的数据相同。数据条带SUNj的数据还包括元数据,例如数据条带SUNj标识、数据条带SUNj的数据的逻辑地址。
结合本申请第一方面,在第一方面的第一种可能实现方式中,第一客户端接收第二写请求;其中,第二写请求包含第二数据和逻辑地址;即第一数据的逻辑地址和第二数据的逻辑地址相同;第一客户端确定逻辑地址分布在分区P,第一客户端从分区P包含的R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;第一客户端划分第二数据得到分条SY中的一个或多个条带 SUYj的数据,将该一个或多个条带SUYj的数据发送到存储节点Nj。数据条带SUYj的数据还包括元数据,例如数据条带SUYj标识、数据条带SUYj的数据的逻辑地址。
结合本申请第一方面,在第一方面的第二种可能实现方式中,第二客户端接收第三写请求;其中,第三写请求包含第三数据和逻辑地址;即第一数据的逻辑地址和第三数据的逻辑地址相同;第二客户端确定逻辑地址分布在分区P,第二客户端从从分区P包含的R个分条中获得分条SK;其中,K为整数1到R中的一个值,N与K不同;第二客户端划分第三数据得到分条SK中的一个或多个条带SUKj的数据,将一个或多个条带SUKj的数据发送到存储节点Nj。数据条带SUKj的数据还包括元数据,例如数据条带SUKj标识、数据条带SUKj的数据的逻辑地址。在分布式块存储系统中,第一客户端和第二客户端可以访问相同的逻辑地址。
结合本申请第一方面,在第一方面的第三种可能实现方式中,一个或多个条带SUNj的数据中的每一个包括第一客户端标识和第一客户端获得分条SN的时间戳TPN中的至少一种。分布式块存储系统的存储节点可以根据条带SUNj的数据中的第一客户端标识确定该条带是由第一客户端写入的;分布式块存储系统的存储节点可以根据条带SUNj的数据中的第一客户端获得分条SN的时间戳TPN确定第一客户端写入条带的先后顺序。
结合本申请第一方面的第一种可能实现方式,在第一方面的第四种可能实现方式中,一个或多个条带SUYj的数据中的每一个包括第一客户端标识和第一客户端获得分条SY的时间戳TPY中的至少一种。分布式块存储系统的存储节点可以根据条带SUYj的数据中的第一客户端标识确定该条带是由第一客户端写入的;分布式块存储系统的存储节点可以根据条带SUYj的数据中的第一客户端获得分条SY的时间戳TPY确定第一客户端写入条带的先后顺序。
结合本申请第一方面的第二种可能实现方式,在第一方面的第五种可能实现方式中,一个或多个条带SUKj的数据中的每一个包第二客户端标识和第二客户端获得分条SK的时间戳TPK中的至少一种。分布式块存储系统的存储节点可以根据条带SUKj的数据中的第二客户端标识确定该条带是由第一客户端写入的;分布式块存储系统的存储节点可以根据条带SUKj的数据中的第一客户端获得分条SK的时间戳TPK确定第二客户端写入条带的先后顺序。
结合本申请第一方面,在第一方面的第六种可能实现方式中,分条Si中条带SUij是由分条元数据服务器根据分区P与分区包含的存储节点Nj的映射,从存储节点Nj分配的。分条元数据服务器预先为分条Si中的条带SUij从存储节点Nj分配物理存储地址,可以减少客户端写入数据时的等待时间,从而提升分布式块存储系统的写性能。
结合本申请第一方面或第一方面的第一至第六种任一可能实现方式,在第一方面的第七种可能实现方式中,一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,数据条带状态信息用于标识分条SN的每个数据条带是否为空,从而不需要使用全0数据代替数据为空的条带的数据并写入存储节点,减少了分布式块存储系统的数据写入量。
本申请第二方面还提供了一种分布式块存储系统中数据存储方法,分布式块存储系统包含分区P,分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;在该方法中,存储节点Nj接收第一客户端发送的分条SN中的条带SUNj的数据;其中,条带SUNj的数据是由第一客户端划分第一数据得到的;第一数据是第一客户端接收第一写请求得到的;所述第一写请求包含第一数据和逻辑地址;逻辑地址用于确定第一数据分布在分区P;存储节点Nj根据条带SUNj标识与存储节点Nj的第一物理地址的映射存储SUNj的数据到第一物理地址。因为逻辑地址是分布式块存储系统中存储客户端写入数据的地址,因此,逻辑地址分布在分区P与第一数据分布在分区P具有相同的含义。存储节点Nj只接收客户端发送的条带SUNj的数据,因此,分布式块存储系统不需要主存储节点,减少了存储节点间的数据交互,同时将分条的条带的数据并发写到相应存储节点,也提升了分布式块存储系统的写性能。进一步的,每一个分条中的条带SUij在存储节点Nj的物理地址可以由分条元数据服务器预先分配,因此,条带SUNj在存储节点Nj的第一物理地址也是由分条元数据服务器预先分配的。分条可以为根据EC算法生成的分条,也可以是由多副本算法生成的分条。当分条是EC算法生成的分条时,分条中的条带SUij包含数据条带和校验条带;当分条是多副本算法生成的分条时,分条中的条带SUij均为数据条带,并且数据条带的数据相同。数据条带SUNj的数据还包括元数据,例如数据条带SUNj标识、数据条带SUNj的数据的逻辑地址。
结合本申请第二方面,在第二方面的第一种可能实现方式中,该方法还包括:
存储节点Nj为条带SUNj的数据分配时间戳TPNj,时间戳TPNj可以作为其他存储节点故障后恢复分条SN中的条带的数据的参考时间戳。
结合本申请第二方面或第二方面的第一种可能实现方式,在第二方面的第二种可能实现方式中,该方法还包括:
存储节点Nj建立条带SUNj的数据的逻辑地址与条带SUNj标识的对应关系,从而用于客户端以逻辑地址来访问分布式块存储系统中存储节点Nj存储的条带SUNj的数据。
结合本申请第二方面或第二方面的第一种或第二种可能实现方式,在第二方面的第三种可能实现方式中,SUNj的数据包括第一客户端标识和第一客户端获得分条SN的时间戳TPN中的至少一种。存储节点Nj可以根据条带SUNj的数据中的第一客户端标识确定该条带是由第一客户端写入的;存储节点Nj可以根据条带SUNj的数据中的第一客户端获得分条SN的时间戳TPN确定第一客户端写入条带的先后顺序。
结合本申请第二方面或第二方面的第一种至第三种任一可能实现方式,在第二方面的第四种可能实现方式中,该方法还包括:存储节点Nj接收第一客户端发送的分条SY中的条带SUYj的数据;条带SUYj的数据是由第一客户端划分第二数据得到的;第二数据是第一客户端接收第二写请求得到的;第二写请求包含第一数据和逻辑地址;逻辑地址用于确定所述第二数据分布在所述分区P;即第一数据的逻辑地址和第二数据的逻辑地址相同;存储节点Nj根据条带SUY标识与存储节 点Nj的第二物理地址的映射存储SUYj的数据到第二物理地址。因为逻辑地址是分布式块存储系统中存储客户端写入数据的地址,因此,逻辑地址分布在分区P与第二数据分布在分区P具有相同的含义。数据条带SUYj的数据还包括元数据,例如数据条带SUYj标识、数据条带SUYj的数据的逻辑地址。
结合本申请第二方面的第四种可能实现方式,在第二方面的第五种可能实现方式中,该方法还包括:存储节点Nj为条带SUYj的数据分配时间戳TPYj。时间戳TPYj可以作为其他存储节点故障后恢复分条SY中的条带的数据的参考时间戳。
结合本申请第二方面的第四种或第五种可能实现方式,在第二方面的第六种可能实现方式中,该方法还包括:存储节点Nj建立条带SUYj的数据的逻辑地址与条带SUYj标识的对应关系,从而用于客户端以逻辑地址来访问分布式块存储系统中存储节点Nj存储的条带SUYj的数据。
结合本申请第二方面的第四至六任一种可能实现方式,在第二方面的第七种可能实现方式中,SUYj的数据包括第一客户端标识和第一客户端获得分条SY的时间戳TPY中的至少一种。存储节点Nj可以根据条带SUYj的数据中的第一客户端标识确定该条带是由第一客户端写入的;存储节点Nj可以根据条带SUYj的数据中的第一客户端获得分条SY的时间戳TPY确定第一客户端写入条带的先后顺序。
结合本申请第二方面或第二方面的第一种或第二种可能实现方式,在第二方面的第八种可能实现方式中,该方法还包括:存储节点Nj接收第二客户端发送的分条SK中的条带SUKj的数据;条带SUKj的数据是由第二客户端划分第三数据得到的;第三数据是第二客户端接收第三写请求得到的;第三写请求包含第三数据和逻辑地址;逻辑地址用于确定第三数据分布在所述分区P;即第一数据的逻辑地址和第三数据的逻辑地址相同;存储节点Nj根据条带SUK标识与存储节点Nj的第三物理地址的映射存储SUKj的数据到第三物理地址。因为逻辑地址是分布式块存储系统中存储客户端写入数据的地址,因此,逻辑地址分布在分区P与第三数据分布在分区P具有相同的含义。在分布式块存储系统中,第一客户端和第二客户端可以访问相同的逻辑地址。数据条带SUKj的数据还包括元数据,例如数据条带SUKj标识、数据条带SUKj的数据的逻辑地址。
结合第二方面的第八种可能实现方式,在第二方面的第九种可能实现方式中,该方法还包括:存储节点Nj为条带SUKj的数据分配时间戳TPKj。时间戳TPKj可以作为其他存储节点故障后恢复分条SK中的条带的数据的参考时间戳。
结合本申请第二方面的第八种或第九种可能实现方式,在第二方面的第十种可能实现方式中,该方法还包括:存储节点Nj建立条带SUKj的数据的逻辑地址与条带SUKj标识的对应关系,从而用于客户端以逻辑地址来访问分布式块存储系统中存储节点Nj存储的条带SUKj的数据。
结合本申请第二方面的第八至十任一种可能实现方式,在第二方面的第十一种可能实现方式中,SUKj的数据包括第二客户端标识和第二客户端获得分条SK的时间戳TPK中的至少一种。存储节点Nj可以根据条带SUKj的数据中的第二客户端标识确定该条带是由第二客户端写入的;存储节点Nj可以根据条带SUKj的数据中的第二客户端获得分条SK的时间戳TPK确定第二客户端写入条带的先后顺序。
结合本申请第二方面,在第二方面的第十二种可能实现方式中,分条Si中条带SUij是由分条元数据服务器根据分区P与分区P包含的存储节点Nj的映射,从存储节点Nj分配的。分条元数据服务器预先为分条Si中的条带SUij从存储节点Nj分配物理存储地址,可以减少客户端写入数据时的等待时间,从而提升分布式块存储系统的写性能。
结合本申请第二方面或第二方面的第一种至十二任一种可能实现方式,在第二方面的第十三种可能实现方式中,一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,数据条带状态信息用于标识分条SN的每个数据条带是否为空,从而不需要使用全0数据代替数据为空的条带的数据并写入存储节点,减少了分布式块存储系统的数据写入量。
结合第二方面的第九种可能实现方式,在第二方面的第十四种可能实现方式中,当存储节点Nj故障后,新的存储节点根据分条SN和SK分别恢复出条带SUNj的数据和SUKj的数据,新的存储节点获取存储节点NX中条带SUNX的数据的时间戳TPNX作为条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为条带SUKj的数据的参考时间戳,新的存储节点根据时间戳TPNX和时间戳TPKX从缓存中将所述条带SUNj的数据和SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个,在存储系统中保留最新的条带的数据,从而节省缓存空间。
结合第二方面的第七种可能实现方式,在第二方面的第十五种可能实现方式中,当存储节点Nj故障后,新的存储节点根据分条SN和SY分别恢复出条带SUNj的数据和SUYj的数据,条带SUNX的数据包含时间戳TPN,条带SUYj的数据包含时间戳TPY,新的存储节点根据时间戳TPN和时间戳TPY从缓存中将条带SUNj的数据和SUYj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个,在存储系统中保留同一个客户端的最新的条带,从而节省缓存空间。
结合本申请第一方面或第一方面的第一至七任一种可能的实现方式中的分布式块存储系统,本申请第三方面还提供了一种分布式块存储系统中数据写入装置,数据写入装置包含多个单元,用于执行本申请第一方面或第一方面的第一至七任一种可能的实现方式。
结合用于本申请第二方面或第二方面的第一至十五任一种可能的实现方式中的分布式块存储系统,本申请第四方面还提供了一种分布式块存储系统中数据存储装置,数据存储装置包含多个单元,用于执行本申请第二方面或第二方面的第一至十五任一种可能的实现方式。
本申请第五方面还提供了本申请第二方面或第一方面的第一至十五任一种可能的实现方式中的分布式块存储系统,分布式块存储系统中的存储节点Nj,用于执行本申请第二方面或第二方面的第一至十五任一种可能的实现方式。
本申请第六方面还提供了一种客户端,应用于本申请第一方面或第一方面的第一至七任一种可能的实现方式中的分布式块存储系统,客户端包含处理器和接口,处理器与接口通信,处理器用于执行本申请第一方面或第一方面的第一至七任一种可能的实现方式。
本申请第七方面还提供了一种存储节点,应用于本申请第二方面或第二方面的第一至十五任一种可能的实现方式中的分布式块存储系统,存储节点作为存储节点Nj包含处理器和接口,处理器与接口通信,处理器用于执行本申请第二方面或第二方面的第一至十五任一种可能的实现方式。
本申请第八方面还提供了一种计算机可读存储介质,应用于本申请第一方面或第一方面的第一至七任一种可能的实现方式中的分布式块存储系统,计算机可读存储介质包含计算机指令,用于使客户端执行本申请第一方面或第一方面的第一至七任一种可能的实现方式。
本申请第九方面还提供了一种计算机可读存储介质,应用于本申请第二方面或第二方面的第一至十五任一种可能的实现方式中的分布式块存储系统,计算机可读存储介质包含计算机指令,用于使存储节点执行本申请第二方面或第二方面的第一至十五任一种可能的实现方式。
本申请第十方面还提供了一种计算机程序产品,应用于本申请第一方面或第一方面的第一至七任一种可能的实现方式中的分布式块存储系统,计算机程序产品包含计算机指令,用于使客户端执行本申请第一方面或第一方面的第一至七任一种可能的实现方式。
本申请第十一方面还提供了一种计算机程序产品,应用于本申请第二方面或第二方面的第一至十五任一种可能的实现方式中的分布式块存储系统,计算机程序产品包含计算机指令,用于使存储节点执行本申请第二方面或第二方面的第一至十五任一种可能的实现方式。
附图说明
图1为现有技术中分布式块存储系统数据存储示意图;
图2为本发明实施例分布式块存储系统示意图;
图3为本发明实施例分布式块存储系统中服务器结构示意图;
图4为本发明实施例分布式块存储系统分区视图示意图;
图5为本发明实施例分布式块存储系统中条带与存储节点关系示意图;
图6为本发明实施例分布式块存储系统中客户端写数据的方法流程图;
图7为本发明实施例分布块存储系统中客户端确定分区示意图;
图8为本发明实施例分布式块存储系统中存储节点存储数据的方法流程图;
图9为本发明实施例分布式块存储系统中存储节点存储分条示意图;
图10为本发明实施例分布式块存储系统中存储节点存储分条示意图;
图11为本发明实施例分布式块存储系统中数据写入装置结构示意图;
图12为本发明实施例分布式块存储系统中数据存储装置结构示意图。
本发明实施例
本发明实施例中的分布式块存储系统,如
Figure PCTCN2017106147-appb-000001
Figure PCTCN2017106147-appb-000002
系列。示例性的如图2所示,分布式块存储系统包括多台服务器,如服务器1、服务器2、服务器3、服务器4、服务器5和服务器6,服务器间通过InfiniBand或以太网络等互相通信。在实际应用当中,分布式块存储系统中服务器的数量可以根据实际 需求增加,本发明实施例对此不作限定。
分布式块存储系统的服务器中包含如图3所示的结构。如图3所示,分布式块存储系统中的每台服务器包含中央处理单元(Central Processing Unit,CPU)301、内存302、接口303、硬盘1、硬盘2和硬盘3,内存302中存储计算机指令,CPU301执行内存302中的程序指令执行相应的操作。接口303可以为硬件接口,如网络接口卡(Network Interface Card,NIC)或主机总线适配器(Host Bus Adaptor,HBA)等,也可以为程序接口模块等。硬盘包含固态硬盘(Solid State Disk,SSD)、机械硬盘或者混合硬盘。机械硬盘如HDD(Hard Disk Drive)。另外,为节省CPU301的计算资源,现场可编程门阵列(Field Programmable Gate Array,FPGA)或其他硬件也可以代替CPU301执行上述相应的操作,或者,FPGA或其他硬件与CPU301共同执行上述相应的操作。为方便描述,本发明实施例将CPU301与内存302、FPGA及其他替代CPU301的硬件或FPGA及其他替代CPU301的硬件与CPU301的组合统称为处理器。
在图3所示的结构中,内存302中加载应用程序,CPU301执行内存302中的应用程序指令,则服务器作为客户端。另外客户端也可以由独立于图2所示的服务器的设备。其中,应用程序可以为虚拟机(Virtual Machine,VM),也可以为某一个特定应用,如办公软件等。客户端向分布式块存储系统写入数据或从分布式块设备存储中读取数据。客户端的结构可参考图3及相关描述。内存302中加载分布式块存储系统程序,CPU301执行内存302中的分布式块存储系统程序,向客户端提供块协议访问接口,为客户端提供分布式块存储接入点服务,使客户端访问分布式块存储系统中存储资源池中的存储资源。通常,该块协议访问接口用于向客户端提供逻辑单元。服务器运行分布式块存储系统程序使包含硬盘的服务器作为存储节点,用于存储客户端数据。示例性的,服务器可以以一块硬盘默认作为一个存储节点,即当服务器中包含多块硬盘时,可以作为多个存储节点,另一种实现,服务器运行分布式块存储系统程序作为一个存储节点,本发明实施例对此不作限定。因此,存储节点的结构可参考图3及相关描述。分布式块存储系统初始化时,将哈希空间(如0~2^32,)划分为N等份,每1等份是1个分区(Partition),这N等份按照硬盘数量进行均分。例如,分布式块存储系统中N默认为3600,即分区分别为P1,P2,P3…P3600。假设当前分布式块存储系统有18块硬盘(存储节点),则每块存储节点承载200个分区。分区P包含M个存储节点Nj,分区与存储节点对应关系,即分区与分区包含的存储节点Nj的映射,也称为分区视图,如图4所示,以分区包含4个存储节点Nj为例,分区视图为“P2-存储节点N1-存储节点N2-存储节点N3-存储节点N4”。其中,j为整数1到M中的每一个值。在分布式块存储系统初始化时会分配好,后续会随着分布式块存储系统中硬盘数量的变化进行调整。客户端保存该分区视图。
根据分布式块存储系统的可靠性要求,可以使用纠删码(Erasure Coding,EC)算法提高数据可靠性,如使用3+1模式,即3个数据条带和1个校验条带组成分条。在本发明实施例中,分区以分条的形式存储数据,一个分区包 含R个分条Si,其中,i为整数1到R中的每一个值。本发明实施例以P2为例进行描述。
分布式块存储系统按照4KB为单位对硬盘进行分片管理,并在硬盘的元数据管理区域记录每个4KB分片的分配信息,硬盘的分片组成存储资源池。分布式块存储系统包含分条元数据服务器,具体实现可以为分条元数据管理程序运行在分布式块存储系统中的一台或多台服务器上。分条元数据服务器为分区分配分条。仍以图4所示分区视图为例,分条元数据服务器根据分区视图,如图5所示,为分区P2的分条Si从分区对应的存储节点Nj分配分条中条带SUij的物理存储地址,即存储空间,具体包括:为SUi1从存储节点N1分配物理存储地址,为SUi2从存储节点N2分配物理存储地址,为SUi3从存储节点N3分配物理存储地址,为SUi4从存储节点N4分配物理存储地址。存储节点Nj记录条带SUij标识与物理存储地址的映射。分条元数据服务器为分条中的条带从存储节点分配物理地址,可以在分布式块存储系统初始化时预先分配,或在客户端向存储节点发送数据前预分配。本发明实施例中,分条Si中条带SUij在客户端写数据之前,仅仅为一段存储空间。当客户端接收数据,分条Si中条带SUij的大小划分得到条带SUij的数据,即分条Si中包含的条带SUij用于存储客户端划分数据得到的条带SUij的数据。为减少分条元数据服务器管理的条带标识的数量,分条元数据服务器为分条中条带标识分配版本号,当一个分条释放后,更新该释放后的分条中条带的条带标识版本号,从而作为新的分条中条带的条带标识。分条元数据服务器预先为分条Si中的条带SUij从存储节点Nj分配物理存储地址,可以减少客户端写入数据时的等待时间,从而提升分布式块存储系统的写性能。
本发明实施例中,客户端挂载分布式块存储系统分配的逻辑单元,从而进行数据访问操作。其中,逻辑单元又称为逻辑单元号(Logical Unit Number,LUN)。在分布式块存储系统中,一个逻辑单元可以只挂载给一个客户端;一个逻辑单元也可以挂载给多个客户端,即多个客户端共享一个逻辑单元。逻辑单元是由图2所示的存储资源池提供的。
本发明实施例中,如图6所示,第一客户端执行如下步骤:
步骤601:第一客户端接收第一写请求;第一写请求包含第一数据和逻辑地址。
在分布式块存储系统中,第一客户端可以为VM,或者服务器。第一客户端中运行应用程序,应用程序访问第一客户端挂载的逻辑单元,例如向逻辑单元发送第一写请求。第一写请求包含第一数据和逻辑地址,逻辑地址又称为逻辑块地址(Logical Block Address,LBA)。逻辑地址用于指示第一数据在逻辑单元中的写入位置。
步骤602:第一客户端确定逻辑地址分布在分区P。
本发明实施例以分区P2为例,结合图4,第一客户端存储有分布式块存储系统的分区视图。如图7所示,第一客户端根据分区视图,确定第一写请求包含的逻辑地址所在的分区。其中一种实现,第一客户端根据逻辑地址生成键Key,根 据哈希算法计算Key的哈希值,确定哈希值对应的分区,从而确定逻辑地址分布在分区P2。也称为第一数据分布在分区P2。
步骤603:第一客户端从R个分条中获得分条SN;其中,N为整数1到R中的一个值。
分条元数据服务器管理分区与分条的对应关系,以及分条中条带与存储节点的关系。其中第一客户端从R个分条中获得分条SN的一种实现,第一客户端确定逻辑地址分布在分区P2,第一客户端查询分条元数据服务器获得分区P2包含的R个分条中的一个分条SN。因为逻辑地址是分布式块存储系统中存储客户端写入数据的地址,因此,逻辑地址分布在分区P与第一数据分布在分区P具有相同的含义。第一客户端从R个分条中获得分条SN的另一种实现,可以为第一客户端从R个分条中已经分配给第一客户端的分条中获得分条SN
步骤604:第一客户端将第一数据划分为分条SN中的一个或多个条带SUNj的数据。
分条SN由条带组成,第一客户端接收第一写请求,缓存第一写请求包含的第一数据,根据分条中条带的大小划分缓存的数据,示例性的,第一客户端根据分条中条带的长度划分得到条带大小的数据,根据条带大小的数据的逻辑地址对分区中的存储节点数M(例如4)取模,从而确定条带大小的数据在分条中的位置,即对应的条带SUNj,进而根据分区视图确定条带SUNj对应存储节点Nj,从而相同逻辑地址的条带的数据分布在相同的存储节点。例如将第一数据划分为1个或多个条带SUNj的数据。本发明实施例以P2为例,结合图5,分条SN包含4个条带,分别为SUN1、SUN2、SUN3和SUN4。以将第一数据划分为2个条带的数据为例,即分别为SUN1的数据和SUN2的数据。条带SUN3的数据可以由第一客户端发送的其他写请求中的数据划分得到,具体可参考第一请求的描述。进而根据SUN1的数据、SUN2的数据和SUN3的数据生成校验条带SUN4的数据,校验条带SUN4的数据也称为校验数据。关于分条中数据条带的数据如何生成校验条带的数据可参考现有分条实现算法,本发明实施例不再赘述。
本发明实施例中,分条SN包含4个条带,即3个数据条带和1个校验条带。当第一客户端缓存数据超过一定时间需要将数据写入存储节点,但不能凑满数据条带的数据时,如只有划分第一数据得到的条带SUN1的数据和SUN2的数据时,根据SUN1的数据和SUN2的数据生成校验条带。有效的数据条带SUNj的数据包含分条SN的数据条带状态信息,有效的数据条带SUNj是指数据不为空的条带,本发明实施例中有效数据条带SUN1的数据和SUN2的数据均包含分条SN的数据条带状态信息,数据条带状态信息用于标识分条SN的每个数据条带是否为空。如用1标识数据条带不为空,用0表示数据条带为空,则SUN1的数据包含的数据条带状态信息为110、SUN2的数据包含的数据条带状态信息为110,即表示SUN1不为空、SUN2不为空和SUN3为空。根据SUN1的数据和SUN2的数据生成的校验条带SUN4的数据包含数据条带状态信息的校验数据。因为SUN3的为空,第一客户端不需要用全0数据代替SUN3的数据并写入存储节点N3,减少了数据写入量。第一客户端在读取分条SN时,根据数据条带SUN1的数据或SUN2的数据包含的分条SN的数据条带 状态信息,确定SUN3的数据为空。
当SUN3不为空,则本发明实施例中SUN1的数据、SUN2的数据和SUN3的数据包含的数据条带状态信息为111,根据SUN1的数据、SUN2的数据和SUN3的数据生成的校验条带SUN4的数据包含数据条带状态信息的校验数据。
进一步的,本发明实施例中,数据条带SUNj的数据还包括第一客户端标识和第一客户端获得分条SN的时间戳TPN中的至少一种,即包含第一客户端标识和第一客户端获得分条SN的时间戳TPN中的任一种或组合。当数据条带SUNj的数据生成校验条带SUNj的数据时,校验条带SUNj的数据中也包含了第一客户端标识和第一客户端获得分条SN的时间戳TPN中的至少一种的校验数据。
本发明实施例中,数据条带SUNj的数据还包括元数据,例如数据条带SUNj标识、数据条带SUNj的数据的逻辑地址。
步骤605:第一客户端将一个或多个条带SUNj的数据发送到存储节点Nj
本发明实施例中,第一客户端将第一数据划分得到的SUN1的数据发送到存储节点N1,将第一数据划分得到的SUN2的数据发送到存储节点N2。第一客户端可以并发将分条SN的条带SUNj的数据发送到存储节点Nj,不需要主存储节点,减少了存储节点数据交互,而且提高了写并发度,从而提高了分布式块存储系统的写性能。
进一步的,在一个逻辑单元只挂载给第一客户端的情况下,第一客户端接收第二写请求;第二写请求包含第二数据和图6描述的逻辑地址,第一客户端根据图6流程中描述的算法确定逻辑地址分布在分区P2,第一客户端从R个分条中获得分条SY,第一客户端将第二数据划分为分条SY中的一个或多个条带SUYj的数据,如SUY1的数据和SUY2的数据,第一客户端将一个或多个条带SUYj的数据发送到存储节点Nj,即将SUY1的数据发送到存储节点N1,将SUY2的数据发送到存储节点N2。其中,Y为整数1到R中的一个值,N与Y不同。本发明实施例,逻辑地址分布在分区P与第二数据分布在分区P具有相同的含义。进一步的,有效的数据条带SUYj的数据包含分条SY的数据条带状态信息。进一步的,数据条带SUYj的数据还包括第一客户端标识和第一客户端获得分条SY的时间戳TPY中的至少一种。进一步的,数据条带SUYj的数据还包含数据条带SUYj的数据的元数据,例如条带SUYj标识,条带SUYj的数据的逻辑地址。进一步描述可参考图6第一客户端的描述,在此不再赘述。第一客户端从R个分条中获得分条SY的含义可参考第一客户端从R个分条中获得分条SN的含义,在此不再赘述。
进一步的,在一个逻辑单元挂载给多个客户端的情况下,如挂载给第一客户端和第二客户端,第二客户端接收第三写请求;第三写请求包含第三数据和图6描述的逻辑地址。第二客户端根据图6流程中描述的算法确定逻辑地址分布在分区P2,第二客户端从R个分条中获得分条SK,第二客户端将第三数据划分为分条SK中的一个或多个条带SUKj的数据,如SUK1的数据和SUK2的数据,第二客户端将一个或多个条带SUKj的数据发送到存储节点Nj,即将SUK1的数据发送到存储节点N1,将SUK2的数据发送到存储节点N2。其中,K为整数1到R中的一个值,N与K不同。逻辑地址分布在分区P与第三数据分布在分区P具有相同的含义。第二客 户端从R个分条中获得分条SK的含义可参考第一客户端从R个分条中获得分条SN的含义,在此不再赘述。进一步的,有效的数据条带SUKj的数据包含分条SK的数据条带状态信息。进一步的,数据条带SUKj的数据还包括第二客户端标识和第二客户端获得分条SK的时间戳TPK中的至少一种。进一步的,数据条带SUKj的数据还包含元数据,如数据条带SUKj标识,数据条带SUKj的数据的逻辑地址。关于第二客户端的进一步描述可参考图6第一客户端的描述,在此不再赘述。
现有技术中,客户端需要将数据先发送到主存储节点,由主存储节点将数据划分为条带的数据,并将除存储在主存储节点的其他条带的数据发送到相应的存储节点,从而使主存储节点成为分布式块存储系统中数据存储瓶颈,同时增加了存储节点间的数据交互,而在图6所示的实施例中,客户端将数据划分为条带的数据,将条带的数据发送到对应的存储节点,不需要主存储节点,减轻了主存储节点压力,减少了存储节点间的数据交互,同时将分条的条带的数据并发写到相应存储节点,也提升了分布式块存储系统的写性能。
与图6所示的第一客户端实施例相对应,如图8所示,存储节点Nj执行如下步骤:
步骤801:存储节点Nj接收第一客户端发送的分条SN中的条带SUNj的数据。
结合图6所示实施例,存储节点N1接收第一客户端发送SUN1的数据,存储节点N2接收第一客户端发送SUN2的数据。
步骤802:存储节点Nj根据条带SUNj标识与存储节点Nj的第一物理地址的映射存储SUNj的数据到第一物理地址。
分条元数据服务器根据分区视图,为分区中分条SN的条带SUNj在存储节点Nj中预先分配了第一物理地址,存储节点Nj的元数据中存储了条带SUNj标识与存储节点Nj的第一物理地址的映射,存储节点Nj接收到条带SUNj的数据,根据该映射将条带SUNj的数据存储到第一物理地址。如存储节点N1接收第一客户端发送的SUN1的数据,将SUN1的数据存储到N1的第一物理地址,存储节点N2接收第二客户端发送的SUN2的数据,将SUN2的数据存储到N2的第一物理地址。
现有技术中,主存储节点需要客户端发送的数据,将数据划分为分条中的数据条带的数据,并根据数据条带的数据形成校验条带的数据,将其他存储节点存储的条带的数据发送到相应的存储节点,而本发明实施例中,存储节点Nj只接收客户端发送的条带SUNj的数据,不需要主存储节点,减少了存储节点间的数据交互,同时将条带的数据并发写到相应存储节点,提升了分布式块存储系统的写性能。
进一步的,条带SUNj的数据是由第一数据划分得到的,第一写请求中包含第一数据的逻辑地址,条带SUNj的数据作为第一数据的一部分,也有相应的逻辑地址,因此,存储节点Nj建立条带SUNj的数据的逻辑地址与条带SUNj标识的映射。这样第一客户端仍然使用逻辑地址访问条带SUNj的数据。例如,当第一客户端访问条带SUNj的数据时,第一客户端使用条带SUNj的数据的逻辑地址对分区P中的存储节点数M(例如4)取模,确定条带SUNj分布在存储节点Nj,发送携带条带SUNj的数据的逻辑地址的读请求到存储节点Nj,存储节点Nj根据条带SUNj的数据的逻 辑地址与条带SUNj标识的映射获得条带SUNj标识,存储节点Nj根据条带SUNj标识与存储节点Nj的第一物理地址的映射获得条带SUNj的数据。
结合图6所示的实施例及相关描述,进一步的,存储节点Nj接收第一客户端发送的分条SY中的条带SUYj的数据,如存储节点N1接收第一客户端发送的SUY1的数据,存储节点N2接收第一客户端发送的SUY2的数据,存储节点Nj根据条带SUYj标识与存储节点Nj的第二物理地址的映射存储SUYj的数据到第二物理地址,如将SUY1的数据存储到N1的第二物理地址,将SUY2的数据存储到N2的第二物理地址。条带SUYj的数据作为第二数据的一部分,也有相应的逻辑地址,因此,存储节点Nj建立条带SUYj的数据的逻辑地址与条带SUYj标识的映射。这样第一客户端仍然使用逻辑地址访问条带SUYj的数据。条带SUYj的数据与条带SUNj的数据具有相同的逻辑地址。
结合图6所示的实施例及相关描述,当一个逻辑单元挂载给第一客户端和第二客户端时,进一步的,存储节点Nj接收第二客户端发送的分条SK中的条带SUKj的数据,如存储节点N1接收第二客户端发送的SUK1的数据,存储节点N2接收第二客户端发送的SUK2的数据,存储节点Nj根据条带SUKj标识与存储节点Nj的第三物理地址的映射存储SUKj的数据到第三物理地址,如将SUK1的数据存储到N1的第三物理地址,将SUK2的数据存储到N2的第三物理地址。条带SUKj的数据作为第三数据的一部分,也有相应的逻辑地址,因此,存储节点Nj建立条带SUKj的数据的逻辑地址与条带SUKj标识的映射。这样第二客户端仍然使用逻辑地址访问条带SUKj的数据。条带SUKj的数据与条带SUNj的数据具有相同的逻辑地址。
进一步的,存储节点Nj为条带SUNj的数据分配时间戳TPNj,存储节点Nj为条带SUKj的数据分配时间戳TPKj,存储节点Nj为条带SUYj的数据分配时间戳TPYj。存储节点Nj可以根据时间戳淘汰缓存中具有相同逻辑地址的时间在先的条带的数据,保留最新的条带的数据,从而节省缓存空间。
结合图6所示的实施例及相关描述,当一个逻辑单元只挂载给第一客户端时,第一客户端发送给存储节点Nj的条带SUNj的数据中包含第一客户端获得分条SN的时间戳TPN,第一客户端发送给存储节点Nj的条带SUYj的数据中包含第一客户端获得分条SY的时间戳TPY,如图9所示,分条SN的数据条带均不为空,SUN1的数据、SUN2的数据和SUN3的数据中均包含第一客户端获得分条SN的时间戳TPN,分条SN的校验条带SUN4的数据中包含时间戳TPN的校验数据TPNp;分条SY的数据条带均不为空,SUY1的数据、SUY2的数据和SUY3的数据中均包含第一客户端获得分条SY的时间戳TPY,分条SY的校验条带SUY4的数据中包含时间戳TPY的校验数据TPYp。因此,当某一个存储数据条带的存储节点故障后,分布式块存储系统根据分条与分区视图,新的存储节点恢复分条SN在故障存储节点Nj的条带SUNj的数据,恢复分条SY在故障存储节点Nj的条带SUYj的数据,因此在新的存储节点的缓存中包含条带SUNj的数据和SUYj的数据。SUNj的数据包含时间戳TPN,SUYj的数据包含时间戳TPY,由于时间戳TPN和时间戳TPY均由第一客户端分配或由同一个时间戳服务器分配,因此可以比较时间先后顺序。新的存储节点根据时间戳TPN和时间戳TPY从缓存中淘汰时间在前的条带的数据。新的存储节点可以为故 障存储节点Nj恢复后的存储节点,或者分布式块存储系统中新加入分条分布在的分区的存储节点。本发明实施例中,以存储节点N1故障为例,在新的存储节点的缓存中包含条带SUN1的数据和SUY1的数据。SUN1的数据包含时间戳TPN,SUY1的数据包含时间戳TPY,时间戳TPN先于时间戳TPY,因此新的存储节点从缓存中淘汰条带SUN1的数据,在存储系统中保留最新的条带的数据,从而节省缓存空间。存储节点Nj可以根据同一客户端分配的时间戳淘汰缓存中来自同一客户端的时间并且具有相同逻辑地址的在先的条带的数据,保留最新的条带的数据,从而节省缓存空间。
结合图6所示的实施例及相关描述,当一个逻辑单元挂载给第一客户端和第二客户端时,存储节点Nj给第一客户端发送的条带SUNj的数据分配时间戳TPNj,存储节点Nj给第二客户端发送的条带SUKj的数据分配时间戳TPKj,如图10所示,分条SN的数据条带均不为空,存储节点N1为条带SUN1的数据分配的时间戳为TPN1、存储节点N2为条带SUN2的数据分配的时间戳为TPN2、存储节点N3为条带SUN3的数据分配的时间戳为TPN3、存储节点N4为条带SUN4的数据分配的时间戳为TPN4;分条SK的数据条带均不为空,存储节点N1为条带SUK1的数据分配的时间戳为TPK1、存储节点N2为条带SUK2的数据分配的时间戳为TPK2、存储节点N3为条带SUK3的数据分配的时间戳为TPK3、存储节点N4为条带SUK4的数据分配的时间戳为TPK4。因此,当某一个存储数据条带的存储节点故障后,分布式块存储系统根据分条与分区视图,恢复分条SN在故障存储节点Nj的条带SUNj的数据,恢复分条SK在故障存储节点Nj的条带SUKj的数据,在新的存储节点的缓存中包含条带SUNj的数据和SUKj的数据。其中一种实现,当条带SUNj的数据包含第一客户端分配的时间戳TPN,条带SUKj的数据包含第二客户端分配的时间戳TPK时,并且TPN和TPK是由同一个时间戳服务器分配时,可以直接比较条带SUNj的数据的时间戳TPN和SUKj的数据的时间戳TPK,新的存储节点根据时间戳TPN和时间戳TPK从缓存中淘汰时间在前的条带的数据。当条带SUNj的数据不包含第一客户端分配的时间戳TPN和/或条带SUKj的数据不包含第二客户端分配的时间戳TPK时,或时间戳TPN和TPK不是来自同一个时间戳服务器时,在新的存储节点的缓存中包含条带SUNj的数据和SUKj的数据。新的存储节点可以查询存储节点NX中分条SN和SK的条带的数据的时间戳,例如新的存储节点获取存储节点NX为条带SUNX的数据分配的时间戳TPNX,将TPNX作为SUNj的数据的参考时间戳,新的存储节点获取存储节点NX为条带SUNX的数据分配的时间戳TPNX,将TPKX作为SUKj的数据的参考时间戳,新的存储节点根据时间戳TPNX和时间戳TPKX从缓存中将条带SUNj的数据和SUKj的数据中时间在前的一个淘汰,其中,X为整数1到M中除j外的任意一个。本发明实施例中,以存储节点N1故障为例,在新的存储节点的缓存中包含条带SUN1的数据和SUK1的数据。新的存储节点获取存储节点N2为SUN2的数据分配的时间戳TPN2作为SUN1的数据的参考时间戳,获取存储节点N2为SUK2的数据分配的时间戳TPK2作为SUK1的数据的参考时间戳,时间戳TPN2先于时间戳TPK2,因此新的存储节点从缓存中淘汰条带SUN1的数据,在存储系统中保留最新的条带的数据,从而节省缓存空间。本发明实施例中,存储节点Nj也为条带SUYj的数据分 配时间戳TPYj
存储节点Nj分配的时间戳可以来自时间戳服务器,也可以由存储节点Nj自己生成。
进一步的,数据条带SUNj的数据包含的第一客户端标识、第一客户端获得分条SN的时间戳、数据条带SUNj标识、数据条带SUNj的数据的逻辑地址、数据条带状态信息可以存储在存储节点Nj为数据条带SUNj分配的物理地址的扩展地址中,从而节省存储节点Nj物理地址。物理地址的扩展地址为存储节点Nj有效物理地址容量外不可见的物理地址,当存储节点Nj接收访问物理地址的读请求时,会默认将物理地址的扩展地址中的数据读取出来。数据条带SUKj的数据包含的第二客户端标识、第二客户端获得分条SK的时间戳、数据条带SUKj标识、数据条带SUKj的数据的逻辑地址、数据条带状态信息也可以存储在存储节点Nj为数据条带SUKj分配的物理地址的扩展地址中。同理,数据条带SUYj包含的第一客户端标识、第一客户端获得分条SY的时间戳、数据条带SUYj标识、数据条带SUYj的数据的逻辑地址、数据条带状态信息也可以存储在存储节点Nj为数据条带SUYj分配的物理地址的扩展地址中。
进一步的,存储节点Nj为条带SUNj的数据分配的时间戳TPNj也可以存储在存储节点Nj为数据条带SUNj分配的物理地址的扩展地址中。存储节点Nj为条带SUKj的数据分配的时间戳TPKj也可以存储在存储节点Nj为数据条带SUKj分配的物理地址的扩展地址中。存储节点Nj为条带SUYj的数据分配的时间戳TPYj也可以存储在存储节点Nj为数据条带SUYj分配的物理地址的扩展地址中。
结合本发明实施例的各种实现,本发明实施例提供了数据写入装置11,应用于本发明实施例中的分布式块存储系统,如图11所示,数据写入装置11包含接收单元111、确定单元112、获得单元113、划分单元114和发送单元115;其中,接收单元111,用于接收第一写请求;第一写请求包含第一数据和逻辑地址;确定单元112,用于确定逻辑地址分布在分区P;获得单元113,用于从R个分条中获得分条SN;其中,N为整数1到R中的一个值;划分单元114,用于将第一数据划分为分条SN中的一个或多个条带SUNj的数据;发送单元115,用于将一个或多个条带SUNj的数据发送到存储节点Nj。进一步的,接收单元111,还用于接收第二写请求;第二写请求包含第二数据和逻辑地址;第二数据的逻辑地址与第一数据的逻辑地址相同;确定单元112,还用于确定逻辑地址分布在分区P;获得单元113,还用于从R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;划分单元114,还用于将第二数据划分为分条SY中的一个或多个条带SUYj的数据;发送单元115,还用于将一个或多个条带SUYj的数据发送到存储节点Nj。本发明实施例的数据写入装置11的实现可以参考本发明实施例中的客户端,例如第一客户端和第二客户端。具体的,数据写入装置11可以为软件模块,可以运行在客户端上,从而使客户端完成本发明实施例中描述的各种实现。数据写入装置11也可以为硬件设备,具体可以参考图3所示的结构,数据写入装置11的各单元可以由图3描述的服务器的处理器实现。因此,关于数据写入装置11的详细描述可参考本发明实施例中客户端的描述。
结合本发明实施例的各种实现,本发明实施例提供了数据存储装置12,应用于本发明实施例中的分布式块存储系统,如图12所示,数据存储装置12包括接收单元121和存储单元122。其中,接收单元121,用于接收第一客户端发送的分条SN中的条带SUNj的数据;条带SUNj的数据是由所述第一客户端划分第一数据得到的;第一数据是第一客户端接收第一写请求得到的;第一写请求包含第一数据和逻辑地址;逻辑地址用于确定所述第一数据分布在所述分区P;存储单元122,用于根据条带SUNj标识与存储节点Nj的第一物理地址的映射存储SUNj的数据到第一物理地址。
结合图12,数据存储装置12还包括分配单元,用于为条带SUNj的数据分配时间戳TPNj
进一步的,结合图12,数据存储装置12还包括建立单元,用于建立条带SUNj的数据的逻辑地址与条带SUNj标识的对应关系。
进一步的,结合图12,接收单元121,还用于接收第一客户端发送的分条SY中的条带SUYj的数据;条带SUYj的数据是由第一客户端划分第二数据得到的;第二数据是第一客户端接收第二写请求得到的;第二写请求包含第二数据和逻辑地址;逻辑地址用于确定所述第二数据分布在所述分区P;存储单元122,还用于根据条带SUY标识与存储节点Nj的第二物理地址的映射存储SUYj的数据到第二物理地址。进一步的,结合图12,分配单元还用于为条带SUYj的数据分配时间戳TPYj
进一步的,结合图12,建立单元,还用于建立条带SUYj的数据的逻辑地址与条带SUYj标识的对应关系。进一步的,SUYj的数据包括第一客户端标识和第一客户端获得分条SY的时间戳TPY中的至少一种。
进一步的,结合图12,接收单元121,还用于接收第二客户端发送的分条SK中的条带SUKj的数据;条带SUKj的数据是由第二客户端划分第三数据得到的;第三数据是第二客户端接收第三写请求得到的;第三写请求包含第三数据和逻辑地址;逻辑地址用于确定第三数据分布在分区P;存储单元122,还用于根据条带SUKj标识与存储节点Nj的第三物理地址的映射存储SUKj的数据到第三物理地址。进一步的,分配单元,还用于为条带SUKj分配时间戳TPKj。进一步的,建立单元,还用于建立条带SUKj的数据的逻辑地址与条带SUKj标识的对应关系。进一步的,数据存储装置12还包括恢复单元,用于当存储节点Nj故障后,根据分条SN恢复出条带SUNj的数据,根据分条SK恢复出条带SUKj的数据;数据存储装置12还包括获取单元,用于获取存储节点NX中条带SUNX的数据的时间戳TPNX作为条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为条带SUKj的数据的参考时间戳,数据存储装置12还包括淘汰单元,用于根据时间戳TPNX和时间戳TPKX从新的存储节点的缓存中将条带SUNj的数据和SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个。
本发明实施例的数据存储装置12的实现可以参考本发明实施例中的存储节点,例如存储节点Nj。具体的,数据存储装置12可以为软件模块,可以运行在服务器上,从而使存储节点完成本发明实施例中描述的各种实现。数据存储装置12也可以为硬件设备,具体可以参考图3所示的结构,数据存储装置12的各单元可 以由图3描述的服务器的处理器实现。因此,关于数据存储装置12的详细描述可参考本发明实施例中存储节点的描述。
本发明实施例中,分条除根据上面描述的EC算法生成的分条外,也可以是由多副本算法生成的分条。当分条是EC算法生成的分条时,分条中的条带SUij包含数据条带和校验条带;当分条是多副本算法生成的分条时,分条中的条带SUij均为数据条带,并且条带SUij的数据相同。
相应的,本发明实施例还提供了计算机可读存储介质和计算机程序产品,计算机可读存储介质和计算机程序产品中包含计算机指令用于实现本发明实施例中描述的各种方案。
在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例各方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。

Claims (88)

  1. 一种分布式块存储系统中数据存储方法,其特征在于,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述方法包括:
    第一客户端接收第一写请求;所述第一写请求包含第一数据和逻辑地址;
    所述第一客户端确定所述逻辑地址分布在所述分区P;
    所述第一客户端从所述R个分条中获得分条SN;其中,N为整数1到R中的一个值;
    所述第一客户端将所述第一数据划分为分条SN中的一个或多个条带SUNj的数据;
    所述第一客户端将所述一个或多个条带SUNj的数据发送到存储节点Nj
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一客户端接收第二写请求;所述第二写请求包含第二数据和所述逻辑地址;
    所述第一客户端确定所述逻辑地址分布在所述分区P;
    所述第一客户端从所述R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;
    所述第一客户端将所述第二数据划分为分条SY中的一个或多个条带SUYj的数据;
    所述第一客户端将所述一个或多个条带SUYj的数据发送到存储节点Nj
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    第二客户端接收第三写请求;所述第三写请求包含第三数据和所述逻辑地址;
    所述第二客户端确定所述逻辑地址分布在所述分区P;
    所述第二客户端从所述R个分条中获得分条SK;其中,K为整数1到R中的一个值,N与K不同;
    所述第二客户端将所述第三数据划分为分条SK中的一个或多个条带SUKj的数据;
    所述第二客户端将所述一个或多个条带SUKj的数据发送到存储节点Nj
  4. 根据权利要求1所述的方法,其特征在于,所述一个或多个条带SUNj的数据中的每一个包括所述第一客户端标识和所述第一客户端获得所述分条SN的时间戳TPN中的至少一种。
  5. 根据权利要求2所述的方法,其特征在于,所述一个或多个条带SUYj的数据中的每一个包括所述第一客户端标识和所述第一客户端获得所述分条SY的时间戳TPY中的至少一种。
  6. 根据权利要求3所述的方法,其特征在于,所述一个或多个条带SUKj的数据中的每一个包所述第二客户端标识和所述第二客户端获得所述分条SK的时间戳TPK中的至少一种。
  7. 根据权利要求1所述的方法,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  8. 根据权利要求1至7任一所述的方法,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  9. 一种分布式块存储系统中数据存储方法,其特征在于,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述方法包括:
    存储节点Nj接收第一客户端发送的分条SN中的条带SUNj的数据;所述条带SUNj的数据是由所述第一客户端划分第一数据得到的;所述第一数据是所述第一客户端接收第一写请求得到的;所述第一写请求包含所述第一数据和逻辑地址;所述逻辑地址用于确定所述第一数据分布在所述分区P;
    所述存储节点Nj根据条带SUNj标识与所述存储节点Nj的第一物理地址的映射存储所述SUNj的数据到所述第一物理地址。
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj为所述条带SUNj的数据分配时间戳TPNj
  11. 根据权利要求9或10所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj建立所述条带SUNj的数据的逻辑地址与所述条带SUNj标识的对应关系。
  12. 根据权利要求9至11任一所述的方法,其特征在于,所述SUNj的数据包括所述第一客户端标识和所述第一客户端获得所述分条SN的时间戳TPN中的至少一种。
  13. 根据权利要求9至12任一所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj接收所述第一客户端发送的分条SY中的条带SUYj的数据;所述条带SUYj的数据是由所述第一客户端划分第二数据得到的;所述第二数据是所述第一客户端接收第二写请求得到的;所述第二写请求包含所述第一数据和所述逻辑地址;所述逻辑地址用于确定所述第二数据分布在所述分区P;
    所述存储节点Nj根据条带SUY标识与所述存储节点Nj的第二物理地址的映射存储所述SUNj的数据到所述第二物理地址。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj为所述条带SUYj的数据分配时间戳TPYj
  15. 根据权利要求13或14所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj建立所述条带SUYj的数据的逻辑地址与所述条带SUYj标识的对应关系。
  16. 根据权利要求13至15任一所述的方法,其特征在于,所述SUYj的数据包括所述第一客户端标识和所述第一客户端获得所述分条SY的时间戳TPY中的至少一种。
  17. 根据权利要求9至12任一所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj接收第二客户端发送的分条SK中的条带SUKj的数据;所述条带SUKj的数据是由所述第二客户端划分第三数据得到的;所述第三数据是所述第二客户端接收第三写请求得到的;所述第三写请求包含所述第三数据和所述逻辑地址;所述逻辑地址用于确定所述第三数据分布在所述分区P;
    所述存储节点Nj根据条带SUK标识与所述存储节点Nj的第三物理地址的映射存储所述SUKj的数据到所述第三物理地址。
  18. 根据权利要求17所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj为所述条带SUKj的数据分配时间戳TPKj
  19. 根据权利要求17或18所述的方法,其特征在于,所述方法还包括:
    所述存储节点Nj建立所述条带SUKj的数据的逻辑地址与所述条带SUKj标识的对应关系。
  20. 根据权利要求17至19任一所述的方法,其特征在于,所述SUKj的数据包括所述第二客户端标识和所述第二客户端获得所述分条SK的时间戳TPK中的至少一种。
  21. 根据权利要求9所述的方法,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  22. 根据权利要求9至21任一所述的方法,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  23. 根据权利要求18所述的方法,其特征在于,当所述存储节点Nj故障后,新的存储节点根据所述分条SN恢复出所述条带SUNj的数据,根据所述分条SK恢复出所述SUKj的数据,所述新的存储节点获取存储节点NX中条带SUNX的数据的时间戳TPNX作为所述条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为所述条带SUKj的数据的参考时间戳,所述新的存储节点根据所述时间戳TPNX和所述时间戳TPKX从缓存中将所述条带SUNj的数据和所述SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个。
  24. 一种分布式块存储系统中数据写入装置,其特征在于,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述数据写入装置包括:
    接收单元,用于接收第一写请求;所述第一写请求包含第一数据和逻辑地址;
    确定单元,用于确定所述逻辑地址分布在所述分区P;
    获得单元,用于从所述R个分条中获得分条SN;其中,N为整数1到R中的一个值;
    划分单元,用于将所述第一数据划分为分条SN中的一个或多个条带SUNj的数据;
    发送单元,用于将所述一个或多个条带SUNj的数据发送到存储节点Nj
  25. 根据权利要求24所述的数据写入装置,其特征在于,所述接收单元,还用于接收第二写请求;所述第二写请求包含第二数据和所述逻辑地址;
    所述确定单元,还用于确定所述逻辑地址分布在所述分区P;
    所述获得单元,还用于从所述R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;
    所述划分单元,还用于将所述第二数据划分为分条SY中的一个或多个条带SUYj的数据;
    所述发送单元,还用于将所述一个或多个条带SUYj的数据发送到存储节点Nj
  26. 根据权利要求24所述的数据写入装置,其特征在于,所述一个或多个条带SUNj的数据中的每一个包括所述数据写入装置标识和所述数据写入装置获得所述分条SN的时间戳TPN中的至少一种。
  27. 根据权利要求25所述的数据写入装置,其特征在于,所述一个或多个条带SUYj的数据中的每一个包括所述数据写入装置标识和所述数据写入装置获得所述分条SY的时间戳TPY中的至少一种。
  28. 根据权利要求24所述的数据写入装置,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  29. 根据权利要求24至28任一所述的数据写入装置,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  30. 一种分布式块存储系统中数据存储装置,其特征在于,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述数据存储装置包括:
    接收单元,用于接收第一客户端发送的分条SN中的条带SUNj的数据;所述条带SUNj的数据是由所述第一客户端划分第一数据得到的;所述第一数据是所述第一客户端接收第一写请求得到的;所述第一写请求包含所述第一数据和逻辑地址;所述逻辑地址用于确定所述第一数据分布在所述分区P;
    存储单元,用于根据条带SUNj标识与所述存储节点Nj的第一物理地址的映射存储所述SUNj的数据到所述第一物理地址。
  31. 根据权利要求30所述的数据存储装置,其特征在于,所述数据存储装置还包括:
    分配单元,用于为所述条带SUNj的数据分配时间戳TPNj
  32. 根据权利要求30或31所述的数据存储装置,其特征在于,所述数据存储装置还包括:
    建立单元,用于建立所述条带SUNj的数据的逻辑地址与所述条带SUNj标识的对应关系。
  33. 根据权利要求30至32任一所述的数据存储装置,其特征在于,所述SUNj 的数据包括所述第一客户端标识和所述第一客户端获得所述分条SN的时间戳TPN中的至少一种。
  34. 根据权利要求30至33任一所述的数据存储装置,其特征在于,所述接收单元,还用于接收所述第一客户端发送的分条SY中的条带SUYj的数据;所述条带SUYj的数据是由所述第一客户端划分第二数据得到的;所述第二数据是所述第一客户端接收第二写请求得到的;所述第二写请求包含所述第二数据和所述逻辑地址;所述逻辑地址用于确定所述第二数据分布在所述分区P;
    所述存储单元,还用于根据条带SUY标识与所述存储节点Nj的第二物理地址的映射存储所述SUNj的数据到所述第二物理地址。
  35. 根据权利要求34所述的数据存储装置,其特征在于,所述分配单元,还用于为所述条带SUYj的数据分配时间戳TPYj
  36. 根据权利要求34或35所述的数据存储装置,其特征在于,所述建立单元,还用于建立所述条带SUYj的数据的逻辑地址与所述条带SUYj标识的对应关系。
  37. 根据权利要求34至36任一所述的数据存储装置,其特征在于,所述SUYj的数据包括所述第一客户端标识和所述第一客户端获得所述分条SY的时间戳TPY中的至少一种。
  38. 根据权利要求30至33任一所述的数据存储装置,其特征在于,所述接收单元,还用于接收第二客户端发送的分条SK中的条带SUKj的数据;所述条带SUKj的数据是由所述第二客户端划分第三数据得到的;所述第三数据是所述第二客户端接收第三写请求得到的;所述第三写请求包含所述第三数据和所述逻辑地址;所述逻辑地址用于确定所述第三数据分布在所述分区P;
    所述存储单元,还用于根据条带SUK标识与所述存储节点Nj的第三物理地址的映射存储所述SUKj的数据到所述第三物理地址。
  39. 根据权利要求38所述的数据存储装置,其特征在于,所述分配单元,还用于为所述条带SUKj的数据分配时间戳TPKj
  40. 根据权利要求38或39所述的数据存储装置,其特征在于,所述建立单元,还用于建立所述条带SUKj的数据的逻辑地址与所述条带SUKj标识的对应关系。
  41. 根据权利要求38至40任一所述的数据存储装置,其特征在于,所述SUKj的数据包括所述第二客户端标识和所述第二客户端获得所述分条SK的时间戳TPK中的至少一种。
  42. 根据权利要求30所述的数据存储装置,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射从存储节点Nj分配的。
  43. 根据权利要求30至42任一所述的数据存储装置,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  44. 根据权利要求39所述的数据存储装置,其特征在于,所述数据存储装置还包括恢复单元,用于当所述存储节点Nj故障后,根据所述分条SN在新的存储节点恢复出所述条带SUNj的数据,根据所述分条SK在所述新的存储节点恢复出所述 SUKj的数据;所述数据存储装置还包括获取单元,用于获取存储节点NX中条带SUNX的数据的时间戳TPNX作为所述条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为所述条带SUKj的数据的参考时间戳,所述数据存储装置还包括淘汰单元,用于根据所述时间戳TPNX和所述时间戳TPKX从所述新的存储节点的缓存中将所述条带SUNj的数据和所述SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个。
  45. 一种分布式块存储系统,其特征在于,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;
    存储节点Nj,用于接收第一客户端发送的分条SN中的条带SUNj的数据,根据条带SUNj标识与所述存储节点Nj的第一物理地址的映射存储所述SUNj的数据到所述第一物理地址;所述条带SUNj的数据是由所述第一客户端划分第一数据得到的;所述第一数据是所述第一客户端接收第一写请求得到的;所述第一写请求包含所述第一数据和逻辑地址;所述逻辑地址用于确定所述第一数据分布在所述分区。
  46. 根据权利要求45所述的系统,其特征在于,
    所述存储节点Nj,还用于为所述条带SUNj的数据分配时间戳TPNj
  47. 根据权利要求45或46所述的系统,其特征在于,
    所述存储节点Nj,还用于建立所述条带SUNj的数据的逻辑地址与所述条带SUNj标识的对应关系。
  48. 根据权利要求45至47任一所述的系统,其特征在于,所述SUNj的数据包括所述第一客户端标识和所述第一客户端获得所述分条SN的时间戳TPN中的至少一种。
  49. 根据权利要求45至48任一所述的系统,其特征在于,
    所述存储节点Nj,还用于接收所述第一客户端发送的分条SY中的条带SUYj的数据,根据条带SUY标识与所述存储节点Nj的第二物理地址的映射存储所述SUNj的数据到所述第二物理地址;所述条带SUYj的数据是由所述第一客户端划分第二数据得到的;所述第二数据是所述第一客户端接收第二写请求得到的;所述第二写请求包含所述第二数据和所述逻辑地址;所述逻辑地址用于确定所述第二数据分布在所述分区P。
  50. 根据权利要求49所述的系统,其特征在于,
    所述存储节点Nj,还用于为所述条带SUYj的数据分配时间戳TPYj
  51. 根据权利要求49或50所述的系统,其特征在于,
    所述存储节点Nj,还用于建立所述条带SUYj的数据的逻辑地址与所述条带SUYj标识的对应关系。
  52. 根据权利要求49至51任一所述的系统,其特征在于,所述SUYj的数据包括所述第一客户端标识和所述第一客户端获得所述分条SY的时间戳TPY中的至少一种。
  53. 根据权利要求45至48任一所述的系统,其特征在于,
    所述存储节点Nj,还用于接收第二客户端发送的分条SK中的条带SUKj的数据,根据条带SUK标识与所述存储节点Nj的第三物理地址的映射存储所述SUKj的数据到所述第三物理地址;所述条带SUKj的数据是由所述第二客户端划分第三数据得到的;所述第三数据是所述第二客户端接收第三写请求得到的;所述第三写请求包含所述第三数据和所述逻辑地址;所述逻辑地址用于确定所述第三数据分布在所述分区P。
  54. 根据权利要求53所述的系统,其特征在于,
    所述存储节点Nj,还用于为所述条带SUKj的数据分配时间戳TPKj
  55. 根据权利要求53或54所述的系统,其特征在于,
    所述存储节点Nj,还用于建立所述条带SUKj的数据的逻辑地址与所述条带SUKj标识的对应关系。
  56. 根据权利要求53至55任一所述的系统,其特征在于,所述SUKj的数据包括所述第二客户端标识和所述第二客户端获得所述分条SK的时间戳TPK中的至少一种。
  57. 根据权利要求45所述的系统,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  58. 根据权利要求45至57任一所述的系统,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  59. 根据权利要求54所述的系统,其特征在于,所述分布式块存储系统还包括当所述存储节点Nj故障后,新的存储节点;
    所述新的存储节点,用于根据所述分条SN恢复出所述条带SUNj的数据,根据所述分条SK恢复出所述条带SUKj的数据,获取存储节点NX中条带SUNX的数据的时间戳TPNX作为所述条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为所述条带SUKj的数据的参考时间戳,根据所述时间戳TPNX和所述时间戳TPKX从所述新的存储节点的缓存中将所述条带SUNj的数据和所述条带SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个。
  60. 一种客户端,其特征在于,所述客户端应用于分布式块存储系统,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述客户端包括处理器和接口,所述处理器和所述接口通信;所述处理器用于:
    接收第一写请求;所述第一写请求包含第一数据和逻辑地址;
    确定所述逻辑地址分布在所述分区P;
    从所述R个分条中获得分条SN;其中,N为整数1到R中的一个值;
    将所述第一数据划分为分条SN中的一个或多个条带SUNj的数据;
    将所述一个或多个条带SUNj的数据发送到存储节点Nj
  61. 根据权利要求60所述的客户端,其特征在于,所述处理器还用于:
    接收第二写请求;所述第二写请求包含第二数据和所述逻辑地址;
    确定所述逻辑地址分布在所述分区P;
    从所述R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;
    将所述第二数据划分为分条SY中的一个或多个条带SUYj的数据;
    将所述一个或多个条带SUYj的数据发送到存储节点Nj
  62. 根据权利要求60所述的客户端,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  63. 根据权利要求60至62任一所述的客户端,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  64. 一种存储节点,其特征在于,所述存储节点应用于分布式块存储系统中,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述存储节点作为存储节点Nj包含处理器和接口,所述处理器与接口通信,所述处理器用于:
    接收第一客户端发送的分条SN中的条带SUNj的数据;所述条带SUNj的数据是由所述第一客户端划分第一数据得到的;所述第一数据是所述第一客户端接收第一写请求得到的;所述第一写请求包含所述第一数据和逻辑地址;所述逻辑地址用于确定所述第一数据分布在所述分区P;
    根据条带SUNj标识与所述存储节点Nj的第一物理地址的映射存储所述SUNj的数据到所述第一物理地址。
  65. 根据权利要求64所述的存储节点,其特征在于,
    所述处理器,还用于为所述条带SUNj的数据分配时间戳TPNj
  66. 根据权利要求64或65所述的存储节点,其特征在于,
    所述处理器,还用于建立所述条带SUNj的数据的逻辑地址与所述条带SUNj标识的对应关系。
  67. 根据权利要求64至66任一所述的存储节点,其特征在于,所述处理器,还用于:
    接收所述第一客户端发送的分条SY中的条带SUYj的数据;所述条带SUYj的数据是由所述第一客户端划分第二数据得到的;所述第二数据是所述第一客户端接收第二写请求得到的;所述第二写请求包含所述第二数据和所述逻辑地址;所述逻辑地址用于确定所述第二数据分布在所述分区P;
    根据条带SUY标识与所述存储节点Nj的第二物理地址的映射存储所述SUNj的数据到所述第二物理地址。
  68. 根据权利要求67所述的存储节点,其特征在于,
    所述处理器,还用于为所述条带SUYj的数据分配时间戳TPYj
  69. 根据权利要求67或68所述的存储节点,其特征在于,
    所述处理器,还用于建立所述条带SUYj的数据的逻辑地址与所述条带SUYj标识的对应关系。
  70. 根据权利要求64至66任一所述的存储节点,其特征在于,所述处理器,还用于:
    接收第二客户端发送的分条SK中的条带SUKj的数据;所述条带SUKj的数据是由所述第二客户端划分第三数据得到的;所述第三数据是所述第二客户端接收第三写请求得到的;所述第三写请求包含所述第三数据和所述逻辑地址;所述逻辑地址用于确定所述第三数据分布在所述分区P;
    根据条带SUK标识与所述存储节点Nj的第三物理地址的映射存储所述SUKj的数据到所述第三物理地址。
  71. 根据权利要求70所述的存储节点,其特征在于,
    所述处理器,还用于为所述条带SUKj的数据分配时间戳TPKj
  72. 根据权利要求70或71所述的存储节点,其特征在于,
    所述处理器,还用于建立所述条带SUKj的数据的逻辑地址与所述条带SUKj标识的对应关系。
  73. 根据权利要求64所述的存储节点,其特征在于,所述分条Si中条带SUij是由分条元数据服务器根据所述分区P与所述分区P包含的存储节点Nj的映射,从存储节点Nj分配的。
  74. 根据权利要求64至73任一所述的存储节点,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  75. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储应用于分布式块存储系统中的计算机指令,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述计算机可读存储介质包含第一计算机指令,用于使第一客户端执行如下操作:
    接收第一写请求;所述第一写请求包含第一数据和逻辑地址;
    确定所述逻辑地址分布在所述分区P;
    从所述R个分条中获得分条SN;其中,N为整数1到R中的一个值;
    将所述第一数据划分为分条SN中的一个或多个条带SUNj的数据;
    将所述一个或多个条带SUNj的数据发送到存储节点Nj
  76. 根据权利要求75所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质还包含第二计算机指令,用于使所述第一客户端执行如下操作:
    接收第二写请求;所述第二写请求包含第二数据和所述逻辑地址;
    确定所述逻辑地址分布在所述分区P;
    从所述R个分条中获得分条SY;其中,Y为整数1到R中的一个值,N与Y不同;
    将所述第二数据划分为分条SY中的一个或多个条带SUYj的数据;
    将所述一个或多个条带SUYj的数据发送到存储节点Nj
  77. 根据权利要求75所述的计算机可读存储介质,其特征在于,所述计算机 可读存储介质还包含第三计算机指令,用于使第二客户端执行如下操作:
    接收第三写请求;所述第三写请求包含第三数据和所述逻辑地址;
    确定所述逻辑地址分布在所述分区P;
    从所述R个分条中获得分条SK;其中,K为整数1到R中的一个值,N与K不同;
    所将所述第三数据划分为分条SK中的一个或多个条带SUKj的数据;
    将所述一个或多个条带SUKj的数据发送到存储节点Nj
  78. 根据权利要求75至77任一所述的计算机可读存储介质,其特征在于,所述一个或多个条带SUNj的数据中的每一个还包括数据条带状态信息,所述数据条带状态信息用于标识所述分条SN的每个数据条带是否为空。
  79. 一种计算机可读存储介质,其特征在于,其特征在于,所述计算机可读存储介质存储应用于分布式块存储系统中的计算机指令,所述分布式块存储系统包含分区P,所述分区P包含M个存储节点Nj和R个分条Si,每一个分条包含条带SUij;其中,j为整数1到M中的每一个值,i为整数1到R中的每一个值;所述计算机可读存储介质包含第一计算机指令,用于使存储节点Nj执行如下操作:
    接收第一客户端发送的分条SN中的条带SUNj的数据;所述条带SUNj的数据是由所述第一客户端划分第一数据得到的;所述第一数据是所述第一客户端接收第一写请求得到的;所述第一写请求包含所述第一数据和逻辑地址;所述逻辑地址用于确定所述第一数据分布在所述分区P;
    根据条带SUNj标识与所述存储节点Nj的第一物理地址的映射存储所述SUNj的数据到所述第一物理地址。
  80. 根据权利要求79所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第二计算机指令,用于使存储节点Nj执行如下操作:
    为所述条带SUNj的数据分配时间戳TPNj
  81. 根据权利要求79或80所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第三计算机指令,用于使存储节点Nj执行如下操作:
    建立所述条带SUNj的数据的逻辑地址与所述条带SUNj标识的对应关系。
  82. 根据权利要求79至81任一所述的计算机可读存储介质,其特征在于所述计算机可读存储介质包含第四计算机指令,用于使存储节点Nj执行如下操作:
    接收所述第一客户端发送的分条SY中的条带SUYj的数据;所述条带SUYj的数据是由所述第一客户端划分第二数据得到的;所述第二数据是所述第一客户端接收第二写请求得到的;所述第二写请求包含所述第二数据和所述逻辑地址;所述逻辑地址用于确定所述第二数据分布在所述分区P;
    根据条带SUY标识与所述存储节点Nj的第二物理地址的映射存储所述SUNj的数据到所述第二物理地址。
  83. 根据权利要求82所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第五计算机指令,用于使存储节点Nj执行如下操作:
    为所述条带SUYj的数据分配时间戳TPYj
  84. 根据权利要求82或83所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第六计算机指令,用于使存储节点Nj执行如下操作:
    建立所述条带SUYj的数据的逻辑地址与所述条带SUYj标识的对应关系。
  85. 根据权利要求79至81任一所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第七计算机指令,用于使存储节点Nj执行如下操作:
    接收第二客户端发送的分条SK中的条带SUKj的数据;所述条带SUKj的数据是由所述第二客户端划分第三数据得到的;所述第三数据是所述第二客户端接收第三写请求得到的;所述第三写请求包含所述第三数据和所述逻辑地址;所述逻辑地址用于确定所述第三数据分布在所述分区P;
    根据条带SUK标识与所述存储节点Nj的第三物理地址的映射存储所述SUKj的数据到所述第三物理地址。
  86. 根据权利要求85所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第八计算机指令,用于使存储节点Nj执行如下操作:
    为所述条带SUKj的数据分配时间戳TPKj
  87. 根据权利要求85或84所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第九计算机指令,用于使存储节点Nj执行如下操作:
    建立所述条带SUKj的数据的逻辑地址与所述条带SUKj标识的对应关系。
  88. 根据权利要求86所述的计算机可读存储介质,其特征在于,所述计算机可读存储介质包含第十计算机指令,当所述存储节点Nj故障后,用于使新的存储节点执行如下操作:
    根据所述分条SN恢复出所述条带SUNj的数据,根据所述分条SK恢复出所述条带SUKj的数据,所述新的存储节点获取存储节点NX中条带SUNX的数据的时间戳TPNX作为所述条带SUNj的数据的参考时间戳,获取存储节点NX中条带SUKX的数据的时间戳TPKX作为所述条带SUKj的数据的参考时间戳,根据所述时间戳TPNX和所述时间戳TPKX从所述新的存储节点的缓存中将所述条带SUNj的数据和所述SUKj的数据中时间在前的一个淘汰;其中,X为整数1到M中除j外的任意一个。
PCT/CN2017/106147 2017-10-13 2017-10-13 分布式块存储系统中数据存储方法、装置及计算机可读存储介质 WO2019071595A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201780002700.6A CN110325958B (zh) 2017-10-13 2017-10-13 分布式块存储系统中数据存储方法、装置及计算机可读存储介质
PCT/CN2017/106147 WO2019071595A1 (zh) 2017-10-13 2017-10-13 分布式块存储系统中数据存储方法、装置及计算机可读存储介质
EP17890845.5A EP3495939B1 (en) 2017-10-13 2017-10-13 Method and device for storing data in distributed block storage system, and computer readable storage medium
US16/172,264 US20190114076A1 (en) 2017-10-13 2018-10-26 Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/106147 WO2019071595A1 (zh) 2017-10-13 2017-10-13 分布式块存储系统中数据存储方法、装置及计算机可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/172,264 Continuation US20190114076A1 (en) 2017-10-13 2018-10-26 Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium

Publications (1)

Publication Number Publication Date
WO2019071595A1 true WO2019071595A1 (zh) 2019-04-18

Family

ID=66096484

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106147 WO2019071595A1 (zh) 2017-10-13 2017-10-13 分布式块存储系统中数据存储方法、装置及计算机可读存储介质

Country Status (4)

Country Link
US (1) US20190114076A1 (zh)
EP (1) EP3495939B1 (zh)
CN (1) CN110325958B (zh)
WO (1) WO2019071595A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109814805B (zh) * 2018-12-25 2020-08-25 华为技术有限公司 存储系统中分条重组的方法及分条服务器
CN112241320B (zh) 2019-07-17 2023-11-10 华为技术有限公司 资源分配方法、存储设备和存储系统
CN111399766B (zh) * 2020-01-08 2021-10-22 华为技术有限公司 存储系统中的数据存储方法、数据读取方法、装置及系统
CN113204520B (zh) * 2021-04-28 2023-04-07 武汉大学 一种基于分布式文件系统的遥感数据快速并发读写方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103368870A (zh) * 2013-06-26 2013-10-23 国家超级计算深圳中心(深圳云计算中心) 集群存储网络并行负载的控制方法及系统
CN103458023A (zh) * 2013-08-30 2013-12-18 清华大学 分布式闪存存储系统
CN105242879A (zh) * 2015-09-30 2016-01-13 华为技术有限公司 一种数据存储方法以及协议服务器
CN105404469A (zh) * 2015-10-22 2016-03-16 浙江宇视科技有限公司 一种视频数据的存储方法和系统
US20170031988A1 (en) * 2015-07-30 2017-02-02 Futurewei Technologies, Inc. Data placement control for distributed computing environment

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08329021A (ja) * 1995-03-30 1996-12-13 Mitsubishi Electric Corp クライアントサーバシステム
EP1377906B1 (en) * 2001-03-13 2012-10-17 Oracle America, Inc. Method and arrangements for node recovery
US7702850B2 (en) * 2005-03-14 2010-04-20 Thomas Earl Ludwig Topology independent storage arrays and methods
US7325111B1 (en) * 2005-11-01 2008-01-29 Network Appliance, Inc. Method and system for single pass volume scanning for multiple destination mirroring
US20080140724A1 (en) * 2006-12-06 2008-06-12 David Flynn Apparatus, system, and method for servicing object requests within a storage controller
US8825789B2 (en) * 2008-12-16 2014-09-02 Netapp, Inc. Method and apparatus to implement a hierarchical cache system with pNFS
WO2013082764A1 (zh) * 2011-12-06 2013-06-13 宇龙计算机通信科技(深圳)有限公司 信息处理方法和终端
US9128826B2 (en) * 2012-10-17 2015-09-08 Datadirect Networks, Inc. Data storage architecuture and system for high performance computing hash on metadata in reference to storage request in nonvolatile memory (NVM) location
WO2014101218A1 (zh) * 2012-12-31 2014-07-03 华为技术有限公司 一种计算存储融合的集群系统
CN103984607A (zh) * 2013-02-08 2014-08-13 华为技术有限公司 分布式存储的方法、装置和系统
CA2867589A1 (en) * 2013-10-15 2015-04-15 Coho Data Inc. Systems, methods and devices for implementing data management in a distributed data storage system
EP2933733A4 (en) * 2013-12-31 2016-05-11 Huawei Tech Co Ltd DATA PROCESSING METHOD AND DEVICE IN A DISTRIBUTED FILE STORAGE SYSTEM
US9519546B2 (en) * 2014-03-17 2016-12-13 Dell Products L.P. Striping cache blocks with logical block address scrambling
US9594632B2 (en) * 2014-07-09 2017-03-14 Qualcomm Incorporated Systems and methods for reliably storing data using liquid distributed storage
CN105095013B (zh) * 2015-06-04 2017-11-21 华为技术有限公司 数据存储方法、恢复方法、相关装置以及系统
CN105389127B (zh) * 2015-11-04 2018-06-26 华为技术有限公司 在存储系统中传输消息的方法、装置及存储系统、控制器
CN107172222A (zh) * 2017-07-27 2017-09-15 郑州云海信息技术有限公司 一种基于分布式存储系统的数据存储方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103368870A (zh) * 2013-06-26 2013-10-23 国家超级计算深圳中心(深圳云计算中心) 集群存储网络并行负载的控制方法及系统
CN103458023A (zh) * 2013-08-30 2013-12-18 清华大学 分布式闪存存储系统
US20170031988A1 (en) * 2015-07-30 2017-02-02 Futurewei Technologies, Inc. Data placement control for distributed computing environment
CN105242879A (zh) * 2015-09-30 2016-01-13 华为技术有限公司 一种数据存储方法以及协议服务器
CN105404469A (zh) * 2015-10-22 2016-03-16 浙江宇视科技有限公司 一种视频数据的存储方法和系统

Also Published As

Publication number Publication date
EP3495939B1 (en) 2021-06-30
EP3495939A1 (en) 2019-06-12
US20190114076A1 (en) 2019-04-18
EP3495939A4 (en) 2019-06-12
CN110325958B (zh) 2021-09-17
CN110325958A (zh) 2019-10-11

Similar Documents

Publication Publication Date Title
US11409705B2 (en) Log-structured storage device format
CN109144406B (zh) 分布式存储系统中元数据存储方法、系统及存储介质
WO2019127018A1 (zh) 存储系统访问方法及装置
US20190114076A1 (en) Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium
US8954706B2 (en) Storage apparatus, computer system, and control method for storage apparatus
WO2021017782A1 (zh) 分布式存储系统访问方法、客户端及计算机程序产品
CN112615917B (zh) 存储系统中存储设备的管理方法及存储系统
US11899533B2 (en) Stripe reassembling method in storage system and stripe server
US11775194B2 (en) Data storage method and apparatus in distributed storage system, and computer program product
CN110199270B (zh) 存储系统中存储设备的管理方法及装置
US12032849B2 (en) Distributed storage system and computer program product
US20210240399A1 (en) Storage system with continuous data verification for synchronous replication of logical storage volumes
US9501290B1 (en) Techniques for generating unique identifiers
US20240176528A1 (en) Unmapping logical block addresses of remote asynchronously replicated volumes
CN115495010A (zh) 一种数据访问方法、装置和存储系统

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2017890845

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017890845

Country of ref document: EP

Effective date: 20180816

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17890845

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE