WO2019184012A1 - 数据写入方法、客户端服务器和系统 - Google Patents

数据写入方法、客户端服务器和系统 Download PDF

Info

Publication number
WO2019184012A1
WO2019184012A1 PCT/CN2018/083107 CN2018083107W WO2019184012A1 WO 2019184012 A1 WO2019184012 A1 WO 2019184012A1 CN 2018083107 W CN2018083107 W CN 2018083107W WO 2019184012 A1 WO2019184012 A1 WO 2019184012A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
block
client server
stripe
check
Prior art date
Application number
PCT/CN2018/083107
Other languages
English (en)
French (fr)
Inventor
罗四维
张雷
王�锋
方新
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18912193.2A priority Critical patent/EP3779705A4/en
Priority to CN201880002799.4A priority patent/CN110557964B/zh
Publication of WO2019184012A1 publication Critical patent/WO2019184012A1/zh
Priority to US17/029,285 priority patent/US11579777B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/1035Keeping track, i.e. keeping track of data and parity changes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors

Definitions

  • the present application relates to the field of information technology and, more particularly, to a method of data writing, a client server and a system.
  • EC Erasure Coding
  • a stripe includes 4 data slices and 2 parity slices, which are used to store data partitions and check partitions, respectively. If the length of the data is sufficient for the full-length write operation (that is, the data is not sufficiently divided into four complete blocks of size and fragment size), the data to be written is first divided, and the size of each block ( Equal to the fragment size) fixed, get 4 complete data blocks; then calculate the other 2 check blocks by XOR operation; finally write 6 blocks to the specified node to complete the block persistence process , a full-length write operation is completed.
  • the complete block is completed by the '0' operation, and then the four complete blocks are calculated. Verify the block and then write all data blocks and check blocks to the specified node. However, this way of complementing the '0' operation will bring ineffective overhead of hard disk space and increase the total number of strips required for data.
  • the application provides a data writing method, a client server and a system, which can reduce the total number of strips required for data.
  • a data writing method comprising:
  • the client server receives the first data
  • the write location information indicates that the data stripe in the target stripe is at the location of the target stripe, wherein the target stripe comprises a plurality of slices, each fragment Corresponding to a storage node, the storage node communicates with the client server, and the target stripe is not full;
  • the client server sends a write request to the storage node of the slice corresponding to the one or more data blocks of the first data, where the write request is used to store one or more data of the first data into corresponding ones.
  • the shard In the shard.
  • the data block of the data is written based on the write location information, and the case where the data does not satisfy the full score (the slice allocated to the data is not filled with the data) does not need to be performed.
  • Complementing the '0' operation can save the overhead of hard disk space and reduce the total number of strips required for data.
  • the total number of strips is reduced, which can: (1) reduce the complexity of strip management, and (2) improve Find the speed of the striping, (3) the number of strips for fault recovery becomes less, which speeds up the recovery of the fault.
  • the write location information is used to indicate that the data block is written to the last position in the target stripe
  • the one or more data partitions of the first data are generated based on the write location information, and specifically include one of the following situations:
  • the first data is segmented according to the size of the slice to generate one or more data points of the first data. a block, wherein if the size of the first data is smaller than the size of the slice, the first data is divided into one data;
  • a data block is firstly segmented from the first data according to the size of the unwritten portion of the specific slice.
  • the cut data block corresponds to the unwritten portion of the specific slice, and then divides the remaining portion of the first data according to the size of the slice, and each data segmented by the remaining portion of the first data
  • the block corresponds to a blank slice; wherein if the size of the first data is smaller than the size of the unwritten portion of the specific slice, the first data is divided into one data.
  • one or more data partitions of the first data are generated based on a fragment size, and specifically include:
  • one or more data partitions of the first data are generated based on the write location information; or, the first data satisfies the second condition And generating one or more data blocks of the first data based on the slice size.
  • the write location information includes an offset of the written data partition in the target stripe, and a node number of the written data chunk.
  • the target stripe has a check block for verifying the written data block, and after the client server obtains at least one data block of the first data, :
  • the calculated check block is stored to the check slice of the target stripe.
  • the client server has a check block for verifying the written data block in the cache of the client server, after the client server obtains at least one data block of the first data, Also includes:
  • the calculated verification block is stored in the cache of the client server
  • the check partitions corresponding to the target strips in the cache of the client server are stored into the checksums of the target stripe.
  • the storage node includes a check node for storing a check data block
  • the client server in addition to sending the write request, further includes:
  • the one or more check nodes are instructed to generate a check chunk according to all data chunks of the target stripe that have been backed up, and the generated chunks are generated.
  • the check block is stored to the checksum of the target stripe.
  • the client server acquiring one or more data partitions of the first data includes one of the following situations:
  • the client server segments the first data to generate one or more data blocks of the first data
  • the client server sends the write location information and the first data to the application server, and then acquires one or more data chunks of the first data from the application server.
  • the method before the client server receives the first data, the method further includes:
  • the client server receives the second data
  • the writing position of the data block of the second data in the target stripe is recorded as the write position information.
  • the write location information is stored in the memory of the client server; or the write location information is stored in the metadata server.
  • the write location information may include an identifier of a node where the last data fragment that has been written in the stripe is located, and data written in the stripe is started with respect to the stripe. The offset of the position.
  • the metadata server in the storage system may store the mapping relationship between the stripe and the node, and send the stripe and the information of the data node and the check node corresponding to the stripe to the client server.
  • the new data slice can continue to be written in an additional write mode.
  • this data writing method can utilize the hard disk space as efficiently as possible to avoid the free space of the data before and after; on the other hand, it can better adapt the flash type storage medium to improve the read and write performance, balance the particle wear and improve the life of the medium.
  • the stripe can be marked as full in the metadata server when the stripe is full.
  • a client server comprising means for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • a client server comprising: a processor and a memory; the memory is for storing instructions; the processor is configured to execute the memory stored instructions to perform any of the first aspect or the first aspect The method in the implementation.
  • a system comprising: the client server of the second aspect or the third aspect, and a plurality of nodes; wherein the plurality of nodes are configured to store data to be written by the client server.
  • a computer readable medium for storing a computer program, the computer program comprising instructions for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • FIG. 1 is a schematic diagram of a scenario in which the technical solution of the embodiment of the present application is applicable.
  • FIG. 2 is a schematic diagram of full-scale data writing.
  • Figure 3 is a schematic diagram of the writing of non-full score data.
  • FIG. 4 is a schematic flowchart of a data writing method in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of data partitioning according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of data blocking of another embodiment of the present application.
  • FIG. 7 is a schematic diagram of a data writing manner according to an embodiment of the present application.
  • FIG. 8 is a flow chart of a data writing method according to an embodiment of the present application.
  • FIG. 9 is a flow chart of a data writing method according to another embodiment of the present application.
  • FIG. 10 is a flowchart of a data writing method according to another embodiment of the present application.
  • FIG. 11 is a flow chart of a data writing method according to another embodiment of the present application.
  • FIG. 12 is a schematic block diagram of a client server according to an embodiment of the present application.
  • FIG. 13 is a schematic block diagram of a client server of another embodiment of the present application.
  • the technical solutions of the embodiments of the present application can be applied to various storage systems.
  • the technical solution of the embodiment of the present application is described by taking a distributed storage system as an example, but the embodiment of the present application does not limit this.
  • data such as files and objects
  • Multiple storage devices share the storage load. This storage method not only improves the reliability, availability, and access efficiency of the system. It is also easy to expand.
  • the storage device is, for example, a server or a combination of a storage controller and a storage medium.
  • the client server receives the first data, obtains the write location information of the target stripe, and the data partition of the other data has been written in the target stripe, and the write location information indicates that the target stripe has been
  • the write data block adds a data block of another data to the target stripe according to the position of the target stripe according to the position of the target stripe, which is equivalent to writing the same stripe at different times.
  • the data of the data is divided into blocks, thereby reducing the total number of strips required for the data.
  • FIG. 1 is a schematic diagram of a scenario in which the technical solution of the embodiment of the present application is applicable.
  • the storage system 100 includes a switch 103 and a plurality of storage nodes (or simply "nodes") 104 and the like.
  • the storage node 104 may also be simply referred to as a node, which is a storage device; the switch 103 is an optional device.
  • the application server 102 may also belong to the interior of the storage system 100.
  • Each node 104 may include multiple disks or other types of storage media (eg, solid state drives, floppy disks, or shingled magnetic records) for storing data, for convenience of description, followed by hard disk (HDD) only.
  • the nodes 104 can be divided into data nodes, check nodes, and metadata servers according to specific functions.
  • the data node is used for storing data blocks of data
  • the check node is used for storing check blocks of data
  • the metadata server can be used for storing metadata of data
  • the client server 101 sends a request to the application server 102, which carries the data to be written.
  • the application server 102 performs a segmentation operation on the data to generate a data block, and generates a check block according to the data block, and returns the obtained block to the client server 101, and the client server 101 sends the block to the switch 103 through the switch 103.
  • Each block corresponds to a node 104, which stores the block and returns a successful write response to the client server 101.
  • the application server 102 and the client server 101 can also be combined.
  • the functions of the application server 102 and the client server 101 can both be implemented by the client server 101.
  • the functions of the application server 102 are integrated into the client server 101, and the information interaction between the application server 102 and the client server 101 can be converted into an internal operation or some operations can be adaptively cancelled.
  • the routing manner between the client server 101 and the node 104 may be a distributed hash table (DHT), but the embodiment of the present application is not limited thereto. That is to say, in the technical solution of the embodiment of the present application, various possible routing manners in the storage system may be adopted.
  • DHT distributed hash table
  • the data storage may be performed by using an Erasure Code (EC) technology in the distributed storage system 100.
  • EC Erasure Code
  • the EC 4+2 mode is taken as an example, that is, one stripe includes four data fragments and two verification fragments, but the embodiment of the present application does not limit this. Fragments are sometimes referred to as strips or stripe units.
  • the application server 102 divides the data into slice sizes (for example, 512 KB) and divides them into four complete blocks of size and size (this case is called In order to satisfy the condition of the full score, then another two check blocks are calculated by the exclusive OR operation.
  • the data does not satisfy the condition of the full score, that is, the size of the data is not enough to be divided into 4 complete blocks, as shown in FIG. 3, for the strip 2, there are 4 data fragments, so that a total of 4 complete blocks can be stored.
  • the actual size of the data is only enough to be divided into two complete blocks 8, 9 .
  • the complement '0' operation is required to obtain the blocks 10 and 11, that is, the data in the two blocks is set to all. 0, then the data blocks 10, 11 generated by the complement '0' are combined with the data blocks 8, 9 to calculate the check blocks P2, Q2.
  • the data blocks 10, 11 after the '0' do not carry useful information, but are also stored on the data node, which results in a reduction in hard disk space utilization.
  • the embodiment of the present application provides a technical solution, which can improve the utilization of the hard disk space in the case where the data does not satisfy the full score, and reduce the total number of strips required for the data.
  • a stripe is a logical space corresponding to a physical space.
  • the striping is composed of shards, each shard corresponding to a physical storage space on the hard disk, and the corresponding relationship is through a mapping relationship between the shard and the logical block address (LBA) address of the hard disk.
  • LBA logical block address
  • this correspondence can be stored in the storage server or stored in other locations.
  • the physical space is not allocated to the stripe before the data is actually stored in the stripe, only when the storage server really needs to store the data block.
  • the physical space is allocated to the storage server (from the hard disk). After the physical space corresponding to the slice stores a certain data block, the data block can be considered as "written".
  • N fragments can correspond to N nodes, that is, each node stores only one partition. In other embodiments, the N fragments may also correspond to nodes smaller than N, as long as the N fragments correspond to N storage media.
  • Node_id indicates the server (node) number with a hard disk in a specific stripe, the value range is [1, N], the node_id from 1 to n indicates the ID of the data node, and the node_id is from n+1 to N represents the ID of the check node.
  • node_id may not be used, and other tags are used as the ID of the node.
  • the set of stripe-to-node mapping ⁇ stripe_id, node_id> can be saved on the metadata server, and one node may correspond to multiple strips.
  • Client_id indicates the number of the client server. After the stripe is allocated to the client server, the stripe_id_owner (described below) of the stripe is the client_id of the client server. This stripe is written by the client server, and other client servers can only read and cannot write. The client_id is saved in the client server.
  • Stripe_id indicates the strip number, which is saved in the metadata server.
  • stripe_id_owner is an illegal value
  • Stripe_full Indicates that the stripe is full. The default value is FALSE, which is saved in the metadata server.
  • Block_size indicates the configured/predetermined slice size, the length of which is fixed.
  • Stripe_id_owner Indicates which client server the data is written to by the client, that is, the owner of the stripe.
  • the initial value of stripe_id_owner is set to an illegal value, indicating that the stripe is not allocated and is stored in the metadata server.
  • Offset The offset from the start position of the stripe, where offset is 0, and the maximum can be block_size*n (n is the number of data slices in EC mode).
  • Location Write location information indicating the last location in the stripe where the data chunk has been written. For example, it can be composed of the node id and the offset where the last data block written in the current round is located, and is expressed as: (node_id, offset).
  • FIG. 4 shows a schematic flow chart of a data writing method 400 of an embodiment of the present application.
  • the scenario of the method 400 application includes a client server and a plurality of storage nodes.
  • the plurality of storage nodes may include a data node and a check node.
  • the method 400 can be applied to the scenario shown in FIG. 1, but the embodiment of the present application does not limit this.
  • the method 400 can be performed by a client server, such as the client server 101 of FIG.
  • the first data is data to be written, such as files, objects, and the like that need to be written.
  • the client server After receiving the first data, the client server prepares to write the first data to the corresponding stripe (target strip).
  • 420 Obtain write location information of the target stripe, where the write location information indicates that the data stripe in the target stripe is in the location of the target stripe, where the target stripe includes multiple fragments, each The fragment corresponds to a storage node, and the storage node communicates with the client server, and the target stripe is not full.
  • striping is not full means that data has been written in the stripe, but there is space in the stripe in an idle state (no data is stored). That is to say, a part of the data has been written in the stripe, but the space of the stripe is not filled.
  • the write location information indicates that the data block has been written in the stripe at the position of the stripe. That is, the write location information may indicate the case where data is written in the stripe. When the strip is not full, the write location information may indicate where the data in the strip has been written.
  • the write location information indicates that the data block is written to the last position in the stripe. For example, which node (sharding) the data in the stripe has been written to, the size of the data that has been written, and so on.
  • the write location information may include node information where the last data block that has been written in the stripe is located and a size of the data that has been written in the stripe.
  • the write location information may be (2,768 KB), where "2" indicates that the last data block that has been written is at the data node 2 (2 is the node_id of the data node 2) ), 768 KB indicates that the size of the written data in the stripe is 768 KB, and the latter can also be called an offset, that is, the offset of the written data with respect to the starting position of the stripe.
  • the node number corresponds to the slice number
  • the write location information may also include an offset of the written data block in the target stripe, and the written data segment. The node number of the block.
  • the write location information may also be represented in other forms, which is not limited in this embodiment of the present application.
  • the above offset may also be changed to the size of the data that has been written in the last written node. Since the size of the slice is fixed, the size of the written data in the stripe can also be obtained. For example, if this form is used, after the second data slice (256 KB) in FIG. 5 is written to the data node 2, the write location information can be expressed as (2, 256 KB). Since the size of the data block that has been written to the data node 1 is a fixed 512 KB (ie, the slice size), the size of the written data in the stripe can also be obtained as 768 KB according to the write position information.
  • the write location information may be saved in the memory of the client server, or may be saved in other nodes, for example, may be saved in the metadata server.
  • the client server may directly obtain the write location information saved in the memory.
  • the client server may acquire the write location information from the metadata server.
  • each data partition corresponds to a fragment of a free space in one of the plurality of fragments, where one or more data points of the first data
  • the block is generated based on the write location information; or one or more data chunks of the first data are generated based on the slice size.
  • the data block of the first data is generated by a split operation to be written into the corresponding data slice.
  • the application server may perform a split operation; in the case where the application server is not set, for example, when the application server is integrated with the client server, the client server may perform segmentation. operating.
  • the first data is segmented by the client server to generate one or more data blocks of the first data.
  • the client server may send the write location information and the first data to the application server to the application server, and then The application server acquires one or more data chunks of the first data.
  • one or more data partitions of the first data are generated based on the write location information. Specifically, it may include one of the following situations:
  • the first data is segmented according to the size of the slice (for example, 512 KB described above) to generate one of the first data. Or a plurality of data blocks, wherein if the size of the first data is smaller than the size of the slice, the first data is segmented as one data;
  • the size of the unwritten portion of the specific slice is first A data block is sliced in the first data, and a cut data block corresponds to an unwritten portion of the specific slice, so as to fill the specific slice together with the existing data block in the specific slice. And dividing the remaining portion of the first data according to the size of the fragment, and each data segment cut out by the remaining portion of the first data corresponds to a blank fragment; wherein, if the size of the first data is smaller than The size of the unwritten portion of the particular slice is the first data as a block of data.
  • the writing process of the data (first data) to be written in the current round is from the next slice.
  • the first data is segmented according to the size of the slice, and finally the data to be written that does not satisfy the size of one slice is used as the last data block, wherein if the size of the first data is smaller than The size of the slice is used to block the entire first data as a data block, thereby obtaining the data block of the first data.
  • the writing process of the data (first data) to be written in this round does not start from the beginning of the slice. Then, firstly, a segment is cut out by a segmentation operation, and a segment that has been written in the specific segment is spliced to a slice size, and then segmented according to the slice size, wherein if the first data is The size of the unwritten portion of the specific slice is smaller than the size of the unwritten portion of the specific slice, and the entire first data is divided into one data block, thereby obtaining the data block of the first data.
  • the data portion is not written, and the overhead of the hard disk space is saved compared with the prior art, and the total number of strips required for data is reduced.
  • one or more data partitions of the first data are generated based on a fragment size.
  • the first data is segmented according to the size of the slice to generate the first One or more data blocks of a data, wherein if the size of the first data is smaller than the size of the slice, the first data is divided into one data.
  • Each data block of the first data corresponds to a blank slice. Accordingly, the writing process of the data (first data) to be written in this round starts from the start position of the next slice.
  • two types of segmentation modes may be selected: when the first secretary satisfies the first condition, one or more data blocks of the first data are generated based on the write location information; or, when the first When a secretary satisfies the second condition, one or more data blocks of the first data are generated based on the slice size.
  • the size of the first data may be less than a predetermined threshold (eg, the size of the slice is used as the predetermined threshold)
  • the first type of segmentation is used, that is, one or more data blocks of the first data are generated based on the write location information; or, when the size of the first data is not less than the predetermined threshold, the second is used.
  • the segmentation mode that is, generating one or more data blocks of the first data based on the slice size.
  • the segmentation mode may be selected according to the file type of the first data, for example, if the first data is a log file, the first segmentation mode is adopted; if the first data is a video file, the second segmentation mode is adopted.
  • the severing mode may be selected according to the QoS of the first data, for example, if the first data has a low latency requirement, the first severing mode is adopted; if the first data has a high latency requirement, the second mode is adopted.
  • a method of segmentation may be selected according to the file type of the first data, for example, if the first data is a log file, the first segmentation mode is adopted; if the first data is a video file, the second segmentation mode is adopted.
  • the severing mode may be selected according to the QoS of the first data, for example, if the first data has a low latency requirement, the first severing mode is adopted; if the first data has a high latency requirement, the second mode is adopted.
  • the client server After the data segmentation of the first data is obtained in the foregoing step, the client server next blocks the data of the first data into corresponding data fragments. Since the fragment corresponds to the storage node storage space, the process of writing data fragments (in particular, data partitioning) can be regarded as a process in which data is stored to the storage node.
  • the client server sends a write request to the storage node of the fragment corresponding to each data block, and the write request may carry information about the data block to be written and the number of the stripe, node (or fragment), and the like. After receiving the write request, the storage node stores the data into corresponding shards.
  • the client server sends a write request to the storage node, and the process of storing the data by the storage node according to the write request may be referred to as a process of writing data by the client server, that is, a process in which the client server writes the block to the node. .
  • the client server may block the data of the first data and write the corresponding data node in an additional write manner.
  • the last location of the data block in the stripe can be determined, and the client server can continue to write the data chunk from the location in an append manner.
  • the unused stripe is written to the data for the first time, since the data is written from the beginning of the slice, it is possible to directly divide the slice to be written according to the slice size without considering the existing data in the stripe. data. For example, if the second data is written in an unused strip, the corresponding process may be:
  • the client server receives the second data; obtains an unused strip as the target stripe, the size of the second data is smaller than the size of the target stripe; and acquires at least one data chunk of the second data, where the Data segmentation of the second data is generated based on the slice size; determining a slice corresponding to the data block of the second data; writing the data block of the second data into the corresponding slice; recording the second The data block of the data is written to the target stripe as the write location information.
  • data to be written (second data) is written in an unused stripe
  • the fragment size is 512 KB
  • the size of data to be written is 768 KB
  • data to be written The writing process starts from the beginning of the first data slice, so the data to be written is sliced according to the slice size 512 KB to obtain a 512 KB data block 501 and a 256 KB data block 502.
  • the client server writes data from the beginning of the stripe, that is, the data chunk 501 is written to the first slice of the stripe, and the data chunk 502 is written to the second stripe.
  • Fragmentation and record location information (2,768KB).
  • the check block may be stored into the node each time the data is stored into the node; or the check score may not be stored when the strip is not full. Block, and then store the check block into the node after the strip is full, as explained below.
  • the block is verified each time data chunking is stored.
  • the client server also obtains the check block of the stripe; and writes the check block of the stripe into the check node corresponding to the stripe.
  • a new check block may be generated according to the newly generated data block of the current round and the check block generated by the previous round of the strip. For example, for a stripe, the newly generated data block of this round is XORed with the check block generated by the previous round of the strip (when XOR, the insufficient part is filled with 0), XOR The result of the operation is used as a new check block.
  • the check block is generated based on the data block generated by this round, for example: like existing As with the technology, the data generated by the current round is padded with 0, and then the check block is generated according to the EC algorithm; and the difference from the prior art is that the main target of the complement 0 is used for calculating the check block, and the complement is used. 0 will not be sent to the node for storage in the future.
  • the target stripe has a check block for verifying the written data block
  • the client server may further Calculating the written data block and the check score common to the one or more data blocks according to the one or more data blocks and the check block for verifying the written data block And storing the calculated check block into the checksum of the target stripe.
  • the client server performs a sharding operation
  • the client server generates a new check block according to the data block of the first data and the check block that has been written into the stripe.
  • the application server in a case where the application server performs a sharding operation, the application server generates a new check block according to the data block of the first data and the check block that has been written into the stripe, and sends the check block to the Client server.
  • the check block 503 and the check block 504 may be obtained by performing an exclusive OR operation based on the 512 KB data block 501 and the 256 KB data block 502.
  • the XOR operation may be performed according to the newly generated 256 KB data block 601 and the 512 KB data block 602, and the check blocks 503 and 504 generated when the data was last stored. Two new check blocks 603 and 604 are obtained.
  • the client server After obtaining the data blocks 601, 602 and the check blocks 603, 604 of the first data, the client server respectively writes the data block and the check block of the first data into the corresponding data node and checksum. node.
  • the metadata server may store the mapping relationship between the stripe and the node, and send the stripe and the information of the data node and the check node corresponding to the stripe to the client server.
  • the metadata server can store the mapping relationship between the stripe_id of the stripe and the node_id of the node, and which client server writes the stripe (stripe_id_owner).
  • the client server queries the metadata server for the stripe_id of the stripe to be written and the mapping relationship between the stripe and the node.
  • the metadata server may select a stripe whose value of stripe_id_owner is an illegal value and assign it to the client server for data writing, and update the value of stripe_id_owner to the client_id number of the client server.
  • the initial value of stripe_id_owner is an illegal value, indicating that the stripe is not allocated.
  • the metadata server sends the stripe_id of the allocated stripe and the corresponding node_id to the client server. Based on this, the client server can know the assigned stripe and the data node and check node corresponding to the stripe.
  • the write position information is (1, 0 KB), or the write position information is empty, or there is no such time.
  • the location information is written, and the client server writes data from the start position of the stripe, that is, the data chunk 501 is written to the data node 1 corresponding to the stripe, and the data chunk 502 is written to the stripe corresponding to the stripe.
  • Data node 2 the start position of the stripe
  • the client server can send a data write request, including data chunking 501, and carries parameters (stripe_id5, node_id 1, offset 0KB, length 512KB),
  • the stripe_id 5 indicates that the data write request is stripe 5
  • the node_id 1 indicates the data node 1 of the stripe
  • the offset 0 KB indicates that the offset of the write position relative to the start position of the stripe is 0 KB, that is,
  • the data is written starting from the beginning of the stripe
  • length 512 KB indicates that the size of the write data is 512 KB.
  • data node 1 persists data block 501 to the hard disk and then returns a successful write response to the client server.
  • the client server can send a data write request, including data partition 502, and carry parameters (stripe_id 5, node_id 2, offset 512 KB, length 256 KB).
  • data node 2 persists data block 502 to the hard disk and returns a successful write response to the client server.
  • the client server sends the check block 503 and the check block 504 to the check node 1 and the check node 2, respectively, and the check node 1 and the check node 2 respectively check the block 503 and check Block 504 persists to the hard disk and returns a successful write response to the client server.
  • Persisting the partition to the hard disk is actually saving the partition to the hard disk.
  • the process of saving the partition is essentially a process of blocking the partition.
  • the client server updates the memory of the client server after the data of the first data is chunked into the corresponding data node.
  • Write location information saved in may include the last location where the data node has written the data partition, or the data node sends the data to the client server while returning the successful write response.
  • the node has been written to the last position of the data block.
  • the client server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information saved in the memory of the client server.
  • the client server may determine that the last position of the written data block returned by the data node 2 is the last position of the data block in the stripe, so that it can be updated.
  • the write location information saved in the memory of the client server is (2,768 KB).
  • the metadata server is updated. This writes location information.
  • the data node may further send, to the metadata server, a last location where the data node has written the data block, and the metadata server may compare the data nodes to write data. The last position of the block determines the last position of the data block in the strip, and then the write position information saved in the metadata server is updated.
  • the metadata server can determine the last of the written data blocks returned by the data node 2 based on the last position of the written data partition returned by the data nodes 1 and 2. The location is the last position in the stripe where the data chunk has been written, so that the write location information saved in the metadata server can be updated to (2,768 KB).
  • data blocks 601 and 602 are newly generated, and check blocks 603 and 604 are generated based on the write position information.
  • the client server determines to start writing data from (2,768 KB) based on the write location information. Therefore, the client server sends the data chunk 601 to the data node 2, and the data chunk 602 is sent to the data node 3.
  • Data node 2 and data node 3 respectively persist data chunk 601 and data chunk 602 onto the hard disk and return a successful write response to the client server.
  • the data chunk of the new IO request may continue to be written in an append-only/append manner.
  • the starting position of the subsequent write data block is the last position of the previously successfully written data block.
  • this data block writing method can utilize the hard disk space as efficiently as possible to avoid the free space of the data before and after; on the other hand, it can better adapt the flash (Flash) type storage medium to improve the read and write performance, and balance the particles. Wear improves media life, such as Solid State Drive (SSD), Storage Class Memory (SCM), and more.
  • SSD Solid State Drive
  • SCM Storage Class Memory
  • the client server sends the check block 603 and the check block 604 to the check node 1 and the check node 2, respectively, and the check node 1 and the check node 2 respectively check the block 603 and check Block 604 persists to its own hard disk and returns a successful write response to the client server.
  • the last position of the data block that has been written in the stripe is the last position of the data node 3 or the start position of the data node 4, so the write position information can be updated to (3) , 1536KB) or (4,1536KB).
  • the above process can be executed cyclically.
  • the stripe can be marked in the metadata server to be full, thereby completing a full-score data writing process.
  • the indication information that the stripe is full may be sent to the metadata server.
  • the metadata server marks that the stripe is full according to the indication information, for example, the stripe_full is marked as TRUE (true).
  • the metadata server updates the write location information
  • the metadata server updates the write location information and determines that the stripe is full, the token strip is already filled, and if the strip is full, TRUE.
  • the metadata server can also send the client server an indication that the stripe is full, so that the client server can no longer write new data to the filled stripe, and continue to write a stripe.
  • the check block when the stripe is not full, the check block is not stored to the node, but the check block obtained in each round is stored in the cache.
  • the final check block is stored to the node when the strip is full.
  • the cache of the client server has a check block for verifying the data partition of the written node.
  • the client server may further perform (1) the one or more data partitions, and (2) the verifying the written
  • the check block of the data block calculates: the written data block and the check block common to the one or more data blocks; storing the calculated check block to the cache of the client server in.
  • the check chunks corresponding to the target strips in the cache of the client server are stored to the target stripe. Checksum fragmentation.
  • the verification block is stored in the cache, and the other processing is similar to the previous embodiment. For the sake of brevity, no further details are provided.
  • the verification slice is not calculated and stored when the slice is not full, and the verification slice is calculated and stored when the slice is full.
  • the storage node includes a check node for storing the check data block, and the client server further sends the one or more data blocks to one or more in addition to sending the write request.
  • Checking nodes perform backup; after all data fragments of the target stripe are full of data chunks, instructing the one or more check nodes to generate checksums according to all data chunks of the target stripe that have been backed up
  • the block stores the generated check block into the check slice of the target stripe.
  • the client server not only writes the data of the first data into the data node, but also sends the data of the first data to the check node corresponding to the stripe for backup;
  • the indication information that the stripe is full is sent to the check node, and the indication information is used to indicate that the check node generates and stores the check of the strip according to all the data chunks of the stripe. Block.
  • the client server After the data of the first data is divided into data nodes, the client server further sends the data of the first data to the check node corresponding to the stripe.
  • the check node caches the data block.
  • the client server can continue to store the next round of data in a similar manner (eg, the data to be written carried by the next write IO request from the host).
  • the client server sends the indication information that the stripe is full to the check node, and after receiving the indication information, the check node generates and blocks all the data according to the stripe. Stores the check block of the stripe. Then, the check node can delete all the data blocks of the cached stripe, and can return a response to the client server that the check block write succeeds.
  • the client server determines whether the stripe is full is similar to the previous embodiment. For example, in the case that the client server updates the write location information, the client server may determine whether the stripe is full after updating the write location information; if the metadata server updates the write location information, After the metadata server updates the write location information, it is determined that the stripe is full, and the indication information that the stripe is full is sent to the client server, and the client server may determine whether the strip is based on whether the packet is received or not. Has been filled.
  • the technical solution of the embodiment of the present application is based on the data block in which the data is written by the write location information. If the data does not satisfy the full score, the operation of the '0' operation is not required, which can save the space of the hard disk space and reduce the data required.
  • the total number of strips thus: (1) the management of the stripe becomes simpler and faster, reduces the complexity of stripe management, and (2) improves the speed of finding strips, when it is necessary to find a stripe, Finding the required strips faster (3)
  • the number of fault recovery strips is reduced, which speeds up the recovery of faults. For example, when a hard disk fails, data needs to be recovered for all the strips involved in the faulty hard disk. In the case where the total number of strips is reduced, the number of strips that need to be restored is reduced, and the recovery time is shortened.
  • FIG. 7 is a schematic diagram showing a data writing manner of an embodiment of the present application.
  • T1-T4 represents the time axis. Taking EC 4+2 mode as an example, the fragment size is 512 KB. N1-N4 represents the node where the data fragment is located, and P and Q represent the nodes where the verification fragment is located.
  • the case where the write location information is stored in the metadata server (not shown) and the write location information is stored in the client server is similar, and will not be described again for the sake of brevity.
  • data block 1 needs to be written, data blocks 2, 3, and 4 are replaced by all '0's, and parity blocks p1 and q1 are calculated by data blocks 1, 2, 3, and 4, and then routed.
  • the information is addressed to the specified node (for example, by DHT mode), and the data block 1 and the check blocks p1 and q1 are persisted to the hard disk (data blocks 2, 3, and 4 do not need to be persistently stored), and the update is performed.
  • the time data is written to the last location of the block to the metadata server.
  • data block 2 needs to be written, and the location where the data block 2 is written is obtained by querying the metadata server, and the check blocks p2 and q2 are calculated by using the data block 2 and the check blocks p1 and q1, and The data block 2 and the check blocks p2 and q2 are persistent to the hard disk, and the last position of the data block writing at this time is updated to the metadata server, and then p1 and q1 are deleted.
  • data block 3 needs to be written, and the location where data block 3 is written is obtained by querying the metadata server, and the check blocks p3 and q3 are calculated by using data block 3 and check blocks p2 and q2, and The data block 3 and the check blocks p3 and q3 are persistent to the hard disk, and the last position of the data block writing at this time is updated to the metadata server, and then p2 and q2 are deleted.
  • data block 4 needs to be written.
  • P3, q3 calculate the final check block p, q and persist, then delete p3, q3, and mark that the strip has been completed.
  • the one-time data writing process ends.
  • Data blocks 1, 2, 3, and 4 can each come from different write data requests from the host.
  • the subsequent data writing process can repeat the steps of T1-T4.
  • FIG. 8 shows a flow chart of a data writing method of one embodiment of the present application.
  • the storage system includes a client server, an application server, a metadata server, a data node, and a check node.
  • the data node and the check node can store data chunks and check chunks in accordance with the EC mode.
  • the number of data nodes and check nodes corresponding to the stripe in Fig. 8 is determined by the adopted EC mode. For example, for the EC 4+2 mode, one strip corresponds to four data nodes and two check nodes.
  • the write location information is saved in the memory of the client server.
  • the following steps 801-813 describe how the client server persists the first data carried by the first IO request to the node; the following steps 814-825 describe how the client server carries the second IO request. Data is persisted to the node.
  • the IO request is, for example, a write IO request received from a host or other device, or a write IO request generated by the client server itself.
  • the client server sends a data write request to the application server.
  • the client server When the client server needs to store data (first data), it sends a data write request to the application server, and the data write request includes data to be written, for example, files, objects, and the like that need to be written.
  • the application server performs a segmentation operation.
  • the application server After receiving the data write request from the client server, the application server performs a split operation on the data to be written according to the configured EC redundancy ratio and the fragment size, and calculates a check block. Since the location information is not written at this time, the application server can perform the tile division operation according to the configured fragment size.
  • the application server returns a data block and a check block to the client server.
  • the client server queries the metadata server for information about the stripe written by the data.
  • the client server queries the metadata server for the stripe_id of the stripe to be written and the mapping relationship between the stripe and the node.
  • the metadata server allocates strips.
  • the metadata server can randomly select a stripe_id_owner for the illegal value to be assigned to the client server for data writing, and update the value of stripe_id_owner to the client server number client_id.
  • Stripe_id_owner is an illegal value means that this stripe is a blank stripe, and no data block has been stored in the stripe, so there is no write position information (or the value of the write position information is 0, or the value of the position information is written). Is an illegal value).
  • the metadata server sends the allocated stripe and the information of the corresponding node to the client server.
  • the metadata server can send the stripe_id of the allocated stripe and the corresponding node_id to the client server.
  • the client server writes data blocks.
  • the client server Since no location information is written at this time, the client server writes data starting from the beginning of the allocated stripe. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the corresponding data node that receives the data chunking persists the data chunks to the hard disk.
  • the data node sends a response to the successful write to the client server and the last location of the data block that has been written.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the response may include the last location where the data node has written the data chunk or the data node returns the last successful location of the data chunk to the client server while returning the successful write response.
  • the client server writes a check block.
  • the client server sends the check block to the corresponding check node.
  • the check node persists the check block.
  • the check node persists the check block to the hard disk.
  • the verification node sends a response to the successful write to the client server.
  • the client server After the client server receives the successful response of the current data block and the check block, the current round of data is successfully written.
  • the client server records the write location information.
  • the client server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then record the write location information in the memory of the client server (if written The position is newly 0 or an illegal value, and the "record" of this step can be understood as "update”.
  • the client server sends a data write request to the application server.
  • the client server When there is new data (second data) to be stored, the client server sends a data write request to the application server, the data write request including data to be written and the write location information.
  • the write location information indicates the last location in the strip where the previous round of data chunks has been written.
  • the application server generates data chunks.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server reads the check block written in the previous round of the stripe.
  • the application server generates a check block.
  • the application server generates a new check block according to the newly generated data block of the current round and the check block written in the previous round of the stripe.
  • the application server returns a data block and a check block to the client server.
  • the client server writes data blocks.
  • the client server Based on the write location information, the client server starts writing data from the last position in the last round of the data block in which the data block has been written. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk. For example, you can continue to write new data chunks in Append-Only mode.
  • the data node sends a response to the successful write to the client server and the last location of the data block that has been written.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the response may include the last location where the data node has written the data chunk or the data node returns the last successful location of the data chunk to the client server while returning the successful write response.
  • the client server writes a check block.
  • the client server sends the check block to the corresponding check node.
  • the check node persists the check block.
  • the check node persists the check block to the hard disk.
  • the verification node sends a response to the successful write to the client server.
  • the client server After the client server receives the successful response of the current data block and the check block, the current round of data is successfully written.
  • the client server updates the write location information.
  • the client server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information saved in the memory of the client server.
  • the client server sends an indication that the stripe is full to the metadata server.
  • the metadata server marks that the stripe is full according to the indication information, and if the stripe write full flag is TRUE, the complete data writing process is completed.
  • FIG. 9 is a flow chart showing a data writing method of another embodiment of the present application.
  • the client server sends a data write request to the application server.
  • the client server When the client server needs to store data, the client server sends a data write request to the application server, and the data write request includes data to be written, for example, files, objects, and the like that need to be written.
  • the application server performs a segmentation operation.
  • the application server After receiving the data write request from the client server, the application server performs a split operation on the data to be written according to the configured EC redundancy ratio and the fragment size, and calculates a check block. Since the location information is not written at this time, the application server can perform the block division operation according to the configured slice size.
  • the application server returns a data block and a check block to the client server.
  • the client server queries the metadata server for the stripe information and the write location information of the data write.
  • the client server queries the metadata server for the stripe_id of the stripe to be written and the mapping relationship between the stripe and the node, and writes the location information.
  • the metadata server allocates the stripe.
  • the metadata server can randomly select a stripe_id_owner for the illegal value to be assigned to the client server for data writing, and update the value of stripe_id_owner to the client server number client_id. Since the new allocation is done this time, no location information is written, or the write location information can be the starting position of the stripe.
  • the metadata server sends the allocated stripe and the information of the corresponding node to the client server.
  • the metadata server can send the stripe_id of the allocated stripe and the corresponding node_id to the client server.
  • the metadata server also sends the write location information to the client server.
  • the client server writes data chunks.
  • the client server For newly allocated strips, the client server writes data starting from the beginning of the allocated stripe. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk.
  • the data node sends a response to the successful write to the client server.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the data node sends a last location to the metadata server that the data node has written the data block.
  • the metadata server updates the write location information.
  • the metadata server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information.
  • the client server writes a check block.
  • the client server sends the check block to the corresponding check node.
  • the check node persists the check block.
  • the check node persists the check block to the hard disk.
  • the verification node sends a response to the successful write to the client server.
  • the client server After the client server receives the successful response of the current data block and the check block, the current round of data is successfully written.
  • the client server queries the metadata server for the write location information.
  • the client server When there is new data to be stored, the client server first queries the metadata server for the write location information, that is, the last position in the strip that has been written to the data chunk in the previous round.
  • the metadata server returns write location information to the client server.
  • the client server sends a data write request to the application server.
  • the client server sends a data write request to the application server, the data write request including data to be written and the write location information.
  • the application server generates data chunks.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server reads the check block written in the previous round of the stripe.
  • the application server generates a verification block.
  • the application server generates a new check block according to the newly generated data block of the current round and the check block written in the previous round of the stripe.
  • the application server returns a data block and a check block to the client server.
  • the client server writes data chunks.
  • the client server Based on the write location information, the client server starts writing data from the last position in the last round of the data block in which the data block has been written. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk. For example, you can continue to write new data chunks in Append-Only mode.
  • the data node sends a response to the successful write to the client server.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the data node sends to the metadata server the last location where the data node has written the data block.
  • the metadata server updates the write location information.
  • the metadata server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information.
  • the client server writes a check block.
  • the client server sends the check block to the corresponding check node.
  • the check node persists the check block.
  • the check node persists the check block to the hard disk.
  • the verification node sends a response to the successful write to the client server.
  • the client server After the client server receives the successful response of the current data block and the check block, the current round of data is successfully written.
  • the metadata server determines whether the stripe is full.
  • the metadata server can determine whether the stripe is full based on the latest write location information. If the strip is not full, execute 931; if the strip is full, execute 932 and 933.
  • the metadata server sends, to the client server, the indication that the stripe is not full.
  • the client server determines, according to the indication information, that the stripe is not full, and continues to perform the process of 915-931 described above.
  • the metadata server sends, to the client server, the indication that the stripe is full.
  • the client server determines, according to the indication information, that the stripe is full, and continues to write a stripe.
  • the metadata server writes the stripe full mark to TRUE, thereby completing a full-score data writing process.
  • FIG. 10 is a flow chart showing a data writing method of another embodiment of the present application.
  • the client server sends a data write request to the application server.
  • the client server When the client server needs to store data, the client server sends a data write request to the application server, and the data write request includes data to be written, for example, files, objects, and the like that need to be written.
  • the application server performs a segmentation operation.
  • the application server After receiving the data write request from the client server, the application server performs the segmentation operation on the data to be written according to the configured EC redundancy ratio and the fragment size. Since the location information is not written at this time, the application server can perform the tile division operation according to the configured fragment size.
  • the application server returns a data block to the client server.
  • the client server queries the metadata server for information about the stripe written by the data.
  • the client server queries the metadata server for the stripe_id of the stripe to be written and the mapping relationship between the stripe and the node.
  • the metadata server allocates strips.
  • the metadata server can randomly select a stripe_id_owner for the illegal value to be assigned to the client server for data writing, and update the value of stripe_id_owner to the client server number client_id.
  • the metadata server sends the allocated stripe and the information of the corresponding node to the client server.
  • the metadata server can send the stripe_id of the allocated stripe and the corresponding node_id to the client server.
  • the client server writes data chunks.
  • the client server Since no location information is written at this time, the client server writes data starting from the beginning of the allocated stripe. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk.
  • the data node sends a response to the successful write to the client server and the last location where the data chunk has been written.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the response may include the last location where the data node has written the data chunk or the data node returns the last successful location of the data chunk to the client server while returning the successful write response.
  • the client server writes a data block to the check node.
  • the client server sends the data block to the check node corresponding to the stripe.
  • the check node caches the data block.
  • the check node caches the received data.
  • the verification node sends a response to the successful write to the client server.
  • the client server updates the write location information.
  • the client server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information in the memory of the client server.
  • the client server sends a data write request to the application server.
  • the client server When there is new data to be stored, the client server sends a data write request to the application server, the data write request including data to be written and the write location information.
  • the write location information indicates the last position in the last round of the strip that has been written to the data chunk.
  • the application server generates data chunks.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server returns a data block to the client server.
  • the client server writes data chunks.
  • the client server Based on the write location information, the client server starts writing data from the last position in the last round of the data block in which the data block has been written. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk. For example, you can continue to write new data chunks in Append-Only mode.
  • the data node sends a response to the successful write to the client server and the last location where the data block has been written.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the response may include the last location where the data node has written the data chunk or the data node returns the last successful location of the data chunk to the client server while returning the successful write response.
  • the client server writes a data block to the check node.
  • the client server sends the check block to the corresponding check node.
  • the check node caches the data block.
  • the check node caches the received data.
  • the verification node sends a response to the successful write to the client server.
  • the client server updates the write location information.
  • the client server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information saved in the memory of the client server.
  • the client server sends an indication that the stripe is full to the check node.
  • the check node calculates and stores the check block.
  • the check node generates and stores the check block of the strip according to all the data blocks of the cached strip. The check node can then delete all of the data chunks of the cached stripe.
  • the check node returns a response of the verification block write success to the client server.
  • the client server sends an indication that the stripe is full to the metadata server.
  • the metadata server marks that the stripe is full according to the indication information, and if the stripe write full flag is TRUE, the complete data writing process is completed.
  • FIG. 11 is a flow chart showing a data writing method of another embodiment of the present application.
  • the client server sends a data write request to the application server.
  • the client server When the client server needs to store data, the client server sends a data write request to the application server, and the data write request includes data to be written, for example, files, objects, and the like that need to be written.
  • the application server performs a segmentation operation.
  • the application server After receiving the data write request from the client server, the application server performs the segmentation operation on the data to be written according to the configured EC redundancy ratio and the fragment size. Since the location information is not written at this time, the application server can perform the tile division operation according to the configured fragment size.
  • the application server returns a data block to the client server.
  • the client server queries the metadata server for the stripe information and the write location information written by the data.
  • the client server queries the metadata server for the stripe_id of the stripe to be written and the mapping relationship between the stripe and the node, and writes the location information.
  • the metadata server allocates strips.
  • the metadata server can randomly select a stripe_id_owner for the illegal value to be assigned to the client server for data writing, and update the value of stripe_id_owner to the client server number client_id. Since the new allocation is done this time, no location information is written, or the write location information can be the starting position of the stripe.
  • the metadata server sends the allocated stripe and the information of the corresponding node to the client server.
  • the metadata server can send the stripe_id of the allocated stripe and the corresponding node_id to the client server.
  • the metadata server also sends the write location information to the client server.
  • the client server writes data blocks.
  • the client server For newly allocated strips, the client server writes data starting from the beginning of the allocated stripe. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk.
  • the data node sends a response to the successful write to the client server.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the data node sends to the metadata server the last location where the data node has written the data block.
  • the metadata server updates the write location information.
  • the metadata server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information.
  • the client server writes a data block to the check node.
  • the client server sends the data block to the check node corresponding to the stripe.
  • the check node caches the data block.
  • the check node caches the received data.
  • the check node sends a response to the successful write to the client server.
  • the client server queries the metadata server for the write location information.
  • the client server When there is new data to be stored, the client server first queries the metadata server for the write location information, that is, the last position in the strip that has been written to the data chunk in the previous round.
  • the metadata server returns write location information to the client server.
  • the client server sends a data write request to the application server.
  • the client server sends a data write request to the application server, the data write request including data to be written and the write location information.
  • the application server generates data chunks.
  • the application server After receiving the data write request of the client server, the application server generates a data partition of the data to be written based on the write location information.
  • the application server returns a data block to the client server.
  • the client server writes a data block.
  • the client server Based on the write location information, the client server starts writing data from the last position in the last round of the data block in which the data block has been written. The client server sends the data to the corresponding data node in chunks.
  • the data node persists data partitioning.
  • the data node that receives the data chunking persists the data chunks to the hard disk. For example, you can continue to write new data chunks in Append-Only mode.
  • the data node sends a response to the successful write to the client server.
  • the data node persists the data sent by the client server, it returns a successful response to the client server.
  • the data node sends to the metadata server the last location where the data node has written the data block.
  • the metadata server updates the write location information.
  • the metadata server can compare the last position where each data node has written the data block to determine the last position of the data block in the stripe, and then update the write location information.
  • the client server writes a data block to the check node.
  • the client server sends the check block to the corresponding check node.
  • the check node caches the data block.
  • the check node caches the received data.
  • the verification node sends a response to the successful write to the client server.
  • the metadata server determines if the strip is full.
  • the metadata server can determine whether the stripe is full based on the latest write location information. If the strip is not full, 1129 is performed; if the strip is full, 1130 is executed.
  • the metadata server sends, to the client server, the indication that the stripe is not full.
  • the client server determines, according to the indication information, that the stripe is not full, and continues to execute the process of the above 1115-1129.
  • the metadata server sends, to the client server, the indication that the stripe is full.
  • the client server sends, to the check node, the indication that the stripe is full.
  • the check node calculates and stores the check block.
  • the check node generates and stores the check block of the strip according to all the data blocks of the cached strip. The check node can then delete all of the data chunks of the cached stripe.
  • the check node returns a response of the verification block write success to the client server.
  • the metadata server marks this strip as full.
  • the metadata server writes the stripe full mark to TRUE, thereby completing a full-score data writing process.
  • the technical solution of the embodiment of the present application does not generate supplemental '0' data when the data does not satisfy the full score, which can reduce the total number of strips required for data, reduce the complexity of stripe management, and improve the speed of searching for strips. Accelerate the speed of fault recovery, and avoid the transmission and persistence of the '0' data, reduce the write amplification of the network and the hard disk, and reduce the movement of invalid data, thereby improving the storage efficiency of the storage system.
  • the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application.
  • the implementation process constitutes any limitation.
  • the client server in the embodiment of the present application may perform various methods in the foregoing embodiments of the present application, that is, the specific working processes of the following various products, and may refer to the corresponding process in the foregoing method embodiments.
  • FIG. 12 shows a schematic block diagram of a client server 1200 of an embodiment of the present application.
  • the client server 1200 can include a receiving module 1210, an obtaining module 1220, and a writing module 1230.
  • the receiving module 1210 is configured to receive first data.
  • the obtaining module 1220 is configured to obtain write location information of the target stripe, where the write location information indicates that the data partition in the target stripe is in the location of the target stripe, where the target stripe includes multiple points. a slice, each slice corresponding to a storage node, the storage node communicating with the client server, the target stripe is not full; and acquiring one or more data chunks of the first data, each data chunking pair There should be a slice of free space in one of the plurality of tiles, wherein one or more data chunks of the first data are generated based on the write location information; or one or more of the first data Data chunking is generated based on the fragment size;
  • the writing module 1230 is configured to send a write request to the storage node of the slice corresponding to the one or more data blocks of the first data, where the write request is used to store the one or more data of the first data in blocks Go to the corresponding slice.
  • the write location information is used to indicate that a last position of the data block is written in the target stripe
  • one or more data partitions of the first data are generated by one of the following methods:
  • the first data is segmented according to the size of the slice to generate one or more data points of the first data. a block, wherein if the size of the first data is smaller than the size of the slice, the first data is divided into one data;
  • a data block is firstly segmented from the first data according to the size of the unwritten portion of the specific slice.
  • the cut data block corresponds to the unwritten portion of the specific slice, and then divides the remaining portion of the first data according to the size of the slice, and each data segment divided by the remaining portion of the first data
  • the block corresponds to a blank slice; wherein if the size of the first data is smaller than the size of the unwritten portion of the specific slice, the first data is divided into one data.
  • one or more data partitions of the first data are generated by:
  • one or more data partitions of the first data are generated by:
  • the writing location information includes: an offset of the written data partition in the target stripe, and a node number of the written data chunk.
  • the target stripe has a check block for verifying the written data block
  • the obtaining module 1220 is further configured to:
  • the write module 1230 is further configured to store the calculated check block into the check slice of the target stripe.
  • the cache of the client server has a check block for verifying the written data block
  • the obtaining module 1220 is further configured to:
  • the writing module 1230 is further configured to store the calculated verification block into the cache of the client server; and when all data fragments of the target stripe are filled with data chunks, the client server is The check block in the cache corresponding to the target stripe is stored to the check slice of the target stripe.
  • the storage node includes a check node for storing a check data block
  • the write module 1230 is further configured to:
  • the one or more check nodes are instructed to generate a check chunk according to all data chunks of the target stripe that have been backed up, and the generated chunks are generated.
  • the check block is stored to the checksum of the target stripe.
  • the obtaining module 1220 is specifically used in one of the following situations:
  • Segmenting the first data to generate one or more data blocks of the first data
  • the receiving module 1210 is further configured to receive second data.
  • the obtaining module 1220 is further configured to obtain an unused stripe as the target stripe, the size of the second data is smaller than the size of the target stripe, and obtain at least one data chunk of the second data, where the second Data chunking of the data is generated based on the fragment size; and determining a fragment corresponding to the data chunk of the second data;
  • the writing module 1230 is further configured to block the data of the second data into corresponding slices; and record the writing position of the data block of the second data in the target strip as the writing position. information.
  • the write location information is saved in a memory of the client server; or the write location information is saved in a metadata server.
  • the metadata server is configured to store a mapping relationship between the stripe and the node, and send the stripe and the information of the data node and the check node corresponding to the stripe to the client server.
  • the obtaining module 1220 is further configured to obtain, from the metadata server, the stripe and the information of the data node and the check node corresponding to the stripe.
  • the metadata server is further configured to mark that the stripe is full when the stripe is full.
  • the client server 1200 in the embodiment of the present application may perform the corresponding processes in the foregoing method embodiments.
  • FIG. 13 is a schematic structural diagram of a client server provided by another embodiment of the present application.
  • the client server includes at least one processor 1302 (eg, a CPU), at least one network interface 1305 or other communication interface, and a memory 1306. Communication between these components.
  • the processor 1302 is configured to execute executable modules, such as computer programs, stored in the memory 1306.
  • a communication connection with at least one other network element is achieved by at least one network interface 1305 (which may be wired or wireless).
  • the memory 1306 stores a program 13061
  • the processor 1302 executes the program 13061 for performing the methods in the various embodiments of the foregoing application.
  • the embodiment of the present application further provides a computer readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the methods in the foregoing various embodiments of the present application.
  • the embodiment of the present application further provides a system, which may include the client server and the plurality of nodes in the foregoing various embodiments.
  • the plurality of nodes may include a data node, a check node, and a metadata server.
  • the system can also include an application server.
  • the present application also provides a computer program product comprising instructions that, when run on a client server, cause the client server to have the functions described above, such as performed by a client server in the various embodiments described above step.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a Solid State Disk (SSD)) or the like.
  • a magnetic medium eg, a floppy disk, a hard disk, a magnetic tape
  • an optical medium eg, a DVD
  • a semiconductor medium such as a Solid State Disk (SSD)
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Security & Cryptography (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种数据写入方法、客户端服务器和系统。该方法包括:客户端服务器接收第一数据;获取目标分条的写入位置信息,目标分条中已经写有其他数据的数据分块,该写入位置信息表示该目标分条中已写入数据分块在该目标分条的位置按所述照目标分条的位置,向目标分条中追加写入另外的数据的数据分块,从而减少数据所需分条总数。

Description

数据写入方法、客户端服务器和系统 技术领域
本申请涉及信息技术领域,并且更具体地,涉及一种数据写入的方法、客户端服务器和系统。
背景技术
分布式存储系统场景下,存储节点广泛分布在多个地域。随着人工智能、图片存储检索、社交网络、地图导航等应用服务的兴起,其产生的数据量呈指数级递增,对于处理数据的处理器、存储数据的存储介质也提出了更高的要求。为了存储海量的数据,企业用户、数据中心基础设施等需要大量的硬盘。海量数据同时也带来了数据可靠性的问题。
为了保证数据读写的高可靠性,可以在分布式存储系统中引入纠删码(Erasure Coding,EC)技术。在EC技术中,对数据进行切分,得到数据分块(data fragment),然后计算出校验分块(parity fragment),将各个分块分别存入不同的节点。
以EC 4+2模式为例,1个分条(Stripe)包括4个数据分片和2个校验分片,分别用于存储数据分块和校验分块。如果数据的长度足以满足满分条写入操作(即数据不够切分为4个大小为分片大小的完整分块),首先对需要写入的数据进行切分操作,每个分块的大小(等于分片大小)固定,得到4个完整数据分块;然后通过异或运算计算出另外2个校验分块;最后将6个分块分别写入到指定节点,完成分块的持久化过程,一次满分条写入操作即完成。若数据不满足满分条的条件,即数据不够切分为4个大小为分片大小的完整分块,则通过补‘0’操作凑足4个完整分块,再计算4个完整分块的校验分块,然后将所有数据分块和校验分块分别写入到指定节点。然而这种补‘0’操作的方式会带来硬盘空间的无效开销,增加数据所需分条总数。
发明内容
本申请提供了一种数据写入方法、客户端服务器和系统,能够减少数据所需分条总数。
第一方面,提供了一种数据写入方法,该方法包括:
客户端服务器接收第一数据;
获取目标分条的写入位置信息,该写入位置信息表示该目标分条中已写入数据分块在该目标分条的位置,其中该目标分条包括多个分片,每个分片对应一个存储节点,该存储节点和该客户端服务器通信,该目标分条未写满;
该客户端服务器获取该第一数据的一个或多个数据分块,每个数据分块对应该多个分片中的一个存在空闲空间的分片,其中,该第一数据的一个或多个数据分块是基于该写入位置信息生成的;或者,该第一数据的一个或多个数据分块是基于分片大小生成的;
该客户端服务器向该第一数据的一个或多个数据分块对应的分片的存储节点发送写请求,该写请求用于将该第一数据的一个或多个数据分块存储到对应的分片中。
本申请实施例的技术方案,基于写入位置信息写入数据的数据分块,对于数据不满足 满分条的情况(分配给所述数据的分条没有被所述数据写满),不需要进行补‘0’操作,可以节省硬盘空间的开销,减少数据所需分条总数,相应的,因分条总数减少,从而可以:(1)降低了分条管理的复杂度,(2)提升了查找分条的速度,(3)故障恢复的分条数目变少,加快了故障恢复的速度。反之,在现有技术中,对于数据不满足满分条的情况,一方面通过补‘0’凑成满分条;另外一方面,客户端服务器收到新的写请求后,无法写入没有写满的分条中写入新的数据,造成分条的浪费。
在一些可能的实现方式中,该写入位置信息用于表示已写入数据分块在该目标分条中的最后位置;
其中,该第一数据的一个或多个数据分块是基于该写入位置信息生成的,具体包括如下情况中的一种:
当该目标分条中已写入数据分块的最后位置是特定分片的最后位置,则按照该分片的大小切分该第一数据,以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块;
当该目标分条中已写入数据分块的最后位置不是该特定分片的最后位置,则先按该特定分片未写部分的大小从该第一数据中切分出一个数据分块,该切出的一个数据分块对应该特定分片的未写部分,再按照该分片的大小切分该第一数据的余下部分,该第一数据的余下部分所切分出的每个数据分块对应一个空白分片;其中,若该第一数据的大小小于该特定分片未写部分的大小,则将该第一数据作为一个数据分块。
基于该写入位置信息生成该第一数据的一个或多个数据分块,可以使生成的数据分块对应目标分条的未写数据部分,从而可以将数据分块写入到目标分条的未写数据部分,节省硬盘空间的开销,减少数据所需分条总数。
在一些可能的实现方式中,该第一数据的一个或多个数据分块是基于分片大小生成的,具体包括:
按照该分片的大小切分该第一数据以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块。
在一些可能的实现方式中,在该第一数据满足第一条件时,基于该写入位置信息生成该第一数据的一个或多个数据分块;或者,在该第一数据满足第二条件时,基于该分片大小生成该第一数据的一个或多个数据分块。
在一些可能的实现方式中,该写入位置信息包括:已写入数据分块在该目标分条中的偏移量,以及该已写入数据分块的节点编号。
在一些可能的实现方式中,该目标分条中有用于校验该已写入数据分块的校验分块,在该客户端服务器获取该第一数据的至少一个数据分块之后,还包括:
根据该一个或多个数据分块以及该用于校验该已写入数据分块的校验分块,计算该已写入数据分块以及该一个或多个数据分块共同的校验分块;
将计算出的校验分块存储到该目标条带的校验分片。
在一些可能的实现方式中,该客户端服务器的缓存中有用于校验该已写入数据分块的校验分块,在该客户端服务器获取该第一数据的至少一个数据分块之后,还包括:
根据该一个或多个数据分块以及该用于校验该已写入数据分块的校验分块,计算该已写入数据分块以及该一个或多个数据分块共同的校验分块;
将计算出的校验分块存储到该客户端服务器的缓存中;
当该目标分条的所有数据分片写满数据分块后,将该客户端服务器的缓存中的对应该目标条带的校验分块存储到该目标条带的校验分片。
在一些可能的实现方式中,该存储节点中包括用于存储校验数据块的校验节点,该客户端服务器在发送该写请求之外,还包括:
将该一个或多个数据分块发送给一个或者多个校验节点进行备份;
当该目标分条的所有数据分片写满数据分块后,指令该一个或者多个校验节点根据已备份的该目标条带的所有数据分块生成校验分块,并将生成的该校验分块存储到该目标条带的校验分片。
在一些可能的实现方式中,该客户端服务器获取该第一数据的一个或多个数据分块包括如下情况中的一种:
该客户端服务器对该第一数据进行切分,以生成该第一数据的一个或多个数据分块;
该客户端服务器发送该写入位置信息以及该第一数据给应用服务器,然后从该应用服务器获取该第一数据的一个或多个数据分块。
在一些可能的实现方式中,在客户端服务器接收第一数据之前,还包括:
该客户端服务器接收第二数据;
获取未使用的分条作为该目标分条,该第二数据的大小小于该目标分条的大小;
获取该第二数据的至少一个数据分块,其中,该第二数据的数据分块是基于该分片大小生成的;
确定该第二数据的数据分块对应的分片;
将该第二数据的数据分块写入对应的分片;
记录该第二数据的数据分块在该目标分条中的写入位置,作为该写入位置信息。
在一些可能的实现方式中,该写入位置信息保存在该客户端服务器的内存中;或者,该写入位置信息保存在元数据服务器中。
在一些可能的实现方式中,该写入位置信息可以包括该分条中已写入的最后一个数据分片所在的节点的标识和该分条中已写入的数据相对于本分条起始位置的偏移量。
在一些可能的实现方式中,存储系统中的元数据服务器可以存储分条与节点的映射关系,并向该客户端服务器发送分条和该分条对应的数据节点和校验节点的信息。
在一些可能的实现方式中,可以以追加写方式继续写入新的数据分片。
这种数据写入方式一方面能够尽可能地高效利用硬盘空间,避免出现前后数据的空闲空间;另一方面能够更好地适配闪存型存储介质提高读写性能,均衡颗粒磨损提高介质寿命。
在一些可能的实现方式中,在分条已写满时,可在元数据服务器中标记本分条已写满。
在本申请实施例的技术方案中,由于不需要补‘0’数据,避免了补‘0’数据的传输和持久化,减少了写放大以及无效数据的搬移操作,因而能够提升存储系统的存储效率。
第二方面,提供了一种客户端服务器,包括执行第一方面或第一方面的任意可能的实现方式中的方法的模块。
第三方面,提供了一种客户端服务器,包括:处理器和存储器;该存储器用于存储指令;该处理器用于执行该存储器存储的指令,以执行第一方面或第一方面的任意可能的实 现方式中的方法。
第四方面,提供了一种系统,该系统包括:第二方面或第三方面的客户端服务器,以及多个节点;其中,该多个节点用于存储该客户端服务器待写入的数据。
第五方面,提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。
附图说明
图1是可应用本申请实施例的技术方案的场景的示意图。
图2是满分条数据写入的示意图。
图3是非满分条数据写入的示意图。
图4是本申请实施例的数据写入方法的示意性流程图。
图5是本申请一个实施例的数据分块的示意图。
图6是本申请另一个实施例的数据分块的示意图。
图7是本申请实施例的数据写入方式的示意图。
图8是本申请一个实施例的数据写入方法的流程图。
图9是本申请另一个实施例的数据写入方法的流程图。
图10是本申请另一个实施例的数据写入方法的流程图。
图11是本申请另一个实施例的数据写入方法的流程图。
图12是本申请一个实施例的客户端服务器的示意性框图。
图13是本申请另一个实施例的客户端服务器的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例的技术方案可以应用于各种存储系统。在下文中以分布式存储系统为例描述本申请实施例的技术方案,但本申请实施例对此并不限定。在分布式存储系统中,数据(例如文件、对象)分散存储在多台存储设备上,多台存储设备分担了存储负荷,这种存储方式不但提高了系统的可靠性、可用性和存取效率,还易于扩展。存储设备例如是服务器,或者是存储控制器和存储介质的组合。
本发明实施例中:客户端服务器接收第一数据;获取目标分条的写入位置信息,目标分条中已经写有其他数据的数据分块,该写入位置信息表示该目标分条中已写入数据分块在该目标分条的位置按所述照目标分条的位置,向目标分条中追加写入另外的数据的数据分块,相当于同一个分条在不同时刻写入多个数据的数据分块,从而减少数据所需分条总数。
图1是可应用本申请实施例的技术方案的场景的示意图。
如图1所示,客户端服务器(client server)101、应用服务器102和存储系统100通信,存储系统100包括交换机103和多个存储节点(或简称“节点”)104等。其中,存储节点104也可简称为节点,是存储设备;交换机103是可选设备。在其他实施例中,应用服务器102也可以属于存储系统100内部。
每个节点104可以包括多个磁盘或者其他类型的存储介质(例如固态硬盘、软盘或者叠瓦式磁记录)用于存储数据,为了方便介绍,后续仅以硬盘(HDD)进行举例。节点104按照具体的功能可以分为数据节点、校验节点和元数据服务器。数据节点用于存储数据的数据分块,校验节点用于存储数据的校验分块,元数据服务器可以用于存储数据的元数据,还可以用于存储数据分块的元数据、元数据分块的元数据。
客户端服务器101发送请求给应用服务器102,请求中携带待写入的数据。应用服务器102对数据进行切分操作生成数据分块,以及根据数据分块生成校验分块,并将得到的分块返回给客户端服务器101,客户端服务器101通过交换机103将分块发送给各个分块相应的节点104,该节点104存储分块后向客户端服务器101返回写入成功的响应。
应理解,应用服务器102与客户端服务器101也可以合并。例如,应用服务器102与客户端服务器101的功能可以都由客户端服务器101实现。合并后,应用服务器102的功能被整合到客户端服务器101,应用服务器102与客户端服务器101之间的信息交互操作可转换成内部操作或者有些操作可以适应性取消。
分布式存储系统100中,客户端服务器101和节点104之间的路由方式可以采用分布式哈希表(Distributed Hash Table,DHT)方式,但本申请实施例对此并不限定。也就是说,在本申请实施例的技术方案中,可以采用存储系统中的各种可能的路由方式。
为了保证数据读写的高可靠性,可以在分布式存储系统100中采用纠删码(Erasure Code,EC)技术进行数据的存储。在本文中,以EC 4+2模式为例进行说明,即一个分条(stripe)包括4个数据分片和2个校验分片,但本申请实施例对此并不限定。分片(fragment)有时也称为条带(strip)或者分条单元(stripe unit)。如图2所示,若采用EC 4+2模式,应用服务器102将数据按照分片大小(例如512KB)切分,切分为4个大小为分片大小的完整分块(这种情况被称为满足满分条的条件),然后通过异或运算计算出另外2个校验分块。若数据不满足满分条的条件,即数据的大小不够分为4个完整分块,如图3所示,对于分条2,拥有4个数据分片,因此一共可以存储4个完整分块,而数据的实际大小只够分为2个完整分块8、9,在这种情况下,需要进行补‘0’操作得到分块10、11,即这两个分块中的数据全设为0,然后将通过补‘0’生成的数据分块10、11连同数据分块8、9一起,计算出校验分块P2、Q2。补‘0’之后的数据分块10、11并没有携带有用的信息,但是也会存储到数据节点上,这样会导致硬盘空间利用率降低。
针对上述问题,本申请实施例提供了一种技术方案,可以提高数据不满足满分条的情况下的硬盘空间利用率,减少数据所需分条总数。
下面首先对本申请实施例中的一些术语和参数进行说明。
分条是一段对应有物理空间的逻辑空间。具体而言,分条由分片组成,每个分片对应硬盘上的一段物理存储空间,这种对应关系通过分片和硬盘的逻辑区块地址(Logical Block  Address,LBA)地址的映射关系来描述,这个对应关系可以存储在存储服务器中或者存储咋其他位置。需要特别说明的是,对于瘦提供(thin provision)这种技术而言,在真正往分条中存储数据前,不会给分条分配物理空间,只有在存储服务器真正需要存储数据块时,才由存储服务器(从硬盘中)分配物理空间给分条。当分片所对应的物理空间存储了某个数据块以后,可以认为这个数据块被“写入了”这个分片。
EC模式/冗余配比,表示为n+m,n+m=N,其中n表示数据分片(data fragment)个数,m表示校验分片(parity fragment)个数,N为分条的分片总数。
分条的n个数据分片分别对应到n个数据节点,m个校验分片分别对应到m个校验节点,该n个数据节点为该分条对应的数据节点,该m个校验节点为该分条对应的校验节点。需要说明是,本实施例中,N个分片可以对应N个节点,也就是说每个节点只存储一个分块。在其他实施例中,N个分片也可以对应少于N的节点,只要N个分片对应到N个存储介质即可。
node_id:表示在具体的某个分条中,带硬盘的服务器(节点)编号,其取值范围是[1,N],node_id从1到n表示数据节点的ID,node_id从从n+1到N表示校验节点的ID。当然,在其他实施例中可以不使用node_id,使用其他标签作为节点的ID。
分条与节点的映射关系<stripe_id,node_id>的集合可以保存在元数据服务器上,并且一个节点可能对应多个分条。
client_id:表示客户端服务器的编号。分条分配给客户端服务器后,该分条的stripe_id_owner(在下文中介绍)为该客户端服务器的client_id。该分条由该客户端服务器写入数据,其他客户端服务器只能读取,不能写入。client_id保存在客户端服务器中。
stripe_id:表示分条编号,保存在元数据服务器中。需要数据写入时,由元数据服务器分配一个之前未写入的分条(stripe_id_owner为非法值)给客户端服务器。
stripe_full:表示分条已写满,默认值为FALSE(假),保存在元数据服务器中。
block_size:表示配置/预定的分片大小,其长度是固定的。
stripe_id_owner:表示该分条具体由哪一个客户端服务器来进行数据写入,也就是分条的拥有者。stripe_id_owner的初始值设为非法值,表示该分条未被分配,保存在元数据服务器中。
offset:相对于分条起始位置的偏移量,其中offset最小为0,最大可以为block_size*n(n为EC模式的数据分片个数)。
location:写入位置信息,表示分条中已写入数据分块的最后位置。例如,可以由本轮最后一个写入的数据分块所在的节点id和偏移量共同构成,表示为:(node_id,offset)。
图4示出了本申请实施例的数据写入方法400的示意性流程图。
该方法400应用的场景包括客户端服务器和多个存储节点。该多个存储节点可以包括数据节点和校验节点。例如,该方法400可以应用于图1所示的场景中,但本申请实施例对此并不限定。该方法400可以由客户端服务器执行,例如,图1中的客户端服务器101执行。
410,接收第一数据。
第一数据为待写入的数据,例如需要写入的文件、对象等数据。客户端服务器接收到该第一数据后,准备将该第一数据写入到相应的分条(目标分条)。
420,获取目标分条的写入位置信息,该写入位置信息表示该目标分条中已写入数据分块在该目标分条的位置,其中该目标分条包括多个分片,每个分片对应一个存储节点,该存储节点和该客户端服务器通信,该目标分条未写满。
在本申请实施例中,“分条未写满”表示分条中已经写入数据,但分条中存在处于空闲状态(没有存储数据)的空间。也就是说,分条中已经写入了一部分数据,但分条的空间没有被写满。
在本申请实施例中,该写入位置信息表示分条中已写入数据分块在该分条的位置。也就是说,该写入位置信息可以表示该分条中数据写入的情况。在该分条未写满时,该写入位置信息可以表示出该分条中数据已写到了哪个位置。
可选地,在本申请一个实施例中,该写入位置信息表示已写入数据分块在分条中的最后位置。例如,该分条中数据已写入到哪个节点(分片)以及已写入数据的大小等。
可选地,在本申请一个实施例中,该写入位置信息可以包括该分条中已写入的最后一个数据分块所在的节点信息和该分条中已写入数据的大小。
例如,如图5所示,若分条中已写入768KB数据,即,第一个数据分块(512KB)已写入到数据节点1中,第二个数据分块(256KB)已写入到数据节点2中,在这种情况下,写入位置信息可以为(2,768KB),其中“2”表示已写入的最后一个数据分块在数据节点2(2是数据节点2的node_id),768KB表示该分条中已写入数据的大小为768KB,后者也可以称为偏移量(offset),即已写入的数据相对于本分条起始位置的偏移量。
应理解,由于分片与节点对应,节点编号与分片编号对应,该写入位置信息也可以包括已写入数据分块在该目标分条中的偏移量,以及该已写入数据分块的节点编号。
应理解,该写入位置信息也可以表示为其他形式,本申请实施例对此并不限定。例如,上述偏移量也可以改为在最后写入的节点中已写入数据的大小,由于分片的大小是固定的,这样也可以得到该分条中已写入数据的大小。例如,若采用该形式,图5中第二个数据分片(256KB)写入到数据节点2后,写入位置信息可以表示为(2,256KB)。由于已写入数据节点1的数据分块的大小为固定的512KB(即分片大小),这样根据该写入位置信息也可以得到该分条中已写入数据的大小为768KB。
可选地,该写入位置信息可以保存在该客户端服务器的内存中,也可以保存在其他节点中,例如,可以保存在元数据服务器中。
可选地,对于该写入位置信息保存在该客户端服务器的内存中的情况,该客户端服务器可以直接获取其内存中保存的该写入位置信息。
可选地,对于该写入位置信息保存在元数据服务器中的情况,该客户端服务器可以从该元数据服务器中获取该写入位置信息。
430,获取该第一数据的一个或多个数据分块,每个数据分块对应该多个分片中的一个存在空闲空间的分片,其中,该第一数据的一个或多个数据分块是基于该写入位置信息生成的;或者,该第一数据的一个或多个数据分块是基于分片大小生成的。
在本申请实施例中,通过切分操作生成该第一数据的数据分块,以写入对应的数据分片中。在设置有应用服务器的情况下,可以由该应用服务器进行切分操作;在没有设置应用服务器的情况下,例如,应用服务器与客户端服务器集成的情况下,可以由该客户端服务器进行切分操作。
可选地,在本申请一个实施例中,由该客户端服务器对该第一数据进行切分,以生成该第一数据的一个或多个数据分块。
可选地,在本申请另一个实施例中,对于应用服务器进行切分操作的情况,该客户端服务器可以向该应用服务器发送该写入位置信息以及该第一数据给应用服务器,然后从该应用服务器获取该第一数据的一个或多个数据分块。
可选地,在本申请一个实施例中,基于该写入位置信息生成该第一数据的一个或多个数据分块。具体可以包括如下情况中的一种:
当该目标分条中已写入数据分块的最后位置是特定分片的最后位置,则按照分片的大小(例如前述的512KB)切分该第一数据,以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块;
当该目标分条中已写入数据分块的最后位置不是该特定分片的最后位置(例如该特定分片中间的某个位置),则先按该特定分片未写部分的大小从该第一数据中切分出一个数据分块,切出的一个数据分块对应该特定分片的未写部分,以便和该特定分片中的已有数据分块共同填满该特定分片,再按照该分片的大小切分该第一数据的余下部分,该第一数据的余下部分所切分出的每个数据分块对应一个空白分片;其中,若该第一数据的大小小于该特定分片未写部分的大小,则将该第一数据作为一个数据分块。
具体而言,如果该目标分条中已写入数据分块的最后位置是特定分片的最后位置,则本轮待写入的数据(第一数据)的写入过程是从下一个分片的起始位置开始,那么将该第一数据按照该分片大小进行切分,最后不满足一个分片大小的待写入数据作为最后一个数据分块,其中,若该第一数据的大小小于该分片的大小,则将整个第一数据作为一个数据分块,由此方式得到该第一数据的数据分块。
如果该分条中已写入数据分块的最后位置不是该特定分片的最后位置,则本轮待写入的数据(第一数据)的写入过程不是从分片的起始位置开始,那么首先需要通过切分操作切出一个分块,与该特定分片中已写入的分块凑足一个分片大小,然后再根据该分片大小进行切分,其中,若该第一数据的大小小于该特定分片未写部分的大小,则将整个第一数据作为一个数据分块,由此方式得到该第一数据的数据分块。
例如,如图6所示,分条中已写入768KB的数据,本轮需要再写入768KB的数据,可以根据该写入位置信息(2,768K)确定已写入数据分块的最后位置在数据节点2(对应数据分片2),且不是数据分片2的最后位置,则先按照数据分片2未写部分的大小256KB,切分出一个数据分块601,再按照分片大小512KB进行切分,由此得到本轮待写入数据的一个256KB的数据分片601和一个512KB的数据分片602。
基于该写入位置信息生成该第一数据的一个或多个数据分块,可以使生成的数据分块对应目标分条的未写数据部分,从而可以将数据分块写入到目标分条的未写数据部分,和现有技术相比,节省硬盘空间的开销,减少数据所需分条总数。
可选地,在本申请一个实施例中,基于分片大小生成该第一数据的一个或多个数据分块。
具体而言,在本实施例中,不论目标分条中已写入数据分块的最后位置是不是特定分片的最后位置,都按照该分片的大小切分该第一数据以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块。 该第一数据的每个数据分块对应一个空白分片。相应地,本轮待写入的数据(第一数据)的写入过程都从下一个分片的起始位置开始。
可选地,可以选择两种切分方式:当所述第一书记满足第一条件时,基于该写入位置信息生成该第一数据的一个或多个数据分块;或者,当所述第一书记满足第二条件时,基于该分片大小生成该第一数据的一个或多个数据分块。例如,可以根据第一数据的大小确定采用上述两种切分方式中的哪种切分方式,例如,可以在该第一数据的大小小于预定阈值(例如把分片的大小作为该预定阈值)时,采用第一种切分方式,即基于该写入位置信息生成该第一数据的一个或多个数据分块;或者,在该第一数据的大小不小于该预定阈值时,采用第二种切分方式,即基于该分片大小生成该第一数据的一个或多个数据分块。或者,还可以根据第一数据的文件类型选择切分方式,例如:如果第一数据是日志文件,采用第一种切分方式;如果第一数据是视频文件,采用第二种切分方式。可选的,还可以根据第一数据的QoS选择切分方式,例如:如果第一数据对时延要求低,采用第一种切分方式;如果第一数据对时延要求高,采用第二种切分方式。
440,向该第一数据的一个或多个数据分块对应的分片对应的存储节点发送写请求,该写请求用于将该第一数据的一个或多个数据分块存储到对应的分片中。
在前述步骤中得到该第一数据的数据分块后,该客户端服务器接下来,将该第一数据的数据分块写入对应的数据分片。由于分片对应到存储节点存储空间,因此数据(具体而言是数据分块)写入数据分片的过程可以视为数据被存储到存储节点的过程。客户端服务器向每个数据分块对应的分片的存储节点发送写请求,该写请求可以携带待写入的数据分块以及分条、节点(或分片)的编号等信息。该存储节点收到该写请求后,将该数据分块存储到对应的分片中。
应理解,客户端服务器向存储节点发送写请求,存储节点根据写请求存储数据的过程,可以称为客户端服务器写数据的过程,即可以称为客户端服务器将分块写入到节点的过程。
可选地,作为本申请的一个实施例,客户端服务器可以将该第一数据的数据分块,以追加写方式写入对应的数据节点。
具体而言,根据该写入位置信息,可以确定分条中已写入数据分块的最后位置,客户端服务器可以以追加写(append)方式从该位置往后继续写入数据分块。
在未使用的条带第一次写入数据时,由于是从分片的起点开始写入数据,因此可以不需要考虑分条中已有数据的情况,直接根据分片大小切分待写入数据。以在未使用的条带中写入第二数据为例,相应流程可以为:
该客户端服务器接收第二数据;获取未使用的分条作为该目标分条,该第二数据的大小小于该目标分条的大小;获取该第二数据的至少一个数据分块,其中,该第二数据的数据分块是基于该分片大小生成的;确定该第二数据的数据分块对应的分片;将该第二数据的数据分块写入对应的分片;记录该第二数据的数据分块在该目标分条中的写入位置,作为该写入位置信息。
例如,如图5所示,在未使用的条带中写入待写入的数据(第二数据),分片大小为512KB,待写入的数据的大小为768KB,待写入的数据的写入过程从第一个数据分片的起始位置开始,因此按照分片大小512KB切分待写入的数据得到一个512KB的数据分块501 和一个256KB的数据分块502。然后,该客户端服务器从该分条的起始位置开始写入数据,即数据分块501写入到该分条的第一个分片,数据分块502写入到该分条的第二个分片,并可记录写入位置信息(2,768KB)。
在本申请实施例中,对于校验分块,可以在每次存储数据分块到节点中的同时存储校验分块到节点中;也可以在分条未写满时,不存储校验分块,而在分条写满后再存储校验分块到节点中,下面分别进行说明。
可选地,在本申请一个实施例中,在每次存储数据分块时,都校验分块。在这种情况下,该客户端服务器还获取该分条的校验分块;并将该分条的校验分块写入该分条对应的校验节点。
对于校验分块,可以根据本轮新生成的数据分块与本分条上一轮生成的校验分块生成新的校验分块。例如,对于某个分条,将本轮新生成的数据分块与本分条上一轮生成的校验分块进行异或运算(异或时,不足的部分以0补齐),异或运算的结果作为新的校验分块。如果在本轮所生成的数据分块写入本分条之前,本分条没有写入过数据分块,那么只根据本轮所生成的数据分块生成校验分块,例如:像现有技术一样对本轮生成的数据分块补0,然后按照EC算法生成校验分块;而和现有技术不同之处在于,补0的主要目标用于计算校验分块之用,补的0在将来不糊发送给节点进行存储。
可选地,该目标分条中有用于校验该已写入数据分块的校验分块,在该客户端服务器获取该第一数据的至少一个数据分块之后,该客户端服务器还可以根据该一个或多个数据分块以及该用于校验该已写入数据分块的校验分块,计算该已写入数据分块以及该一个或多个数据分块共同的校验分块;以及将计算出的校验分块存储到该目标条带的校验分片。
可选地,对于客户端服务器执行切分操作的情况,该客户端服务器根据该第一数据的数据分块和已写入该分条的校验分块,生成新的校验分块。
可选地,对于应用服务器执行切分操作的情况,该应用服务器根据该第一数据的数据分块和已写入该分条的校验分块生成新的校验分块,并发送给该客户端服务器。
例如,在存储图5所示的数据时,可以根据512KB的数据分块501和第256KB的数据分块502进行异或运算得到校验分块503和校验分块504。在对图6所示的数据进行存储时,可以根据新生成的256KB的数据分块601和512KB的数据分块602,以及上一次存储数据时生成的校验分块503和504进行异或运算得到两个新的校验分块603和604。
在得到该第一数据的数据分块601、602和校验分块603、604后,客户端服务器分别将该第一数据的数据分块和校验分块写入对应的数据节点和校验节点。
可选地,元数据服务器可以存储分条与节点的映射关系,并向该客户端服务器发送该分条和该分条对应的数据节点和校验节点的信息。
例如,元数据服务器中可以存储分条的stripe_id和节点的node_id的映射关系、分条由哪一个客户端服务器写入(stripe_id_owner)。客户端服务器在新写一个分条时,向元数据服务器查询需要写入的分条的stripe_id以及分条与节点的映射关系。元数据服务器可以选取一个stripe_id_owner的值为非法值的分条分配给客户端服务器用于数据写入,并将stripe_id_owner的值更新为客户端服务器的编号client_id。其中,stripe_id_owner的初始值为非法值,表示该分条未被分配。元数据服务器向客户端服务器发送分配的分条的stripe_id以及对应的node_id。客户端服务器据此可以获知分配的分条以及该分条对应的数 据节点和校验节点。
举例来说,在存储图5所示的数据时,由于分条中还未写数据,该写入位置信息为(1,0KB),或者该写入位置信息为空,或者此时还没有该写入位置信息,该客户端服务器从该分条的起始位置开始写入数据,即数据分块501写入到该分条对应的数据节点1,数据分块502写入到该分条对应的数据节点2。假设该分条的stripe_id的值为5,针对数据分块501,该客户端服务器可以发送数据写入请求,包括数据分块501,并携带参数(stripe_id5,node_id 1,offset 0KB,length 512KB),其中,stripe_id 5表示该数据写入请求是分条5的,node_id 1表示该分条的数据节点1,offset 0KB表示写入位置相对于该分条起始位置的偏移量为0KB,即表示从该分条的起始位置开始写入数据,length 512KB表示写入数据的大小为512KB。数据节点1根据该数据写入请求,将数据分块501持久化到硬盘上,然后向该客户端服务器返回写入成功的响应。类似的,针对数据分块502,该客户端服务器可以发送数据写入请求,包括数据分块502,并携带参数(stripe_id 5,node_id 2,offset512KB,length 256KB)。数据节点2根据该数据写入请求,将数据分块502持久化到硬盘上,并向该客户端服务器返回写入成功的响应。同时,该客户端服务器将校验分块503和校验分块504分别发送到校验节点1和校验节点2,校验节点1和校验节点2分别将校验分块503和校验分块504持久化到硬盘上,并向该客户端服务器返回写入成功的响应。把分块持久化到硬盘上实际上就是将分块保存到硬盘,保存分块的过程实质上就是分块持久化的过程。
可选地,对于写入位置信息保存在该客户端服务器的内存中的情况,该客户端服务器在将该第一数据的数据分块写入对应的数据节点之后,更新该客户端服务器的内存中保存的写入位置信息。可选地,各数据节点返回的写入成功的响应中可以包括该数据节点已写入数据分块的最后位置,或者包括各数据节点返回写入成功的响应的同时向客户端服务器发送该数据节点已写入数据分块的最后位置。该客户端服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新该客户端服务器的内存中保存的写入位置信息。
例如,在存储完图5所示的数据时,该客户端服务器可以确定数据节点2返回的已写入数据分块的最后位置为分条中已写入数据分块的最后位置,从而可以更新该客户端服务器的内存中保存的写入位置信息为(2,768KB)。
可选地,在本申请另一个实施例中,对于写入位置信息保存在元数据服务器中的情况,在该第一数据的数据分块被写入对应的数据节点之后,该元数据服务器更新该写入位置信息。可选地,各数据节点在将数据分块持久化成功后,还可以向元数据服务器发送该数据节点已写入数据分块的最后位置,该元数据服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新该元数据服务器中保存的写入位置信息。
例如,在存储完图5所示的数据时,该元数据服务器根据数据节点1和2返回的已写入数据分块的最后位置,可以确定数据节点2返回的已写入数据分块的最后位置为分条中已写入数据分块的最后位置,从而可以更新该元数据服务器中保存的写入位置信息为(2,768KB)。
在存储完图5所示的数据后,如图6所示,在下一轮数据写入过程中,根据写入位置 信息,新生成数据分块601和602,以及校验分块603和604。该客户端服务器根据写入位置信息确定从(2,768KB)处开始写入数据,因此,该客户端服务器将数据分块601发送到数据节点2,将数据分块602发送到数据节点3,数据节点2和数据节点3分别将数据分块601和数据分块602持久化到硬盘上,并向该客户端服务器返回写入成功的响应。
可选地,对于已经写了IO请求的数据分块的分条,可以以追加写(append-only/append)方式继续写入新的IO请求的数据分块。在append-only方式中,后续写入数据分块的起始位置是前面已成功写入数据分块的最后位置。这种数据分块写入方式一方面能够尽可能地高效利用硬盘空间,避免出现前后数据的空闲空间;另一方面能够更好地适配闪存(Flash)型存储介质提高读写性能,均衡颗粒磨损提高介质寿命,如固态硬盘(Solid State Drive,SSD)、存储级内存(Storage Class Memory,SCM)等。
同时,该客户端服务器将校验分块603和校验分块604分别发送到校验节点1和校验节点2,校验节点1和校验节点2分别将校验分块603和校验分块604持久化到自己的硬盘上,并向该客户端服务器返回写入成功的响应。
在存储完图6所示的数据后,分条中已写入数据分块的最后位置为数据节点3的最后位置或数据节点4的起始位置,因此可以将写入位置信息更新为(3,1536KB)或(4,1536KB)。
在分条未写满时,可循环执行上述的流程,在分条已写满时,可在元数据服务器中标记本分条已写满,从而完成一次满分条数据写入过程。
例如,在客户端服务器更新写入位置信息的情况下,若客户端服务器更新写入位置信息后,确定本分条已经写满,则可以向元数据服务器发送本分条已写满的指示信息,元数据服务器根据该指示信息标记本分条已写满,例如:置本分条写满标记stripe_full为TRUE(真)。
在元数据服务器更新写入位置信息的情况下,若元数据服务器更新写入位置信息后,确定本分条已经写满,则标记本分条已经写满,如置本分条写满标记为TRUE。元数据服务器还可以向客户端服务器发送本分条已经写满的指示信息,这样客户端服务器可以不再往已写满的分条写入新的数据,而继续写下一分条。
可选地,在本申请另一个实施例中,在分条未写满时,不将校验分块存储到节点,而是将每一轮得到的校验分块存储到缓存中,在分条已写满时再存储最终的校验分块到节点。
在本实施例中,该客户端服务器的缓存中有用于校验该已写入节点的数据分块的校验分块。在该客户端服务器获取该第一数据的至少一个数据分块之后,该客户端服务器还可以根据(1)该一个或多个数据分块,以及(2)该用于校验该已写入数据分块的校验分块计算出:该已写入数据分块以及该一个或多个数据分块共同的校验分块;将计算出的校验分块存储到该客户端服务器的缓存中。当该目标分条的所有数据分片写满数据分块(即“满分条写”)后,将该客户端服务器的缓存中的对应该目标条带的校验分块存储到该目标条带的校验分片。
在本实施例中,除了在分条未写满时,将校验分块存储到缓存中,其他处理与前述实施例类似,可以参考前述实施例中的相应描述,为了简洁,不再赘述。
可选地,在本申请另一个实施例中,在分条未写满时,不计算和存储校验分片,而在 分条已写满时再计算并存储校验分片。
在本实施例中,该存储节点中包括用于存储校验数据块的校验节点,在发送该写请求之外,该客户端服务器还将该一个或多个数据分块发送给一个或者多个校验节点进行备份;当该目标分条的所有数据分片写满数据分块后,指令该一个或者多个校验节点根据已备份的该目标条带的所有数据分块生成校验分块,并将生成的该校验分块存储到该目标条带的校验分片。
在本实施例中,客户端服务器除了将该第一数据的数据分块写入数据节点,还将该第一数据的数据分块发送给该分条对应的校验节点进行备份;在该分条已写满时,向该校验节点发送该分条已写满的指示信息,该指示信息用于指示该校验节点根据该分条的所有数据分块生成并存储该分条的校验分块。
在本实施例中,除了以下描述外,其他处理,例如客户端服务器将该第一数据的数据分块写入数据节点以及写入位置信息的使用和更新等,与前述实施例类似,可以参考前述实施例中的相应描述,为了简洁,不再赘述。
在将该第一数据的数据分块写入数据节点后,客户端服务器还将该第一数据的数据分块发送给该分条对应的校验节点。该校验节点缓存该数据分块。在该分条未写满时,客户端服务器可以以类似方式继续存储下一轮的数据(例如:来自于主机的下一个写IO请求所携带的待写数据)。在该分条已写满时,客户端服务器向该校验节点发送该分条已写满的指示信息,该校验节点收到该指示信息后,根据该分条的所有数据分块生成并存储该分条的校验分块。然后,该校验节点可以删除缓存的该分条的所有数据分块,并可以向该客户端服务器返回校验分块写入成功的响应。
客户端服务器确定分条是否写满的方式与前述实施例类似。例如,在客户端服务器更新写入位置信息的情况下,客户端服务器在更新写入位置信息后,可以确定本分条是否已写满;在元数据服务器更新写入位置信息的情况下,若元数据服务器更新写入位置信息后,确定本分条已经写满,可以向客户端服务器发送本分条已经写满的指示信息,客户端服务器可以根据是否收到该指示信息确定本分条是否已写满。
本申请实施例的技术方案,基于写入位置信息写入数据的数据分块,对于数据不满足满分条的情况,不需要进行补‘0’操作,可以节省硬盘空间的开销,减少数据所需分条总数,从而:(1)对分条的管理变得更为简单快捷,降低了分条管理的复杂度,(2)提升了查找分条的速度,当需要查找一个分条时,可以更快的找到需要的分条(3)故障恢复的分条数目变少,加快了故障恢复的速度,例如当某个硬盘发生故障,需要对故障硬盘涉及的所有分条进行数据恢复,在分条总数减少的情况下,需要恢复的分条变少,从而恢复时间缩短。
另外,在本申请实施例的技术方案中,由于不需要补‘0’数据,避免了补‘0’数据的传输和持久化,减少了写放大以及无效数据的搬移操作,因而能够提升存储系统的存储效率。
下面将结合具体的例子详细描述本申请实施例。应注意,这些例子只是为了帮助本领域技术人员更好地理解本申请实施例,而非限制本申请实施例的范围。
图7示出了本申请一个实施例的数据写入方式的示意图。
图7中,T1-T4表示时间轴,以EC 4+2模式为例,分片大小为512KB,N1-N4表示数据分片所在的节点,P、Q表示校验分片所在的节点。在元数据服务器(未图示)中存 储写入位置信息,在客户端服务器中存储写入位置信息的情况与此类似,为了简洁,不再赘述。
T1时刻需要写入数据分块1,将数据分块2、3、4以全‘0’代替,通过数据分块1、2、3和4计算出校验分块p1、q1,然后通过路由信息寻址到指定的节点(例如通过DHT方式),将数据分块1和校验分块p1、q1持久化到硬盘上(数据分块2、3和4不需要持久化存储),更新本时刻数据分块写入的最后位置到元数据服务器。
T2时刻需要写入数据分块2,通过查询元数据服务器获取数据分块2写入的位置,利用数据分块2、校验分块p1、q1计算出校验分块p2、q2,并将数据分块2和校验分块p2、q2持久化到硬盘上,更新本时刻数据分块写入的最后位置到元数据服务器,然后删除p1和q1。
T3时刻需要写入数据分块3,通过查询元数据服务器获取数据分块3写入的位置,利用数据分块3、校验分块p2、q2计算出校验分块p3、q3,并将数据分块3和校验分块p3、q3持久化到硬盘上,更新本时刻数据分块写入的最后位置到元数据服务器,然后删除p2和q2。
T4时刻需要写入数据分块4,此时满足满分条写入条件(数据分块总大小=512KB*4=存储数据分块的分片总大小),利用数据分块4、校验分块p3、q3计算出最终的校验分块p、q并持久化,然后删除p3、q3,并标记此分条已经完成满分条写入过程。至此,一次满分条数据写入过程结束。数据分块1、2、3和4可以分别来自于主机的不同写数据请求。后续数据写入过程可重复T1-T4的步骤。
图8示出了本申请一个实施例的数据写入方法的流程图。
在图8中,存储系统包括客户端服务器,应用服务器,元数据服务器、数据节点和校验节点。数据节点和校验节点可按照EC的模式存储数据分块和校验分块。图8中分条对应的数据节点和校验节点的数量由采用的EC模式而决定。例如,对于EC 4+2模式,一个分条对应四个数据节点和两个校验节点。在图8中,在客户端服务器的内存中保存写入位置信息。
应理解,对于应用服务器与客户端服务器合并设置的情况,可以由客户端服务器实现应用服务器的相应功能,为了简洁,不再赘述。
下面的步骤801-813介绍了客户端服务器如何把第一个IO请求携带的第一数据持久化到节点;下面的步骤814-825介绍了客户端服务器如何把第二个IO请求携带的第二数据持久化到节点。其中,IO请求例如是收到的来自主机或者其他设备的写IO请求,或者客户端服务器自己所产生的写IO请求。
801,客户端服务器向应用服务器发送数据写入请求。
客户端服务器在需要存储数据时(第一数据),向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据,例如,需要写入的文件、对象等数据。
802,应用服务器进行切分操作。
应用服务器收到客户端服务器的数据写入请求后,依据配置的EC冗余配比和分片大小,对待写入的数据进行切分操作,并计算出校验分块。由于此时没有写入位置信息,因此,应用服务器可以按照配置的分片大小进行分块切分操作。
803,应用服务器向客户端服务器返回数据分块和校验分块。
804,客户端服务器向元数据服务器查询数据写入的分条的信息。
客户端服务器向元数据服务器查询需要写入的分条的stripe_id以及分条与节点的映射关系。
805,元数据服务器分配分条。
元数据服务器可以随机选取一个stripe_id_owner为非法值的分条分配给客户端服务器用于数据写入,并将stripe_id_owner的值更新为客户端服务器的编号client_id。stripe_id_owner为非法值意味着这个分条是空白分条,分条中尚未存储任何数据块,因此也就不存在写入位置信息(或者写入位置信息的值为0,或者写入位置信息的值为非法值)。
806,元数据服务器向客户端服务器发送分配的分条以及对应的节点的信息。
例如,元数据服务器可以向客户端服务器发送分配的分条的stripe_id以及对应的node_id。
807,客户端服务器写入数据分块。
由于此时没有写入位置信息,客户端服务器从分配的分条的起始位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
808,数据节点持久化数据分块。
收到数据分块的相应数据节点将数据分块持久化到硬盘中。
809,数据节点向客户端服务器发送写入成功的响应以及已写入数据分块的最后位置。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。该响应中可以包括该数据节点已写入数据分块的最后位置或者数据节点返回写入成功的响应的同时向客户端服务器发送该数据节点已写入数据分块的最后位置。
810,客户端服务器写入校验分块。
客户端服务器将校验分块发送给对应的校验节点。
811,校验节点持久化校验分块。
校验节点将校验分块持久化到硬盘中。
812,校验节点向客户端服务器发送写入成功的响应。
客户端服务器收到本轮数据分块和校验分块均写入成功的响应后,本轮数据写入成功。
813,客户端服务器记录写入位置信息。
客户端服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而在该客户端服务器的内存中记录写入位置信息(如果写入位置新为0或者非法值,本步骤的“记录”可以理解为“更新”)。
814,客户端服务器向应用服务器发送数据写入请求。
在有新的数据(第二数据)需要存储时,客户端服务器向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据和该写入位置信息。该写入位置信息表示该分条中上一轮已写入数据分块的最后位置。
815,应用服务器生成数据分块。
应用服务器收到客户端服务器的数据写入请求后,基于该写入位置信息生成待写入的数据的数据分块。具体生成方式可以参见前述各实施例,为了简洁,不再赘述。
816,应用服务器读取该分条上一轮写入的校验分块。
817,应用服务器生成校验分块。
应用服务器根据本轮新生成的数据分块与该分条上一轮写入的校验分块生成新的校验分块。
818,应用服务器向客户端服务器返回数据分块和校验分块。
819,客户端服务器写入数据分块。
客户端服务器根据该写入位置信息,从该分条中上一轮已写入数据分块的最后位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
820,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。例如,可以以Append-Only方式继续写入新的数据分块。
821,数据节点向客户端服务器发送写入成功的响应以及已写入数据分块的最后位置。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。该响应中可以包括该数据节点已写入数据分块的最后位置或者数据节点返回写入成功的响应的同时向客户端服务器发送该数据节点已写入数据分块的最后位置。
822,客户端服务器写入校验分块。
客户端服务器将校验分块发送给对应的校验节点。
823,校验节点持久化校验分块。
校验节点将校验分块持久化到硬盘中。
824,校验节点向客户端服务器发送写入成功的响应。
客户端服务器收到本轮数据分块和校验分块均写入成功的响应后,本轮数据写入成功。
825,客户端服务器更新写入位置信息。
客户端服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新该客户端服务器的内存中保存的写入位置信息。
如果该分条未写满,则循环执行上述814-825的流程。如果该分条已写满,则可以执行以下流程。
826,客户端服务器向元数据服务器发送该分条已写满的指示信息。
827,元数据服务器标记本分条已写满。
元数据服务器根据该指示信息标记本分条已经写满,如置本分条写满标记为TRUE,从而完成一次满分条数据写入过程。
图9示出了本申请另一个实施例的数据写入方法的流程图。
图9与图8所示实施例的不同在于,在图9中,在元数据服务器中保存写入位置信息。除此之外,其他相关描述可以参见图8所示实施例,为了简洁,不再赘述。
901,客户端服务器向应用服务器发送数据写入请求。
客户端服务器在需要存储数据时,向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据,例如,需要写入的文件、对象等数据。
902,应用服务器进行切分操作。
应用服务器收到客户端服务器的数据写入请求后,依据配置的EC冗余配比和分片大小,对待写入的数据进行切分操作,并计算出校验分块。由于此时没有写入位置信息,因 此,应用服务器可以按照配置的分片大小进行分块切分操作。
903,应用服务器向客户端服务器返回数据分块和校验分块。
904,客户端服务器向元数据服务器查询数据写入的分条的信息和写入位置信息。
客户端服务器向元数据服务器查询需要写入的分条的stripe_id以及分条与节点的映射关系,以及写入位置信息。
905,元数据服务器分配分条。
元数据服务器可以随机选取一个stripe_id_owner为非法值的分条分配给客户端服务器用于数据写入,并将stripe_id_owner的值更新为客户端服务器的编号client_id。由于此次为新分配分条,因此没有写入位置信息,或者写入位置信息可以为该分条的起始位置。
906,元数据服务器向客户端服务器发送分配的分条以及对应的节点的信息。
例如,元数据服务器可以向客户端服务器发送分配的分条的stripe_id以及对应的node_id。另外,若有写入位置信息,元数据服务器也将写入位置信息发送给客户端服务器。
907,客户端服务器写入数据分块。
对于新分配的分条,客户端服务器从分配的分条的起始位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
908,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。
909,数据节点向客户端服务器发送写入成功的响应。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。
910,数据节点向元数据服务器发送该数据节点已写入数据分块的最后位置。
911,元数据服务器更新写入位置信息。
元数据服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新写入位置信息。
912,客户端服务器写入校验分块。
客户端服务器将校验分块发送给对应的校验节点。
913,校验节点持久化校验分块。
校验节点将校验分块持久化到硬盘中。
914,校验节点向客户端服务器发送写入成功的响应。
客户端服务器收到本轮数据分块和校验分块均写入成功的响应后,本轮数据写入成功。
915,客户端服务器向元数据服务器查询写入位置信息。
在有新的数据需要存储时,客户端服务器先向元数据服务器查询写入位置信息,即该分条中上一轮已写入数据分块的最后位置。
916,元数据服务器向客户端服务器返回写入位置信息。
917,客户端服务器向应用服务器发送数据写入请求。
客户端服务器向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据和该写入位置信息。
918,应用服务器生成数据分块。
应用服务器收到客户端服务器的数据写入请求后,基于该写入位置信息生成待写入的数据的数据分块。
919,应用服务器读取该分条上一轮写入的校验分块。
920,应用服务器生成校验分块。
应用服务器根据本轮新生成的数据分块与该分条上一轮写入的校验分块生成新的校验分块。
921,应用服务器向客户端服务器返回数据分块和校验分块。
922,客户端服务器写入数据分块。
客户端服务器根据该写入位置信息,从该分条中上一轮已写入数据分块的最后位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
923,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。例如,可以以Append-Only方式继续写入新的数据分块。
924,数据节点向客户端服务器发送写入成功的响应。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。
925,数据节点向元数据服务器发送该数据节点已写入数据分块的最后位置。
926,元数据服务器更新写入位置信息。
元数据服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新写入位置信息。
927,客户端服务器写入校验分块。
客户端服务器将校验分块发送给对应的校验节点。
928,校验节点持久化校验分块。
校验节点将校验分块持久化到硬盘中。
929,校验节点向客户端服务器发送写入成功的响应。
客户端服务器收到本轮数据分块和校验分块均写入成功的响应后,本轮数据写入成功。
930,元数据服务器确定该分条是否已写满。
元数据服务器可以根据最新的写入位置信息确定该分条是否已写满。如果该分条未写满,则执行931;如果该分条已写满,则执行932和933。
931,元数据服务器向客户端服务器发送该分条未写满的指示信息。
客户端服务器根据该指示信息确定该分条未写满,继续循环执行上述915-931的流程。
932,元数据服务器向客户端服务器发送该分条已写满的指示信息。
客户端服务器根据该指示信息确定该分条已写满,从而继续写下一分条。
933,元数据服务器标记本分条已写满。
例如,元数据服务器置本分条写满标记为TRUE,从而完成一次满分条数据写入过程。
图10示出了本申请另一个实施例的数据写入方法的流程图。
图10与图8所示实施例的不同在于,在图10中,在分条未写满时,不存储校验分块,而在分条已写满时再存储校验分块。除此之外,其他相关描述可以参见图8所示实施例, 为了简洁,不再赘述。
1001,客户端服务器向应用服务器发送数据写入请求。
客户端服务器在需要存储数据时,向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据,例如,需要写入的文件、对象等数据。
1002,应用服务器进行切分操作。
应用服务器收到客户端服务器的数据写入请求后,依据配置的EC冗余配比和分片大小,对待写入的数据进行切分操作。由于此时没有写入位置信息,因此,应用服务器可以按照配置的分片大小进行分块切分操作。
1003,应用服务器向客户端服务器返回数据分块。
1004,客户端服务器向元数据服务器查询数据写入的分条的信息。
客户端服务器向元数据服务器查询需要写入的分条的stripe_id以及分条与节点的映射关系。
1005,元数据服务器分配分条。
元数据服务器可以随机选取一个stripe_id_owner为非法值的分条分配给客户端服务器用于数据写入,并将stripe_id_owner的值更新为客户端服务器的编号client_id。
1006,元数据服务器向客户端服务器发送分配的分条以及对应的节点的信息。
例如,元数据服务器可以向客户端服务器发送分配的分条的stripe_id以及对应的node_id。
1007,客户端服务器写入数据分块。
由于此时没有写入位置信息,客户端服务器从分配的分条的起始位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
1008,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。
1009,数据节点向客户端服务器发送写入成功的响应以及已写入数据分块的最后位置。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。该响应中可以包括该数据节点已写入数据分块的最后位置或者数据节点返回写入成功的响应的同时向客户端服务器发送该数据节点已写入数据分块的最后位置。
1010,客户端服务器向校验节点写入数据分块。
客户端服务器将数据分块发送给分条对应的校验节点。
1011,校验节点缓存数据分块。
校验节点将收到的数据分块进行缓存。
1012,校验节点向客户端服务器发送写入成功的响应。
1013,客户端服务器更新写入位置信息。
客户端服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而在该客户端服务器的内存中更新写入位置信息。
1014,客户端服务器向应用服务器发送数据写入请求。
在有新的数据需要存储时,客户端服务器向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据和该写入位置信息。该写入位置信息表示该分条中上一轮已写入 数据分块的最后位置。
1015,应用服务器生成数据分块。
应用服务器收到客户端服务器的数据写入请求后,基于该写入位置信息生成待写入的数据的数据分块。具体生成方式可以参见前述各实施例,为了简洁,不再赘述。
1016,应用服务器向客户端服务器返回数据分块。
1017,客户端服务器写入数据分块。
客户端服务器根据该写入位置信息,从该分条中上一轮已写入数据分块的最后位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
1018,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。例如,可以以Append-Only方式继续写入新的数据分块。
1019,数据节点向客户端服务器发送写入成功的响应以及已写入数据分块的最后位置。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。该响应中可以包括该数据节点已写入数据分块的最后位置或者数据节点返回写入成功的响应的同时向客户端服务器发送该数据节点已写入数据分块的最后位置。
1020,客户端服务器向校验节点写入数据分块。
客户端服务器将校验分块发送给对应的校验节点。
1021,校验节点缓存数据分块。
校验节点将收到的数据分块进行缓存。
1022,校验节点向客户端服务器发送写入成功的响应。
1023,客户端服务器更新写入位置信息。
客户端服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新该客户端服务器的内存中保存的写入位置信息。
如果该分条未写满,则循环执行上述1014-1023的流程。如果该分条已写满,则可以执行以下流程。
1024,客户端服务器向校验节点发送该分条已写满的指示信息。
1025,校验节点计算和存储校验分块。
校验节点根据缓存的该分条的所有数据分块生成并存储该分条的校验分块。然后,该校验节点可以删除缓存的该分条的所有数据分块。
1026,校验节点向客户端服务器返回校验分块写入成功的响应。
1027,客户端服务器向元数据服务器发送该分条已写满的指示信息。
1028,元数据服务器标记本分条已写满。
元数据服务器根据该指示信息标记本分条已经写满,如置本分条写满标记为TRUE,从而完成一次满分条数据写入过程。
图11示出了本申请另一个实施例的数据写入方法的流程图。
图11与图9所示实施例的不同在于,在图11中,在分条未写满时,不存储校验分块,而在分条已写满时再存储校验分块。除此之外,其他相关描述可以参见图9所示实施例,为了简洁,不再赘述。
1101,客户端服务器向应用服务器发送数据写入请求。
客户端服务器在需要存储数据时,向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据,例如,需要写入的文件、对象等数据。
1102,应用服务器进行切分操作。
应用服务器收到客户端服务器的数据写入请求后,依据配置的EC冗余配比和分片大小,对待写入的数据进行切分操作。由于此时没有写入位置信息,因此,应用服务器可以按照配置的分片大小进行分块切分操作。
1103,应用服务器向客户端服务器返回数据分块。
1104,客户端服务器向元数据服务器查询数据写入的分条的信息和写入位置信息。
客户端服务器向元数据服务器查询需要写入的分条的stripe_id以及分条与节点的映射关系,以及写入位置信息。
1105,元数据服务器分配分条。
元数据服务器可以随机选取一个stripe_id_owner为非法值的分条分配给客户端服务器用于数据写入,并将stripe_id_owner的值更新为客户端服务器的编号client_id。由于此次为新分配分条,因此没有写入位置信息,或者写入位置信息可以为该分条的起始位置。
1106,元数据服务器向客户端服务器发送分配的分条以及对应的节点的信息。
例如,元数据服务器可以向客户端服务器发送分配的分条的stripe_id以及对应的node_id。另外,若有写入位置信息,元数据服务器也将写入位置信息发送给客户端服务器。
1107,客户端服务器写入数据分块。
对于新分配的分条,客户端服务器从分配的分条的起始位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
1108,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。
1109,数据节点向客户端服务器发送写入成功的响应。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。
1110,数据节点向元数据服务器发送该数据节点已写入数据分块的最后位置。
1111,元数据服务器更新写入位置信息。
元数据服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新写入位置信息。
1112,客户端服务器向校验节点写入数据分块。
客户端服务器将数据分块发送给分条对应的校验节点。
1113,校验节点缓存数据分块。
校验节点将收到的数据分块进行缓存。
1114,校验节点向客户端服务器发送写入成功的响应。
1115,客户端服务器向元数据服务器查询写入位置信息。
在有新的数据需要存储时,客户端服务器先向元数据服务器查询写入位置信息,即该分条中上一轮已写入数据分块的最后位置。
1116,元数据服务器向客户端服务器返回写入位置信息。
1117,客户端服务器向应用服务器发送数据写入请求。
客户端服务器向应用服务器发送数据写入请求,该数据写入请求包括待写入的数据和该写入位置信息。
1118,应用服务器生成数据分块。
应用服务器收到客户端服务器的数据写入请求后,基于该写入位置信息生成待写入的数据的数据分块。
1119,应用服务器向客户端服务器返回数据分块。
1120,客户端服务器写入数据分块。
客户端服务器根据该写入位置信息,从该分条中上一轮已写入数据分块的最后位置开始写入数据。客户端服务器将数据分块发送给对应的数据节点。
1121,数据节点持久化数据分块。
收到数据分块的数据节点将数据分块持久化到硬盘中。例如,可以以Append-Only方式继续写入新的数据分块。
1122,数据节点向客户端服务器发送写入成功的响应。
数据节点将客户端服务器发送的数据分块持久化成功后,向客户端服务器返回写入成功的响应。
1123,数据节点向元数据服务器发送该数据节点已写入数据分块的最后位置。
1124,元数据服务器更新写入位置信息。
元数据服务器可以比较各数据节点已写入数据分块的最后位置确定该分条中已写入数据分块的最后位置,进而更新写入位置信息。
1125,客户端服务器向校验节点写入数据分块。
客户端服务器将校验分块发送给对应的校验节点。
1126,校验节点缓存数据分块。
校验节点将收到的数据分块进行缓存。
1127,校验节点向客户端服务器发送写入成功的响应。
1128,元数据服务器确定该分条是否已写满。
元数据服务器可以根据最新的写入位置信息确定该分条是否已写满。如果该分条未写满,则执行1129;如果该分条已写满,则执行1130。
1129,元数据服务器向客户端服务器发送该分条未写满的指示信息。
客户端服务器根据该指示信息确定该分条未写满,继续循环执行上述1115-1129的流程。
1130,元数据服务器向客户端服务器发送该分条已写满的指示信息。
1131,客户端服务器向校验节点发送该分条已写满的指示信息。
1132,校验节点计算和存储校验分块。
校验节点根据缓存的该分条的所有数据分块生成并存储该分条的校验分块。然后,该校验节点可以删除缓存的该分条的所有数据分块。
1133,校验节点向客户端服务器返回校验分块写入成功的响应。
1134,元数据服务器标记本分条已写满。
例如,元数据服务器置本分条写满标记为TRUE,从而完成一次满分条数据写入过程。
本申请实施例的技术方案,对于数据不满足满分条的情况,不产生补‘0’数据,既可以减少数据所需分条总数,降低分条管理的复杂度,提升查找分条的速度,加快故障恢复的速度,又可以避免补‘0’数据的传输和持久化,减少网络和硬盘的写放大,并能够减少无效数据的搬移操作,从而能够提升存储系统的存储效率。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,本申请实施例中的具体的例子只是为了帮助本领域技术人员更好地理解本申请实施例,而非限制本申请实施例的范围。
上文中详细描述了本申请实施例的数据写入的方法,下面将描述本申请实施例的客户端服务器。应理解,本申请实施例的客户端服务器可以执行前述本申请实施例的各种方法,即以下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。
图12示出了本申请实施例的客户端服务器1200的示意性框图。
如图12所示,该客户端服务器1200可以包括接收模块1210,获取模块1220和写入模块1230。
接收模块1210,用于接收第一数据;
获取模块1220,用于获取目标分条的写入位置信息,该写入位置信息表示该目标分条中已写入数据分块在该目标分条的位置,其中该目标分条包括多个分片,每个分片对应一个存储节点,该存储节点和该客户端服务器通信,该目标分条未写满;以及获取该第一数据的一个或多个数据分块,每个数据分块对应该多个分片中的一个存在空闲空间的分片,其中,该第一数据的一个或多个数据分块是基于该写入位置信息生成的;或者,该第一数据的一个或多个数据分块是基于分片大小生成的;
写入模块1230,用于向该第一数据的一个或多个数据分块对应的分片的存储节点发送写请求,该写请求用于将该第一数据的一个或多个数据分块存储到对应的分片中。
可选地,在本申请一个实施例中,该写入位置信息用于表示已写入数据分块在该目标分条中的最后位置;
其中,该第一数据的一个或多个数据分块由如下方式中的一种生成:
当该目标分条中已写入数据分块的最后位置是特定分片的最后位置,则按照该分片的大小切分该第一数据,以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块;
当该目标分条中已写入数据分块的最后位置不是该特定分片的最后位置,则先按该特定分片未写部分的大小从该第一数据中切分出一个数据分块,切出的一个数据分块对应该特定分片的未写部分,再按照该分片的大小切分该第一数据的余下部分,该第一数据的余下部分所切分出的每个数据分块对应一个空白分片;其中,若该第一数据的大小小于该特定分片未写部分的大小,则将该第一数据作为一个数据分块。
可选地,在本申请一个实施例中,该第一数据的一个或多个数据分块由如下方式生成:
按照该分片的大小切分该第一数据以生成该第一数据的一个或多个数据分块,其中,若该第一数据的大小小于该分片的大小,则将该第一数据作为一个数据分块。
可选地,在本申请一个实施例中,该第一数据的一个或多个数据分块由如下方式生成:
在该第一数据满足第一条件时,基于该写入位置信息生成该第一数据的一个或多个数据分块;或者,在该第一数据满足第二条件时,基于该分片大小生成该第一数据的一个或多个数据分块。
可选地,在本申请一个实施例中,该写入位置信息包括:已写入数据分块在该目标分条中的偏移量,以及该已写入数据分块的节点编号。
可选地,在本申请一个实施例中,该目标分条中有用于校验该已写入数据分块的校验分块,该获取模块1220还用于:
根据该一个或多个数据分块以及该用于校验该已写入数据分块的校验分块,计算该已写入数据分块以及该一个或多个数据分块共同的校验分块;
该写入模块1230还用于将计算出的校验分块存储到该目标条带的校验分片。
可选地,在本申请一个实施例中,该客户端服务器的缓存中有用于校验该已写入数据分块的校验分块,该获取模块1220还用于:
根据该一个或多个数据分块以及该用于校验该已写入数据分块的校验分块,计算该已写入数据分块以及该一个或多个数据分块共同的校验分块;
该写入模块1230还用于将计算出的校验分块存储到该客户端服务器的缓存中;以及当该目标分条的所有数据分片写满数据分块后,将该客户端服务器的缓存中的对应该目标条带的校验分块存储到该目标条带的校验分片。
可选地,在本申请一个实施例中,该存储节点中包括用于存储校验数据块的校验节点,该写入模块1230还用于:
将该一个或多个数据分块发送给一个或者多个校验节点进行备份;
当该目标分条的所有数据分片写满数据分块后,指令该一个或者多个校验节点根据已备份的该目标条带的所有数据分块生成校验分块,并将生成的该校验分块存储到该目标条带的校验分片。
可选地,在本申请一个实施例中,该获取模块1220具体用于如下情况中的一种:
对该第一数据进行切分,以生成该第一数据的一个或多个数据分块;
发送该写入位置信息以及该第一数据给应用服务器,然后从该应用服务器获取该第一数据的一个或多个数据分块。
可选地,在本申请一个实施例中,该接收模块1210还用于接收第二数据;
该获取模块1220还用于获取未使用的分条作为该目标分条,该第二数据的大小小于该目标分条的大小;获取该第二数据的至少一个数据分块,其中,该第二数据的数据分块是基于该分片大小生成的;以及确定该第二数据的数据分块对应的分片;
该写入模块1230还用于将该第二数据的数据分块写入对应的分片;以及记录该第二数据的数据分块在该目标分条中的写入位置,作为该写入位置信息。
可选地,在本申请一个实施例中,该写入位置信息保存在该客户端服务器的内存中;或者,该写入位置信息保存在元数据服务器中。
可选地,在本申请一个实施例中,元数据服务器用于存储分条与节点的映射关系,并向该客户端服务器发送该分条和该分条对应的数据节点和校验节点的信息。获取模块1220还用于从元数据服务器获取该分条和该分条对应的数据节点和校验节点的信息。
可选地,在本申请一个实施例中,该元数据服务器还用于,在该分条已写满时,标记 该分条已写满。
本申请实施例的客户端服务器1200可以执行前述方法实施例中的相应流程,相应的具体描述可参考前述各实施例,为了简洁,在此不再赘述。
图13示出了本申请又一个实施例提供的客户端服务器的结构示意图。该客户端服务器包括至少一个处理器1302(例如CPU),至少一个网络接口1305或者其他通信接口,和存储器1306。这些部件之间通信连接。处理器1302用于执行存储器1306中存储的可执行模块,例如计算机程序。通过至少一个网络接口1305(可以是有线或者无线)实现与至少一个其他网元之间的通信连接。
在一些实施方式中,存储器1306存储了程序13061,处理器1302执行程序13061,用于执行前述本申请各种实施例中的方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得该计算机执行前述本申请各种实施例中的方法。
本申请实施例还提供了一种系统,该系统可以包括前述各种实施例中的客户端服务器以及多个节点。该多个节点可以包括数据节点、校验节点和元数据服务器。该系统还可以包括应用服务器。
本申请还提供一种包含指令的计算机程序产品,当所述计算机程序产品在客户端服务器上运行时,使得所述客户端服务器拥有上述功能,例如上述各实施例中由客户端服务器所执行的步骤。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
应理解,在本申请实施例中,术语“第一”等仅仅是为了指代对象,并不表示相应对象的次序。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (25)

  1. 一种数据写入方法,其特征在于,包括:
    客户端服务器接收第一数据;
    获取目标分条的写入位置信息,所述写入位置信息表示所述目标分条中已写入数据分块在所述目标分条的位置,其中所述目标分条包括多个分片,每个所述分片对应一个存储节点,所述存储节点和所述客户端服务器通信,所述目标分条未写满;
    所述客户端服务器获取所述第一数据的一个或多个数据分块,每个数据分块对应所述多个分片中的一个存在空闲空间的分片,其中,所述第一数据的一个或多个数据分块是基于所述写入位置信息生成的;或者,所述第一数据的一个或多个数据分块是基于分片大小生成的;
    所述客户端服务器向所述第一数据的一个或多个数据分块对应的分片的存储节点发送写请求,所述写请求用于将所述第一数据的一个或多个数据分块存储到所述对应的分片中。
  2. 根据权利要求1所述的数据写入方法,其中,所述写入位置信息用于表示已写入数据分块在所述目标分条中的最后位置;
    其中,所述第一数据的一个或多个数据分块是基于所述写入位置信息生成的,具体包括如下情况中的一种:
    当所述目标分条中已写入数据分块的最后位置是特定分片的最后位置,则按照所述分片的大小切分所述第一数据,以生成所述第一数据的一个或多个数据分块,其中,若所述第一数据的大小小于所述分片的大小,则将所述第一数据作为一个数据分块;
    当所述目标分条中已写入数据分块的最后位置不是所述特定分片的最后位置,则先按所述特定分片未写部分的大小从所述第一数据中切分出一个数据分块,所述切分出的一个数据分块对应所述特定分片的未写部分,再按照所述分片的大小切分所述第一数据的余下部分,所述第一数据的余下部分所切分出的每个数据分块对应一个空白分片;其中,若所述第一数据的大小小于所述特定分片未写部分的大小,则将所述第一数据作为一个数据分块。
  3. 根据权利要求1所述的数据写入方法,其中,所述第一数据的一个或多个数据分块是基于分片大小生成的,具体包括:
    按照所述分片的大小切分所述第一数据以生成所述第一数据的一个或多个数据分块,其中,若所述第一数据的大小小于所述分片的大小,则将所述第一数据作为一个数据分块。
  4. 根据权利要求1至3中任一项所述的数据写入方法,其中,在所述第一数据满足第一条件时,基于所述写入位置信息生成所述第一数据的一个或多个数据分块;或者,在所述第一数据满足第二条件时,基于所述分片大小生成所述第一数据的一个或多个数据分块。
  5. 根据权利要求1至4中任一项所述的数据写入方法,其中,所述写入位置信息包括:
    所述已写入数据分块在所述目标分条中的偏移量,以及所述已写入数据分块的节点编号。
  6. 根据权利要求1至5中任一项所述的数据写入方法,其中,所述目标分条中有用于校验所述已写入数据分块的校验分块,在所述客户端服务器获取所述第一数据的至少一个数据分块之后,还包括:
    根据所述一个或多个数据分块以及所述用于校验所述已写入数据分块的校验分块,计算所述已写入数据分块以及所述一个或多个数据分块共同的校验分块;
    将计算出的校验分块存储到所述目标条带的校验分片。
  7. 根据权利要求1至5中任一项所述的数据写入方法,其中,所述客户端服务器的缓存中有用于校验所述已写入数据分块的校验分块,在所述客户端服务器获取所述第一数据的至少一个数据分块之后,还包括:
    根据所述一个或多个数据分块以及所述用于校验所述已写入数据分块的校验分块,计算所述已写入数据分块以及所述一个或多个数据分块共同的校验分块;
    将计算出的校验分块存储到所述客户端服务器的缓存中;
    当所述目标分条的所有数据分片写满数据分块后,将所述客户端服务器的缓存中的对应所述目标条带的校验分块存储到所述目标条带的校验分片。
  8. 根据权利要求1至5中任一项所述的数据写入方法,其中,所述存储节点中包括用于存储校验数据块的校验节点,所述客户端服务器在发送所述写请求之外,还包括:
    将所述一个或多个数据分块发送给一个或者多个校验节点进行备份;
    当所述目标分条的所有数据分片写满数据分块后,指令所述一个或者多个校验节点根据已备份的所述目标条带的所有数据分块生成校验分块,并将生成的所述校验分块存储到所述目标条带的校验分片。
  9. 根据权利要求1至8中任一项所述的数据写入方法,其中,所述客户端服务器获取所述第一数据的一个或多个数据分块包括如下情况中的一种:
    所述客户端服务器对所述第一数据进行切分,以生成所述第一数据的一个或多个数据分块;
    所述客户端服务器发送所述写入位置信息以及所述第一数据给应用服务器,然后从所述应用服务器获取所述第一数据的一个或多个数据分块。
  10. 根据权利要求1至9中任一项所述的数据写入方法,其中,在客户端服务器接收第一数据之前,还包括:
    所述客户端服务器接收第二数据;
    获取未使用的分条作为所述目标分条,所述第二数据的大小小于所述目标分条的大小;
    获取所述第二数据的至少一个数据分块,其中,所述第二数据的数据分块是基于所述分片大小生成的;
    确定所述第二数据的数据分块对应的分片;
    将所述第二数据的数据分块写入对应的分片;
    记录所述第二数据的数据分块在所述目标分条中的写入位置,作为所述写入位置信息。
  11. 根据权利要求1至10中任一项所述的数据写入方法,其中,所述写入位置信息保存在所述客户端服务器的内存中;或者,所述写入位置信息保存在元数据服务器中。
  12. 一种客户端服务器,其特征在于,包括:
    接收模块,用于接收第一数据;
    获取模块,用于获取目标分条的写入位置信息,所述写入位置信息表示所述目标分条中已写入数据分块在所述目标分条的位置,其中所述目标分条包括多个分片,每个所述分片对应一个存储节点,所述存储节点和所述客户端服务器通信,所述目标分条未写满;以及获取所述第一数据的一个或多个数据分块,每个数据分块对应所述多个分片中的一个存在空闲空间的分片,其中,所述第一数据的一个或多个数据分块是基于所述写入位置信息生成的;或者,所述第一数据的一个或多个数据分块是基于分片大小生成的;
    写入模块,用于向所述第一数据的一个或多个数据分块对应的分片的存储节点发送写请求,所述写请求用于将所述第一数据的一个或多个数据分块存储到所述对应的分片中。
  13. 根据权利要求12所述的客户端服务器,其中,所述写入位置信息用于表示已写入数据分块在所述目标分条中的最后位置;
    其中,所述第一数据的一个或多个数据分块由如下方式中的一种生成:
    当所述目标分条中已写入数据分块的最后位置是特定分片的最后位置,则按照所述分片的大小切分所述第一数据,以生成所述第一数据的一个或多个数据分块,其中,若所述第一数据的大小小于所述分片的大小,则将所述第一数据作为一个数据分块;
    当所述目标分条中已写入数据分块的最后位置不是所述特定分片的最后位置,则先按所述特定分片未写部分的大小从所述第一数据中切分出一个数据分块,所述切分出的一个数据分块对应所述特定分片的未写部分,再按照所述分片的大小切分所述第一数据的余下部分,所述第一数据的余下部分所切分出的每个数据分块对应一个空白分片;其中,若所述第一数据的大小小于所述特定分片未写部分的大小,则将所述第一数据作为一个数据分块。
  14. 根据权利要求12所述的客户端服务器,其中,所述第一数据的一个或多个数据分块由如下方式生成:
    按照所述分片的大小切分所述第一数据以生成所述第一数据的一个或多个数据分块,其中,若所述第一数据的大小小于所述分片的大小,则将所述第一数据作为一个数据分块。
  15. 根据权利要求12至14中任一项所述的客户端服务器,其中,所述第一数据的一个或多个数据分块由如下方式生成:
    在所述第一数据满足第一条件时,基于所述写入位置信息生成所述第一数据的一个或多个数据分块;或者,在所述第一数据满足第二条件时,基于所述分片大小生成所述第一数据的一个或多个数据分块。
  16. 根据权利要求12至15中任一项所述的客户端服务器,其中,所述写入位置信息包括:
    所述已写入数据分块在所述目标分条中的偏移量,以及所述已写入数据分块的节点编号。
  17. 根据权利要求12至16中任一项所述的客户端服务器,其中,所述目标分条中有用于校验所述已写入数据分块的校验分块,所述获取模块还用于:
    根据所述一个或多个数据分块以及所述用于校验所述已写入数据分块的校验分块,计算所述已写入数据分块以及所述一个或多个数据分块共同的校验分块;
    所述写入模块还用于将计算出的校验分块存储到所述目标条带的校验分片。
  18. 根据权利要求12至16中任一项所述的客户端服务器,其中,所述客户端服务器的缓存中有用于校验所述已写入数据分块的校验分块,所述获取模块还用于:
    根据所述一个或多个数据分块以及所述用于校验所述已写入数据分块的校验分块,计算所述已写入数据分块以及所述一个或多个数据分块共同的校验分块;
    所述写入模块还用于将计算出的校验分块存储到所述客户端服务器的缓存中;以及当所述目标分条的所有数据分片写满数据分块后,将所述客户端服务器的缓存中的对应所述目标条带的校验分块存储到所述目标条带的校验分片。
  19. 根据权利要求12至16中任一项所述的客户端服务器,所述存储节点中包括用于存储校验数据块的校验节点,所述写入模块还用于:
    将所述一个或多个数据分块发送给一个或者多个校验节点进行备份;
    当所述目标分条的所有数据分片写满数据分块后,指令所述一个或者多个校验节点根据已备份的所述目标条带的所有数据分块生成校验分块,并将生成的所述校验分块存储到所述目标条带的校验分片。
  20. 根据权利要求12至19中任一项所述的客户端服务器,其中,所述获取模块具体用于如下情况中的一种:
    对所述第一数据进行切分,以生成所述第一数据的一个或多个数据分块;
    发送所述写入位置信息以及所述第一数据给应用服务器,然后从所述应用服务器获取所述第一数据的一个或多个数据分块。
  21. 根据权利要求12至20中任一项所述的客户端服务器,其中,所述接收模块还用于接收第二数据;
    所述获取模块还用于获取未使用的分条作为所述目标分条,所述第二数据的大小小于所述目标分条的大小;获取所述第二数据的至少一个数据分块,其中,所述第二数据的数据分块是基于所述分片大小生成的;以及确定所述第二数据的数据分块对应的分片;
    所述写入模块还用于将所述第二数据的数据分块写入对应的分片;以及记录所述第二数据的数据分块在所述目标分条中的写入位置,作为所述写入位置信息。
  22. 根据权利要求12至21中任一项所述的客户端服务器,其中,所述写入位置信息保存在所述客户端服务器的内存中;或者,所述写入位置信息保存在元数据服务器中。
  23. 一种客户端服务器,其特征在于,包括:处理器和存储器;
    所述存储器用于存储指令;
    所述处理器用于执行所述存储器存储的指令,以执行根据权利要求1至11中任一项所述的方法。
  24. 一种系统,其特征在于,包括:
    根据权利要求12至23中任一项所述的客户端服务器,以及多个节点;
    其中,所述多个节点用于存储所述客户端服务器待写入的数据。
  25. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行根据权利要求1至11中任一项所述的方法。
PCT/CN2018/083107 2018-03-30 2018-04-13 数据写入方法、客户端服务器和系统 WO2019184012A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP18912193.2A EP3779705A4 (en) 2018-03-30 2018-04-13 DATA WRITING PROCEDURE, CLIENT SERVER AND SYSTEM
CN201880002799.4A CN110557964B (zh) 2018-03-30 2018-04-13 数据写入方法、客户端服务器和系统
US17/029,285 US11579777B2 (en) 2018-03-30 2020-09-23 Data writing method, client server, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/081446 WO2019183958A1 (zh) 2018-03-30 2018-03-30 数据写入方法、客户端服务器和系统
CNPCT/CN2018/081446 2018-03-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/029,285 Continuation US11579777B2 (en) 2018-03-30 2020-09-23 Data writing method, client server, and system

Publications (1)

Publication Number Publication Date
WO2019184012A1 true WO2019184012A1 (zh) 2019-10-03

Family

ID=68058520

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2018/081446 WO2019183958A1 (zh) 2018-03-30 2018-03-30 数据写入方法、客户端服务器和系统
PCT/CN2018/083107 WO2019184012A1 (zh) 2018-03-30 2018-04-13 数据写入方法、客户端服务器和系统

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081446 WO2019183958A1 (zh) 2018-03-30 2018-03-30 数据写入方法、客户端服务器和系统

Country Status (4)

Country Link
US (1) US11579777B2 (zh)
EP (1) EP3779705A4 (zh)
CN (1) CN110557964B (zh)
WO (2) WO2019183958A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124280A (zh) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 一种数据追加写入方法、装置及电子设备和存储介质
CN113204554A (zh) * 2021-05-11 2021-08-03 深圳市杉岩数据技术有限公司 一种对象存储系统实现稀疏写的方法、装置及电子设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111273868A (zh) * 2020-01-19 2020-06-12 西安奥卡云数据科技有限公司 一种全闪存阵列垃圾回收减少写放大的方法
CN111831618A (zh) * 2020-07-21 2020-10-27 北京青云科技股份有限公司 数据写入方法、数据读取方法、装置、设备及存储介质
CN112506814B (zh) * 2020-11-17 2024-03-22 合肥康芯威存储技术有限公司 一种存储器及其控制方法与存储系统
US20230049602A1 (en) * 2021-08-10 2023-02-16 Samsung Electronics Co., Ltd. Systems, methods, and apparatus for hierarchical aggregation for computational storage
US20230046030A1 (en) * 2021-08-10 2023-02-16 Samsung Electronics Co., Ltd. Systems, methods, and apparatus for data resizing for computational storage

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102752402A (zh) * 2012-07-20 2012-10-24 广东威创视讯科技股份有限公司 一种云存储方法及系统
CN103699494A (zh) * 2013-12-06 2014-04-02 北京奇虎科技有限公司 一种数据存储方法、数据存储设备和分布式存储系统
US20160246677A1 (en) * 2015-02-19 2016-08-25 Netapp, Inc. Virtual chunk service based data recovery in a distributed data storage system
US20170272100A1 (en) * 2016-03-15 2017-09-21 Cloud Crowding Corp. Distributed Storage System Data Management And Security

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5522065A (en) * 1991-08-30 1996-05-28 Compaq Computer Corporation Method for performing write operations in a parity fault tolerant disk array
WO1993023803A1 (fr) * 1992-05-21 1993-11-25 Fujitsu Limited Appareil de commande pour pile de disques
US5583876A (en) * 1993-10-05 1996-12-10 Hitachi, Ltd. Disk array device and method of updating error correction codes by collectively writing new error correction code at sequentially accessible locations
US5537534A (en) * 1995-02-10 1996-07-16 Hewlett-Packard Company Disk array having redundant storage and methods for incrementally generating redundancy as data is written to the disk array
US5860090A (en) * 1995-10-20 1999-01-12 Informix Software, Inc. Append-only storage in a disk array using striping and parity caching
US6343343B1 (en) * 1998-07-31 2002-01-29 International Business Machines Corporation Disk arrays using non-standard sector sizes
US6195727B1 (en) * 1999-03-31 2001-02-27 International Business Machines Corporation Coalescing raid commands accessing contiguous data in write-through mode
US6985995B2 (en) * 2002-03-29 2006-01-10 Panasas, Inc. Data file migration from a mirrored RAID to a non-mirrored XOR-based RAID without rewriting the data
US7426611B1 (en) * 2003-08-18 2008-09-16 Symantec Operating Corporation Method and system for improved storage system performance using cloning of cached data
US7346733B2 (en) * 2004-09-09 2008-03-18 Hitachi, Ltd. Storage apparatus, system and method using a plurality of object-based storage devices
US8151014B2 (en) * 2005-10-03 2012-04-03 Hewlett-Packard Development Company, L.P. RAID performance using command descriptor block pointer forwarding technique
US7725765B2 (en) * 2006-09-29 2010-05-25 Hitachi, Ltd. Method and apparatus for encryption with RAID in storage system
US8799535B2 (en) * 2008-01-11 2014-08-05 Akamai Technologies, Inc. Storage of data utilizing scheduling queue locations associated with different data rates
US20090198885A1 (en) * 2008-02-04 2009-08-06 Manoj Jose K System and methods for host software stripe management in a striped storage subsystem
KR101023877B1 (ko) * 2009-04-17 2011-03-22 (주)인디링스 캐시 및 디스크 관리 방법 및 상기 방법을 이용한 컨트롤러
CN102088389B (zh) * 2009-12-02 2015-01-28 中兴通讯股份有限公司 一种分布式内容存取调度装置和内容读取方法
US8566686B2 (en) * 2011-05-13 2013-10-22 Lsi Corporation System and method for optimizing read-modify-write operations in a RAID 6 volume
CN102841931A (zh) * 2012-08-03 2012-12-26 中兴通讯股份有限公司 分布式文件系统的存储方法及装置
US10073626B2 (en) * 2013-03-15 2018-09-11 Virident Systems, Llc Managing the write performance of an asymmetric memory system
US10044613B2 (en) * 2013-05-16 2018-08-07 Intel IP Corporation Multiple radio link control (RLC) groups
US10853268B2 (en) * 2016-06-15 2020-12-01 Hitachi, Ltd. Parity generating information processing system
CN112328168A (zh) * 2017-06-29 2021-02-05 华为技术有限公司 分片管理方法和分片管理装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102752402A (zh) * 2012-07-20 2012-10-24 广东威创视讯科技股份有限公司 一种云存储方法及系统
CN103699494A (zh) * 2013-12-06 2014-04-02 北京奇虎科技有限公司 一种数据存储方法、数据存储设备和分布式存储系统
US20160246677A1 (en) * 2015-02-19 2016-08-25 Netapp, Inc. Virtual chunk service based data recovery in a distributed data storage system
US20170272100A1 (en) * 2016-03-15 2017-09-21 Cloud Crowding Corp. Distributed Storage System Data Management And Security

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3779705A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124280A (zh) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 一种数据追加写入方法、装置及电子设备和存储介质
CN113204554A (zh) * 2021-05-11 2021-08-03 深圳市杉岩数据技术有限公司 一种对象存储系统实现稀疏写的方法、装置及电子设备
CN113204554B (zh) * 2021-05-11 2023-03-31 深圳市杉岩数据技术有限公司 一种对象存储系统实现稀疏写的方法、装置及电子设备

Also Published As

Publication number Publication date
CN110557964A (zh) 2019-12-10
WO2019183958A1 (zh) 2019-10-03
EP3779705A4 (en) 2021-06-30
EP3779705A1 (en) 2021-02-17
US11579777B2 (en) 2023-02-14
CN110557964B (zh) 2021-12-24
US20210004166A1 (en) 2021-01-07

Similar Documents

Publication Publication Date Title
WO2019184012A1 (zh) 数据写入方法、客户端服务器和系统
US10303363B2 (en) System and method for data storage using log-structured merge trees
WO2018040591A1 (zh) 一种远程数据复制方法及系统
EP3905023A1 (en) Storage system, storage node, and data storage method
US20150058583A1 (en) System and method for improved placement of blocks in a deduplication-erasure code environment
CN110651246B (zh) 一种数据读写方法、装置和存储服务器
US20070174333A1 (en) Method and system for balanced striping of objects
JP6094267B2 (ja) ストレージシステム
CN109582213B (zh) 数据重构方法及装置、数据存储系统
US20180356979A1 (en) Data updating technology
CN107153512B (zh) 一种数据迁移方法和装置
US20210278983A1 (en) Node Capacity Expansion Method in Storage System and Storage System
US10394484B2 (en) Storage system
CN114647383A (zh) 数据访问方法、装置、存储节点及存储介质
US20220129346A1 (en) Data processing method and apparatus in storage system, and storage system
CN114579045A (zh) 存储装置、存储装置的操作方法和存储服务器的操作方法
WO2021088586A1 (zh) 一种存储系统中的元数据的管理方法及装置
US11487428B2 (en) Storage control apparatus and storage control method
US11513739B2 (en) File layer to block layer communication for block organization in storage
EP4087212A1 (en) Method and apparatus for cloning file system
US20210034577A1 (en) File layer to block layer communication for selective data reduction
US20210311654A1 (en) Distributed Storage System and Computer Program Product
CN115878017A (zh) 数据处理方法及存储系统
US10712941B2 (en) Leveraging temporal locality to link files together and bypass accessing a central inode list
KR20200101594A (ko) 분산 파일 시스템에서 클라이언트 기반의 실시간 데이터 복구에 기반한 디코딩 처리 방법 및 이를 위한 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912193

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018912193

Country of ref document: EP

Effective date: 20201027