CN109407975B - Data writing method, computing node and distributed storage system - Google Patents

Data writing method, computing node and distributed storage system Download PDF

Info

Publication number
CN109407975B
CN109407975B CN201811095443.8A CN201811095443A CN109407975B CN 109407975 B CN109407975 B CN 109407975B CN 201811095443 A CN201811095443 A CN 201811095443A CN 109407975 B CN109407975 B CN 109407975B
Authority
CN
China
Prior art keywords
hard disk
target
data
target data
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811095443.8A
Other languages
Chinese (zh)
Other versions
CN109407975A (en
Inventor
何春雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201811095443.8A priority Critical patent/CN109407975B/en
Publication of CN109407975A publication Critical patent/CN109407975A/en
Application granted granted Critical
Publication of CN109407975B publication Critical patent/CN109407975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In a distributed storage system, after a computing node receives a write command, the write command carries target data, and a target partition corresponding to the write command is inquired according to a label of the write command; querying a plurality of hard disk IDs corresponding to the target partition, wherein the plurality of hard disks corresponding to the target partition comprise hard disks with a selected state and hard disks with an alternative state; and storing the first data to the hard disk in the selected state, but not storing the target data to the hard disk in the alternative state.

Description

Data writing method, computing node and distributed storage system
Technical Field
The invention relates to the field of storage, in particular to a distributed storage technology.
Background
In the prior art, a distributed storage system is partitioned into multiple failure domains (typically at server or rack granularity). Only one copy is stored in the same fault domain, so that the condition that a plurality of copies are inaccessible due to the failure of a single fault domain is avoided, and therefore the number of the fault domains required by the system is larger than or equal to the number of the copies. For example, in a scenario with three copies, at least three or more fault domains are required to ensure that two copies are not placed in the same fault domain. When a fault domain fails, data in the fault domain can be reconstructed on other fault domains to ensure the integrity of the copy number.
Under prior art solutions, once a storage pool is created, it is subsequently difficult to make adjustments to the copy of the storage pool. For example, a three-copy distributed storage system can only store data in a three-copy manner, and cannot increase or decrease the number of copies.
If the number of the copies needs to be forcibly reduced, the fault domains can be reduced one by one until the number of the fault domains is equal to the number of the copies, and then one fault domain is forcibly removed, because mutual exclusion rules exist among the copies, data cannot be rebuilt, so that the purpose of reducing the number of the copies is achieved. Taking the example of reducing three copies to two copies, the scheme is: firstly, the total number of fault domains of the distributed storage system is reduced to three; then storing the data in a three-copy mode; then forcibly removing one fault domain and only keeping two fault domains; in this way, two failure domains are reserved, each having one copy, thereby achieving the purpose of reducing from three copies to two copies. It can be seen that this scheme is complex to operate, and has strong limitations because the fault domain must be forcibly removed.
Disclosure of Invention
In a first aspect, the present invention provides an embodiment of a data writing method, used in a distributed storage system for storing data, where the distributed storage system includes a computing node and a plurality of storage nodes, and each storage node includes a hard disk, and the method includes: the computing node receives a write command, wherein the write command carries target data and a target label, and the target label is used for identifying the target data; the computing node inquires a target partition corresponding to the target label; the computing node inquires a plurality of hard disk IDs corresponding to the target partitions, the plurality of hard disks corresponding to the target partitions comprise hard disks with selected states and hard disks with alternative states, the hard disks with the selected states can be used for storing copies of the target data, the hard disks with the selected states comprise a master hard disk and a slave hard disk, and the hard disks with the alternative states are not used for storing the copies of the target data; the computing node sends the target data and a selected hard disk list to a storage node where the main hard disk is located, wherein the selected hard disk list comprises a slave hard disk ID and does not comprise an alternative hard disk ID; the storage node where the master hard disk is located stores the target data into the master hard disk, and sends the target data to the storage node where the slave hard disk is located according to the ID of the slave hard disk; and the storage node where the slave hard disk is located stores the target data into the local slave hard disk.
Based on the scheme, the number of copies of the target data stored and the number of hard disks corresponding to the target partition are not consistent any more, so that the flexibility is improved.
In a first possible implementation manner of the first aspect, after the target data is stored in the local slave hard disk by the storage node where the slave hard disk is located, the method further includes: updating the state of the alternative hard disk corresponding to the target partition into the selected hard disk; and acquiring the target data from the main hard disk or the hard disk which stores the target data, and storing the acquired target data in a new selected hard disk.
Based on the scheme, the proportion of the selected hard disk and the alternative hard disk can be adjusted, so that the copy number is adjusted.
In a second possible implementation manner of the first aspect, the target data may be a data block, and the target tag is a combination of a logical unit number identification LUN ID and a logical block address LBA.
This scheme provides a specific example of target data and target tags.
In a third possible implementation manner of the first aspect, the target data may be a key value pair (key value pair), and the target label is a key (key).
This scheme provides a specific example of target data and target tags.
In a fourth possible implementation manner of the first aspect, querying the target partition corresponding to the target tag specifically includes: calculating a hash value of a target label, and inquiring a target partition corresponding to the target label of a target partition corresponding to the target label according to the corresponding relation between the hash value and the target partition.
The scheme provides a specific operation scheme for inquiring the target partition.
In a fifth possible implementation form of the first aspect, the method of claim 1, wherein: and the states of the plurality of hard disks corresponding to the target partition are related to the target partition.
This scheme explains that the state of the hard disk is relevant to the target partition. The state may be different for both partitions for the same partition.
In a second aspect, an embodiment of a data writing method is provided, for writing data to a distributed storage system, where the distributed storage system includes a computing node and a plurality of storage nodes, and each storage node includes a hard disk, and the method includes:
the computing node receives a write command, wherein the write command carries target data and a target label, and the target label is used for identifying the target data; the computing node inquires a target partition corresponding to the target label; the computing node inquires a plurality of hard disk IDs corresponding to the target partitions, the plurality of hard disks corresponding to the target partitions comprise hard disks with selected states and hard disks with alternative states, the hard disks with the selected states can be used for storing the copies of the target data, and the hard disks with the selected states are not used for storing the copies of the target data; and the computing node sends the target data to the storage node where the hard disk with the selected state is located according to the hard disk ID with the selected state in the plurality of hard disk IDs corresponding to the target partition.
Based on the scheme, the number of copies of the target data stored and the number of hard disks corresponding to the target partition are not consistent any more, so that the flexibility is improved. In this scheme, the two cases are covered by the computing node sending the target data to the storage node where the hard disk with the selected state is located: a first situation is similar to the scheme provided by the first aspect, the computing server sends the target data to a storage server where the master hard disk is located, the storage server where the master hard disk is located sends the target data to a storage server where the slave hard disk is located, and then the target data is stored by the server where the slave hard disk is located; and the second group of conditions is that the main hard disk and the auxiliary hard disk are not distinguished, and the calculation server directly sends the target data to a storage server where the selected hard disk is located for storage according to the selected hard disk corresponding to the target partition.
In a first possible implementation manner of the second aspect, after the target data is stored in the local slave hard disk by the storage node where the slave hard disk is located, the method further includes: updating the state of the alternative hard disk corresponding to the target partition into the selected hard disk; and acquiring the target data from the main hard disk or the hard disk which stores the target data, and storing the acquired target data in a new selected hard disk.
The scheme may update the ratio of the selected hard disk and the alternative hard disk.
The invention also provides embodiments of a distributed storage system and a computing node, which have the effects of the corresponding method.
Drawings
FIG. 1 is a topology diagram of an embodiment of a distributed storage system;
FIG. 2 is a schematic diagram of an embodiment of a partition-to-hard disk state correspondence;
FIG. 3 is a schematic diagram of another embodiment of a partition-to-hard disk state correspondence;
FIG. 4 is a schematic diagram of yet another embodiment of a partition to hard disk state correspondence;
FIG. 5 is a flow diagram of an embodiment of writing data to a distributed storage system;
fig. 6 is a schematic diagram of another embodiment of the correspondence relationship between the partition and the hard disk state.
Detailed Description
In a distributed storage system, the same data is stored repeatedly on different storage servers, the data on each storage server is called a copy (copy), and this way of protecting the data is called a multi-copy (multi-copy), also called a mirror (mirror). Data here is for example a data block or an object. A data block is a unit of data stored in a block, such as a Storage Area Network (SAN); an object is a data unit of an object store (object store), such as a cloud object store (cloud object store) or a key-value (KV) store.
Referring to fig. 1, a host 11 communicates with a distributed storage system including a storage server 12, a storage server 13, a storage server 14, and a storage server 15, and further including a compute server 16, a compute server 17, and a metadata server 18, the number of each server may be larger (not shown). The computing server is used for receiving data sent from the host 11 and computing a partition corresponding to the data; in the embodiment of the present invention, the metadata server 18 additionally stores states of the hard disks corresponding to the partitions, only the hard disk in the selected state is used for storing data, and the hard disk in the alternative state is not used for storing data. And the metadata server manages the corresponding table of the relation between the partitions and the hard disk and the state of the hard disk. By querying the metadata server 18, the computing server 16 can know that data needs to be sent to a storage server (a storage server where a selected hard disk is located); the storage server is used for directly or indirectly receiving the data sent by the computing server and storing the data in a local selected hard disk. The data stored by each selected hard disk is referred to as a copy of the data sent by the host.
It is possible that both the calculation server 16 and the calculation server 17 receive the data block of the host, and when a certain calculation server receives the data block sent by the host 11, it queries the hard disk ID for storing the data block. This data block is stored in the hard disks of the storage server 12, the storage server 13, and the storage server 14, respectively, according to the result of the inquiry. The data blocks stored on each storage server are the same, and each data block may be referred to as a replica (replica 121, replica 131, and replica 141). Since the total number of copies is 3, this storage is called 3 copies.
It should be noted that in other embodiments, two or three of the storage server, the computation server and the metadata server may be integrated, for example, the same server may have both the functions of the storage server and the computation server. Since there is no change in the technical spirit, the embodiments of the present invention will not be separately described in this case.
In the data protection mode of multiple copies, a single storage server can be used as the minimum unit for storing copies as in fig. 1, that is, each storage server is called a fault domain; for the same data block, the number of the copies stored in each storage server does not exceed 1, and the failure of any storage server only affects the copy stored by the storage server and does not affect the copies on other storage servers. Besides the storage server is used as a fault domain, the hard disk, the machine frame, the machine room and the data center can be used as the minimum unit for storing the copy.
The embodiment of the invention reduces the volume of the copy according to the current business requirement, the volume after volume reduction can meet the current business requirement, the performance influence caused by data reconstruction can not be caused in the volume reduction process, and the data reconstruction efficiency is high; in addition, a scheme for increasing the number of data copies is also provided. Specifically, the embodiment of the present invention establishes a correspondence between data to be stored and a hard disk storing the data by partitioning (partition). In this embodiment, the maximum number of copies exists, and different selected states are set for the hard disks corresponding to the partitions by using tags or other manners, where the selected states of the hard disks include a selected hard disk and an alternative hard disk. Within the range of the maximum copy number, the copy number of the data can be increased or decreased by changing the proportion between the selected hard disk and the alternative hard disk; when all the hard disks corresponding to the partitions are selected hard disks (no alternative hard disks exist), the maximum copy number of the partitions is the same as the number of the selected hard disks of the partitions.
Specifically, there is a maximum number of copies of a partition in the distributed storage system. And establishes the mapping relation between each partition and the hard disk, and records the mapping relation in a mapping table, which is also called a partition routing table. The mapping table is stored in the metadata server 18 in a multi-copy manner. In the embodiment of the invention, the hard disk state can be marked, and the hard disk state comprises a selected state and an alternative state. The hard disk in the selected state can be stored in the copy, and the hard disk in the alternative state cannot be stored in the copy. The total number of the hard disks in the selected state + the total number of the hard disks in the alternative state is the maximum number of copies. The state of the hard disk may be recorded in the mapping table together with the mapping relationship or separately.
For a plurality of hard disks corresponding to the same partition, the precedence relationship among the hard disks can be set, and the order of changing the states of the hard disks is determined according to the precedence relationship. For example, a plurality of hard disks corresponding to the same partition may record the precedence relationship of the hard disks in a linked list manner. The head of the linked list is a main hard disk, and the rest hard disks are auxiliary hard disks. Taking fig. 2 as an example, the number of hard disks corresponding to each partition is 4, and therefore, when all the hard disks are in a selected state (not shown), each data is stored in 4 copies. For example, the partition 1 corresponds to the hard disk 211, the hard disk 221, the hard disk 231, and the hard disk 241, and the position relationship of these 4 hard disks in the linked list is depicted by arrows in fig. 2. For partition 1, the hard disk 211 is located at the head of the linked list, and the hard disk 241 is located at the tail of the linked list, wherein the hard disk 211 is a master hard disk, and the hard disk 221, the hard disk 231 and the hard disk 241 are all slave hard disks; similarly, partition 2 corresponds to hard disk 251, hard disk 261, hard disk 271, and hard disk 211, and partition 3 corresponds to hard disk 291, hard disk 301, hard disk 311, and hard disk 211. As can also be seen from fig. 2, the hard disk 211 is associated with three partitions, of which: for partition 1, hard disk 211 is in the selected state, while for partitions 2 and 3, hard disk 211 is in the alternative state. In other words, the state of the hard disk is not an attribute of the hard disk itself, but is a parameter with respect to the partition. The hard disk state describes whether this hard disk is the selected state or the alternative state for a particular partition.
As shown in fig. 2, when the copy data of the data needs to be reduced to 3, for partition 1, the hard disk 241 at the end of the linked list located at the server 24 may be set to be in the alternative state, and the rest of the hard disks should be set to be in the selected state; for partition 2, the hard disk 211 at the end of the linked list may be set to an alternative state; for partition 3, the hard disk 211 at the end of the linked list may be set to an alternate state. Further, if one wants to expand the number of copies, for example, from 3 copies to 4 copies, 1 hard disk in the alternative state may be updated to the selected state. For example, by changing the hard disk 214 in partition 1 of FIG. 2 from the alternative state to the selected state, partition 1 may support storage of 4 copies. In the distributed system, the hard disk state is changed for all the copies, so that the number of the copies of each partition can be kept consistent.
In the technical solution provided by this embodiment, the number of copies supported by the distributed storage system can be flexibly adjusted by changing the state of the hard disk. Because the smaller the number of copies, the higher the storage resource utilization rate, and the larger the number of copies, the higher the data reliability, a better balance between the storage resource utilization rate and the data reliability can be found after the embodiment of the invention is used.
Referring to fig. 3, a specific implementation of the write data is described below. The method may be applied to the distributed storage system shown in fig. 1.
In step S31, the Operating System (OS) of the host 11 sends a write request (write request is also referred to as a write IO request) to any first computing server (target computing server) of the distributed storage system through a Small Computer System Interface (SCSI) or an Internet Small Computer System Interface (iSCSI). For convenience of description, the write request issued in this step is referred to as a first write request, or a target write request. The first write request carries data to be written into the distributed storage system, which is called first data (target data). In addition, the first write request also carries a tag of the first data, and the tag can distinguish different data. Note that different data usually corresponds to different tags, and in a few cases, different data is allowed to have the same tag.
For object (object) storage, the tag may be the name of the object. In the key-value storage, the first data is a value (value), and the key (key) may be a tag.
For block stores, the write address may be tagged. The write address may be: logical unit number identification (LUN ID) + Logical Block Address (LBA). Wherein the LUN ID describes which LUN the first data needs to be written into; the LBA is used to describe a specific location where the first data is written in the first LUN (target LUN), and is referred to as an offset. For convenience of description, the present invention is described by taking the block storage as an example without specific description, and the key-value storage is not separately described.
In this step, the host 11 may be a device outside the distributed storage system or a device within the distributed storage system. For example, the host 11 may be integrated with any one of the storage servers shown in fig. 1, in this scenario, the host may be a combination of a physical host and a storage server, or may be a storage server having a virtual machine function. Further, if the host and the storage server are integrated, the first data and write address may be generated by the storage server and the act of "sending the first write request to the distributed storage system" may be omitted. Accordingly, the operation of "receiving SCSI command" in step S32 is not required, and the storage server receiving the first write request can assemble a key directly from the LUN ID and the LBA.
At step S32, the first computing server receives the SCSI write request by running Virtual Block Storage (VBS) management software (as mentioned above, the first computing server may be integrated into a storage server, in which case the device receiving the SCSI command may be referred to as a storage server, but the storage server plays a role corresponding to the combination of the computing server and the storage server). And assembling a key according to the LUN ID of the first data and the LBA of the first data in the received SCSI command. And then carrying out hash (hash) operation on the key to obtain a hash value. The Hash function is a one-way function that can turn an input of arbitrary length into an output of fixed length, and two different inputs cannot be found for one output. Virtual block storage is sometimes translated into a virtual block storage, sometimes referred to as a Virtual Block System (VBS).
The combination of the LUN ID and LBA can uniquely identify a data block, and thus both can be concatenated together as a tag for the data block. The LUN ID and LBA are spliced, for example, in the following manner: the LUN ID is spliced together at the front and the LBA at the back to form a Key. For Key-Value storage, the Key in the Key-Value storage can be directly used as a tag.
The VBS software module may perform volume metadata management, and the VBS provides a distributed storage access point service through an SCSI or iSCSI interface, so that the first computing server can access the distributed storage resource through the VBS. The VBS performs point-to-point communication with the storage servers, so that the VBS can access the storage server hard disks concurrently. Each storage server can be deployed with a VBS process, and VBSs on a plurality of nodes form a VBS cluster. IO performance may also be enhanced by deploying multiple VBSs on the storage server.
Step S33, the first computing server determines a partition ID corresponding to the first data according to a correspondence between the result of the Hash operation and the partition. For convenience of description, the partition ID corresponding to the first data (target data) is referred to as a target partition ID.
Based on a Distributed Hash Table (DHT), the Distributed storage system divides a Hash space (0-2 ^32) into N parts (for example, N equal parts), wherein each 1 part is 1 partition (partition), each partition has a plurality of Hash values, and each partition has a unique ID (partition ID). Each partition corresponds to a plurality of hard disks. That is, by the hash function, it is possible to realize: the hash value of Key is greater than the partition, and is greater than the corresponding relation of a plurality of hard disks. The correspondence between partitions and hard disks is stored in a mapping table, which may be stored in a metadata server. The plurality of hard disks can be divided into a master hard disk and a slave hard disk; the hard disks may be in an equal relationship without distinction between master and slave.
Step S34, the first computing server queries, according to the target partition ID, the ID of the hard disk corresponding to the target partition ID and the state of the hard disks from the mapping table. The mapping table records the corresponding relation between each partition ID and the hard disk ID and the state of the hard disk. The states include: selected state, alternative state. The hard disk state may be programmed by an administrator. The hard disk in the selected state is called the selected hard disk, and the hard disk in the alternative state is called the alternative hard disk. The selected hard disk can be written with data, and the alternative hard disk cannot be written with data. It should be noted that the hard disk state herein is not a physical state of the hard disk, but is a logical setting for marking: when writing data, whether the hard disk will be written with a copy of the data. The hard disk ID corresponds to the storage server address, so that after the hard disk ID corresponding to the target partition is obtained, the corresponding storage server address can be obtained. The state of the hard disk is maintained in the metadata server.
In step S35, the VBS module of the first computing server sends, to the storage server where the main hard disk is located: the first data, the hard disk ID in the selected state (may include the master hard disk ID and the slave hard disk ID, or may not include the master hard disk ID but only include the slave hard disk ID). After receiving the first data, the storage server where the main hard disk is located writes the first data into the main hard disk according to the ID of the main hard disk; and sending the first data to a storage server in which the selected slave hard disk is located according to the slave hard disk ID. It should be noted that the first computing server does not send the first data to the storage server where the hard disk in the alternative state is located.
Taking the partition 1 of fig. 3 as an example, the storage server 21 in which the main hard disk 211 is located performs the following operations: writing the first data to the main hard disk 211; sending the first data to a storage server 22 where the hard disk 221 is located, wherein the sent request carries the address of the storage server 22 and the ID of the hard disk 221; and sending the first data to the storage server 23 where the hard disk 231 is located, wherein the sent request carries the address of the storage server 23 and the ID of the hard disk 231. However, although the hard disk 241 is also a hard disk corresponding to the partition 1, since the state of the hard disk 241 is in the alternative state, the storage server 21 does not transmit the first data to the storage server 24 in which the hard disk 241 is located.
Step S36, the slave hard disk obtains the first data and the slave hard disk ID from the storage server where the slave hard disk is located through the server where the master hard disk is located, and stores the first data into each slave hard disk according to the ID of the slave hard disk. That is: the storage server 22 writes the first data to the hard disk 221, and the storage server 23 writes the first data to the hard disk 231. The storage server 24 does not receive the first data and does not write the first data to the hard disk 241.
After receiving the first data, the storage server 22 and the storage server 23 send a response message of successful writing to the storage server 21, and the storage server 21 sends a response message of successful writing to the host, so that the host knows that the first data has been successfully written.
Step S37, the metadata server sets one or more of the alternative hard disks corresponding to the target partition as the selected hard disk. The metadata server instructs to copy the copy from the server that has stored the copy to the hard disk whose state is updated from the alternative to the selected state. Specifically, the server where the metadata server instructs the copy to send the copy to the server where the newly-added selected hard disk is located, and the server where the newly-added selected hard disk is located stores the copy into the newly-added hard disk; or, the metadata server instructs the server where the newly added selected hard disk is located to obtain a copy from the server where the target data is already stored, and stores the copy into the locally added hard disk.
Referring to fig. 4, the hard disk 241 corresponding to the partition 1 is changed from the alternative state to the selected state. It needs to be copied from the hard disk 211, or the first copy in the hard disk 221 or 231, to the hard disk 241 so that the first data is maintained in the distributed storage system in a four-copy manner.
The alternative hard disk of the partition is set as the selected hard disk, or the selected hard disk of the partition is set as the alternative hard disk, which can be set manually by an administrator at the metadata server, or can be automatically executed by the metadata server according to a policy. For example, when the storage server of the distributed system detects that the heat of the data is increased, the alternative hard disk is set as the selected hard disk, so that the number of copies is increased; otherwise, the number of copies is reduced.
And S38-S43, after one or more alternative hard disks in the alternative hard disks corresponding to the target partition are set as the selected hard disks, the host sends a second write request to the distributed storage system. The second write command carries second data and a second tag, and the second tag is used for identifying the second data. And a second computing server of the distributed storage system receives the second write command and stores the second data. Steps S38-S43 are similar to steps S31-S36, so reference may be made to steps S31-S36 without further elaboration. The steps S38-S43 are different from the steps S31-S36 in that: the write request has changed and therefore the data being stored has also changed; after step S37, the number of spare hard disks is reduced (to 0 in the least cases), and the number of selected hard disks is increased, so the number of copies that the data needs to store is increased. For example: the hard disk 241 of the partition 1 of fig. 4 is changed from the alternative state to the selected state. Then the storage server 21 needs to send the second data and the ID of the hard disk 221 to the storage server 22, send the second data and the ID of the hard disk 231 to the storage server 23, and send the second data and the ID of the hard disk 241 to the storage server 24, in addition to writing the second data to the main hard disk 211. The second data is finally stored in the hard disk 211, the hard disk 221, the hard disk 231, and the hard disk 241. The computing servers in steps S38-S43 and steps S31-S36 may be the same computing server or different computing servers. To distinguish the difference in roles, the compute servers of steps S31-S36 are referred to as first compute servers, and the compute servers of S38-S43 are referred to as second compute servers.
In other embodiments, there may be an alternative to steps S38-S43, where the metadata server sets the selected hard disk of the target partition to an alternative state. Correspondingly, after one or more selected hard disks in the alternative hard disks of the target partition are set as alternative hard disks, the server where the hard disks in the alternative state are located is notified to delete the local copy. Referring to FIG. 5, hard disk 231 of partition 1 is changed from the selected state to an alternate state relative to FIG. 2. The first copy in hard disk 231 may be deleted. The new write command received when the selected hard disk of partition 1 is set to the alternative state may be referred to as a third write command in order to distinguish it from the first write command and the second write command. The third write command carries third data and a third tag, and the number of copies of the third data to be stored corresponds to the number of selected hard disks, that is, the third data is stored in the hard disk 211 and the hard disk 221. Since the storage process of the third data is similar to the steps S31-S36, it is not described in detail.
For all the partitions in the same distributed storage system, the copy number, that is, the proportion of the selected hard disk and the alternative hard disk, can be kept consistent.

Claims (12)

1. A data writing method for writing data into storage nodes of a distributed storage system, the distributed storage system comprising computing nodes and the storage nodes, each storage node comprising a hard disk, the method comprising:
the computing node receives a write command, wherein the write command carries target data and a target label, and the target label is used for identifying the target data;
the computing node inquires a target partition corresponding to the target label;
the computing node inquires a plurality of hard disk IDs corresponding to the target partitions, the plurality of hard disks corresponding to the target partitions comprise hard disks with selected states and hard disks with alternative states, the hard disks with the selected states can be used for storing copies of the target data, the hard disks with the selected states comprise a master hard disk and a slave hard disk, and the hard disks with the alternative states are not used for storing the copies of the target data;
the computing node sends the target data and a selected hard disk list to a storage node where the main hard disk is located, wherein the selected hard disk list comprises a slave hard disk ID and does not comprise an alternative hard disk ID;
the storage node where the master hard disk is located stores the target data into the master hard disk, and sends the target data to the storage node where the slave hard disk is located according to the ID of the slave hard disk;
and the storage node where the slave hard disk is located stores the target data into the local slave hard disk.
2. The method of claim 1, further comprising, after the target data is stored in the local slave hard disk by the storage node where the slave hard disk is located:
updating the state of the hard disk with the alternative state corresponding to the target partition into the selected state;
and acquiring the target data from the main hard disk or the hard disk which stores the target data, and storing the acquired target data in a new selected hard disk.
3. The method of claim 1, further comprising:
the target data is a data chunk;
the target tag is a combination of logical unit number identification LUNID and logical block address LBA.
4. The method according to claim 1, wherein querying the target partition corresponding to the target tag specifically comprises:
and calculating the hash value of the target label, and inquiring the target partition corresponding to the target label according to the corresponding relation between the hash value and the target partition.
5. The method of claim 1, wherein:
and the states of the plurality of hard disks corresponding to the target partition are related to the target partition.
6. A method of writing data for writing data to a distributed storage system, the distributed storage system comprising a compute node and a plurality of storage nodes, each storage node comprising a hard disk, the method comprising:
the computing node receives a write command, wherein the write command carries target data and a target label, and the target label is used for identifying the target data;
the computing node inquires a target partition corresponding to the target label;
the computing node inquires a plurality of hard disk IDs corresponding to the target partitions, the plurality of hard disks corresponding to the target partitions comprise hard disks with selected states and hard disks with alternative states, and the hard disks with the selected states comprise a master hard disk and a slave hard disk; the hard disk in the selected state can be used for storing the copy of the target data, and the hard disk in the alternative state is not used for storing the copy of the target data;
and the computing node sends the target data to the storage node where the hard disk with the selected state is located according to the hard disk ID with the selected state in the plurality of hard disk IDs corresponding to the target partition.
7. The method of claim 6, after sending the target data to the storage node where the hard disk with the selected state is located, further comprising:
updating the state of the alternative hard disk corresponding to the target partition into the selected hard disk;
and acquiring the target data from the main hard disk or the hard disk which stores the target data, and storing the acquired target data in a new selected hard disk.
8. A distributed storage system comprising compute nodes, metadata nodes, and a plurality of storage nodes, each storage node comprising a hard disk, wherein:
the compute node is to:
receiving a write command, wherein the write command carries target data and a target tag, and the target tag is used for identifying the target data;
inquiring a target partition corresponding to the target label;
querying a plurality of hard disk IDs corresponding to the target partition from a metadata node, wherein the plurality of hard disks corresponding to the target partition comprise a hard disk with a selected state and a hard disk with an alternative state, the hard disk with the selected state can be used for storing a copy of the target data, the hard disks with the selected state comprise a master hard disk and a slave hard disk, and the hard disks with the alternative states are not used for storing the copy of the target data;
sending the target data and a selected hard disk list to a storage node where the main hard disk is located, wherein the selected hard disk list comprises a slave hard disk ID and does not comprise an alternative hard disk ID;
the storage node where the main hard disk is located is used for:
storing the target data into the master hard disk, and sending the target data to a storage node where the slave hard disk is located according to the ID of the slave hard disk;
the storage node where the slave hard disk is located is used for:
and storing the target data into a local slave hard disk.
9. The distributed storage system of claim 8, the metadata node further to:
updating the state of the alternative hard disk to the selected hard disk, thereby increasing the number of the selected hard disk corresponding to the target partition;
and instructing to copy the target data from the master hard disk or the slave hard disk which stores the target data to a new selected hard disk.
10. The distributed storage system of claim 8, wherein:
the target data is a data chunk;
the target tag is a combination of a logical unit number identification LUN ID and a logical block address LBA.
11. The distributed storage system of claim 8, wherein:
and the states of the plurality of hard disks corresponding to the target partition are related to the target partition.
12. A compute node, the compute node and a plurality of storage nodes belonging to a distributed storage system, each storage node comprising a hard disk, the compute node comprising a processor, a memory and an interface, the memory having stored therein a computer program, the compute node executing, via the computer program:
receiving a write command through the interface, wherein the write command carries target data and a target tag, and the target tag is used for identifying the target data;
inquiring a target partition corresponding to the target label;
querying a plurality of hard disk IDs corresponding to the target partition, wherein the plurality of hard disks corresponding to the target partition comprise a hard disk with a selected state and a hard disk with an alternative state, the hard disk with the selected state can be used for storing a copy of the target data, and the hard disk with the alternative state is not used for storing the copy of the target data;
and sending the target data to the storage node where the hard disk with the selected state is located through the interface according to the hard disk ID with the selected state in the plurality of hard disk IDs corresponding to the target partition.
CN201811095443.8A 2018-09-19 2018-09-19 Data writing method, computing node and distributed storage system Active CN109407975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811095443.8A CN109407975B (en) 2018-09-19 2018-09-19 Data writing method, computing node and distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811095443.8A CN109407975B (en) 2018-09-19 2018-09-19 Data writing method, computing node and distributed storage system

Publications (2)

Publication Number Publication Date
CN109407975A CN109407975A (en) 2019-03-01
CN109407975B true CN109407975B (en) 2020-08-25

Family

ID=65465017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811095443.8A Active CN109407975B (en) 2018-09-19 2018-09-19 Data writing method, computing node and distributed storage system

Country Status (1)

Country Link
CN (1) CN109407975B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241320B (en) * 2019-07-17 2023-11-10 华为技术有限公司 Resource allocation method, storage device and storage system
CN111488124A (en) * 2020-04-08 2020-08-04 深信服科技股份有限公司 Data updating method and device, electronic equipment and storage medium
CN112015820A (en) * 2020-09-01 2020-12-01 杭州欧若数网科技有限公司 Method, system, electronic device and storage medium for implementing distributed graph database
CN113626217A (en) * 2021-07-28 2021-11-09 北京达佳互联信息技术有限公司 Asynchronous message processing method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595267B2 (en) * 2011-06-27 2013-11-26 Amazon Technologies, Inc. System and method for implementing a scalable data storage service
US9069835B2 (en) * 2012-05-21 2015-06-30 Google Inc. Organizing data in a distributed storage system
US9154298B2 (en) * 2012-08-31 2015-10-06 Cleversafe, Inc. Securely storing data in a dispersed storage network
CN104144127A (en) * 2013-05-08 2014-11-12 华为软件技术有限公司 Load balancing method and device
CN103645859B (en) * 2013-11-19 2016-04-13 华中科技大学 A kind of magnetic disk array buffer storage method of virtual SSD and SSD isomery mirror image
US9336091B2 (en) * 2014-03-06 2016-05-10 International Business Machines Corporation Reliability enhancement in a distributed storage system

Also Published As

Publication number Publication date
CN109407975A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
WO2018040591A1 (en) Remote data replication method and system
CN109407975B (en) Data writing method, computing node and distributed storage system
US10108367B2 (en) Method for a source storage device sending data to a backup storage device for storage, and storage device
JP6708948B2 (en) Block storage
US10838829B2 (en) Method and apparatus for loading data from a mirror server and a non-transitory computer readable storage medium
US8204858B2 (en) Snapshot reset method and apparatus
US9031906B2 (en) Method of managing data in asymmetric cluster file system
CN103827843B (en) A kind of data writing method, device and system
CA2893304C (en) Data storage method, data storage apparatus, and storage device
US20060047926A1 (en) Managing multiple snapshot copies of data
US7529972B2 (en) Methods and apparatus for reconfiguring a storage system
US10031682B1 (en) Methods for improved data store migrations and devices thereof
CN105027070A (en) Safety for volume operations
US8745342B2 (en) Computer system for controlling backups using wide area network
US20070157002A1 (en) Methods and apparatus for configuring a storage system
JP2009237826A (en) Storage system and volume management method therefor
US7539838B1 (en) Methods and apparatus for increasing the storage capacity of a storage system
US20220350779A1 (en) File system cloning method and apparatus
US20150381727A1 (en) Storage functionality rule implementation
CN110663034B (en) Method for improved data replication in cloud environment and apparatus therefor
US10324652B2 (en) Methods for copy-free data migration across filesystems and devices thereof
JP6376626B2 (en) Data storage method, data storage device, and storage device
US10684918B2 (en) Granular dump backup restart
US11269539B2 (en) Methods for managing deletion of data objects by utilizing cold storage and devices thereof
WO2016046951A1 (en) Computer system and file management method therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant