WO2017113280A1 - 分布式存储系统及管理元数据的方法 - Google Patents

分布式存储系统及管理元数据的方法 Download PDF

Info

Publication number
WO2017113280A1
WO2017113280A1 PCT/CN2015/100088 CN2015100088W WO2017113280A1 WO 2017113280 A1 WO2017113280 A1 WO 2017113280A1 CN 2015100088 W CN2015100088 W CN 2015100088W WO 2017113280 A1 WO2017113280 A1 WO 2017113280A1
Authority
WO
WIPO (PCT)
Prior art keywords
mdc
resource pool
standby
metadata
node
Prior art date
Application number
PCT/CN2015/100088
Other languages
English (en)
French (fr)
Inventor
谢会云
陈钟平
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201580070472.7A priority Critical patent/CN107211003B/zh
Priority to PCT/CN2015/100088 priority patent/WO2017113280A1/zh
Publication of WO2017113280A1 publication Critical patent/WO2017113280A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications

Definitions

  • Embodiments of the present invention relate to the field of computers and, more particularly, to distributed storage systems and methods of managing metadata.
  • the basic architecture of a typical distributed storage system includes a ZooKeeper (ZK) cluster, a Metadata Controller (MDC) cluster, a pool of resources, and a cluster of clients.
  • the MDC cluster is deployed in a master-slave mode.
  • the master MDC is responsible for metadata calculation, reading and writing, and pool fault handling.
  • the metadata storage nodes in the ZK cluster are divided into data nodes and temporary (Ephemeral) nodes.
  • the data under the data nodes is modified by the main MDC, and other MDCs can be read.
  • the Ephemeral nodes include the primary temporary nodes and the standby temporary nodes.
  • the temporary node stores the identification information of the primary MDC.
  • Each standby temporary node stores the identifier information of the standby MDC.
  • the primary MDC monitors the standby temporary node to determine whether the standby MDC status is normal.
  • the standby MDC monitors the primary temporary node to determine the primary MDC. Whether the status is normal. Once the primary MDC fails, all standby MDCs will receive the ZK event notification and enter the competitive main process.
  • the generated new primary MDC will read the metadata from the ZK cluster and provide external service services after initialization.
  • the metadata is read and written by the main MDC, it can only support a certain size of the resource pool service, and does not support the expansion of the resource pool dimension; the standby MDC does not process the service, wastes the system resources, and the main MDC fails and the new primary MDC The service of the entire distributed storage system will be affected during the period when the service is not provided, and the existing distributed storage system cannot support the dynamic expansion and reduction of the metadata control cluster.
  • the invention provides a distributed storage system and a method for managing metadata, which can manage a larger-scale storage cluster and can implement fault domain isolation.
  • a first aspect provides a distributed storage system, including: a metabase, a plurality of metadata controllers MDC, and a plurality of resource pools; the metabase configured to store metadata corresponding to the plurality of resource pools; a primary MDC of the plurality of MDCs, configured to manage a mapping relationship between the standby MDCs and the resource pools in the plurality of MDCs stored in the metadata database; And is used to manage metadata corresponding to the resource pool stored in the metabase and having a mapping relationship with the standby MDC.
  • the primary MDC in the distributed storage system in the embodiment of the present invention is used to manage the mapping relationship between the MDC and the resource pool
  • the standby MDC is used to manage the metadata corresponding to the resource pool in the element library that has a mapping relationship with the standby MDC.
  • the device is responsible for computing, reading and writing, resource pool fault processing, and the like of the resource pool that is in a mapping relationship with the standby MDC. Therefore, the distributed storage system can manage a larger-scale storage cluster and implement fault domain isolation.
  • the metadata storage node in the metadata database includes a public node, a private node, and a temporary node; wherein the metadata stored in the public node is configured by the The primary MDC performs the modification; the private node stores the metadata corresponding to each of the multiple resource pools, and the metadata corresponding to each of the multiple resource pools is read by the standby MDC that manages the resource pool. Taking and modifying; the temporary node stores identification information of each of the plurality of MDCs.
  • each of the plurality of MDCs can participate in the management of the metadata, and can manage a larger storage cluster, and each MDC manages only the metadata of the resource pool corresponding thereto, thereby enabling fault domain isolation.
  • the primary MDC is specifically configured to determine that the multiple MDCs need to be created when the resource pool needs to be created.
  • the mapping between the resource pool to be created and the home MDC of the resource pool to be created is written in the public node; the home MDC of the resource pool to be created is used according to the public A mapping relationship between the resource pool to be created and the home MDC of the resource pool to be created, and the topology information of the resource pool to be created is read from the private node.
  • the metadata may be stored in a meta-database in the form of a multi-level directory, that is, the metadata may be stored in a multi-level node.
  • a public node may be used as a root node, and different types of metadata may be stored in the root.
  • the private node can be used as the root node, and the metadata corresponding to each resource pool is stored in a leaf node.
  • the primary MDC is further configured to: receive a create resource pool request sent by a user, where the The source pool request carries the topology information; the topology information is written into the private node.
  • the home MDC of the resource pool that needs to be created is further used to: store according to the private node
  • the topology information generates metadata corresponding to the resource pool that needs to be created; and the metadata corresponding to the resource pool that needs to be created is written into the private node.
  • the number of resource pools can be increased according to the service requirements of the user, and the storage capacity of the system can be improved to better meet the service requirements of the user.
  • the primary MDC is further configured to: delete the first resource pool of the multiple resource pools stored in the public node, and the The mapping of the first standby MDC in the standby MDC.
  • the first standby MDC is configured to stop the element corresponding to the first resource pool when the mapping relationship between the first resource pool and the first standby MDC is deleted. Management of data.
  • the primary MDC is specifically configured to: when determining that the first standby MDC is faulty, delete the first resource pool The mapping relationship with the first standby MDC.
  • the mapping relationship between the resource pool and the standby MDC can be deleted. Therefore, the standby MDC that has a mapping relationship with the resource pool stops the metadata corresponding to the resource pool. Management, and the standby MDC can manage the metadata of other resource pools when needed, thereby improving the utilization of system resources.
  • the primary MDC is further configured to: when determining the first standby MDC fault, in the multiple MDCs Determining, by the second standby MDC, the mapping relationship between the first resource pool and the second standby MDC is written in the public node; the second standby MDC is configured to: according to the first resource pool stored in the public node The mapping relationship of the second standby MDC reads the metadata of the first resource pool from the private node.
  • the primary MDC is specifically configured to: in a standby MDC in which the load in the multiple MDCs is less than a preset threshold A standby MDC is determined as the second standby MDC.
  • a standby MDC when a standby MDC is faulty, a standby MDC may be newly determined to manage metadata corresponding to the resource pool managed by the failed standby MDC, thereby improving system reliability.
  • the primary MDC is further configured to: receive an created MDC request sent by the user; and write the identifier information of the MDC that is requested to be created by the created MDC request into the public node.
  • the distributed storage system of the embodiment of the present invention can increase the MDC online according to the request of the user, so as to improve the processing capability and reliability of the system.
  • the standby MDC of the multiple MDCs is further configured to initiate a contention primary process when determining the primary MDC fault
  • One of the plurality of MDCs is used as a new primary MDC for loading metadata in the public node.
  • the primary MDC is further configured to: determine that a view state of the client changes; and update the public node with the client The metadata corresponding to the view of the end.
  • the standby MDC of the multiple MDCs is further configured to: determine a resource that has a mapping relationship with the standby MDC.
  • the metadata of the resource pool that has a mapping relationship with the standby MDC is updated.
  • a second aspect provides a method for managing metadata in a distributed storage system, including: a first metadata controller MDC receiving a mapping relationship query request sent by a client, where the mapping relationship query request is used to request query management and a user Determining the MDC of the metadata of the corresponding resource pool; the first MDC sends the mapping relationship indication information to the client, where the mapping relationship indication information indicates that the MDC of the resource pool corresponding to the user request is the second MDC; The client reads metadata of the resource pool corresponding to the user request from the second MDC.
  • the method before the first MDC sends the mapping relationship indication information to the client, the method further includes: storing, in the first MDC reading metabase a mapping relationship list, wherein the metadata storage node in the metadata database includes a public node, a private node, and a temporary node, and the metadata stored in the public node is modified by the primary MDC, and the multiple resource pool is stored in the private node.
  • the metadata corresponding to each of the resource pools, and the metadata corresponding to each of the plurality of resource pools is read and modified by the standby MDC that manages the resource pool, and the temporary node stores each of the plurality of MDCs.
  • the primary MDC manages the public metadata, and prepares the MDC pair.
  • the metadata of the resource pool with the mapping relationship is managed, and the identifier information of each MDC is stored in the temporary node, so that the primary MDC and the standby MDC mutually monitor each other's state, since each MDC can participate in the management of the metadata.
  • a larger storage cluster can be managed, and each MDC manages only the metadata of the corresponding resource pool, which can implement fault domain isolation.
  • the primary MDC determines that a resource pool needs to be created; the primary MDC determines a home MDC of the resource pool that needs to be created; The primary MDC writes the mapping relationship between the resource pool to be created and the home MDC of the resource pool to be created in the public node; the home MDC of the resource pool to be created is based on the resource pool created in the public node. A mapping relationship between the resource pool and the home MDC of the resource pool to be created, and the topology information of the resource pool to be created is read from the private node.
  • the primary MDC may determine that a resource needs to be created according to the needs of the user when the system is initialized; or the primary MDC determines that the resource pool needs to be created when the user determines that the storage capacity of the existing resource pool cannot meet the service requirement of the user; Alternatively, the primary MDC determines that the resource pool needs to be created when the user creates a resource pool request, and the created resource pool request sent by the user carries the topology information of the resource pool to be created, and the primary MDC receives the created resource sent by the user. When the pool requests, the topology information is written to the private node in the metabase.
  • the method further includes: the primary MDC receiving a create resource pool request sent by the user, where the create resource pool request is carried The topology information; the primary MDC writes the topology information to the private node.
  • the method further includes: generating, by the home MDC of the resource pool that needs to be created, according to the topology information The metadata corresponding to the created resource pool; the home MDC of the resource pool to be created writes the metadata corresponding to the resource pool that needs to be created into the private node.
  • the method further includes: deleting, by the primary MDC, a mapping relationship between the first resource pool and the first standby MDC stored in the public node
  • the first standby MDC determines that the mapping relationship between the first resource pool and the first standby MDC is deleted, the management of the metadata corresponding to the first resource pool is stopped.
  • the method further includes: determining, by the primary MDC, that the first standby MDC is faulty a second standby MDC; the primary MDC writes the mapping relationship between the first resource pool and the second standby MDC to the public node; the second standby MDC is configured according to the first resource pool and the first stored in the public node A mapping relationship between the two standby MDCs, and reading metadata of the first resource pool from the private node.
  • the method further includes: the primary MDC receiving a create MDC request sent by a user; the primary MDC: creating the MDC Information about the MDC requested to be created is requested to be written to the public node.
  • the method further includes: when the MDC determines that the primary MDC is faulty, initiates a contention of the primary process; A standby MDC is used as the new primary MDC to load the metadata in the public node.
  • the method further includes: determining, by the standby MDC, a state of a resource pool that has a mapping relationship with the standby MDC. At the same time, the metadata of the resource pool having a mapping relationship with the standby MDC is updated.
  • a metadata controller MDC is provided for performing the method of any of the foregoing second aspect or any of the possible implementations of the second aspect, in particular, the MDC is configured to perform the second aspect or the foregoing A unit of a method in any of the possible implementations of the two aspects.
  • a data service device comprising: a processor and a memory, the processor and the memory being connected by a bus system for storing metadata, the processor for managing metadata stored in the memory
  • the data service device is caused to perform the method of any of the above-described second aspect or any of the possible implementations of the second aspect.
  • a computer readable medium for storing a computer program comprising instructions for performing the method of any of the second aspect or any of the possible implementations of the second aspect.
  • FIG. 1 is an architectural diagram of a distributed storage system in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram of a distributed storage system in accordance with an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a method of dividing a node in a metabase according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart of a method for managing metadata in a distributed storage system according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a method for establishing a mapping relationship between a POOL and an MDC according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of a mapping relationship according to an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a method for creating a POOL by a home MDC according to an embodiment of the present invention
  • FIG. 8 is a schematic flowchart of a method for deleting a mapping relationship according to an embodiment of the present invention.
  • FIG. 9 is a schematic flowchart of a method for reconstructing a mapping relationship between an MDC and a POOL according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a method of adding an MDC according to an embodiment of the present invention.
  • FIG. 11 is a schematic flowchart of a method of selecting a new primary MDC according to an embodiment of the present invention.
  • FIG. 12 is a schematic flowchart of a method for processing a client view change by a primary MDC according to an embodiment of the present invention
  • FIG. 13 is a schematic flowchart of a method for a home MDC processing resource pool state change according to an embodiment of the present invention
  • FIG. 14 is a schematic block diagram of a data service device in accordance with an embodiment of the present invention.
  • the distributed storage system 10 includes a metabase 11 and a plurality of metadata controllers (Metadata). Controller (referred to as "MDC") 12 and multiple resource pools 13;
  • MDC metadata controllers
  • the metadata database 11 is configured to store metadata corresponding to the plurality of resource pools 13;
  • the primary MDC 121 of the plurality of MDCs is configured to manage a mapping relationship between the standby MDCs 122 and the resource pools of the plurality of MDCs stored in the metadata database 11;
  • the standby MDC 122 of the plurality of MDCs is configured to manage metadata corresponding to the resource pools that are stored in the metadata database and have a mapping relationship with the standby MDC 122.
  • the metabase may be a distributed, open source distributed application coordination service, such as ZooKeeper ("ZK" for short), or Googele Chubby, multiple MDCs. It can form an MDC cluster. There is a one-to-one correspondence between the standby MDC and the resource pool in the MDC cluster.
  • ZK ZooKeeper
  • Googele Chubby multiple MDCs. It can form an MDC cluster. There is a one-to-one correspondence between the standby MDC and the resource pool in the MDC cluster.
  • the distributed storage system of the embodiment of the present invention may be the distributed storage system shown in FIG. 2.
  • the distributed storage system includes: a ZooKeeper cluster, an MDC cluster, a plurality of resource pools (POOLs), and a client cluster.
  • the MDC cluster includes one primary MDC and multiple standby MDCs.
  • the primary MDC is responsible for the management of the MDC and POOL mapping relationship, the monitoring of the MDC health status, the MDC cluster expansion and volume reduction, the creation and deletion of POOL, and other global resources.
  • the standby MDC is responsible for the POOL mapping relationship with the standby MDC. Metadata (such as view) calculation, reading and writing, POOL fault handling and other services.
  • the metadata storage nodes in the ZK are divided into three types: a public node, a private node, and an Ephemeral node, wherein the Public node and the Private node are used to store metadata.
  • the data node, the metadata stored in the Public node may be referred to as public data
  • the metadata stored in the Private node may be referred to as private data.
  • the metadata under the Public node may include a resource pool mapping relationship, an MDC list, a client view, and the like, and the metadata under the Public node can only be modified by the primary MDC, and other MDCs can be read; the elements under the Private node The data corresponds to a specific POOL, and the metadata corresponding to one POOL can only be read and modified by the MDC having a mapping relationship with the POOL; the Ephemeral node stores the identification information of each NDC, and the Ephemeral node can further include the main The temporary node and the standby temporary node, the primary temporary node stores the identification information of the primary MDC, and each standby temporary node stores the identification information of the standby MDC, and the primary MDC monitors the standby temporary node to determine whether the standby MDC status is normal, and the standby MDC monitors the primary The temporary node determines whether the status of the primary MDC is normal.
  • the identification information of the MDC may include an Internet Protocol ("IP") address and/or an identification ("
  • the metadata may be stored in a meta-database in the form of a multi-level directory, that is, the metadata may be stored in a multi-level node.
  • the public node may be used as a root.
  • Nodes, different types of metadata are stored in different leaf nodes under the root node.
  • one leaf node under the public node is used to store the resource pool mapping relationship, and one leaf node is used to store the MDC list, one The leaf node is used to store the client view.
  • a private node may be used as a root node, and metadata corresponding to each resource pool is stored in a leaf node. As shown in FIG.
  • a leaf node under the private node is used to store metadata corresponding to resource pool 0, and
  • Each of the next-level leaf nodes of the leaf node may store different types of metadata, for example, the topology information of the storage resource pool 0 in the next-level leaf node, and the storage resource pool 0 in the next-level leaf node. View information.
  • the client cluster provides distributed storage volume services externally and needs to interact with the MDC cluster and POOL.
  • the client finds the specific location of the data in the POOL through the view information in the MDC cluster, and then requests the POOL to complete the user's data read and write request.
  • a POOL is a collection of resources of an Object-based Storage Device (OSD) and needs to interact with an MDC cluster and a Client cluster.
  • OSD Object-based Storage Device
  • the distribution of data in POOL depends on the metadata view in the MDC cluster.
  • FIG. 4 shows a method for managing metadata in a distributed storage system according to an embodiment of the present invention.
  • the method 100 includes:
  • the first metadata controller MDC receives a mapping relationship query request sent by the client, where the mapping relationship query request is used to query an MDC that manages metadata of a resource pool corresponding to the user request;
  • the first MDC sends mapping relationship indication information to the client, where the mapping relationship indication information indicates that the MDC of the resource pool corresponding to the user request is the second MDC;
  • the client reads metadata of the resource pool corresponding to the user request from the second MDC.
  • the client in the client cluster receives the read and write request sent by the user, and then the client parses the received read and write request, and the client cannot determine the read/write request in the locally stored partition view (Partition View).
  • the client may request a query mapping relationship from any MDC in the MDC cluster, that is, requesting to query the MDC corresponding to the resource pool in which the data requested by the read/write request is located, and receiving the query request.
  • the MDC can read the mapping from the metabase and return it to the client.
  • the client is in the process of determining management and reading and writing. After requesting the corresponding MDC of the resource pool in which the requested data is located, the MDC requests the latest Partition View, finds the specific location of the data in the resource pool, and requests the resource pool to complete the read and write request of the user data.
  • the primary MDC 121 is specifically configured to determine, in the multiple MDCs, the home MDC of the resource pool to be created, the resource pool to be created, and the requirement, when the resource pool needs to be created.
  • the mapping relationship of the home MDC of the created resource pool is written in the public node; the home MDC of the resource pool to be created is used to create a resource pool that needs to be created according to the resource pool that needs to be created in the public node.
  • the mapping relationship of the belonging to the MDC, and the topology information of the resource pool to be created is read from the private node.
  • the home MDC of the resource pool to be created reads the mapping relationship stored in the public node.
  • the home MDC of the created resource pool reads the topology information of the resource pool that needs to be created from the private node.
  • the primary MDC may determine that the resource pool needs to be created according to the requirements of the user when the system is initialized; or the primary MDC determines that the resource pool needs to be created when the user determines that the storage capacity of the existing resource pool cannot meet the service requirement of the user; or When receiving the request for creating a resource pool sent by the user, the primary MDC determines that the resource pool needs to be created.
  • the request for creating the resource pool sent by the user carries the topology information of the resource pool to be created, and the primary MDC receives the request for creating the resource pool sent by the user. When you write the topology information to the Private node in the metabase.
  • one of the standby MDCs in the non-faulty state may be determined as the home MDC of the resource pool to be created, preferably The primary MDC determines the one of the standby MDCs in the standby MDC that is in the non-fault state to be the home MDC of the resource pool to be created. Moreover, when all the standby MDCs are in a fault state, the primary MDC can also determine itself as the home MDC of the resource pool that needs to be created.
  • the primary MDC when the primary MDC writes the mapping relationship between the created resource pool and the home MDC of the resource pool to be created to the public node, the primary MDC may rewrite all existing mappings to ensure data consistency.
  • the content in the public node is modified in a manner that the mapping between the resource pool to be created and the home MDC of the resource pool to be created is performed.
  • mapping relationship For example, suppose the following mapping relationship already exists in the public node: POOL1->MDC1 POOL2->MDC2, and the newly created POOL is POOL3, and the attribution MDC of the resource pool to be created is MDC3, that is, the mapping relationship between POOL3 and MDC3 needs to be established, and the main MDC is modifying the content in the public node.
  • the mapping relationship between POOL3 and MDC3 can be added by adding POOL3->MDC3 on the basis of the original content, or POOL1 and MDC3 can be written by means of writing POOL1->MDC1, POOL2->MDC2 and POOL3->MDC3. The mapping relationship is written to the public node.
  • method 200 includes:
  • the mapping between the standby MDC and the resource pool is saved in the resource pool mapping node.
  • the resource pool mapping node may also maintain the state of the MDC (whether it is in a fault state) that has a mapping relationship with the resource pool.
  • the primary MDC modifies the content of the POOL Mapping node.
  • the primary MDC When determining that a new resource pool needs to be created, the primary MDC first selects a standby MDC as the home MDC of the resource pool to be created according to the method described above, and creates a resource pool that needs to be created under the Private node in the ZK. Corresponding nodes, and write the topology information of the resource pool that needs to be created to the node in the ZK.
  • the POOL to be created is POOL3, and the home MDC of the POOL determined by the primary MDC is MDC3. Therefore, the primary MDC needs to write the topology information of POOL3 into the ZK, and add POOL3 and MDC3 to the POOL Mapping node. Mapping relationship.
  • the ZK returns a modification success information to the primary MDC.
  • the MDC3 receives the node event notification.
  • the ZK is triggered to send a node event notification to all standby MDCs.
  • the MDC3 After receiving the node event notification, the MDC3 obtains the content in the POOL Mapping node.
  • the MDC3 determines, according to the content of the POOL Mapping node, a mapping relationship between the newly added POOL3 and the MDC3.
  • the MDC3 After reading the contents of the POOL Mapping node, the MDC3 compares the contents read twice before and after, and determines whether to establish a mapping relationship with the POOL according to the contents read twice.
  • the MDC3 reads the related service data of the POOL3 from the ZK.
  • MDC3 When creating a POOL, MDC3 reads the topology information of POOL3 from ZK.
  • the resource pool corresponding to the resource pool to be created may be generated according to the acquired topology information. Metadata; then the home MDC of the resource pool that needs to be created writes the metadata information into the private node.
  • FIG. 7 is a schematic flowchart of a method for creating a POOL by a home MDC according to an embodiment of the present invention. As shown in FIG. 7, the method 300 includes:
  • the home MDC generates data information according to the topology information.
  • the data information may include an Object-based Storage Device ("OSD”) view and a Partition View.
  • OSD Object-based Storage Device
  • the home MDC sends the topology information, the data information, and the like to the ZK through an interface on the ZK.
  • the ZK After receiving the information sent by the home MDC, the ZK creates a node corresponding to the resource pool to be created, so that the home MDC writes topology information, data information, and the like into the node.
  • the ZK returns, to the home MDC, indication information that the writing is successful.
  • the home MDC determines that the POOL is successfully created and waits for the OSD to provide the service.
  • the primary MDC 121 is further configured to: delete a mapping relationship between the first resource pool of the multiple resource pools stored in the public node and the first standby MDC in the standby MDC;
  • the first standby MDC is configured to stop management of metadata of the first resource pool when it is determined that the mapping relationship between the first resource pool and the first standby MDC is deleted.
  • the POOL when the user service is changed, the current POOL is not required, the POOL may be deleted, and the mapping relationship between the resource pool and the MDC is first deleted before deleting the POOL, and then the Private node corresponding to the POOL in the metabase is deleted. can.
  • the mapping relationship between the first resource pool and the first standby MDC may be deleted.
  • the method 400 includes:
  • the primary MDC modifies the content of the POOL Mapping node.
  • the primary MDC deletes the mapping relationship between MDC2 and POOL2 stored in the POOL mapping. It is assumed that the following mapping relationship exists in the POOL mapping: POOL1->MDC1, POOL2->MDC2, POOL3->MDC3, and the primary MDC is in the modified POOL Mapping node.
  • the content of the POOL2 and MDC2 can be deleted by deleting the POOL2->MDC2 based on the original content, or the POOL1->MDC1 and the POOL3->MDC3 can be rewritten to modify the POOL mapping. content.
  • the ZK returns a modification success information to the primary MDC.
  • the MDC2 receives the node event notification.
  • the ZK is triggered to send a node time notification to all standby MDCs.
  • the MDC2 After receiving the node event notification, the MDC2 obtains the content in the POOL Mapping node.
  • ZK returns the content read by MDC2 to MDC2;
  • the MDC2 determines to delete the mapping relationship between the MDC2 and the POOL2 according to the content in the POOL Mapping node.
  • the MDC2 After reading the content of the POOL Mapping node, the MDC2 compares the content read twice before and after, and determines whether to delete the mapping according to the content read twice.
  • the MDC2 uninstalls the resource related to the POOL2, and stops the management of the metadata corresponding to the POOL2.
  • the main MDC confirms whether the MDC2 is uninstalled.
  • the primary MDC confirms that the mapping relationship between the MDC2 and the POOL2 is successfully deleted.
  • the primary MDC 121 is further configured to: when determining that the first standby MDC is faulty, determine a second standby MDC in the multiple MDCs, the first resource pool and the second The mapping relationship of the standby MDC is written in the public node;
  • the second standby MDC is configured to read metadata of the first resource pool from the private node according to the mapping relationship between the first resource pool and the second standby MDC stored in the public node.
  • the primary MDC may determine one of the standby MDCs whose load is less than the preset threshold as the second standby MDC. For example, the primary MDC may load the lightest. The MDC is determined to be the second standby MDC.
  • the primary MDC can periodically query the standby temporary node created by the standby MDC in the metabase. When a standby temporary node is found to be invalid, it is determined that the standby MDC corresponding to the standby temporary node is faulty, and then the standby MDC is configured. The status is set to the fault status.
  • the standby MDC can report its own health status to the primary MDC periodically. When the standby MDC is faulty or fails to report, the primary MDC sets the state of the standby MDC to a fault state.
  • method 500 includes:
  • the primary MDC establishes a new mapping relationship for the POOL2 originally mapped on the MDC2 when determining that the standby MDC2 is faulty.
  • the primary MDC selects the least-loaded MDC3 to establish a mapping relationship with POOL2, deletes the mapping relationship between MDC2 and POOL2 in the POOL Mapping node in ZK, and adds the mapping relationship between MDC3 and POOL3 to the POOL Mapping node.
  • S503 The MDC3 receives the POOL Mapping node change event notification.
  • S504 The MDC3 reads the content of the POOL Mapping node.
  • the ZK returns the content of the POOL Mapping node to the MDC3.
  • the MDC3 determines, according to the content of the POOL Mapping node, a mapping relationship between the newly added MDC2 and the POOL2.
  • the MDC3 reads the related service data of the POOL2 from the ZK;
  • MDC3 determines that the mapping relationship with POOL2 is successfully established.
  • the primary MDC 121 is further configured to: receive an created MDC request sent by the user, and write the identifier information of the MDC that is requested to be created by the created MDC request into the public node.
  • the identification information of the MDC may be an IP address or a port number of the MDC, but the present invention is not limited thereto.
  • the distributed storage system of the embodiment of the present invention can dynamically increase or decrease the number of resource pools, and can also perform on-line smooth expansion and reduction of the MDC, thereby improving the processing capability and reliability of the system.
  • method 600 includes:
  • S601 The user sends a request for adding an MDC to the primary MDC.
  • the primary MDC verifies that the IP, the port of the MDC3 that is requested to be added is legal, and the like.
  • the primary MDC updates the MDC list and adds monitoring to the MDC3.
  • the agent returns a user operation result to the user.
  • MDC3 creates a temporary node in ZK
  • MDC3 initializes loading the necessary data and starts normal work.
  • the standby MDC in the multiple MDCs is further used to initiate a contention primary process when determining the primary MDC fault;
  • a standby MDC in the MDC acts as a new primary MDC for loading metadata in the public node.
  • the standby MDC that first creates the primary temporary node in the metabase can be determined as the new primary MDC, or the primary temporary node can be created in a standby MDC. Then, the load of the other standby MDCs can be determined by the mapping relationship between the MDC and the POOL stored in the metabase. If the standby MDC determines that the load is higher than the load of some standby MDCs in other standby MDCs, the standby MDC can make The main temporary node created by itself is invalidated, so that other standby MDCs can create the primary temporary node in the metabase. According to this method, the lightest standby MDC can be upgraded to the new primary MDC, and the public database in the loading metabase is loaded. The service is provided after the metadata of the node.
  • Method 700 includes:
  • the primary MDC registers the primary temporary node in the ZK.
  • the standby MDC monitors the primary temporary node.
  • the primary MDC failure causes the primary temporary node to fail
  • the standby MDC receives the primary temporary node change notification.
  • the standby MDC3 is promoted to the new primary MDC.
  • MDC3 new primary MDC
  • the primary MDC 121 is further configured to: determine that a view of the client changes; and update metadata stored in the public node corresponding to the view of the client.
  • the client can periodically report the heartbeat to the primary MDC.
  • the primary MDC does not receive the heartbeat sent by the client within a certain period of time, and can confirm that the view of the client changes. Further, the primary MDC changes the metadata stored in the metabase corresponding to the view of the client.
  • method 800 includes:
  • the main MDC confirms that the client view (Client View) changes
  • the primary MDC modifies data in the Client View node in the ZK.
  • FIG. 12 is only an example of the Client View.
  • the change of other public data can only be processed by the main MDC, because the write conflict does not occur, and the processing flow of other common data is similar to the method 800.
  • the standby MDC of the multiple MDCs is further configured to: when determining a state of a resource pool that has a mapping relationship with the standby MDC, update a mapping relationship with the standby MDC. Metadata for the resource pool.
  • the state change of the resource pool includes changes in the OSD view and changes in the Partition View.
  • method 900 includes:
  • S901 The home MDC determines that the OSD view is changed.
  • the home MDC modifies data of the OSD View node in the ZK.
  • the home MDC returns a processing result to the OSD.
  • FIG. 13 is only an example of the OSD View.
  • the change of other private metadata of the resource pool can only be processed by the home MDC with which the mapping relationship is related.
  • the processing flow of other private data is similar to the method 900. Since the private metadata of a resource pool can only be processed by the MDC that has a mapping relationship with the resource pool, multiple POOL services can be concurrently processed at the service level without conflicting with each other.
  • the distributed storage system and the method for managing metadata according to the embodiments of the present invention can manage a larger-scale storage cluster and can implement fault domain isolation.
  • FIG. 14 shows a data service device 100 including a plurality of processors 101, a memory 102 and a bus system 103, which are connected by a bus system 103, in accordance with an embodiment of the present invention.
  • the memory 102 is configured to store metadata corresponding to a plurality of resource pools
  • a main processor of the plurality of processors 101 is configured to manage a standby processor of the plurality of processors stored in the memory 102
  • a mapping relationship between the resource pools, where the standby processor of the plurality of processors is configured to manage metadata corresponding to the resource pools in the memory that are mapped to the standby processor.
  • a first processor of the plurality of processors 101 configured to receive a mapping relationship query request sent by a client, where the mapping relationship query request is used to request a processor that queries metadata of a resource pool corresponding to the user request;
  • the first processor is further configured to send mapping relationship indication information to the client, where the mapping relationship indication information indicates that the processor that manages the resource pool corresponding to the user request is a second processor, so that the client is from the second
  • the metadata of the resource pool corresponding to the user request is read in the processor.
  • the data service device of the embodiment of the present invention can manage a larger-scale storage cluster and can implement fault domain isolation.
  • the processor 101 may be a central processing unit (CPU), and the processor 101 may be another general-purpose processor or a digital signal processor (Digital Signal Processing). , referred to as DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components, etc.
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the memory 102 can include read only memory and random access memory and provides instructions and data to the processor 101.
  • a portion of the memory 102 may also include a non-volatile random access memory.
  • the memory 102 can also store information of the device type.
  • the bus system 103 may include a power bus, a control bus, a status signal bus, and the like in addition to the data bus. However, for clarity of description, various buses are labeled as the bus system 103 in the figure.
  • each step of the above method may be completed by an integrated logic circuit of hardware in the processor 101 or an instruction in a form of software.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented as a hardware processor, or may be performed by a combination of hardware and software modules in the processor.
  • the software modules can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory 102, and the processor 101 reads the metadata in the memory 102 and performs the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • the metadata storage node in the memory 102 includes a public node, a private node, and a temporary node; wherein the metadata stored in the public node is modified by the main processor, where the private node is The metadata corresponding to each of the plurality of resource pools is stored, and the metadata corresponding to each of the plurality of resource pools is read and modified by a standby processor that manages the resource pool; The identification information of each of the plurality of processors is stored.
  • the main processor is specifically configured to: when a resource pool needs to be created, determine, in the multiple processors, a home processor of the resource pool to be created, and the resource pool to be created The mapping relationship of the home processor of the resource pool to be created is written into the memory;
  • the home MDC of the resource pool to be created is used to read the resource to be created from the private node according to the mapping relationship between the resource pool created in the memory and the home MDC of the resource pool to be created.
  • the topology information of the pool is used to read the resource to be created from the private node according to the mapping relationship between the resource pool created in the memory and the home MDC of the resource pool to be created.
  • the main processor is further configured to: receive a create resource pool request sent by the user, where the create resource pool request carries the topology information; and write the topology information into the memory.
  • the home processor of the resource pool that needs to be created is further configured to: generate, according to the topology information stored in the memory, a element corresponding to the resource pool that needs to be created. Data; the metadata corresponding to the resource pool that needs to be created is written to the memory.
  • the main processor is further configured to: delete a mapping relationship between a first resource pool of the multiple resource pools stored in the memory and a first standby processor in the standby processor;
  • the first standby processor is configured to stop management of the metadata corresponding to the first resource pool when the mapping relationship between the first resource pool and the first standby MDC is deleted.
  • the main processor is further configured to: when determining that the first standby processor is faulty, determine a second standby processor among the multiple processors, and the first resource pool and the The mapping relationship of the second standby processor is written in the memory; the second standby processor is configured to read from the memory according to the mapping relationship between the first resource pool and the second standby processor stored in the memory Take the metadata of the first resource pool.
  • the main processor is further configured to: receive a create processor request sent by the user; and write the identifier information of the processor requested by the create processor request to be written into the memory.
  • a standby processor of the multiple processors is further configured to initiate a contention primary process when determining that the primary processor is faulty; and one of the multiple processors is used as a new a main processor for loading metadata in the memory.
  • the standby processor of the multiple processors is further configured to: when the state of the resource pool that has a mapping relationship with the standby processor is changed, the update has a mapping with the standby processor.
  • the metadata of the resource pool of the relationship is further configured to: when the state of the resource pool that has a mapping relationship with the standby processor is changed, the update has a mapping with the standby processor.
  • the main processor is configured to manage a mapping relationship between the processor and the resource pool
  • the standby processor is configured to manage metadata corresponding to the resource pool that is stored in the memory and has a mapping relationship with the standby processor, where Therefore, each processor can participate in the management of metadata, can manage a larger storage cluster, and each processor only manages the metadata of the corresponding resource pool, which can implement fault domain isolation.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种分布式存储系统和管理元数据的方法,该分布式存储系统包括:元数据库、多个元数据控制器MDC和多个资源池;该元数据库,用于存储与该多个资源池相对应的元数据;该多个MDC中的主MDC,用于管理该元数据库中存储的该多个MDC中的备MDC与资源池之间的映射关系;该多个MDC中的备MDC,用于管理该元数据库中存储的与该备MDC具有映射关系的资源池对应的元数据。因此,本发明实施例的分布式存储系统能够管理更大规模的存储集群,实现故障域隔离。

Description

分布式存储系统及管理元数据的方法 技术领域
本发明实施例涉及计算机领域,并且更具体地,涉及分布式存储系统及管理元数据的方法。
背景技术
典型的分布式存储系统的基本架构包括Zookeeper(ZK)集群、元数据控制器(Metadata Controller,简称为“MDC”)集群、资源池(Pool)和客户端(Client)集群。其中,MDC集群采用一主多备的方式部署,主MDC负责元数据的计算、读写、Pool故障处理等业务。ZK集群中的元数据存储节点分为数据(Data)节点和临时(Ephemeral)节点,Data节点下的数据由主MDC修改,其他MDC可以读取,Ephemeral节点包括主临时节点和备临时节点,主临时节点中存储主MDC的标识信息,每个备临时节点存储一个备MDC的标识信息,主MDC监控备临时节点,来判定备MDC状态是否正常,备MDC监控主临时节点,来判断主MDC的状态是否正常。一旦主MDC失效,所有的备MDC将收到ZK事件通知,进入竞争主流程,产生的新的主MDC将从ZK集群中读取元数据,完成初始化后对外提供业务服务。
由于元数据的读写都是主MDC处理,只能支持一定规模的资源池业务,不支持资源池维度的扩容;备MDC不处理业务,浪费系统的资源,在主MDC故障且新的主MDC还未提供服务期间,整个分布式存储系统的业务都会受到影响,而且现有的分布式存储系统不能支持元数据控制集群的动态扩减容。
发明内容
本发明提供一种分布式存储系统及管理元数据的方法,能够管理更大规模的存储集群,并能够实现故障域隔离。
第一方面,提供了一种分布式存储系统,包括:元数据库、多个元数据控制器MDC和多个资源池;该元数据库,用于存储与该多个资源池相对应的元数据;该多个MDC中的主MDC,用于管理该元数据库中存储的该多个MDC中的备MDC与资源池之间的映射关系;该多个MDC中的备MDC, 用于管理该元数据库中存储的与该备MDC具有映射关系的资源池对应的元数据。
本发明实施例中的分布式存储系统中的主MDC用于管理MDC与资源池之间的映射关系,备MDC用于管理元素库中存储的与该备MDC具有映射关系的资源池对应的元数据,包括负责与该备MDC具有映射关系的资源池的元数据的计算、读写、资源池故障处理等业务,由此,分布式存储系统可以管理更大规模的存储集群,实现故障域隔离。
结合第一方面,在第一方面的第一种可能的实现方式中,该元数据库中的元数据存储节点包括公有节点、私有节点和临时节点;其中,该公有节点中存储的元数据由该主MDC进行修改;该私有节点中存储该多个资源池中每个资源池对应的元数据,且该多个资源池中每个资源池对应的元数据由管理该资源池的备MDC进行读取与修改;该临时节点中存储该多个MDC中每个MDC的标识信息。
也就是说,可以在元数据库中建立不同类型的元数据存储节点来实现主MDC对公有元数据的管理,备MDC对与之有映射关系的资源池的元数据的管理,以及主MDC与备MDC之间状态的相互监控。由此,多个MDC中的每个MDC都可以参与元数据的管理,可以管理更大的存储集群,并且每个MDC只管理与之对应的资源池的元数据,能够实现故障域隔离。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,该主MDC,具体用于在需要创建资源池时,在该多个MDC中确定需要创建的资源池的归属MDC,将需要创建的资源池与该需要创建的资源池的归属MDC之间的映射关系写入该公有节点中;该需要创建的资源池的归属MDC,用于根据该公有节点中存储的该需要创建的资源池与该需要创建的资源池的归属MDC的映射关系,从该私有节点中读取该需要创建的资源池的拓扑信息。
可选地,元数据库中可以采用多级目录的形式存储元数据,也就是说,可以采用多级节点的方式存储元数据,例如,公有节点可以作为根节点,不同类型的元数据存储在根节点下的不同的叶子节点中。私有节点可以作为根节点,每个资源池对应的元数据存储在一个叶子节点中。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,该主MDC还用于:接收用户发送的创建资源池请求,该创建资 源池请求携带该拓扑信息;将该拓扑信息写入该私有节点中。
结合第一方面的第二种或第三种可能的实现方式,在第一方面的第四种可能的实现方式中,该需要创建的资源池的归属MDC还用于:根据该私有节点中存储的该拓扑信息,生成与该需要创建的资源池相对应的元数据;将该与该需要创建的资源池相对应的元数据写入该私有节点中。
由此,可以根据用户的业务需要增加资源池的个数,提升系统的存储容量,更好的满足用户的业务需求。
结合上述任一可能的实现方式,在第一方面的第五种可能的实现方式中,该主MDC还用于:删除该公有节点中存储的该多个资源池中的第一资源池与该备MDC中的第一备MDC的映射关系;该第一备MDC,用于在确定该第一资源池与该第一备MDC的映射关系被删除时,停止对该第一资源池对应的元数据的管理。
结合第一方面的第五种可能的实现方式,在第一方面的第六种可能的实现方式中,该主MDC具体用于:在确定该第一备MDC故障时,删除该第一资源池与该第一备MDC的映射关系。
在用户由于业务变更,不需要使用某一个资源池时,可以删除该资源池与备MDC的映射关系,由此,与该资源池具有映射关系的备MDC停止对该资源池对应的元数据的管理,并且该备MDC可以在需要时对其他资源池的元数据进行管理,由此能够提高系统资源的利用率。
结合第一方面的第六种可能的实现方式,在第一方面的第七种可能的实现方式中,该主MDC还用于:在确定该第一备MDC故障时,在该多个MDC中确定第二备MDC,将该第一资源池与该第二备MDC的映射关系写入该公有节点中;该第二备MDC,用于根据该公有节点中存储的该第一资源池与该第二备MDC的映射关系,从该私有节点中读取该第一资源池的元数据。
结合第一方面的第七种可能的实现方式,在第一方面的第八种可能的实现方式中,该主MDC具体用于:将该多个MDC中负载小于预设阈值的备MDC中的一个备MDC确定为该第二备MDC。
本发明实施例中,在一个备MDC出现故障时,可以重新确定一个备MDC来管理由出现故障的备MDC管理的资源池对应的元数据,由此,可以提高系统的可靠性。
结合第一方面的任一可能的实现方式,在第一方面的第九种可能的实现 方式中,该主MDC还用于:接收用户发送的创建MDC请求;将该创建MDC请求所请求创建的MDC的标识信息写入该公有节点中。
进而,本发明实施例的分布式存储系统可以根据用户的请求,在线增加MDC,以提高系统的处理能力和可靠性。
结合第一方面的任一可能的实现方式,在第一方面的第十种可能的实现方式中,该多个MDC中的备MDC,还用于在确定该主MDC故障时,发起竞争主流程;该多个MDC中的一个备MDC作为新的主MDC,用于加载该公有节点中的元数据。
结合第一方面的任一可能的实现方式,在第一方面的第十一种可能的实现方式中,该主MDC还用于:确定客户端的视图状态发生变化;更新该公有节点中与该客户端的视图相对应的元数据。
结合第一方面的任一可能的实现方式,在第一方面的第十二种可能的实现方式中,该多个MDC中的备MDC还用于:在确定与该备MDC具有映射关系的资源池的状态发生变化时,更新与该备MDC具有映射关系的资源池的元数据。
第二方面,提供了一种分布式存储系统中管理元数据的方法,包括:第一元数据控制器MDC接收客户端发送的映射关系查询请求,该映射关系查询请求用于请求查询管理与用户请求相对应的资源池的元数据的MDC;该第一MDC向该客户端发送映射关系指示信息,该映射关系指示信息指示管理该与用户请求相对应的资源池的MDC为第二MDC;该客户端从该第二MDC中读取与该与用户请求相对应的资源池的元数据。
结合第二方面,在第二方面的第一种可能的实现方式中,在该第一MDC向该客户端发送映射关系指示信息之前,该方法还包括:该第一MDC读取元数据库中存储的映射关系列表,其中,该元数据库中的元数据存储节点包括公有节点、私有节点和临时节点,该公有节点中存储的元数据由主MDC进行修改,该私有节点中存储该多个资源池中每个资源池对应的元数据,且该多个资源池中每个资源池对应的元数据由管理该资源池的备MDC进行读取与修改,该临时节点中存储多个MDC中每个MDC的标识信息;该第一MDC根据该映射关系列表,将该第二MDC确定为管理该与用户请求相对应的资源池的MDC。
在本发明实施例中,主MDC对公有元数据进行管理,备MDC对与之 有映射关系的资源池的元数据进行管理,并且临时节点中存储每个MDC的标识信息,以便主MDC与备MDC之间相互监控彼此的状态,由于每个MDC都可以参与元数据的管理,由此可以管理更大的存储集群,并且每个MDC只管理与之对应的资源池的元数据,能够实现故障域隔离。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,该主MDC确定需要创建资源池;该主MDC确定需要创建的资源池的归属MDC;该主MDC将需要创建的资源池与该需要创建的资源池的归属MDC的映射关系写入该公有节点中;该需要创建的资源池的归属MDC根据该公有节点中存储的该需要创建的资源池与该需要创建的资源池的归属MDC的映射关系,从该私有节点中读取该需要创建的资源池的拓扑信息。
在本发明实施例中,主MDC可以在系统初始化时根据用户的需求确定需要创建资源;或者主MDC在用户确定已有资源池的存储容量不能满足用户的业务需要时,确定需要创建资源池;或者,主MDC在接收到用户发送的创建资源池请求时,确定需要创建资源池,其中,用户发送的创建资源池请求携带需要创建的资源池的拓扑信息,主MDC接收到用户发送的创建资源池请求时,将其中的拓扑信息写入到元数据库中的私有(Private)节点下。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,该方法还包括:该主MDC接收用户发送的创建资源池请求,该创建资源池请求携带该拓扑信息;该主MDC将该拓扑信息写入该私有节点中。
结合第二方面的第三种可能的实现方式,在第二方面的第四种可能的实现方式中,该方法还包括:该需要创建的资源池的归属MDC根据该拓扑信息,生成与该需要创建的资源池相对应的元数据;该需要创建的资源池的归属MDC将该与该需要创建的资源池相对应的元数据写入该私有节点中。
结合上述任一可能的实现方式,在第二方面的第五种可能的实现方式中,该方法还包括:该主MDC删除该公有节点中存储的第一资源池与第一备MDC的映射关系;该第一备MDC在确定该第一资源池与该第一备MDC的映射关系被删除时,停止对该第一资源池对应的元数据的管理。
结合第二方面的第五种可能的实现方式,在第二方面的第六种可能的实现方式中,该方法还包括:该主MDC在确定该第一备MDC故障时,确定 第二备MDC;该主MDC将该第一资源池与该第二备MDC的映射关系写入该公有节点中;该第二备MDC根据该公有节点中存储的该第一资源池与该第二备MDC的映射关系,从该私有节点中读取该第一资源池的元数据。
结合第二方面的上述任一可能的实现方式,在第二方面的第七种可能的实现方式中,该方法还包括:该主MDC接收用户发送的创建MDC请求;该主MDC将该创建MDC请求所请求创建的MDC的相关信息写入该公有节点中。
结合第二方面的上述任一可能的实现方式,在第二方面的第八种可能的实现方式中,该方法还包括:备MDC在确定主MDC故障时,发起竞争主流程;该备MDC中的一个备MDC作为新的主MDC,加载该公有节点中的元数据。
结合第二方面的上述任一可能的实现方式,在第二方面的第九种可能的实现方式中,该方法还包括:备MDC在确定与该备MDC具有映射关系的资源池的状态发生变化时,更新与该备MDC具有映射关系的资源池的元数据。
第三方面,提供了一种元数据控制器MDC,用于执行上述第二方面或第二方面的任意可能的实现方式中的方法,具体地,该MDC包括用于执行上述第二方面或第二方面的任意可能的实现方式中的方法的单元。
第四方面,提供了一种数据服务设备,包括:处理器和存储器,该处理器和该存储器通过总线系统相连,该存储器用于存储元数据,该处理器用于管理该存储器中存储的元数据,使得该数据服务设备执行上述第二方面或第二方面的任意可能的实现方式中的方法。
第五方面,提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第二方面或第二方面的任意可能的实现方式中的方法的指令。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本发明实施例的分布式存储系统的架构图;
图2是根据本发明一个具体实施例的分布式存储系统的架构图;
图3是根据本发明实施例的元数据库中的节点的划分方法的示意图;
图4是根据本发明实施例的分布式存储系统中管理元数据的方法的示意性流程图;
图5是根据本发明一个具体实施例的建立POOL与MDC的映射关系的方法的示意性流程图;
图6是根据本发明实施例的映射关系的示意图;
图7是根据本发明实施例的归属MDC创建POOL的方法的示意性流程图;
图8是根据本发明一个具体实施例的删除映射关系的方法的示意性流程图;
图9是根据本发明一个具体实施例的重建MDC与POOL之间映射关系的方法的示意性流程图;
图10是根据本发明一个具体实施例的增加MDC的方法的示意性流程图;
图11是根据本发明一个具体实施例的选择新的主MDC的方法的示意性流程图;
图12是根据本发明一个具体实施例的主MDC处理客户端视图变更的方法的示意性流程图;
图13是根据本发明一个具体实施例的归属MDC处理资源池状态变更的方法的示意性流程图;
图14是根据本发明实施例的数据服务设备的示意性框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。
图1是根据本发明实施例的分布式存储系统的示意性架构图,如图1所示,该分布式存储系统10包括:元数据库11、多个元数据控制器(Metadata  Controller,简称为“MDC”)12和多个资源池13;
该元数据库11,用于存储与该多个资源池13相对应的元数据;
该多个MDC中的主MDC121,用于管理该元数据库11中存储的该多个MDC中的备MDC122与资源池之间的映射关系;
该多个MDC中的备MDC122,用于管理该元数据库中存储的与该备MDC122具有映射关系的资源池对应的元数据。
在本发明实施例中,可选地,该元数据库可以为一个分布式的、开放源码的分布式应用程序协调服务,例如ZooKeeper(简称为“ZK”),还可以为Googele Chubby,多个MDC可以构成一个MDC集群,MDC集群中的备MDC与资源池之间是一一对应的关系。
一般地,本发明实施例的分布式存储系统可以为图2所示的分布式存储系统。如图2所示,分布式存储系统包括:Zookeeper集群、MDC集群、多个资源池(POOL)以及客户端(Client)集群。MDC集群包括一个主MDC和多个备MDC。主MDC负责备MDC与POOL映射关系的管理、备MDC健康状况的监控、MDC集群扩减容、POOL的创建与删除等全局性资源的控制;备MDC负责与该备MDC具有映射关系的POOL的元数据(如视图)的计算、读写、POOL故障处理等业务。
如图3所示,ZK中的元数据存储节点划分为三种类型:公有(Public)节点、私有(Private)节点和临时(Ephemeral)节点,其中,Public节点和Private节点为用于存储元数据的数据节点,Public节点中存储的元数据可以称为公有数据,Private节点中存储的元数据可以称为私有数据。一般来说Public节点下的元数据可以包括资源池映射关系、MDC列表、客户端视图等,并且Public节点下的元数据只能由主MDC进行修改,其他MDC可读取;Private节点下的元数据与具体的POOL相对应,与一个POOL相对应的元数据只能由与该POOL有映射关系的MDC进行读取与修改;Ephemeral节点中存储每个NDC的标识信息,Ephemeral节点可以进一步包括主临时节点和备临时节点,主临时节点中存储主MDC的标识信息,每个备临时节点存储一个备MDC的标识信息,主MDC监控备临时节点,来判定备MDC状态是否正常,备MDC监控主临时节点,来判断主MDC的状态是否正常。MDC的标识信息可以包括MDC的网络互连协议(Internet Protocol,简称为“IP”)地址和/或身份标识(Identification,简称为“ID”)。
可选地,作为一个例子,如图3所示,元数据库中可以采用多级目录的形式存储元数据,也就是说可以采用多级节点的方式存储元数据,例如,可以以公有节点作为根节点,不同类型的元数据存储在根节点下的不同的叶子节点中,例如在图3中,公有节点下的一个叶子节点用于存储资源池映射关系,一个叶子节点用于存储MDC列表,一个叶子节点用于存储客户端视图。可以以私有节点作为根节点,每个资源池对应的元数据存储在一个叶子节点中,如图3中所示的,私有节点下的一个叶子节点用于存储资源池0对应的元数据,并且该叶子节点的每一个下一级叶子节点中可以存储不同类型的元数据,例如,一个下一级叶子节点中存储资源池0的拓扑信息,另一个下一级叶子节点中存储资源池0的视图信息。
Client集群对外提供分布式存储的卷服务,需要与MDC集群和POOL交互。Client通过MDC集群中的视图信息找到数据在POOL中的具体位置,再向POOL请求完成用户的数据读写请求。POOL是对象存储设备(Object-based Storage Device,简称为“OSD”)的资源集合,需要与MDC集群和Client集群交互。POOL中的数据分布取决于MDC集群中的元数据视图。
相对应的,图4示出了本发明实施例中的分布式存储系统中管理元数据的方法,如图4所示,方法100包括:
S110,第一元数据控制器MDC接收客户端发送的映射关系查询请求,该映射关系查询请求用于查询管理与用户请求相对应的资源池的元数据的MDC;
S120,该第一MDC向该客户端发送映射关系指示信息,该映射关系指示信息指示管理该与用户请求相对应的资源池的MDC为第二MDC;
S130,该客户端从该第二MDC中读取与该与用户请求相对应的资源池的元数据。
具体来说,客户端集群中的客户端接收用户发送的读写请求,之后客户端解析接收到的读写请求,客户端在本地存储的分区视图(Partition View)中不能确定该读写请求所请求的数据的具体存放位置时,客户端可以向MDC集群中的任意一个MDC请求查询映射关系,即请求查询与该读写请求所请求的数据所在的资源池相对应的MDC,接收到查询请求的MDC可以从元数据库中读取映射关系,并返回给客户端,客户端在确定管理与读写 请求所请求的数据所在的资源池相对应的MDC后,向该MDC请求最新的Partition View,找到数据在资源池中的具体位置,再向资源池请求完成用户的数据的读写请求。
在本发明实施例中,可选地,该主MDC121,具体用于在需要创建资源池时,在该多个MDC中确定需要创建的资源池的归属MDC,将需要创建的资源池与该需要创建的资源池的归属MDC的映射关系写入该公有节点中;该需要创建的资源池的归属MDC,用于根据该公有节点中存储的该需要创建的资源池与该需要创建的资源池的归属MDC的映射关系,从该私有节点中读取该需要创建的资源池的拓扑信息。需要创建的资源池的归属MDC读取公有节点中存储的映射关系,通过对比当前读取的映射关系与之前读取的映射关系,确认需要建立与需要创建的资源池的映射关系时,该需要创建的资源池的归属MDC从私有节点中读取该需要创建的资源池的拓扑信息。
在本发明实施例中,新增加资源池时,首先需要建立资源池与MDC的映射关系。可选地,主MDC可以在系统初始化时根据用户的需求确定需要创建资源池;或者主MDC在用户确定已有资源池的存储容量不能满足用户的业务需要时,确定需要创建资源池;或者,主MDC在接收到用户发送的创建资源池请求时,确定需要创建资源池,其中,用户发送的创建资源池请求携带需要创建的资源池的拓扑信息,主MDC接收到用户发送的创建资源池请求时,将其中的拓扑信息写入到元数据库中的Private节点下。
可选地,主MDC在多个MDC中确定需要创建的资源池的归属MDC时,可以将处于非故障状态的备MDC中的一个备MDC确定为该需要创建的资源池的归属MDC,优选地,主MDC将处于非故障状态的备MDC中负载最轻的一个备MDC确定为该需要创建的资源池的归属MDC。并且,在所有备MDC都处于故障状态时,主MDC还可以将自己确定为该需要创建的资源池的归属MDC。
可选地,主MDC将需要创建的资源池与需要创建的资源池的归属MDC的映射关系写入公有节点时,为了保证数据的一致性,主MDC可以采用重新写入所有已存在的映射关系及该需要创建的资源池与需要创建的资源池的归属MDC的映射关系的方式修改公有节点中的内容。
举例来说,假设公有节点中已经存在如下映射关系:POOL1->MDC1、 POOL2->MDC2,而需要新创建的POOL为POOL3,并且需要创建的资源池的归属MDC为MDC3,也就是说,需要建立POOL3与MDC3的映射关系,则主MDC在修改公有节点中的内容时,可以采用在原有内容的基础上增加POOL3->MDC3的方式加入POOL3与MDC3的映射关系,也可以采用写入POOL1->MDC1、POOL2->MDC2及POOL3->MDC3的方式将POOL3与MDC3的映射关系写入到公有节点中。
下面将结合图5详细描述根据本发明一个具体实施例的建立POOL与MDC之间映射关系的方法。如图5所示,方法200包括:
S201,所有的备MDC监控ZK中的资源池映射(POOL Mapping)节点;
资源池映射节点中保存有所有的备MDC与资源池的映射关系,可选地,资源池映射节点中还可以保存有与资源池具有映射关系的MDC的状态(是否处于故障状态)。
S202,主MDC修改POOL Mapping节点内容;
主MDC在确定需要创建新的资源池时,首先按照上文中描述的方法选择一个备MDC作为需要创建的资源池的归属MDC,并且在ZK中的Private节点下创建一个与需要创建的资源池相对应的节点,并将需要创建的资源池的拓扑信息写入到ZK中的该节点中。
如图6中所示,需要创建的POOL为POOL3,主MDC确定的POOL的归属MDC为MDC3,因此主MDC需要将POOL3的拓扑信息写入ZK中,并在POOL Mapping节点中新增POOL3与MDC3的映射关系。
S203,ZK向主MDC返回修改成功信息;
S204,MDC3接收到节点事件通知;
由于POOL Mapping节点的内容已经被修改,所以会触发ZK向所有备MDC发送节点事件通知。
S205,MDC3接收到节点事件通知后,获取POOL Mapping节点中的内容;
S206,ZK向MDC3返回MDC3读取的内容;
S207,MDC3根据POOL Mapping节点的内容,确定新增POOL3与MDC3的映射关系;
MDC3在读取POOL Mapping节点的内容后,会对比前后两次读取到的内容,根据前后两次读取到的内容确定是否要建立与POOL的映射关系。
S208,MDC3从ZK中读取POOL3的相关业务数据。
在创建POOL时,MDC3从ZK中读取的是POOL3的拓扑信息。
S209,ZK向POOL3返回数据读取成功信息;
S210,MDC3确定与POOL3的映射关系建立成功。
在本发明实施例中,可选地,在需要创建的资源池的归属MDC读取需要创建的资源池的拓扑信息后,可以根据获取到的拓扑信息生成与该需要创建的资源池相对应的元数据;之后该需要创建的资源池的归属MDC将元数据信息写入该私有节点中。
具体来说,图7是根据本发明一个具体实施例的归属MDC创建POOL的方法的示意性流程图,如图7所示,方法300包括:
S301,归属MDC根据拓扑信息生成数据信息;
数据信息可以包括对象存储设备(Object-based Storage Device,简称为“OSD”)视图(View)及Partition View。
S302,归属MDC将拓扑信息和数据信息等通过ZK上的接口发送至ZK。
S303,ZK收到归属MDC发送的信息后,创建与需要创建的资源池相对应的节点,以便归属MDC将拓扑信息和数据信息等写入到该节点中;
S304,ZK向该归属MDC返回写入成功的指示信息;
S305,归属MDC确定POOL创建成功,等待OSD提供服务。
在本发明实施例中,可选地,该主MDC121还用于:删除该公有节点中存储的该多个资源池中的第一资源池与该备MDC中的第一备MDC的映射关系;
该第一备MDC,用于在确定该第一资源池与该第一备MDC的映射关系被删除时,停止对该第一资源池的元数据的管理。
具体来说,由于用户业务变更,不需要使用当前的POOL时,可以删除POOL,删除POOL之前首先要删除该资源池与MDC之间的映射关系,之后删除元数据库中该POOL对应的Private节点即可。或者,在主MDC确定该第一备MDC故障时,可以删除该第一资源池与该第一备MDC的映射关系。
举例来说,假设需要删除MDC2与POOL2的映射关系,具体可以按照图8所示的方法进行,如图8所示,方法400包括:
S401,MDC2监控ZK中的POOL Mapping节点;
所有的MDC都需要监控ZK中的POOL-Mapping节点。
S402,主MDC修改POOL Mapping节点内容;
主MDC删除POOL Mapping中存储的MDC2与POOL2之间的映射关系,假设POOL Mapping中已经存在如下映射关系:POOL1->MDC1、POOL2->MDC2、POOL3->MDC3,主MDC在修改POOL Mapping节点中的内容时,可以采用在原有内容的基础上删除POOL2->MDC2的方式删除POOL2与MDC2的映射关系,也可以采用只重新写入POOL1->MDC1和POOL3->MDC3的方式修改POOL Mapping中的内容。
S403,ZK向主MDC返回修改成功信息;
S404,MDC2接收到节点事件通知;
由于POOL Mapping节点的内容已经修改,所以会触发ZK向所有备MDC发送节点时间通知。
S405,MDC2收到节点事件通知后,获取POOL Mapping节点中的内容;
S406,ZK向MDC2返回MDC2读取的内容;
S407,MDC2根据POOL Mapping节点中的内容,确定删除MDC2与POOL2的映射关系;
MDC2在读取POOL Mapping节点的内容后,会对比前后两次读取到的内容,根据前后两次读取到的内容确定是否要删除映射关系。
S408,MDC2卸载与POOL2相关的资源,停止对POOL2对应的元数据的管理;
S409,主MDC确认MDC2是否卸载完成;
S410,MDC2返回确认卸载完成信息;
S411,主MDC确认成功删除MDC2与POOL2映射关系。
在本发明实施例中,可选地,该主MDC121还用于:在确定该第一备MDC故障时,在该多个MDC中确定第二备MDC,将该第一资源池与该第二备MDC的映射关系写入该公有节点中;
该第二备MDC,用于根据该公有节点中存储的该第一资源池与该第二备MDC的映射关系,从该私有节点中读取该第一资源池的元数据。
在主MDC在多个MDC中确定第二备MDC时,主MDC可以将负载小于预设阈值的备MDC中的一个备MDC确定为该第二备MDC,例如,主MDC可以将负载最轻的MDC确定为该第二备MDC。
换句话说,在一个备MDC故障时,可以重建POOL与备MDC的映射关系。可选地,主MDC可以定期查询各备MDC在元数据库中创建的备临时节点,在发现一个备临时节点失效时,确定与该备临时节点对应的备MDC发生故障,之后将该备MDC的状态设置为故障状态。或者,备MDC可以定期向主MDC上报自己的健康状况,当备MDC处于故障或者逾期未上报时,主MDC将该备MDC的状态设置为故障状态。
下面将结合图9详细描述在一个备MDC(备MDC2)故障时,重新建立POOL与备MDC3的映射关系的方法。如图9所示,方法500包括:
S501,主MDC查询备MDC2和备MDC3在ZK中建立的备临时节点的状态;
S502,主MDC在确定备MDC2故障时,为原映射在MDC2上的POOL2建立新的映射关系;
例如,主MDC选取负载最轻的MDC3与POOL2建立映射关系,在ZK中的POOL Mapping节点中删除MDC2与POOL2的映射关系,并添加MDC3与POOL3的映射关系到POOL Mapping节点中。
S503,MDC3收到POOL Mapping节点变更事件通知;
S504,MDC3读取POOL Mapping节点内容;
S505,ZK向MDC3返回POOL Mapping节点的内容;
S506,MDC3根据POOL Mapping节点的内容确定新增MDC2与POOL2的映射关系;
S507,MDC3从ZK中读取POOL2的相关业务数据;
S508,ZK向MDC3返回POOL2的相关业务数据;
S509,MDC3确定成功建立与POOL2的映射关系。
在本发明实施例中,可选地,主MDC121还用于:接收用户发送的创建MDC请求,将该创建MDC请求所请求创建的MDC的标识信息写入该公有节点中。MDC的标识信息可以为MDC的IP地址或端口号,但本发明并不限于此。
因此,本发明实施例分布式存储系统,可以动态的增加或减少资源池的数量,并且也可以对MDC进行在线平滑扩减容,由此能够提高系统的处理能力和可靠性。
下面将结合图10详细描述根据本发明实施例的增加MDC的方法,如 图10所示,方法600包括:
S601,用户将新增MDC的请求发送给主MDC;
假设,用户发送的新增MDC的请求所请求增加的MDC为MDC3。
S602,主MDC验证请求的合法性;
例如,主MDC验证请求增加的MDC3的IP、端口是否合法等。
S603,主MDC更新MDC列表并对MDC3添加监控;
S604,ZK更新Public节点下的MDC列表(MDC list);
S605,ZK向主MDC返回更新结果;
S606,主MDC向用户返回增加MDC3成功;
S607,用户向代理(Agent)请求拉起新的MDC进程;
S608,Agent执行拉起MDC进行;
S609,Agent向用户返回用户操作结果;
S610,主MDC建立与MDC3的长连接;
S611,MDC3在ZK中创建临时节点;
S612,ZK向MDC3返回创建结果;
S613,MDC3初始化加载必要的数据,开始正常工作。
在本发明实施例中,可选地,如果需要删除一个MDC,可以按照方法600的逆过程进行,在此不再赘述。
在本发明实施例中,可选地,在主MDC故障时,需要重新确定主MDC,因此,多个MDC中的备MDC还用于在确定主MDC故障时,发起竞争主流程;该多个MDC中的一个备MDC作为新的主MDC,用于加载该公有节点中的元数据。
具体来说,元数据库中同一时刻只允许一个备MDC创建主临时节点,可以将最先在元数据库中创建主临时节点的备MDC确定为新的主MDC,或者在一个备MDC创建主临时节点后,可以通过元数据库中存储的MDC与POOL的映射关系,确定其他备MDC的负载,如果该备MDC确定自身负载高于其他备MDC中的某些备MDC的负载,则该备MDC可以使自己创建的主临时节点失效,以此让其他备MDC在元数据库中创建主临时节点,按照这个方法,可以使负载最轻的备MDC升为新的主MDC,并在加载元数据库中的公有节点的元数据后提供服务。
下面将结合图11详细描述根据本发明实施例的重新选择主MDC的方 法。如图11所示,假设分布式存储系统中的MDC集群包括一个主MDC,3个备MDC,分别为备MDC1、备MDC2和备MDC3。方法700包括:
S701,主MDC在ZK中注册主临时节点;
S702,备MDC监控主临时节点;
S703,主MDC故障导致主临时节点失效;
S704,备MDC接收到主临时节点变更通知;
S705,备MDC发起竞争流程;
S706,最先在ZK中创建主临时节点的备MDC升为新的主MDC;
例如,图11中所示,备MDC3升为新的主MDC。
S707,MDC3(新的主MDC)加载ZK中Public节点中的数据后提供服务。
在本发明实施例中,可选地,主MDC121还用于:确定客户端的视图发生变化;更新该公有节点中存储的与该客户端的视图相对应的元数据。
可选地,客户端可以定期向主MDC上报心跳,主MDC在一定时间内没有接收到客户端发送的心跳,可以确认该客户端的视图发生变化。进而,主MDC更改元数据库中存储的与该客户端的视图相对应的元数据。
下面将结合图12详细描述根据本发明实施例的主MDC处理客户端视图变更的方法。如图12所示,方法800包括:
S801,主MDC确认客户端视图(Client View)发生变更;
S802,主MDC修改ZK中Client View节点中的数据;
S803,ZK向主MDC返回修改成功;
S804,主MDC向客户端返回处理结果。
应理解,图12仅是以Client View为例进行说明,其他公有数据的变更也只能由主MDC处理,因为不会发生写冲突,其他公用数据的处理流程与方法800类似。
在本发明实施例中,可选地,该多个MDC中的备MDC还用于:在确定与该备MDC具有映射关系的资源池的状态发生变化时,更新与该备MDC具有映射关系的资源池的元数据。资源池的状态变化包括OSD视图发生变更、Partition View发生变更。
下面将结合图13详细描述根据本发明实施例的备MDC处理资源池的状态变更的方法。如图13所示,方法900包括:
S901,归属MDC确定OSD视图发生变更;
S902,归属MDC修改ZK中OSD View节点的数据;
S903,ZK向归属MDC返回修改成功;
S904,归属MDC向OSD返回处理结果。
应理解,图13仅是以OSD View为例进行说明,资源池的其他私有元数据的变更也只能由与之具有映射关系的归属MDC处理,其他私有数据的处理流程与方法900类似。由于一个资源池的私有元数据只能由与该资源池具有映射关系的MDC处理,因此,在业务层面上可以做到多个POOL业务并发处理,不会相互冲突。
因此,本发明实施例的分布式存储系统及管理元数据的方法,能够管理更大规模的存储集群,并能够实现故障域隔离。
图14示出了根据本发明实施例的数据服务设备100,该数据服务设备100包括多个处理器101、存储器102和总线系统103,该多个处理器101和该存储器102通过总线系统103相连,该存储器102用于存储与多个资源池相对应的元数据,该多个处理器101中的主处理器,用于管理该存储器102中存储的该多个处理器中的备处理器与资源池之间的映射关系,该多个处理器中的备处理器,用于管理该存储器中存储的与该备处理器具有映射关系的资源池对应的元数据。
该多个处理器101中的第一处理器,用于接收客户端发送的映射关系查询请求,该映射关系查询请求用于请求查询与用户请求相对应的资源池的元数据的处理器;该第一处理器,还用于向客户端发送映射关系指示信息,该映射关系指示信息指示管理该与用户请求相对应的资源池的处理器为第二处理器,以便该客户端从该第二处理器中读取与该与用户请求相对应的资源池的元数据。
本发明实施例的数据服务设备,能够管理更大规模的存储集群,并能够实现故障域隔离。
应理解,在本发明实施例中,可选的,处理器101可以是中央处理单元(Central Processing Unit,简称CPU),处理器101还可以是其他通用处理器、数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、 分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器102可以包括只读存储器和随机存取存储器,并向处理器101提供指令和数据。存储器102的一部分还可以包括非易失性随机存取存储器。例如,存储器102还可以存储设备类型的信息。
该总线系统103除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线系统103。
在实现过程中,上述方法的各步骤可以通过处理器101中的硬件的集成逻辑电路或者软件形式的指令完成。结合本发明实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器102,处理器101读取存储器102中的元数据,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
可选地,作为一个实施例,该存储器102中的元数据存储节点包括公有节点、私有节点和临时节点;其中,该公有节点中存储的元数据由该主处理器进行修改,该私有节点中存储该多个资源池中每个资源池对应的元数据,且该多个资源池中每个资源池对应的元数据由管理该资源池的备处理器进行读取与修改;该临时节点中存储该多个处理器中每个处理器的标识信息。
可选地,作为一个实施例,该主处理器具体用于:在需要创建资源池时,在该多个处理器中确定需要创建的资源池的归属处理器,将需要创建的资源池与该需要创建的资源池的归属处理器的映射关系写入该存储器中;
该需要创建的资源池的归属MDC,用于根据该存储器中存储的该需要创建的资源池与该需要创建的资源池的归属MDC的映射关系,从该私有节点中读取该需要创建的资源池的拓扑信息。
可选地,作为一个实施例,该主处理器还用于:接收用户发送的创建资源池请求,该创建资源池请求携带该拓扑信息;将该拓扑信息写入该存储器中。
可选地,作为一个实施例,该需要创建的资源池的归属处理器还用于:根据该存储器中存储的该拓扑信息,生成与该需要创建的资源池相对应的元 数据;将该与该需要创建的资源池相对应的元数据写入该存储器中。
可选地,作为一个实施例,该主处理器还用于:删除该存储器中存储的该多个资源池中的第一资源池与该备处理器中的第一备处理器的映射关系;该第一备处理器,用于在确定该第一资源池与该第一备MDC的映射关系被删除时,停止对该第一资源池对应的元数据的管理。
可选地,作为一个实施例,该主处理器还用于:在确定该第一备处理器故障时,在该多个处理器中确定第二备处理器,将该第一资源池与该第二备处理器的映射关系写入该存储器中;该第二备处理器,用于根据该存储器中存储的该第一资源池与该第二备处理器的映射关系,从该存储器中读取该第一资源池的元数据。
可选地,作为一个实施例,该主处理器还用于:接收用户发送的创建处理器请求;将该创建处理器请求所请求创建的处理器的标识信息写入该存储器中。
可选地,作为一个实施例,该多个处理器中的备处理器,还用于在确定该主处理器故障时,发起竞争主流程;该多个处理器中的一个备处理器作为新的主处理器,用于加载该存储器中的元数据。
可选地,作为一个实施例,该多个处理器中的备处理器还用于:在确定与该备处理器具有映射关系的资源池的状态发生变化时,更新与该备处理器具有映射关系的资源池的元数据。
本发明实施例的数据服务设备,主处理器用于管理处理器与资源池之间的映射关系,备处理器用于管理存储器中存储的与该备处理器具有映射关系的资源池对应的元数据,由此,每个处理器都可以参与元数据的管理,可以管理更大的存储集群,并且每个处理器只管理与之对应的资源池的元数据,能够实现故障域隔离。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应 过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。

Claims (20)

  1. 一种分布式存储系统,其特征在于,包括:元数据库、多个元数据控制器MDC和多个资源池;
    所述元数据库,用于存储与所述多个资源池相对应的元数据;
    所述多个MDC中的主MDC,用于管理所述元数据库中存储的所述多个MDC中的备MDC与资源池之间的映射关系;
    所述多个MDC中的备MDC,用于管理所述元数据库中存储的与所述备MDC具有映射关系的资源池对应的元数据。
  2. 根据权利要求1所述的方法,其特征在于,所述元数据库中的元数据存储节点包括公有节点、私有节点和临时节点;
    其中,所述公有节点中存储的元数据由所述主MDC进行修改;所述私有节点中存储所述多个资源池中每个资源池对应的元数据,且所述多个资源池中每个资源池对应的元数据由管理该资源池的备MDC进行读取与修改;所述临时节点中存储所述多个MDC中每个MDC的标识信息。
  3. 根据权利要求2所述的分布式存储系统,其特征在于,所述主MDC,具体用于在需要创建资源池时,在所述多个MDC中确定需要创建的资源池的归属MDC,将需要创建的资源池与所述需要创建的资源池的归属MDC的映射关系写入所述公有节点中;
    所述需要创建的资源池的归属MDC,用于根据所述公有节点中存储的所述需要创建的资源池与所述需要创建的资源池的归属MDC的映射关系,从所述私有节点中读取所述需要创建的资源池的拓扑信息。
  4. 根据权利要求3所述的分布式存储系统,其特征在于,所述主MDC还用于:
    接收用户发送的创建资源池请求,所述创建资源池请求携带所述拓扑信息;
    将所述拓扑信息写入所述私有节点中。
  5. 根据权利要求4所述的分布式存储系统,其特征在于,所述需要创建的资源池的归属MDC还用于:
    根据所述私有节点中存储的所述拓扑信息,生成与所述需要创建的资源池相对应的元数据;
    将所述与所述需要创建的资源池相对应的元数据写入所述私有节点中。
  6. 根据权利要求2至5中任一项所述的分布式存储系统,其特征在于,所述主MDC还用于:删除所述公有节点中存储的所述多个资源池中的第一资源池与所述备MDC中的第一备MDC的映射关系;
    所述第一备MDC,用于在确定所述第一资源池与所述第一备MDC的映射关系被删除时,停止对所述第一资源池对应的元数据的管理。
  7. 根据权利要求6所述的分布式存储系统,其特征在于,所述主MDC还用于:在确定所述第一备MDC故障时,在所述多个MDC中确定第二备MDC,将所述第一资源池与所述第二备MDC的映射关系写入所述公有节点中;
    所述第二备MDC,用于根据所述公有节点中存储的所述第一资源池与所述第二备MDC的映射关系,从所述私有节点中读取所述第一资源池的元数据。
  8. 根据权利要求2至7中任一项所述的分布式存储系统,其特征在于,所述主MDC还用于:
    接收用户发送的创建MDC请求;
    将所述创建MDC请求所请求创建的MDC的标识信息写入所述公有节点中。
  9. 根据权利要求2至8中任一项所述的分布式存储系统,其特征在于,所述多个MDC中的备MDC,还用于在确定所述主MDC故障时,发起竞争主流程;
    所述多个MDC中的一个备MDC作为新的主MDC,用于加载所述公有节点中的元数据。
  10. 根据权利要求2至9中任一项所述的分布式存储系统,其特征在于,所述多个MDC中的备MDC还用于:
    在确定与所述备MDC具有映射关系的资源池的状态发生变化时,更新与所述备MDC具有映射关系的资源池的元数据。
  11. 一种分布式存储系统中管理元数据的方法,其特征在于,所述方法包括:
    第一元数据控制器MDC接收客户端发送的映射关系查询请求,所述映射关系查询请求用于请求查询管理与用户请求相对应的资源池的元数据的MDC;
    所述第一MDC向所述客户端发送映射关系指示信息,所述映射关系指示信息指示管理所述与用户请求相对应的资源池的MDC为第二MDC;
    所述客户端从所述第二MDC中读取与所述与用户请求相对应的资源池的元数据。
  12. 根据权利要求11所述的方法,其特征在于,在所述第一MDC向所述客户端发送映射关系指示信息之前,所述方法还包括:
    所述第一MDC读取元数据库中存储的映射关系列表,其中,所述元数据库中的元数据存储节点包括公有节点、私有节点和临时节点,所述公有节点中存储的元数据由主MDC进行修改,所述私有节点中存储所述多个资源池中每个资源池对应的元数据,且所述多个资源池中每个资源池对应的元数据由管理该资源池的备MDC进行读取与修改,所述临时节点中存储多个MDC中每个MDC的标识信息;
    所述第一MDC根据所述映射关系列表,将所述第二MDC确定为管理所述与用户请求相对应的资源池的MDC。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    所述主MDC确定需要创建资源池;
    所述主MDC确定需要创建的资源池的归属MDC;
    所述主MDC将需要创建的资源池与所述需要创建的资源池的归属MDC的映射关系写入所述公有节点中;
    所述需要创建的资源池的归属MDC根据所述公有节点中存储的所述需要创建的资源池与所述需要创建的资源池的归属MDC的映射关系,从所述私有节点中读取所述需要创建的资源池的拓扑信息。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    所述主MDC接收用户发送的创建资源池请求,所述创建资源池请求携带所述拓扑信息;
    所述主MDC将所述拓扑信息写入所述私有节点中。
  15. 根据权利要求14所述的方法,其特征在于,所述方法还包括:
    所述需要创建的资源池的归属MDC根据所述拓扑信息,生成与所述需要创建的资源池相对应的元数据;
    所述需要创建的资源池的归属MDC将所述与所述需要创建的资源池相对应的元数据写入所述私有节点中。
  16. 根据权利要求12至15中任一项所述的方法,其特征在于,所述方法还包括:
    所述主MDC删除所述公有节点中存储的第一资源池与第一备MDC的映射关系;
    所述第一备MDC在确定所述第一资源池与所述第一备MDC的映射关系被删除时,停止对所述第一资源池对应的元数据的管理。
  17. 根据权利要求16所述的方法,其特征在于,所述方法还包括:
    所述主MDC在确定所述第一备MDC故障时,确定第二备MDC;
    所述主MDC将所述第一资源池与所述第二备MDC的映射关系写入所述公有节点中;
    所述第二备MDC根据所述公有节点中存储的所述第一资源池与所述第二备MDC的映射关系,从所述私有节点中读取所述第一资源池的元数据。
  18. 根据权利要求12至17中任一项所述的方法,其特征在于,所述方法还包括:
    所述主MDC接收用户发送的创建MDC请求;
    所述主MDC将所述创建MDC请求所请求创建的MDC的标识信息写入所述公有节点中。
  19. 根据权利要求12至18中任一项所述的方法,其特征在于,所述方法还包括:
    备MDC在确定主MDC故障时,发起竞争主流程;
    所述备MDC中的一个备MDC作为新的主MDC,加载所述公有节点中的元数据。
  20. 根据权利要求12至19中任一项所述的方法,其特征在于,所述方法还包括:
    备MDC在确定与所述备MDC具有映射关系的资源池的状态发生变化时,更新与所述备MDC具有映射关系的资源池的元数据。
PCT/CN2015/100088 2015-12-31 2015-12-31 分布式存储系统及管理元数据的方法 WO2017113280A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580070472.7A CN107211003B (zh) 2015-12-31 2015-12-31 分布式存储系统及管理元数据的方法
PCT/CN2015/100088 WO2017113280A1 (zh) 2015-12-31 2015-12-31 分布式存储系统及管理元数据的方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/100088 WO2017113280A1 (zh) 2015-12-31 2015-12-31 分布式存储系统及管理元数据的方法

Publications (1)

Publication Number Publication Date
WO2017113280A1 true WO2017113280A1 (zh) 2017-07-06

Family

ID=59224094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/100088 WO2017113280A1 (zh) 2015-12-31 2015-12-31 分布式存储系统及管理元数据的方法

Country Status (2)

Country Link
CN (1) CN107211003B (zh)
WO (1) WO2017113280A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108540832A (zh) * 2018-03-12 2018-09-14 四川合智聚云科技有限公司 一种基于电视业务的智能管理方法
CN111131441A (zh) * 2019-12-21 2020-05-08 西安天互通信有限公司 一种实时文件共享系统及方法
CN112260874A (zh) * 2020-10-23 2021-01-22 南京鹏云网络科技有限公司 一种基于分布式存储单元管理系统和方法
US11223681B2 (en) 2020-04-10 2022-01-11 Netapp, Inc. Updating no sync technique for ensuring continuous storage service in event of degraded cluster state
CN116560818A (zh) * 2023-06-29 2023-08-08 深圳市易图资讯股份有限公司 一种空间数据服务分发与调度的方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704212B (zh) * 2017-10-31 2019-09-06 新华三信息技术有限公司 一种数据处理方法及装置
CN111414136B (zh) * 2020-03-13 2023-01-06 苏州浪潮智能科技有限公司 一种存储池的创建方法、系统、设备以及介质
CN112667577A (zh) * 2020-12-25 2021-04-16 浙江大华技术股份有限公司 一种元数据管理方法、元数据管理系统及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220943A1 (en) * 2002-05-23 2003-11-27 International Business Machines Corporation Recovery of a single metadata controller failure in a storage area network environment
US20110107139A1 (en) * 2009-11-09 2011-05-05 Quantum Corporation Timer bounded arbitration protocol for resource control
WO2011130185A2 (en) * 2010-04-11 2011-10-20 Alex Grossman Systems and methods for raid metadata storage
CN103384550A (zh) * 2012-12-28 2013-11-06 华为技术有限公司 储存数据的方法及装置
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群系统
CN104135539A (zh) * 2014-08-15 2014-11-05 华为技术有限公司 数据存储方法、sdn控制器和分布式网络存储系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104820670B (zh) * 2015-03-13 2018-11-06 华中电网有限公司 一种电力信息大数据的采集和存储方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220943A1 (en) * 2002-05-23 2003-11-27 International Business Machines Corporation Recovery of a single metadata controller failure in a storage area network environment
US20110107139A1 (en) * 2009-11-09 2011-05-05 Quantum Corporation Timer bounded arbitration protocol for resource control
WO2011130185A2 (en) * 2010-04-11 2011-10-20 Alex Grossman Systems and methods for raid metadata storage
CN103384550A (zh) * 2012-12-28 2013-11-06 华为技术有限公司 储存数据的方法及装置
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群系统
CN104135539A (zh) * 2014-08-15 2014-11-05 华为技术有限公司 数据存储方法、sdn控制器和分布式网络存储系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108540832A (zh) * 2018-03-12 2018-09-14 四川合智聚云科技有限公司 一种基于电视业务的智能管理方法
CN111131441A (zh) * 2019-12-21 2020-05-08 西安天互通信有限公司 一种实时文件共享系统及方法
US11223681B2 (en) 2020-04-10 2022-01-11 Netapp, Inc. Updating no sync technique for ensuring continuous storage service in event of degraded cluster state
CN112260874A (zh) * 2020-10-23 2021-01-22 南京鹏云网络科技有限公司 一种基于分布式存储单元管理系统和方法
CN116560818A (zh) * 2023-06-29 2023-08-08 深圳市易图资讯股份有限公司 一种空间数据服务分发与调度的方法及系统
CN116560818B (zh) * 2023-06-29 2023-09-12 深圳市易图资讯股份有限公司 一种空间数据服务分发与调度的方法及系统

Also Published As

Publication number Publication date
CN107211003A (zh) 2017-09-26
CN107211003B (zh) 2020-07-14

Similar Documents

Publication Publication Date Title
WO2017113280A1 (zh) 分布式存储系统及管理元数据的方法
US11445019B2 (en) Methods, systems, and media for providing distributed database access during a network split
US10496671B1 (en) Zone consistency
US11157457B2 (en) File management in thin provisioning storage environments
CN109995813B (zh) 一种分区扩展方法、数据存储方法及装置
JP6325001B2 (ja) 階層データ構造のノードにおいて再帰的イベントリスナを用いる方法およびシステム
CN110096220B (zh) 一种分布式存储系统、数据处理方法和存储节点
US10235047B2 (en) Memory management method, apparatus, and system
US8977703B2 (en) Clustering without shared storage
CN109299190B (zh) 分布式存储系统中处理对象的元数据的方法及装置
CN106878382B (zh) 一种分布式仲裁集群中动态改变集群规模的方法及装置
CN104506654B (zh) 云计算系统及动态主机配置协议服务器备份方法
US20170054803A1 (en) Information processing device, method, and system
CN111459749A (zh) 基于Prometheus的私有云监控方法、装置、计算机设备及存储介质
US11153173B1 (en) Dynamically updating compute node location information in a distributed computing environment
WO2017041650A1 (zh) 用于扩展分布式一致性服务的方法和设备
CN115470303B (zh) 一种数据库访问方法、装置、系统、设备及可读存储介质
CN112805964B (zh) 用于通信装置的可靠操作的方法和系统
CN117435569A (zh) 缓存系统动态扩容方法、装置、设备、介质和程序产品
CN114281872B (zh) 分布式序列号的生成方法、装置、设备及可读存储介质
CN115510016A (zh) 一种基于目录分片的客户端应答方法、装置及介质
CN114072768A (zh) 用于桥接数据库架构的控制器
CN111352916A (zh) 基于nas存储系统的数据存储方法、系统及存储介质
CN113868679B (zh) 一种集群的加密方法及装置
WO2024055741A1 (zh) 一种建立网络连接的方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15911906

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15911906

Country of ref document: EP

Kind code of ref document: A1