WO2019119212A1 - 识别osd亚健康的方法、装置和数据存储系统 - Google Patents

识别osd亚健康的方法、装置和数据存储系统 Download PDF

Info

Publication number
WO2019119212A1
WO2019119212A1 PCT/CN2017/116951 CN2017116951W WO2019119212A1 WO 2019119212 A1 WO2019119212 A1 WO 2019119212A1 CN 2017116951 W CN2017116951 W CN 2017116951W WO 2019119212 A1 WO2019119212 A1 WO 2019119212A1
Authority
WO
WIPO (PCT)
Prior art keywords
osd
health
data request
written
write data
Prior art date
Application number
PCT/CN2017/116951
Other languages
English (en)
French (fr)
Inventor
谢会云
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17935811.4A priority Critical patent/EP3620905B1/en
Priority to PCT/CN2017/116951 priority patent/WO2019119212A1/zh
Priority to CN201780003315.3A priority patent/CN108235751B/zh
Publication of WO2019119212A1 publication Critical patent/WO2019119212A1/zh
Priority to US16/903,762 priority patent/US11320991B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0605Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0634Configuration or reconfiguration of storage systems by changing the state or mode of one or more devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes

Definitions

  • Embodiments of the present invention relate to storage technologies, and in particular, to methods, apparatus, and data storage systems for identifying OSD sub-health.
  • Sub-health problems in a storage node in a distributed storage system can seriously affect the availability of the entire distributed storage system.
  • the traditional solution may be that the storage node periodically reports the heartbeat to the management device. If the storage node loses the heartbeat due to a sub-health failure, the node is offlinely isolated.
  • the storage node itself may be implanted with a fault detection module. When the input/output (referred to as I/O) delay is greater than a predetermined threshold, the fault is reported to the management node, and the management node isolates it offline.
  • I/O input/output
  • Traditional distributed storage systems have a large delay in monitoring and processing sub-health of storage nodes. Sub-health failures have a long-term impact on distributed storage systems, and the availability of distributed storage systems is greatly reduced.
  • the present application provides a method, apparatus, and system for identifying sub-health of an OSD.
  • a first aspect of the present application provides a data storage system, the system comprising a management node and a plurality of storage nodes, wherein a plurality of OSDs are deployed in the system, the plurality of OSDs being located on the plurality of storage nodes
  • the plurality of OSDs include a first OSD and a second OSD.
  • the first OSD is configured to receive a first write data request, where the write data request includes a data block to be written and a corresponding partition to be written, and determining, according to the partition allocation view, the partition to be written.
  • the standby OSD is the second OSD, and the first write data request is copied to the second OSD, and the first report is sent to the management node after the data block is copied to the second OSD.
  • the first report message includes an identifier of the first OSD, an identifier of the second OSD, and health status information of the second OSD.
  • the management node is configured to receive the first report message, update an OSD health status record saved on the management node according to the first report message, and determine, according to the OSD health status record, that the second OSD is a sub-health
  • the OSD health status record includes health status information of the second OSD reported by the other OSD.
  • the system provided by the present application can more comprehensively detect the fault condition of nodes in the system, thereby improving the accuracy of identifying the sub-health OSD.
  • a method of identifying a sub-health OSD is provided, the method being applied to the data storage system provided by the first aspect above.
  • the method comprises the following steps:
  • a first write data request receives, by the first OSD, a first write data request, where the write data request includes a data block to be written to a partition managed by the first OSD and a corresponding partition to be written, and determining, according to the partition allocation view,
  • the standby OSD of the partition to be written is the second OSD, and the write data request is copied to the second OSD, and the time taken for the data to be copied to the second OSD is sent to the management node.
  • a first report message where the first report message includes an identifier of the first OSD, an identifier of the second OSD, and health status information of the second OSD.
  • the saved OSD health status record is determined according to the OSD health status record, wherein the second OSD is a sub-health OSD, and the OSD health status record includes health status information of the second OSD reported by the other OSD.
  • the method provided by the present application can more comprehensively detect the fault condition of nodes in the system, thereby improving the accuracy of identifying the sub-health OSD.
  • a third aspect of the present application provides a Virtual Block System (VBS) VBS for implementing the functions of the computing node in the above system or method.
  • the VBS includes an access interface, a service module, a client, and a reporting module.
  • the access interface is configured to receive a first write data request, where the first write data request includes data to be written, a write position of the data to be written, and a data length of the data to be written. And block device information of the data to be written.
  • the service module is configured to divide data included in the first write data request into data blocks, and calculate and each data according to a write position, an offset, and the block device information of each data block. The partition to which the block is written.
  • the client is configured to find a primary OSD corresponding to the partition according to an I/O view, send a second write data request to the primary OSD, and obtain the second write data request to be sent to the primary OSD. It takes a long time, and the second write data request includes a data block to be written and a partition to be written by the data block to be written.
  • the reporting module is configured to send a first report message to the management node, where the first report message includes an identifier of the VBS, an identifier of the primary OSD, and health status information of the primary OSD.
  • the VBS detects whether there is an abnormality on the path sent by the write data request to the primary OSD, and reports the result to the management node, so that the management node can more comprehensively detect the health status of the node in the system.
  • a fourth aspect of the present application provides an Object Storage Device (OSD) involved in the above system or method.
  • the OSD includes a write data module, a copy module, and a report module.
  • the OSD receives a write data request from the computing node, the write data request including a data block to be written and a partition to be written to the data to be written.
  • the copying module is configured to receive the write data request, and copy the write data request to a standby OSD corresponding to the partition to be written included in the write request, to obtain the to-be-written data block copy.
  • the time taken to the standby OSD is long and the write data request is sent to the write data module.
  • the write data module is configured to receive the write data request, and write the to-be-written data included in the write data request to a persistent storage resource corresponding to the corresponding to-be-written partition.
  • the reporting module is configured to send a first report message to the management node, where the first report message includes an identifier of the OSD, an identifier of the standby OSD, and health status information of the standby OSD.
  • a Meta Data Controller is provided.
  • the MDC is used to implement the functions of the management node in the above system or method.
  • the MDC includes a management module and a receiving module, where the receiving module is configured to receive a report message reported by a computing node or an OSD in a data storage system, where the report message includes an identifier of the reporter, an identifier of the reported OSD, and The health status information of the reported person.
  • the management module is configured to update the saved OSD health status record according to the received report message, and determine, according to the updated OSD health status record, that the one or more OSDs in the reported OSD are sub-health OSDs.
  • a sixth aspect of the present application provides a method for identifying a sub-health OSD, the method being applied to a data storage system, where the data storage system includes a management node, a plurality of storage nodes, and a plurality of computing nodes, where A plurality of OSDs are deployed in the data storage system, and the plurality of OSDs are located on the plurality of storage nodes.
  • the method is performed by the management node, the method includes the following steps: receiving a first report message reported by at least one of the plurality of storage nodes when processing a write data request, where the first report message is Including reporting to the OSD And the reported OSD and the reported OSD are one of the plurality of OSDs, and the reported OSD is not the reported OSD.
  • the first reported OSD is determined to be a sub-health OSD according to the received first report message.
  • the management node receives a second report message reported by at least one of the plurality of computing nodes when processing a read/write data request, the second report message
  • the management node determines the one of the plurality of OSDs a switching OSD of the sub-health OSD, establishing a correspondence between the sub-health OSD and the switching OSD, and updating a partition allocation view according to a correspondence between the sub-health OSD and the switching OSD, the updated The partition allocation view includes an updated I/O view, the updated I/O view is sent to the plurality of computing nodes, and the updated partition allocation view is sent to the switching OSD and
  • the sub-health OSD has an OSD with an active/standby relationship.
  • the management node Receiving, by the management node, a third report message of the sub-health OSD sent by the switching OSD, where the third report message includes an identifier of the handover OSD, an identifier of the sub-health OSD, and the sub-health OSD Third health status information, the third health status information of the sub-health OSD is after the switching OSD synchronizes the received write data request to the sub-health OSD based on the updated partition allocation view, according to the sub-health The write data returned by the OSD is sent in response.
  • the management node determines, according to the third report message that receives the sub-health OSD, that the sub-health OSD returns to normal.
  • the partition allocation view includes a partition, a correspondence between a primary OSD of the partition and a standby OSD of the partition.
  • updating the partition allocation view according to the correspondence between the sub-health OSD and the switching OSD includes: if the sub-health OSD is the main OSD of the partition, changing the original main OSD to a standby OSD, The original OSD is changed to a primary OSD, and the switched OSD is associated to the changed standby OSD. If the sub-health OSD is a standby OSD of the partition, the handover OSD is associated to the standby OSD.
  • the method further includes: the management node disassociating the relationship between the sub-health OSD and the switching OSD, and updating the partition allocation view, and the updated partition
  • the allocation view is sent to the plurality of computing nodes, the switching OSD, and an OSD having a master-slave relationship with the sub-health node.
  • the health status information involved in the above embodiment includes indication information that the OSD is in sub-health.
  • a seventh aspect of the present application provides a method for reporting health status information, which may be performed by a VBS in a system, the method comprising: receiving a first write data request, where the first write data request includes data to be written a write position of the data to be written, a data length of the data to be written, and block device information of the data to be written.
  • the data included in the first write data request is divided into data blocks, and a partition to be written with each of the data blocks is calculated according to a write position, an offset, and the block device information of each data block.
  • the write data request includes a data block to be written and a partition to be written to the data block to be written.
  • the method further includes receiving a write data response returned by the primary OSD, by comparing a time when the second write data request is sent, and receiving the write data response The difference in time is obtained by the time taken for the second write data request to be sent to the primary OSD.
  • the first report message is sent when the time that the second write data request is sent to the primary OSD exceeds a threshold.
  • the eighth aspect of the present application provides another method for reporting health status information, which may be performed by an OSD in a system, the method comprising: receiving a write data request, where the write data request includes a data block to be written and a Describe the partition to which the write data is to be written, and copy the write data request to the standby OSD corresponding to the partition to be written included in the write request, and obtain the copy of the to-be-written data block to be copied to the standby OSD.
  • the duration, and the write data request is sent to the write data module.
  • Sending a first report message to the management node where the first report message includes an identifier of the OSD, an identifier of the standby OSD, and health status information of the standby OSD.
  • the method further includes: when the data to be written included in the write data request is written into the persistent storage resource corresponding to the corresponding to-be-written partition, obtaining the data block to be written The length of time it takes to write to the persistent storage resource corresponding to the partition to be written. Sending a second report message to the management node, where the second report message includes an identifier of the OSD and health status information of the OSD.
  • the sub-health status information of the standby OSD is sent to the reporting module.
  • the first report message is sent, or the data block to be written is determined to be written to the to-be-written partition.
  • the second report message is sent when the time spent in the corresponding persistent storage resource exceeds the threshold.
  • the method further includes: receiving a copy response returned by the standby OSD, by copying the write data request to the standby OSD, and receiving the time The difference in time of the copy response is obtained by the time taken for the data block to be written to be copied to the standby OSD.
  • a ninth aspect of the present application provides an apparatus for implementing the sixth aspect, the seventh aspect, and the eighth aspect, the apparatus comprising a processor and a memory, wherein the processor and the memory are connected by a bus, wherein A memory for storing computer operating instructions. Specifically, it may be a high speed RAM memory or a non-volatile memory.
  • the processor is configured to execute a computer operation instruction stored in a memory.
  • the processor may be a central processing unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. Wherein the processor performs the method of the six aspects or the seventh aspect or the eighth aspect by executing a computer operation instruction stored in the memory. To achieve the above functions of the MDC or VBS or OSD in the data storage system.
  • a tenth aspect of the present application provides a storage medium for storing the computer operation instructions mentioned in the above ninth aspect.
  • these operational instructions are executed by a computer, the methods of the above six aspects or the seventh or eighth aspects may be performed.
  • the methods of the above six aspects or the seventh or eighth aspects may be performed.
  • FIG. 1 is a schematic structural diagram of a distributed storage system
  • FIG. 2 is a schematic structural diagram of a data storage system according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a method for identifying a sub-health OSD according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart of switching a sub-health OSD according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a device according to an embodiment of the present invention.
  • the embodiment of the invention relates to a data storage system.
  • the data storage system is composed of two or more server nodes, each of which can provide computing resources and/or storage resources.
  • the data storage system is used to provide computing resources and storage resources to different applications, which may be virtual machines or databases, and the like.
  • the above-mentioned server nodes can communicate with each other through a computing network (for example, a high-speed data exchange network).
  • a distributed storage controller is running on the data storage system, and the distributed storage controller is a collective term for storage control function modules running on each server node.
  • the distributed storage controller uses a distributed algorithm to virtualize storage resources (persistent storage resources, cache resources) in each server node into a shared storage resource pool for application sharing in the data storage system.
  • FIG. 1 is only an example of the data storage system, and in actual applications, more server nodes can be deployed.
  • the distributed storage controller in the embodiment of the present invention may be implemented by a software module installed on a hardware device of the server.
  • the distributed storage controller can be functionally divided into the following parts:
  • a Meta Data Controller is mainly used for allocating a Partition corresponding to a physical storage resource managed by the OSD for each Object Storage Device (OSD) in the data storage system. Establish a partition allocation view. The MDC is also used to update the partition allocation view when the correspondence between the OSD and the partition in the data storage system changes.
  • the partition refers to a logical mapping of physical storage resources of the storage layer.
  • the physical storage resources here usually refer to persistent storage resources, which can be provided by a mechanical hard disk such as Hard disk drive, HDD, solid state hard disk SSD or Storage Class Memory (SCM) device.
  • the partition allocation view includes the correspondence between the partition and the OSD to which the partition belongs. To meet user availability needs, data storage systems typically use multiple copies of storage.
  • one partition has a correspondence with multiple OSDs, and the plurality of OSDs One of them acts as the primary OSD for the partition, and the remaining OSDs act as the standby OSD for the partition.
  • the partition allocation view includes a partition, a correspondence between the partition's primary OSD and the partition's standby OSD.
  • the partition allocation view includes an I/O view, and it can be understood that the I/O view is a sub-table of the partition allocation view.
  • the I/O view is used to record the correspondence between the storage partition and the OSD.
  • VBS virtual block system
  • a virtual block system (VBS), as a storage driver layer, is used to provide a block access interface to an application accessing the data storage system, and completes the read and write logic of the block storage data.
  • the I/O allocation view is saved on the VBS, and the I/O allocation view includes all the partitions and the corresponding relationship with the main OSD corresponding to the partition. Based on the I/O view, the data is forwarded to the corresponding OSD or the data is read from the corresponding OSD.
  • the OSD is configured to receive a read/write data request, and read or write data from the persistent storage resource corresponding to the partition managed by the OSD according to the read/write data request.
  • the partition allocation view is saved on the OSD, and the partition allocation view includes the correspondence between all the partitions in the data storage system and their corresponding active and standby OSDs. According to the partition allocation view, the OSD can find the standby OSD corresponding to the data to be read or the partition to which the data is to be written.
  • the specific deployment mode of the data storage system may select different configuration files according to user needs, and the configuration file includes the deployment strategy of the foregoing functional modules, the partition specification of the data storage system (ie, how many copies of each hard disk are divided), and Mutual communication address information (including address information of MDC, VBS, and OSD) between different server nodes.
  • each server node runs different functional modules of the distributed storage controller according to the deployment policy in the configuration file, that is, according to the deployment strategy, the distributed storage controller can be run on different server nodes. Different functional modules, each server node can run all the functional modules of the distributed storage controller, and can also run the functional modules of the distributed storage controller part.
  • the foregoing MDC may be deployed only on a server node of the data storage system, and the foregoing VBS may be deployed in each server node having computing resources in the data storage system; the OSD may be deployed in the cluster system. Store resources on each server node. Depending on the actual needs, you can deploy one OSD or multiple OSDs on one server node.
  • a server node on which the MDC is deployed is referred to as a management node.
  • a server node capable of providing computing resources is referred to as a computing node
  • a server node capable of providing a storage resource is referred to as a storage node.
  • the OSDs are deployed on the storage node
  • the VBSs are deployed on the computing nodes.
  • the computing node, the storage node, and the management node are logical concepts. Physically, a server node can be either a computing node, a storage node, or a management node.
  • the user imports the configuration file into the data storage system through the management end of the system, and the MDC establishes a partition allocation view of the data storage system according to the imported configuration file, where the partition allocation view includes the OSD and the partition between the data storage system.
  • the partition allocation view includes a partition, a correspondence between the primary OSD of the partition and the standby OSD of the partition; and the I/O view includes a correspondence between the partition and the primary OSD of the partition.
  • the OSD on the server node When the OSD on the server node is activated, the OSD requests a partition allocation view from the MDC, and according to the request, the MDC sends the already assigned partition allocation view to the OSD.
  • the VBS on the server node After the VBS on the server node is activated, the VBS requests an I/O view from the MDC, and according to the request, the MDC sends the already allocated I/O view to the corresponding VBS.
  • the MDC may send the partition allocation view to the OSD and send the I/O view to the VBS after generating the partition allocation view, or after the partition allocation view is updated, after the update.
  • the partition allocation view is sent to the OSD and the updated I/O view is sent to the VBS.
  • a data provided by an embodiment of the present invention is provided.
  • a storage system comprising a management node and a plurality of storage nodes, wherein a plurality of OSDs are deployed in the system, the plurality of OSDs being located on the plurality of storage nodes; the plurality of OSDs comprising a first OSD and Second OSD.
  • the first OSD is configured to receive a first write data request, where the write data request includes a data block to be written and a corresponding partition to be written, and determining, according to the partition allocation view, the partition to be written.
  • the standby OSD is the second OSD, and the first write data request is copied to the second OSD, and the first report is sent to the management node after the data block is copied to the second OSD.
  • the first report message includes an identifier of the first OSD, an identifier of the second OSD, and health status information of the second OSD.
  • the management node is configured to receive the first report message, update an OSD health status record saved on the management node according to the first report message, and determine, according to the OSD health status record, that the second OSD is a sub-health
  • the OSD health status record includes health status information of the second OSD reported by the other OSD.
  • the other OSDs mentioned here are OSDs other than the first OSD and the second OSD among the plurality of OSDs.
  • the computing node is configured to receive a second write data request, divide the data to be written included in the second write data request into at least one data block to be written, and determine each of the at least one data block. a partition to be written by the data block, determining, according to the I/O view, that the primary OSD that processes the to-be-written data block includes the first OSD, and sends the first write data request to the first OSD to obtain Sending a second report message to the management node after sending the first write data request to the first OSD, where the second report message includes an identifier of the computing node, the first OSD And the health status information of the first OSD.
  • the management node is further configured to update an OSD health status record recorded on the management node according to the second report message, and determine, according to the OSD health status record, that the first OSD is a sub-health OSD, and the OSD health status
  • the record includes health status information of the second OSD reported by other OSDs.
  • the computing node is configured to receive a first read data request, determine a partition where each data block to be read is to be read by the first read data request, and determine, according to the I/O view, the processing
  • the main OSD of the data block to be read includes the first OSD
  • the second read data request is sent to the first OSD, and the time taken to send the second read data request to the first OSD is obtained.
  • the management node sends a second report message, where the second report message includes an identifier of the computing node, an identifier of the first OSD, and health status information of the first OSD, where the second read data request includes The partition where the data block is to be read.
  • the management node is further configured to receive the second report message, update an OSD health status record saved on the management node according to the second report message, and determine that the first OSD is a sub-health OSD according to an OSD health status record.
  • the OSD health status record includes health status information of the first OSD reported by other OSDs.
  • the first OSD is further configured to write the data block to be written to the partition managed by the first OSD into a persistent storage resource corresponding to the corresponding partition to be written, and obtain After the data block is written to the persistent storage resource, sending a third report message to the management node, where the third report message includes an identifier of the first OSD and the first OSD Health status information.
  • the management node is configured to determine, according to the third report information, that the first OSD is a sub-health OSD.
  • the management node is configured to determine, from the plurality of OSDs, a handover OSD of the sub-health OSD, and establish a correspondence between the sub-health OSD and the handover OSD, according to the sub-health OSD Update a partition allocation view with a correspondence between the switching OSDs, the updated partition allocation view including an updated I/O view, and sending the updated I/O view to the plurality of computing nodes And sending the updated partition allocation view to the switching OSD and an OSD having an active/standby relationship with the sub-health OSD, where the switching OSD is different And the first OSD and the second OSD.
  • the computing node is configured to receive a third write data request, divide the data to be written included in the third write data request into at least one data block to be written, and determine each of the at least one data block to be written a partition to be written by the data block, determining, according to the I/O view, that the primary OSD that processes the at least one data block to be written includes the third OSD, and sends a fourth write data request to the third OSD,
  • the fourth write data request includes a data block to be written to a partition managed by the third OSD and a corresponding partition to be written, wherein the third OSD is one of the plurality of OSDs, and is different from The switching OSD.
  • the third OSD copies the third write data request to the switching OSD according to the updated partition allocation view, where the updated partition allocation view is The standby OSD corresponding to the partition to be written included in the third write data request is the sub-health node.
  • the switching OSD is further configured to synchronize the received third write data request to the sub-health OSD based on the updated partition allocation view.
  • the switching OSD is further configured to send a third report message to the management node after obtaining a time length of obtaining the third write data request to the sub-health OSD, the third report message
  • the identifier of the switching OSD, the identifier of the sub-health OSD, and the third health status information of the sub-health OSD are included.
  • the management node is further configured to update an OSD health status record recorded on the management node according to the third report message, and determine that the sub-health OSD returns to normal according to the OSD health status record, where the OSD health status record includes Health status information of the sub-health OSD reported by other OSDs.
  • the first OSD is configured to receive a replication response returned by the second OSD, and obtain the data by comparing a difference between a time when the first write data request is sent and a time when the copy response is received. The time it takes for the block to be copied to the second OSD.
  • the time taken to synchronize the third write data request to the sub-health OSD may also be the difference between the time when the third write data request is sent and the time when the synchronous response is received after receiving the synchronization response returned by the sub-health OSD. The value is obtained.
  • the computing node is further configured to receive a first write data response returned by the first OSD, by comparing a time when the first write data request is sent and a time when the first write data response is received. The difference is obtained by the time taken to send the first write data request to the first OSD.
  • the computing node is further configured to receive a read data response returned by the first OSD for the second read data request, by comparing sending the second read data request and receiving the read data response. The difference in time is obtained by the time taken to send the second read data request to the first OSD.
  • the above health status information may include indication information that the OSD is in sub-health. It can also be the time taken above.
  • the first report message, the second report message, and the third report message mentioned above may be heartbeat messages.
  • the MDC includes a receiving module and a management module, where the receiving module is configured to receive a report message reported by a computing node or an OSD in a data storage system, where the report message includes a report.
  • the identifier of the OSD is reported to the OSD and the health status information of the reported person.
  • the management module is configured to update the saved OSD health status record according to the received report message, and determine the location according to the updated OSD health status record.
  • the one or more OSDs reported in the OSD are sub-health OSDs.
  • the management module is configured to determine, from the data storage system, a handover OSD of the sub-health OSD, and establish a correspondence between the sub-health OSD and the handover OSD, according to the sub-health OSD Updating a partition allocation view with a correspondence between the switching OSDs, the updated partition allocation view including updated I/O The view sends the updated I/O view to the computing node, and sends the updated partition allocation view to the switching OSD and an OSD having a master-slave relationship with the sub-health OSD.
  • the receiving module is further configured to receive the first report message of the sub-health OSD sent by the switching OSD, where the first report message includes an identifier of the handover OSD, and an identifier of the sub-health OSD.
  • first health status information of the sub-health OSD is that the switching OSD synchronizes the received write data request to the sub-health OSD based on the updated partition allocation view Thereafter, it is sent according to the write data response returned by the sub-health OSD.
  • the management module updates the saved OSD health status record according to the first report message that receives the sub-health OSD, and determines that the sub-health OSD returns to normal according to the updated OSD health status record judgment.
  • VBS Virtual Block System
  • the VBS serves as a storage driving layer for providing a block access interface to an application accessing the data storage system, and simultaneously completing the read and write logic of the block storage data.
  • the I/O allocation view is saved on the VBS, and the I/O allocation view includes all the partitions and the corresponding relationship with the main OSD corresponding to the partition. Based on the I/O view, the data is forwarded to the corresponding OSD or the data is read from the corresponding OSD.
  • the VBS includes an access interface, a service module, a client, and a reporting module, where the access interface is configured to receive a first write data request, the first write data.
  • the request includes data to be written, a write position of the data to be written, a data length of the data to be written, and block device information of the data to be written.
  • the service module is configured to divide data included in the first write data request into data blocks, and calculate and each data according to a write position, an offset, and the block device information of each data block. The partition to which the block is written.
  • the client is configured to find a primary OSD corresponding to the partition according to an I/O view, send a second write data request to the primary OSD, and obtain the second write data request to be sent to the primary OSD. It takes a long time, and the second write data request includes a data block to be written and a partition to be written by the data block to be written.
  • the reporting module is configured to send a first report message to the management node, where the first report message includes an identifier of the VBS, an identifier of the primary OSD, and health status information of the primary OSD.
  • the client is further configured to receive a write data response returned by the primary OSD, and obtain the foregoing by comparing a difference between a time when the second write data request is sent and a time when the write data response is received. The time it takes for the second write data request to be sent to the primary OSD.
  • the VBS further includes a determining module (not shown), where the determining module is configured to determine that the time required for the second write data request to be sent to the primary OSD exceeds a threshold, The sub-health status information of the standby OSD is sent to the reporting module.
  • the access interface is further configured to receive the read data request, where the read data request includes a start position of the data to be read, and data of the data to be read. Length and block device information of the data to be read.
  • the service module is further configured to divide the data to be read into data blocks, and calculate a partition with each of the data blocks according to a starting position, an offset, and the block device information of each data block.
  • the client is configured to find a primary OSD corresponding to the partition according to an I/O view, and send a read data request from the primary OSD, where the read data request includes a partition where the data block to be read is located.
  • the access interface may be a block device access interface based on a Small Computer System Interface (SCSI), that is, a SCSI interface.
  • SCSI Small Computer System Interface
  • the reporting module can be a heartbeat module.
  • the I/O view can be actively sent to the VBS by the MDC, or can be obtained from the MDC by the VBS. Replaceable Alternatively, the MDC may issue the partition allocation view to the VBS, and the VBS generates the I/O view according to the received partition allocation view.
  • the OSD may include a replication module, a write data module, and a reporting module, where the replication module is configured to receive a write data request, where the write data request includes data to be written. a block and the partition to be written to the data to be written, the write data request is copied to the standby OSD corresponding to the partition to be written included in the write request, and the data block to be written is copied to the standby
  • the OSD takes a long time and sends the write data request to the write data module.
  • the write data module is configured to receive the write data request, and write the to-be-written data included in the write data request to a persistent storage resource corresponding to the corresponding to-be-written partition.
  • the reporting module is configured to send a first report message to the management node, where the first report message includes an identifier of the OSD, an identifier of the standby OSD, and health status information of the standby OSD.
  • the write data module is further configured to obtain a time length of writing the to-be-written data block to the persistent storage resource corresponding to the to-be-written partition.
  • the reporting module is further configured to send a second report message to the management node, where the second report message includes an identifier of the OSD and health status information of the OSD.
  • the OSD further includes a determining module (not shown), where the determining module is configured to determine that the time taken to copy the to-be-written data block to the standby OSD exceeds a threshold or determine that The sub-health status information of the standby OSD is sent to the reporting module when the time period in which the write data block is written to the persistent storage resource corresponding to the to-be-written partition exceeds the threshold.
  • a determining module is configured to determine that the time taken to copy the to-be-written data block to the standby OSD exceeds a threshold or determine that The sub-health status information of the standby OSD is sent to the reporting module when the time period in which the write data block is written to the persistent storage resource corresponding to the to-be-written partition exceeds the threshold.
  • the copying module is further configured to receive a replication response returned by the standby OSD, obtain a difference between a time when the write data request is copied to the standby OSD and a time when the copy response is received. It takes a long time for the write data block to be copied to the standby OSD.
  • the foregoing write data module may be an asynchronous I/O module
  • the replication module may be a Replicated State Machine (RSM)
  • the reporting module may be a heartbeat module.
  • RSM Replicated State Machine
  • the receiving module mentioned in the above embodiment may also be a heartbeat module.
  • the heartbeat module involved in the present invention refers to a module for receiving and transmitting a heartbeat message. It can be understood that when the reporting module is a heartbeat module, the information sent by the reporting module is carried in the heartbeat message. Correspondingly, the heartbeat message is also received by the heartbeat module in the MDC.
  • FIG. 2 shows a logical structure diagram of a data storage system 100 in accordance with an embodiment of the present invention.
  • the data storage system 100 includes: compute nodes 1, 2, storage nodes 1, 2, 3, and a management node.
  • one primary OSD and one standby OSD are corresponding to each partition, and one OSD is used as an example on each storage node.
  • the primary OSD is OSD1 on storage node 1
  • the standby OSD is OSD2 on storage node 2.
  • the primary OSD is OSD2 on storage node 2
  • the standby OSD is OSD3 on storage node 3.
  • partitions in the system which are simple and will not be expanded here.
  • the correspondence between the partition 1, the OSD1, and the OSD2, and the correspondence between the partition 2, the OSD2, and the OSD3 are stored on the OSD.
  • the correspondence between partition 1 and OSD1 and the correspondence between partition 2 and OSD2 are saved in the I/O allocation view saved on the VBS.
  • a method for identifying a sub-health OSD is provided in the distributed storage system shown in FIG. 2. Referring to FIG. 3, the method includes the following steps:
  • the VBS in Node 1 receives a write data request.
  • the write data request includes data to be written, a write position of the data to be written, a data length of the data to be written, and block device information of the data to be written.
  • the write data request here can be a write I/O request.
  • VBS determines, according to the write location of the data to be written, the data length of the data to be written, and the block device information of the data to be written, to process the to-be-written according to the write data request. Enter the primary OSD of the data.
  • the specific process of the step may include: the VBS may divide the data to be written into multiple data blocks according to a preset length, calculate a partition information corresponding to each data block by using a consistent hash algorithm, and then save according to the save.
  • the I/O view finds the corresponding primary OSD for each of the data blocks. It can be understood that the data to be written may also be less than a preset length, and the VBS divides the data to be written into one data block according to a preset length.
  • the data to be written may be divided into one or more data blocks, and therefore, the main OSD corresponding to the data to be written may also be one or more.
  • data to be written is divided into two data blocks, data block 1 and data block 2.
  • the partition corresponding to the data block 1 is partition 1 according to the consistency hash algorithm, and the partition corresponding to the data block 2 is partition 2.
  • the primary OSD corresponding to partition 1 is OSD1
  • the primary OSD corresponding to partition 2 is OSD2.
  • the VBS sends a write data request to the main OSD corresponding to each of the data blocks, where the write data request includes a to-be-written data block to be written to the main OSD, and the to-be-written data block is to be written. Partition. And, the transmission time of the write data request is recorded.
  • the VBS transmits a write data request including the identification of the data block 1 and the partition 1 to the OSD1 on the storage node 1, and a write data request including the identification of the data block 2 and the partition 2 to the OSD on the storage node 2, respectively.
  • the specific processing procedure of the embodiment of the present invention is described below by taking the processing procedure of the data block 1 as an example.
  • the OSD1 on the storage node 1 After receiving the write data request including the data block 1, the OSD1 on the storage node 1 returns a write data response to the VBS1 that sends the write data request.
  • the OSD1 calls the system call interface of the operating system (OS) running on the storage node 1 to the persistent resource corresponding to the partition 1 managed by the OSD1.
  • OS operating system
  • the OSD writes the data block 1 to the persistent storage resource managed by the OSD, and returns a write response to the OSD1.
  • the OSD1 receives the write response returned by the OS, it compares the write time of the data block 1 with the time of receiving the corresponding write response, and acquires the cost of writing the data block 1 to the persistent storage resource managed by the OSD1. The length of time.
  • the OSD1 After receiving the write data request including the data block 1, the OSD1 copies the write data request to the standby OSD of the partition 1 according to the partition allocation view. And the transmission time of the write data request is recorded.
  • the primary OSD corresponding to the partition 1 is the OSD1 on the storage node 1
  • the standby OSD corresponding to the partition 1 is the OSD2 on the storage node 2. Therefore, after receiving the write data request including the data block 1, the OSD 1 on the storage node 1 copies the write data request to the OSD 2, and records the transmission time.
  • the OSD2 after receiving the write data request, the OSD2 writes the data block 1 therein into the persistent storage resource managed by the OSD2, and after the persistence is completed, returns a copy response to the OSD1.
  • the OSD 2 can also record the length of time that the OSD 2 writes the data block 1 to the persistent storage resource managed by the OSD 2.
  • the OSD1 receives the copy response returned by the OSD2, compares the recorded time of sending the write data request to the OSD2, and the time of receiving the copy response, and obtains the time taken for the data to be copied to the standby OSD.
  • the OSD1 may report the ID of the OSD2 and the health status information of the OSD2 to the management node.
  • the health status information of the OSD2 is used to reflect the health status of the OSD2.
  • the health status information of the OSD2 refers to the length of time that the data is copied to the standby storage node.
  • the health status information of the OSD 2 refers to the indication that the OSD 2 is in sub-health.
  • the OSD 1 determines whether the OSD 2 is in a sub-health state or a healthy state according to the length of time that the data is copied to the standby storage node. For example, when the length of time that data block 1 is copied to the standby storage node exceeds a certain threshold, OSD2 is considered to be in a sub-health state.
  • the heartbeat message reported to the management node on the storage node 1 where the OSD1 is located includes the ID of the OSD2 and the indication information that the OSD2 is in the sub-health state.
  • the indication information of the sub-health state may refer to a sub-health type. Some people also refer to the sub-health type as the fault level.
  • the message header of the heartbeat message reported to the management node carries the identifier of the reporter OSD1
  • the following message field carries the ID of the OSD2 and the indication information that the OSD2 is in the sub-health state:
  • osd_num refers to the number of reported OSDs, and type refers to the reported OSD sub-health
  • Osd_array refers to the list of reported OSDs.
  • the OSD1 can report the ID of the OSD1 and the health status information of the OSD1 to the management node.
  • the health status information of the OSD1 is used to reflect the health status of the OSD1.
  • the health status information of the OSD may be the length of time that the data is copied to the standby storage node, or may be the indication information that the OSD1 is in a sub-health state.
  • the OSD 1 can determine the health status of the OSD 1 based on the length of time it takes for the data block 1 to be written to the persistent storage resource managed by the OSD 1.
  • the time taken by OSD1 to write data block 1 to the persistent storage resource managed by OSD1 exceeds a certain threshold, it is determined that the OSD1 is in a sub-health state.
  • the heartbeat message reported to the management node on the storage node 1 where the OSD1 is located includes the ID of the OSD1 and the indication information that the OSD1 is in the sub-health state.
  • OSD1 and OSD2 can refer to the above example to report the health status information of the OSD using the existing heartbeat message.
  • the health status information of the above OSD can also be reported by another message, which is not limited by the present invention.
  • the VBS1 After receiving the write data response sent by the OSD1, the VBS1 compares the time when the VBS1 sends the write data request and the time when the write data response is received, and obtains the length of time for sending the write data request to the main OSD.
  • the ID of the OSD1 and the health status information of the OSD1 may be reported to the management node.
  • the health status information of the OSD1 is used to reflect the health status of the OSD1.
  • the health status information of the OSD1 may be the duration of sending the write data request to the primary OSD, or may be the indication information that the OSD1 is in the sub-health state.
  • OSD1 can determine the health of OSD1 according to the request to send the write data to the main OSD. status. For example, when the time length of sending the write data request to the main OSD exceeds a certain threshold, it is determined that the OSD1 is in a sub-health state. When the OSD1 is in the sub-health state, the heartbeat message reported to the management node on the computing node 1 where the VBS1 is located includes the ID of the OSD1 and the indication information that the OSD1 is in the sub-health state.
  • the header of the heartbeat message reported to the management node carries the identifier of the reporter, and the following message field carries the ID of the OSD1 and the indication information that the OSD1 is in the sub-health state:
  • osd_num refers to the number of reported OSDs, and type refers to the reported OSD sub-health
  • Osd_array refers to the list of reported OSDs.
  • the management node receives the health status information of the storage node reported by the storage node and/or the computing node, and determines the health status of the OSD in the data storage system according to the received health status information, and performs corresponding processing.
  • the computing node and the storage node in the data storage system may be multiple, wherein each computing node and the storage node may report the write data request or when processing the data request or the read data request. Reading the data request involves the health status information of the OSD.
  • an OSD can be used as the primary OSD for many partitions, or as a standby OSD for many other partitions.
  • the management node receives multiple heartbeat messages, and the health node record table can be set on the management node to record the OSD in the sub-health state in the data storage system.
  • the correspondence between the identifier of the sub-health OSD and the identifier of the reporter reporting the sub-health OSD may be recorded, and the format of the record is not limited in the present invention.
  • the received report information can be classified into the following three types: the health status information of the OSD reported by the computing node, the health status information of the OSD on other storage nodes reported by the storage node, and the storage node reporting the local node. Health status information of the OSD.
  • the reported health status information may be indication information indicating that the OSD is in a sub-health state, or may be a delay in processing the read and write data request.
  • the delay in processing the read and write data request may include the time taken by the OSD to write the data into the persistent storage resource managed by the OSD, the time taken by the main OSD to copy the data to the standby OSD, and the VBS sends the read and write data request to the request.
  • the main OSD takes a long time.
  • the processing delay of processing the read and write data request may include that the OSD reads the data from the persistent storage resource managed by the OS, and the VBS reads and writes the data. The length of time it takes for a request to be sent to the primary OSD.
  • the management node After receiving the health status information of the OSD reported by each node, the management node can manage the OSD according to the set management policy.
  • the health device information reported by each node is indication information indicating that the OSD is in a sub-health state, and the management node separately manages the above three types of information.
  • the management node For the health status information of the OSD reported by the computing node, the management node performs the following steps:
  • S322-11 Acquire a number of computing nodes in the data storage system that have processed the read and write data request in a certain period of time, and record the number of the computing nodes as X.
  • the main OSD reported as the sub-health by the most computing nodes may be counted, and the number of computing nodes reporting the main OSD is recorded as Y.
  • a computing node ratio (Yi/X) reporting the primary OSD sub-health state falls within a preset range, and if yes, determining that the primary OSD is in a sub-health state, Permanently isolate or temporarily quarantine them according to their fault level.
  • the fault level may be determined according to factors such as the delay of processing the read and write data request, the ratio of the computing node affected by the delay, and the like, and the present invention is not limited.
  • the above calculation node ratio may also be Y/X.
  • the management node performs the following steps on the storage node reporting the health status information of the OSD on other storage nodes:
  • S322-21 Obtain the number of primary storage nodes that have copied the data request to the standby OSD on other storage nodes within a certain period of time, and record the number of the primary storage nodes as X'.
  • the primary storage node here refers to the storage node where the primary OSD is located. It can be understood that the primary OSD is corresponding to the partition, and the primary OSD of one partition may also be the standby OSD of another partition. Therefore, the same storage node can be used as both a primary storage node and a standby storage node.
  • the standby OSD reported as the sub-health by the most other storage nodes is counted, and the number of other storage nodes reporting the standby OSD is recorded as Y.
  • the storage node ratio (Y'i/X') reporting the sub-health state of the standby OSD falls within a preset range, and if yes, determining that the storage node is sub-health , according to its fault level, permanently isolated or temporarily quarantined online.
  • the fault level may be determined according to factors such as the delay of processing the read and write data request, the ratio of the storage node affected by the delay, and the like, and the present invention is not limited.
  • the management node performs the following steps on the storage node reporting the health status information of the OSD on the storage node:
  • the health status information received by the management node refers to the delay in processing the read and write data request. Then, after receiving the health status information, the management node determines, according to a certain policy, which OSD in the data storage system is a sub-health state, which needs temporary isolation or permanent isolation.
  • the specific strategy is not limited by the present invention.
  • the health status of each node in the data storage system is more comprehensively and accurately detected, and the sub-health OSD is determined according to the detection result. Handle accordingly.
  • each OSD can have a primary OSD of multiple partitions, or a standby OSD of another partition. Take OSD2 as an example. If in actual deployment, OSD2 is the standby OSD of X partitions and also the main OSD of Y partitions. Then, the X+Y partition is the partition managed by the OSD2. When it is determined by the method of the above embodiment that the OSD 2 is in a sub-health state, other OSDs are required to take over the partition managed by the OSD 2. Thus, when a data is to be written to a partition managed by OSD2, a write data request carrying the data is allocated to the OSD that takes over OSD2.
  • the OSD in a sub-health state is referred to as a sub-health OSD; the OSD used to take over the partition managed by the sub-health OSD is referred to as a handover OSD.
  • the partition managed by OSD2 can be assigned to multiple switching OSDs. It can also be assigned to a switching OSD. The specific allocation can be determined according to factors such as the load of other OSDs in the storage system, or can be determined according to certain preset policies.
  • the isolation sub-health of the embodiment of the present invention is described below with reference to FIG. 4, in which the standby OSD of the partition 1 (ie, OSD2) is determined to be in a sub-health state, and the partition on the OSD2 is assigned to a handover OSD.
  • OSD standby OSD of the partition 1
  • the method of OSD is described below with reference to FIG. 4, in which the standby OSD of the partition 1 (ie, OSD2) is determined to be in a sub-health state, and the partition on the OSD2 is assigned to a handover OSD.
  • the management node allocates a switching OSD to the OSD2, and the switching OSD takes over the partition managed by the sub-health OSD, and replaces the sub-health OSD to process subsequent write data requests.
  • the management node may adopt an algorithm to select a handover node for the sub-health storage node.
  • the OSD in the storage node with lower capacity and persistent storage resources is preferentially selected as the switching OSD to ensure the storage balance of the cluster is dispersed.
  • the cabinet when the cabinet is a conflict zone, select a cabinet with the least capacity in the cabinet without conflicts. Select the server with the least capacity under the cabinet; and select a storage node with the least capacity and persistent storage resources in the server.
  • the OSD4 is determined as the switching OSD of the OSD2 according to the above method. It can be understood that there may be multiple switching OSDs to take over the partitions managed by the OSD2, which is not limited by the present invention.
  • the management node selects the switching node OSD4 to take over the partition managed by the sub-health node OSD2, the OSD 4 becomes the standby OSD of the X partitions originally managed by the OSD 2. Since the OSD2 is also the primary OSD of the Y partitions, the OSD4 reduces the original OSD2 to the standby OSD of the Y partitions before taking over the Y partitions, and uses the original partition as the Y partitions. The OSD is the main OSD. Then, OSD2 is replaced by OSD4 as the standby OSD of the Y partitions. It can be understood that the MDC will update the partition allocation view of the data storage system according to these changes, and send the updated partition allocation view to the OSD in the storage node. Due to the joining of the switching node, the updated partition allocation view may further include a correspondence between the sub-health node and the switching node. Referring to Figure 4, the process of switching includes the following steps (some steps are not shown in the figure):
  • the management node sends the updated partition allocation view to all computing nodes in the storage system, the switching node, and the OSD associated with the partition taken by the switching OSD.
  • OSD4 becomes the standby OSD of the X+Y partitions
  • the current primary OSD of the X+Y partitions may be referred to as the OSD associated with the partition that the OSD takes over. Since the partition allocation view is updated, the management node sends the updated partition allocation view to the relevant node in the storage system.
  • the computing node in the storage system refreshes the local I/O allocation view.
  • the OSD associated with the partition taken over by the switching OSD updates the local partition allocation view.
  • the VBS in the computing node 1 receives the write data request, according to the write data request, the write position of the data to be written included, the data length of the data to be written, and the block of the data to be written
  • the device information determines a primary OSD that processes the data to be written.
  • the VBS divides the data to be written carried in the write data request into multiple data blocks according to a preset length, and calculates the partition information corresponding to each data block by using a consistent hash algorithm, according to the save.
  • the I/O view finds the corresponding primary OSD.
  • data to be written is divided into two data blocks, data block 1 and data block 2.
  • the partition corresponding to the data block 1 is partition 1 according to the consistency hash algorithm, and the partition corresponding to the data block 2 is Partition 2.
  • the primary OSD corresponding to the partition 1 is OSD1
  • the primary OSD corresponding to the partition 2 is OSD2.
  • the standby OSD corresponding to partition 1 is OSD2 and the standby OSD of partition 2 is OSD3, then the primary OSD corresponding to partition 1 is OSD1, the standby OSD corresponding to partition 1 is OSD4, and the primary OSD of partition 2 is OSD3. , and the standby OSD of partition 2 is OSD4.
  • the VBS will send a write data request to the OSD1, and the write data request sent to the OSD1 includes the identifiers of the data block 1 and the partition 1.
  • the VBS also sends a write data request to the OSD3 according to the updated I/O view, and the write data request sent to the OSD3 includes the data block 2 and the identifier of the partition 2.
  • OSD1 receives the write data request including the data block 1
  • the OSD1 copies the received write data request to the OSD4 according to the updated partition allocation view.
  • OSD4 replaces OSD2 as the standby OSD of partition 1.
  • the OSD4 after receiving the write data request including the identifiers of the data block 1 and the partition 1, the OSD4 learns that it is the standby OSD of the partition 1 according to the saved partition allocation view, and is the switching OSD of the OSD2.
  • the OSD 4 writes the data block 1 to the persistent storage resource it manages and obtains the length of time it takes for the data block 1 to be written to the persistent storage resource.
  • the OSD4 further sends the write data request including the data block 1 to the background handoff thread, and the handoff thread requests the write data to the OSD2 asynchronously, and records the time when the write data request is sent.
  • the OSD4 After receiving the write response returned by the OSD2, the OSD4 determines the length of time required for the data to be synchronized to the sub-health OSD (OSD2) according to the time of sending the write data request recorded in step 414 and the time of receiving the write response.
  • OSD2 sub-health OSD
  • the OSD4 can determine whether to report the health status information of the OSD2 to the management node according to the set policy. For example, when the time required to synchronize to the OSD2 exceeds a certain threshold, the health status information of the OSD2 is sent to the management node.
  • the method of reporting the health status information of the OSD2 refer to the description of the above embodiment, and details are not described herein again.
  • the management node determines the health status of the sub-health OSD according to the received health status information.
  • the method for the management node to determine the health status of the OSD refers to the previous embodiment, and details are not described herein again.
  • the VBS transmits a write data request including the identification of the data block 2 and the partition 2 to the OSD 3.
  • the following describes the processing procedure after the OSD 3 receives the write data request including the identification of the data block 2 and the partition 2.
  • the OSD3 learns that it is the primary OSD of the partition 2 according to the updated partition allocation view.
  • the OSD 3 writes the data block 2 to the persistent storage resource it manages, and obtains the length of time it takes for the data block 2 to write to the persistent storage resource, and the OSD3 knows according to the updated partition allocation view.
  • the standby OSD of the partition 2 is a sub-health node, and the switching node of the sub-health node is the OSD4, and the write data request is copied to the switching node OSD4.
  • the OSD4 after receiving the write data request including the identifiers of the data block 2 and the partition 2, the OSD4 writes the data block 2 into the persistent storage resource managed by the data block 2, and obtains the duration of the data block 2 written to the persistent storage resource. .
  • the OSD4 learns that it is a switching node of the sub-health OSD (OSD2) according to the locally saved updated partition allocation view, and the OSD2 is the standby OSD of the partition 2. Then, the OSD 4 sends the write data request including the data block 2 to the background handoff thread, and the handoff thread requests the write data to the OSD 2 asynchronously, and records the time when the write data request is sent.
  • OSD2 sub-health OSD
  • the OSD4 After receiving the write response returned by the OSD2, the OSD4 determines the length of time required for the data to be synchronized to the OSD (OSD2) before the switch according to the time of transmitting the write data request recorded in step 422 and the time of receiving the write response.
  • the OSD4 can determine whether to report the health status information of the OSD2 to the management node according to the set policy. For example, when the time required to synchronize to the OSD2 exceeds a certain threshold, the health status information of the OSD2 is sent to the management node.
  • the method of reporting the health status information of the OSD2 refer to the description of the above embodiment, and details are not described herein again.
  • the management node determines that OSD2 has returned to normal, according to which the partition allocation view is updated, that is, the partition allocation view, OSD2
  • the correspondence between the node and the switching node is deleted.
  • the updated partition allocation view is sent to all computing nodes in the storage system and the OSD associated with the partition taken over by the switching OSD receives the broadcast, and updates its saved I/O allocation view or partition allocation view.
  • OSD2 is the primary OSD of some partitions.
  • OSD2 will be reduced to the standby OSD of these partitions, and the original standby OSD will be promoted to the primary OSD of these partitions. Then, when OSD2 is healthy, you can also switch back the active and standby OSDs of these partitions.
  • the method of switching back may be to update the partition allocation view after the management node determines that the OSD2 is healthy. In the updated partition allocation view, the identity of the active and standby OSDs is switched back for the partitions.
  • the switching node Since the OSD2 in the sub-health state is isolated online, the switching node asynchronously pushes the received write data request to the OSD2 through the handoff thread. Therefore, for the partition 1, the OSD2 maintains data consistency with the OSD1. For partition 2, OSD2 and OSD3 maintain data consistency. Once the OSD2 fault is removed, OSD2 can be put into use directly.
  • the management node can clear it out of the data storage system.
  • the VBS, SOD, and management node (MDC) mentioned in the above embodiments can be implemented by a software module installed on a hardware device of the server.
  • the VBS, OSD, and management node may each include a processor 501 and a memory 502, wherein the processor 501 and the memory 502 bus complete communication with each other.
  • the memory 502 is configured to store computer operation instructions. Specifically, it may be a high speed RAM memory or a non-volatile memory.
  • the processor 501 is configured to execute a computer operation instruction stored in a memory.
  • the processor 501 may specifically be a central processing unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the processor 501 implements the functions of the VBS, SOD, and MDC in the above embodiments by executing different computer operation instructions stored in the memory 502.
  • Another embodiment of the present invention provides a storage medium for storing computer operating instructions mentioned in the above embodiments.
  • the stored operational instructions of the storage medium are executed by the computer, the method in the above embodiments may be performed to implement the functions of the above-described MDC or VBS or OSD in the data storage system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种识别亚健康OSD的方法、装置和系统。数据存储系统的处理读写数据请求的主OSD将备OSD处理读写数据请求的时延上报给系统中的管理节点,由该管理节点根据接收到的上报消息确定系统中哪些OSD处于亚健康状态。由于收集的数据更全面,管理节点可以更准确地判断系统中的亚健康OSD。

Description

识别OSD亚健康的方法、装置和数据存储系统 技术领域
本发明实施例涉及存储技术,特别是识别OSD亚健康的方法、装置和数据存储系统。
背景技术
分布式存储系统中某个存储节点出现亚健康问题会严重影响整个分布式存储系统可用性。传统的解决方案可以是存储节点定期向管理设备上报心跳,如果存储节点发生亚健康故障导致心跳丢失,则对该节点进行离线隔离处理。或者也可以是存储节点自身植入故障检测模块,检测到输入/输出(简称I/O)延迟大于预定的阈值,则向管理节点上报故障,管理节点对其进行离线隔离。传统的分布式存储系统对存储节点亚健康的监控与处理延迟很大,亚健康故障对分布式存储系统的影响时间长,分布式存储系统的可用性大打折扣。
发明内容
有鉴于此,本申请提供了一种识别OSD亚健康的方法、装置和系统。
本申请的第一方面提供了一种数据存储系统,所述系统包括管理节点和多个存储节点,其中所述系统中部署了多个OSD,所述多个OSD位于所述多个存储节点上;所述多个OSD包括第一OSD和第二OSD。其中,所述第一OSD用于接收第一写数据请求,所述写数据请求中包括待写入数据块以及相应的待写入的分区,根据分区分配视图确定所述待写入的分区的备OSD为所述第二OSD,将所述第一写数据请求复制给所述第二OSD,获得所述数据块复制到所述第二OSD所耗时长之后向所述管理节点发送第一报告消息,所述第一报告消息中包括所述第一OSD的标识、所述第二OSD的标识以及所述第二OSD的健康状态信息。所述管理节点用于接收所述第一报告消息,根据所述第一报告消息更新所述管理节点上保存的OSD健康状态记录,根据所述OSD健康状态记录确定所述第二OSD为亚健康OSD,所述OSD健康状态记录包括所述其他OSD上报的所述第二OSD的健康状态信息。
由于第一OSD将待写入数据复制到第二OSD的时候,会记录其消耗的时长来作为判断该第二OSD是否处于亚健康的参考。因此,相比于现有技术,本申请提供的系统能够更加全面地检测系统中节点的故障情况,从而提高识别亚健康OSD的准确性。
本申请的第二方面,提供了一种识别亚健康OSD的方法,所述方法应用于上述第一方面所提供的数据存储系统中。该方法包括如下步骤:
所述第一OSD接收第一写数据请求,所述写数据请求中包括待写入所述第一OSD所管理的分区的数据块以及相应的待写入的分区,根据分区分配视图确定所述待写入的分区的备OSD为所述第二OSD,将所述写数据请求复制给所述第二OSD,获得所述数据复制到所述第二OSD所耗时长之后向所述管理节点发送第一报告消息,所述第一报告消息中包括所述第一OSD的标识、所述第二OSD的标识以及所述第二OSD的健康状态信息。
所述管理节点接收所述第一报告消息,根据所述第一报告消息更新所述管理节点上 保存的OSD健康状态记录,根据所述OSD健康状态记录确定所述第二OSD为亚健康OSD,所述OSD健康状态记录包括所述其他OSD上报的所述第二OSD的健康状态信息。
基于同样的理由,本申请提供的方法能够更加全面地检测系统中节点的故障情况,从而提高识别亚健康OSD的准确性。
本申请第三方面提供了一种虚拟块系统(Virtual Block System,VBS)VBS,用来实现上述系统或方法中所述计算节点的功能。所述VBS包括访问接口、业务模块、客户端、上报模块。其中,所述访问接口,用于接收第一写数据请求,所述第一写数据请求中包括待写入数据、所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息。所述业务模块,用于将所述第一写数据请求中包含的数据分割成数据块,并根据每个数据块的写入位置、偏移和所述块设备信息计算与所述每个数据块要写入的分区。所述客户端,用于根据I/O视图,找到与所述分区对应的主OSD,向所述主OSD发送第二写数据请求,获得所述第二写数据请求发送到所述主OSD所耗时长,所述第二写数据请求中包括待写入的数据块以及该待写入数据块要写入的分区。所述上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述VBS的标识、所述主OSD的标识以及所述主OSD的健康状态信息。
VBS检测写数据请求下发到主OSD的路径上是否有异常,并将此上报给管理节点,使得管理节点可以更加全面的检测系统中节点的健康状态。
本申请的第四方面提供了一种上述系统或方法中所涉及的对象存储设备(Object Storage Device,OSD)。所述OSD包括写数据模块、复制模块和上报模块。作为主OSD时,该OSD会从计算节点接收到写数据请求,所述写数据请求中包括待写入数据块及所述待写入数据要写入的分区。具体到OSD内部,所述复制模块用于接收所述写数据请求,将所述写数据请求复制给写请求中包括的待写入的分区对应的备OSD,获得所述待写入数据块复制到所述备OSD所耗时长,并且将所述写数据请求发送给所述写数据模块。所述写数据模块,用于接收所述写数据请求,将写数据请求中包括的待写入数据写入到对应的待写入分区对应的持久化存储资源中。所述上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述OSD的标识、所述备OSD的标识以及所述备OSD的健康状态信息。
由于主OSD将备OSD的状态及时反馈给了管理节点,为管理节点的管理提供了更全面的数据。
本申请的第五方面,提供了一种元数据控制器(Meta Data Controller,MDC)。所述MDC用来实现上述系统或方法中的管理节点的功能。所述MDC包括管理模块和接收模块,其中,所述接收模块用于接收数据存储系统中的计算节点或者OSD上报的报告消息,所述报告消息中包括上报者的标识,被上报OSD的标识以及被上报者的健康状态信息。所述管理模块用于根据接收到的报告消息更新保存的OSD健康状态记录,根据所述更新后的OSD健康状态记录确定所述被上报OSD中的一个或多个OSD为亚健康OSD。
本申请的第六方面,提供了又一种识别亚健康OSD的方法,所述方法应用于数据存储系统中,所述数据存储系统包括管理节点、多个存储节点和多个计算节点,其中,所述的数据存储系统中部署了多个OSD,所述多个OSD位于所述多个存储节点上。所述的方法由所述管理节点执行,该方法包括如下步骤:接收所述多个存储节点中的至少一个存储节点在处理写数据请求时上报的第一报告消息,所述第一报告消息中包括上报OSD的 标识、被上报者的标识以及被上报OSD的健康状态信息,所述被上报OSD和所述上报OSD为所述多个OSD中的一个,且,所述上报OSD不是所述被上报OSD。根据接收到的第一报告消息判断第一被上报OSD为亚健康OSD。
本申请第六方面的一种可能的实现中,所述管理节点接收所述多个计算节点中的至少一个计算节点在处理读写数据请求时上报的第二报告消息,所述第二报告消息中包括上报者的标识、被上报OSD的标识以及被上报OSD的健康状态信息,所述被上报OSD为所述多个OSD中的一个,且所述上报者为上报所述第二报告消息的计算节点。所述管理节点根据接收到的第二报告消息判断第二被上报OSD为亚健康OSD。
本申请第六方面的另一种可能的实现中,或者结合第六方面的第一种实现,在第五方面的第二种实现中,所述管理节点从所述多个OSD中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O视图,将所述更新后的I/O视图发送给所述多个计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD。所述管理节点接收所述切换OSD发送的所述亚健康OSD的第三报告消息,所述第三报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第三健康状态信息,所述亚健康OSD的第三健康状态信息是所述切换OSD基于更新后的分区分配视图将接收到的写数据请求同步给所述亚健康OSD之后,根据所述亚健康OSD返回的写数据响应发送的。所述管理节点根据接收到所述亚健康OSD的第三报告消息判断确定所述亚健康OSD恢复正常。
可选的,所述分区分配视图包括分区,该分区的主OSD和该分区的备OSD的对应关系。且所述根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图包括:如果所述亚健康OSD为所述分区的主OSD,则将所述原主OSD更改为备OSD,所述原备OSD更改为主OSD,将所述切换OSD关联至所述更改后的备OSD。如果所述亚健康OSD为所述分区的备OSD,则将所述切换OSD关联至所述备OSD。
可选的,亚健康OSD恢复正常后,该方法还包括:所述管理节点解除所述亚健康OSD与所述切换OSD之间的对应关系,并更新分区分配视图,将所述更新后的分区分配视图发送给所述多个计算节点、所述切换OSD以及与所述亚健康节点有主备关系的OSD。
上述实施例中所涉及的健康状态信息包括OSD处于亚健康的指示信息。
本申请第七方面提供一种上报健康状态信息的方法,该方法可以由系统中的VBS执行,所述方法包括:接收第一写数据请求,所述第一写数据请求中包括待写入数据、所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息。将所述第一写数据请求中包含的数据分割成数据块,并根据每个数据块的写入位置、偏移和所述块设备信息计算与所述每个数据块要写入的分区。根据I/O视图,找到与所述分区对应的主OSD,向所述主OSD发送第二写数据请求,获得所述第二写数据请求发送到所述主OSD所耗时长,所述第二写数据请求中包括待写入的数据块以及该待写入数据块要写入的分区。向所述管理节点发送第一报告消息,所述第一报告消息中包括所述VBS的标识、所述主OSD的标识以及所述主OSD的健康状态信息。
可选的,向所述主OSD发送第二写数据请求之后,还包括接收所述主OSD返回的写数据响应,通过比较发送所述第二写数据请求的时间和接收到所述写数据响应的时间的差值获得所述第二写数据请求发送到所述主OSD所耗时长。
可选的,所述第二写数据请求发送到所述主OSD所耗时长超过阈值时,发送所述第一报告消息。
本申请第八方面提供另一种上报健康状态信息的方法,该方法可以由系统中的OSD执行,所述方法包括:接收写数据请求,所述写数据请求中包括待写入数据块及所述待写入数据要写入的分区,将所述写数据请求复制给写请求中包括的待写入的分区对应的备OSD,获得所述待写入数据块复制到所述备OSD所耗时长,并且将所述写数据请求发送给所述写数据模块。将写数据请求中包括的待写入数据写入到对应的待写入分区对应的持久化存储资源中。向所述管理节点发送第一报告消息,所述第一报告消息中包括所述OSD的标识、所述备OSD的标识以及所述备OSD的健康状态信息。
可选的,所述的方法还包括:在将写数据请求中包括的待写入数据写入到对应的待写入分区对应的持久化存储资源中时,获得将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长。向所述管理节点发送第二报告消息,所述第二报告消息中包括所述OSD的标识以及所述OSD的健康状态信息。
可选的,确定所述待写入数据块复制到所述备OSD所耗时长超过阈值时或者确定将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长超过所述阈值时,将所述备OSD的亚健康状态信息发送给所述上报模块。换句话说,确定所述待写入数据块复制到所述备OSD所耗时长超过阈值时发送所述第一报告消息或者确定将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长超过所述阈值时,发送第二报告消息。
可选的,将所述写数据请求复制所述备OSD之后,该方法还包括:接收所述备OSD返回的复制响应,通过将所述写数据请求复制给所述备OSD的时间和接收到所述复制响应的时间的差值获得所述待写入数据块复制到所述备OSD所耗时长。
本申请的第九方面提供了用以实现上述第六方面、第七方面、和第八方面的装置,该装置包括处理器和存储器,所述的处理器和存储器用总线连接,其中,所述存储器,用于存放计算机操作指令。具体可以是高速RAM存储器,也可以是非易失性存储器(non-volatile memory)。所述处理器,用于执行存储器中存放的计算机操作指令。处理器具体可以是中央处理器(central processing unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。其中,处理器通过执行该存储器中存储的计算机操作指令以执行所述六方面或第七方面或第八方面的方法。以实现上述MDC或VBS或OSD在数据存储系统中的功能。
本申请的第十方面提供了一种存储介质,用以存储上述第九方面提到的计算机操作指令。当这些操作指令被计算机执行时,可以执行上述六方面或第七方面或第八方面的方法。以实现上述MDC或VBS或OSD在数据存储系统中的功能。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图 作简单地介绍。
图1是一种分布式存储系统的结构示意图;
图2是一种本发明实施例中涉及的数据存储系统的结构示意图;
图3是本发明实施例涉及的识别亚健康OSD的方法流程示意图;
图4是本发明实施例涉及的切换亚健康OSD的流程示意图;
图5是本发明实施例涉及的装置结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。
另外,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本发明实施例中涉及到一种数据存储系统,如图1所示,该数据存储系统由两个或两个以上的服务器节点组成,每个服务器节点能够提供计算资源和/或存储资源。所述数据存储系统用于将计算资源和存储资源提供给不同的应用程序使用,该应用程序可以是虚拟机或者数据库等。上述的服务器节点之间可以通过计算网络(比如,高速数据交换网络)进行数据通信。该数据存储系统上运行了分布式存储控制器,所述分布式存储控制器是对每台服务器节点上运行的存储控制功能模块的统称。所述分布式存储控制器使用分布式算法,将各个服务器节点中的存储资源(持久化存储资源、缓存资源)虚拟化为共享存储资源池,供数据存储系统中的应用程序共享使用。当所述数据存储系统上运行应用程序时,所述应用程序的相关数据可以存储在所述数据存储系统的存储资源上,或者,从所述数据存储系统的存储资源上读取所述应用程序的相关数据。需要说明的是:图1仅仅为该数据存储系统的一种示例,实际应用中,既可以部署更多的服务器节点。
在本发明实施例中所述分布式存储控制器可以通过安装在服务器的硬件设备上的软件模块来实现。具体地,所述分布式存储控制器从功能上可以划分为如下几个部分:
元数据控制器(Meta Data Controller,MDC),主要用于为数据存储系统中的每个对象存储设备(Object Storage Device,OSD)分配与该OSD所管理的物理存储资源对应的分区(Partition),建立分区分配视图。当所述数据存储系统中的OSD跟分区的对应关系发生变化时,所述的MDC还用于更新分区分配视图。其中,分区是指对存储层的物理存储资源的逻辑映射。这里的物理存储资源通常是指持久化存储资源,可以由机械硬盘,如Hard disk drive,HDD,固态硬盘SSD或存储类内存(Storage Class Memory,SCM)设备等来提供持久化存储资源。分区分配视图包括分区和该分区所归属的OSD之间的对应关系。为满足用户可用性需求,数据存储系统通常采用多副本存储。在实际应用中可能是三副本,或者其他多副本的方式。在多副本存储的场景下,一个分区跟多个OSD有对应关系,并且,所述多个OSD 中的一个作为该分区的主OSD,而其余的OSD则作为该分区的备OSD。在这种场景下,分区分配视图包括分区、该分区的主OSD和该分区的备OSD之间的对应关系。其中,该分区分配视图包括I/O视图,可以理解为所述I/O视图是分区分配视图的子表。I/O视图用于记录存储分区与OSD的对应关系。虚拟块系统(Virtual Block System,VBS),作为存储的驱动层,用于向访问所述数据存储系统的应用程序提供块访问接口,同时完成块存储数据的读写逻辑。VBS上保存了I/O分配视图,该I/O分配视图包含了所有分区以及跟该分区对应的主OSD的对应关系。根据所述I/O视图,将数据转发到相应的OSD上,或者从相应的OSD上读取数据。
OSD用于接收读写数据请求,根据所述读写数据请求从所述OSD所管理的分区对应的持久化存储资源中读取数据或者写入数据。OSD上保存着分区分配视图,该分区分配视图包含了所述数据存储系统中所有分区及其对应的主备OSD的对应关系。根据分区分配视图,OSD可以找到待读取数据或者待写入数据的分区对应的备OSD。
通常,数据存储系统的具体部署方式可以依据用户需要选择不同的配置文件,该配置文件中包括上述功能模块的部署策略、数据存储系统的分区规格(即,把每个硬盘分为多少份)以及不同服务器节点间的互相通信地址信息(包括MDC、VBS和OSD的地址信息)等。在实际部署中,每台服务器节点根据配置文件中的部署策略运行分布式存储控制器的不同的功能模块,也就是说,根据该部署策略,可以在不同的服务器节点上运行分布式存储控制器的不同的功能模块,每台服务器节点可以运行分布式存储控制器所有的功能模块,也可以运行分布式存储控制器部分的功能模块。比如,上述MDC可以只部署在数据存储系统的某个服务器节点上,上述的VBS可以部署在所述数据存储系统中具有计算资源的每个服务器节点中;上述OSD可以部署在集群系统中的具有存储资源的每个服务器节点上。根据实际需要,可以在一个服务器节点上部署一个OSD或者多个OSD。
为便于描述,在下文中,将部署有所述MDC的服务器节点称之为管理节点。另外,将能够提供计算资源的服务器节点称之为计算节点,而将能够提供存储资源的服务器节点称之存储节点。可以理解的是,以上述举例中的部署策略,所述的OSD都部署在存储节点上,而所述的VBS都部署在计算节点上。这里所说的计算节点、存储节点、管理节点是逻辑概念,物理上,一个服务器节点既可以是计算节点,也可以是存储节点,还可以是管理节点。
初始化阶段,用户通过系统的管理端将配置文件导入数据存储系统,MDC根据导入的配置文件建立数据存储系统的分区分配视图,所述分区分配视图包括该数据存储系统中的OSD跟分区之间的映射关系。在多副本存储的场景下,分区分配视图包括分区、该分区的主OSD和该分区的备OSD之间的对应关系;I/O视图则包括分区与该分区的主OSD之间的对应关系。
当服务器节点上的OSD被激活后,所述OSD向MDC请求分区分配视图,根据该请求,MDC将已经分配好的分区分配视图发送给该OSD。当服务器节点上的VBS被激活后,所述VBS向MDC请求I/O视图,根据该请求,MDC将已经分配好的I/O视图发送给对应的VBS。可以理解的是,也可以是MDC在生成分区分配视图之后将该分区分配视图发送给所述OSD并将其中的I/O视图发送给所述VBS,或者在分区分配视图更新后,将更新后的分区分配视图发送给所述OSD以及将更新后的I/O视图发送给所述VBS。
本发明实施例就是在前面的基础上来实现的。参考图2,本发明实施例提供的一种数据 存储系统,所述系统包括管理节点和多个存储节点,其中所述系统中部署了多个OSD,所述多个OSD位于所述多个存储节点上;所述多个OSD包括第一OSD和第二OSD。其中,所述第一OSD用于接收第一写数据请求,所述写数据请求中包括待写入数据块以及相应的待写入的分区,根据分区分配视图确定所述待写入的分区的备OSD为所述第二OSD,将所述第一写数据请求复制给所述第二OSD,获得所述数据块复制到所述第二OSD所耗时长之后向所述管理节点发送第一报告消息,所述第一报告消息中包括所述第一OSD的标识、所述第二OSD的标识以及所述第二OSD的健康状态信息。所述管理节点用于接收所述第一报告消息,根据所述第一报告消息更新所述管理节点上保存的OSD健康状态记录,根据所述OSD健康状态记录确定所述第二OSD为亚健康OSD,所述OSD健康状态记录包括所述其他OSD上报的所述第二OSD的健康状态信息。这里说的其他OSD是多个OSD中除第一OSD和第二OSD之外的OSD。
可选的,所述计算节点用于接收第二写数据请求,将所述第二写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个数据块要写入的分区,根据I/O视图确定出处理所述待写入数据块的主OSD包括所述第一OSD,向所述第一OSD发送所述第一写数据请求,获得发送所述第一写数据请求到所述第一OSD所耗时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息。所述管理节点还用于根据所述第二报告消息更新所述管理节点上记录的OSD健康状态记录,根据所述OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第二OSD的健康状态信息。
可选的,所述计算节点用于接收第一读数据请求,确定所述第一读数据请求所要读取的每个待读取数据块所在的分区,根据I/O视图确定出处理所述待读取数据块的主OSD包括所述第一OSD,向所述第一OSD发送第二读数据请求,获得发送所述第二读数据请求到所述第一OSD所耗时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息,所述第二读数据请求中包括待读取数据块所在的分区。所述管理节点还用于接收所述第二报告消息,根据所述第二报告消息更新所述管理节点上保存的OSD健康状态记录,根据OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第一OSD的健康状态信息。
可选的,所述第一OSD还用于将所述待写入所述第一OSD所管理的分区的数据块写入到相应的待写入的分区对应的持久化存储资源中,获得将所述数据块写入到所述持久化存储资源所耗时长后,向所述管理节点发送第三报告消息,所述第三报告消息包括所述第一OSD的标识以及所述第一OSD的健康状态信息。所述管理节点用于根据所述第三报告信息确定所述第一OSD为亚健康OSD。
可选的,所述管理节点用于从所述多个OSD中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O视图,将所述更新后的I/O视图发送给所述多个计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD,所述切换OSD不同 于所述第一OSD及所述第二OSD。所述计算节点用于接收第三写数据请求,将所述第三写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个待写入数据块要写入的分区,根据I/O视图确定出处理所述至少一个待写入数据块的主OSD包括所述第三OSD,向所述第三OSD发送第四写数据请求,所述第四写数据请求中包括待写入所述第三OSD所管理的分区的数据块以及相应的待写入的分区,其中,所述第三OSD为所述多个OSD中一个,且不同于所述切换OSD。所述第三OSD接收所述第三写数据请求后,根据更新后的分区分配视图将所述第三写数据请求复制给所述切换OSD,其中,所述更新后的分区分配视图中所述第三写数据请求中包括的待写入的分区对应的备OSD为所述亚健康节点。所述切换OSD还用于基于更新后的分区分配视图将接收到的第三写数据请求同步给所述亚健康OSD。
可选的,所述切换OSD还用于在在获得将所述第三写数据请求同步给所述亚健康OSD所耗时长后向所述管理节点发送第三报告消息,所述第三报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第三健康状态信息。所述管理节点还用于根据所述第三报告消息更新所述管理节点上记录的OSD健康状态记录,根据所述OSD健康状态记录确定所述亚健康OSD恢复正常,所述OSD健康状态记录包括其他OSD上报的所述亚健康OSD的健康状态信息。
可选的,所述第一OSD用于接收所述第二OSD返回的复制响应,通过比较发送所述第一写数据请求的时间和接收到所述复制响应的时间的差值获得所述数据块复制到所述第二OSD所耗时长。
上述将第三写数据请求同步给所述亚健康OSD所耗时长也可以是接收到亚健康OSD返回的同步响应之后,通过比较发送第三写数据请求的时间和接收到同步响应的时间的差值获得。
可选的,所述计算节点还用于接收所述第一OSD返回的第一写数据响应,通过比较发送所述第一写数据请求的时间和接收到所述第一写数据响应的时间的差值获得所述发送所述第一写数据请求到所述第一OSD所耗时长。
可选的,所述计算节点还用于接收所述第一OSD返回的针对所述第二读数据请求的读数据响应,通过比较发送所述第二读数据请求和接收所述读数据响应的时间的差值获得发送所述第二读数据请求到所述第一OSD所耗时长。
可以理解的是,上述健康状态信息可以包括OSD处于亚健康的指示信息。也可以是上述所耗时长。上述提到的第一报告消息、第二报告消息、第三报告消息可以是心跳消息。
上述的管理节点的功能可以由MDC实现的。参考图2,在一种可能的实现中,MDC包括接收模块和管理模块;其中,所述接收模块用于接收数据存储系统中的计算节点或者OSD上报的报告消息,所述报告消息中包括上报者的标识,被上报OSD的标识以及被上报者的健康状态信息;所述管理模块用于根据接收到的报告消息更新保存的OSD健康状态记录,根据所述更新后的OSD健康状态记录确定所述被上报OSD中的一个或多个OSD为亚健康OSD。
可选的,所述管理模块用于从所述数据存储系统中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O 视图,将所述更新后的I/O视图发送给计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD。
可选的,所述接收模块还用于接收所述切换OSD发送的所述亚健康OSD的第一报告消息,所述第一报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第一健康状态信息,所述亚健康OSD的第一健康状态信息是所述切换OSD基于更新后的分区分配视图将接收到的写数据请求同步给所述亚健康OSD之后,根据所述亚健康OSD返回的写数据响应发送的。所述管理模块根据接收到所述亚健康OSD的第一报告消息更新保存的OSD健康状态记录,根据更新后的所述OSD健康状态记录判断确定所述亚健康OSD恢复正常。
上述的计算节点的功能由虚拟块系统(Virtual Block System,VBS)实现。参考图2,所述VBS作为存储的驱动层,用于向访问所述数据存储系统的应用程序提供块访问接口,同时完成块存储数据的读写逻辑。VBS上保存了I/O分配视图,该I/O分配视图包含了所有分区以及跟该分区对应的主OSD的对应关系。根据所述I/O视图,将数据转发到相应的OSD上,或者从相应的OSD上读取数据。
参考图2,在一种可能的实现中,所述VBS包括访问接口、业务模块、客户端、上报模块,其中,所述访问接口,用于接收第一写数据请求,所述第一写数据请求中包括待写入数据、所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息。所述业务模块,用于将所述第一写数据请求中包含的数据分割成数据块,并根据每个数据块的写入位置、偏移和所述块设备信息计算与所述每个数据块要写入的分区。所述客户端,用于根据I/O视图,找到与所述分区对应的主OSD,向所述主OSD发送第二写数据请求,获得所述第二写数据请求发送到所述主OSD所耗时长,所述第二写数据请求中包括待写入的数据块以及该待写入数据块要写入的分区。所述上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述VBS的标识、所述主OSD的标识以及所述主OSD的健康状态信息。
可选的,所述客户端还用于接收所述主OSD返回的写数据响应,通过比较发送所述第二写数据请求的时间和接收到所述写数据响应的时间的差值获得所述第二写数据请求发送到所述主OSD所耗时长。
可选的,所述VBS还包括判断模块(图中未示出),所述判断模块,用于确定所述第二写数据请求发送到所述主OSD所耗时长超过阈值时,将所述备OSD的亚健康状态信息发送给所述上报模块。
如果应用发送的是读数据请求,则所述访问接口还用于接收所述读数据请求,所述读数据请求中包括所述待读取数据的起始位置、所述待读取数据的数据长度和所述待读取数据的块设备信息。所述业务模块还用于根据将待读取数据分成数据块,并根据每个数据块的起始位置、偏移和所述块设备信息计算与所述每个数据块所在的分区。所述客户端,用于根据I/O视图,找到与所述分区对应的主OSD,从所述主OSD上发送读数据请求,所述读数据请求中包括待读取数据块所在的分区。
其中,访问接口可以是基于小型计算机系统接口(Small Computer System Interface,简称SCSI)的块设备访问接口,即SCSI接口。上报模块可以是心跳模块。
其中,I/O视图可以由MDC主动下发给VBS,也可以由VBS主动从MDC上获取。可替 代的,也可以是MDC下发所述分区分配视图给所述VBS,而所述VBS根据接收到的分区分配视图生成所述的I/O视图。
参考图2,在一种可能的实现中,上述OSD可以包括复制模块、写数据模块和上报模块,其中,所述复制模块用于接收写数据请求,所述写数据请求中包括待写入数据块及所述待写入数据要写入的分区,将所述写数据请求复制给写请求中包括的待写入的分区对应的备OSD,获得所述待写入数据块复制到所述备OSD所耗时长,并且将所述写数据请求发送给所述写数据模块。所述写数据模块,用于接收所述写数据请求,将写数据请求中包括的待写入数据写入到对应的待写入分区对应的持久化存储资源中。所述上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述OSD的标识、所述备OSD的标识以及所述备OSD的健康状态信息。
可选的,所述写数据模块还用于获得将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长。所述上报模块,还用于向所述管理节点发送第二报告消息,所述第二报告消息中包括所述OSD的标识以及所述OSD的健康状态信息。
可选的,所述OSD还包括判断模块(图中未示出),所述判断模块,用于确定所述待写入数据块复制到所述备OSD所耗时长超过阈值时或者确定将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长超过所述阈值时,将所述备OSD的亚健康状态信息发送给所述上报模块。
可选的,所述复制模块还用于接收所述备OSD返回的复制响应,通过将所述写数据请求复制给所述备OSD的时间和接收到所述复制响应的时间的差值获得所述待写入数据块复制到所述备OSD所耗时长。
可选的,上述写数据模块可以是异步I/O模块,复制模块可以是复制协议状态机(Replicated State Machine,RSM),上报模块可以是心跳模块。
可以理解的是,上述实施例中所提到的接收模块也可以是心跳模块。本发明所涉及到的心跳模块是指用来接收和发送心跳消息的模块。可以理解的,当上报模块是心跳模块的时候,上报模块所发送的信息是携带在心跳消息中上报的。相应的,MDC中也是由心跳模块来接收心跳消息的。
图2示出了本发明实施例的一个数据存储系统100的逻辑结构图。如图2所示,该数据存储系统100包括:计算节点1、2,存储节点1、2、3,以及管理节点。为便于描述,在下面的方法实施例中,以每个分区对应一个主OSD和一个备OSD,且每个存储节点上有一个OSD为例。并且,对于分区1而言,主OSD是存储节点1上的OSD1,备OSD是存储节点2上的OSD2。而对于分区2而言,主OSD是存储节点2上的OSD2,备OSD是存储节点3上的OSD3。系统中还有其他分区,为简洁,此处不再展开。具体到本实施例中,OSD上保存着分区1、OSD1和OSD2之间的对应关系,以及分区2、OSD2和OSD3之间的对应关系。VBS上保存的I/O分配视图中保存了分区1和OSD1的对应关系,以及分区2和OSD2的对应关系。可以理解的是,上述的系统实施例以及装置实施例中涉及的实现细节也可以参考下述的方法实施例。
本发明实施例提供的一种识别亚健康OSD的方法,应用于图2所示的分布式存储系统中,参考图3,该方法包括如下步骤:
S302,当运行在所述服务器集群系统上的任意一个应用程序发起写数据操作之后,计算 节点1中的VBS接收到写数据请求。所述写数据请求包括待写入数据、所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息。此处的写数据请求可以是写I/O请求。
S304:所述VBS根据所述写数据请求包括的所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息确定处理所述待写入数据的主OSD。
本步骤的具体过程可以包括:所述VBS可以根据预设长度将所述待写入数据分割为多个数据块,通过一致性哈希算法计算出每个数据块对应的分区信息,然后根据保存的I/O视图为所述每个数据块找到对应的主OSD。可以理解的是,待写入数据也可以是小于预设长度的,该VBS根据预设长度将该待写入数据划为一个数据块。待写入数据可以划分为一个或多个数据块,因此,对应于该待写入数据的主OSD也可以是一个或者多个。
本实施例中,假设待写入数据被划分为2个数据块,数据块1和数据块2。其中,根据一致性哈希算法计算出数据块1对应的分区为分区1,数据块2对应的分区为分区2。根据I/O视图可知,分区1对应的主OSD为OSD1,而分区2对应的主OSD为OSD2。
S306:所述VBS分别向所述每个数据块对应的主OSD发送写数据请求,该写数据请求中包括要写入该主OSD的待写入数据块以及该待写入数据块要写入的分区。并且,记录所述写数据请求的发送时间。
也就是说,VBS分别向存储节点1上的OSD1发送包括数据块1和分区1的标识的写数据请求,而向存储节点2上的OSD发送包括数据块2和分区2的标识的写数据请求。下文以数据块1的处理过程为例来说明本发明实施例的具体处理过程。
S308,所述存储节点1上的OSD1接收到包括有数据块1的写数据请求之后,向发送该写数据请求的VBS1返回一个写数据响应。
S310,OSD1在接收到包含数据块1和分区1的写数据请求之后,调用存储节点1上运行的操作系统(Operating System,OS)的系统调用接口向OSD1所管理的分区1对应的持久化资源写入数据块1,记录数据块1的写入时间。所述的OSD将数据块1写入到所述OSD管理的持久化存储资源中,向OSD1返回写入响应。OSD1接收所述OS返回的写入响应时,通过比较数据块1的写入时间以及收到相应的写入响应的时间,获取数据块1写入到OSD1所管理的持久化存储资源上所耗费的时长。
S312,OSD1接收到包括有数据块1的写数据请求之后,根据分区分配视图将所述写数据请求复制给分区1的备OSD。并且记录将该写数据请求的发送时间。
由于本实施例中,采用的多副本存储,根据OSD1中存储的分区分配视图可知,分区1对应的主OSD为存储节点1上OSD1,而分区1对应的备OSD为存储节点2上的OSD2。因此,存储节点1上的OSD1收到包括有数据块1的写数据请求之后,将所述写数据请求复制给OSD2,将该并记录发送时间。S314,OSD2接收到所述的写数据请求后,将其中的数据块1写入OSD2所管理的持久化存储资源中,持久化完成之后,向OSD1返回复制响应。同时,所述OSD2也可以记录OSD2将数据块1写入到OSD2所管理的持久化存储资源上的时长。
S316,OSD1接收OSD2返回的复制响应,比较记录的发送所述写数据请求给所述OSD2的时间以及接收所述复制响应的时间,获得数据复制到备OSD的所耗时长。
S318,OSD1可以将所述OSD2的ID和OSD2的健康状态信息上报给管理节点。其中,所述OSD2的健康状态信息用以反映OSD2的健康状态。
在一种可能的实现中,OSD2的健康状态信息是指所述数据复制到备存储节点所耗的时长。
在另外一种可能的实现中,OSD2的健康状态信息是指该OSD2处于亚健康的指示信息。在这种实现中,OSD1根据所述数据复制到备存储节点所耗的时长判断所述OSD2的是处于亚健康状态还是处于健康状态。比如,当数据块1复制到备存储节点所耗的时长超过一定阈值时,则认为OSD2处于亚健康状态。当所述OSD2处于亚健康状态时,可以在OSD1所在存储节点1上报给管理节点的心跳消息中包括所述的OSD2的ID以及该OSD2处于亚健康状态的指示信息。其中该亚健康状态的指示信息可以是指亚健康类型。也有人将亚健康类型称为故障等级。
通常,在上报给管理节点的心跳消息的报文头中携带上报者OSD1的标识,通过下面的报文字段携带OSD2的ID以及OSD2处于亚健康状态的指示信息:
typedef struct PACKFLAG osd_heartbeat_request
{
required dsw_u16 osd_num;
required dsw_u8 type;
required dsw_u32 osd_array[0];
}osd_heartbeat_request_t;
其中,osd_num是指被上报的OSD的数量,type是指被上报的OSD亚健康;
osd_array则是指被上报的OSD的列表。
可以理解的是,上述S310中,当OSD1获取数据块1写入到OSD1所管理的持久化存储资源上所耗费的时长之后,OSD1可以OSD1的ID和OSD1的健康状态信息上报给管理节点。其中,所述OSD1的健康状态信息用以反映OSD1的健康状态。其中,OSD的健康状态信息可以是指所述数据复制到备存储节点所耗的时长,也可以是指该OSD1处于亚健康状态的指示信息。OSD1可以根据数据块1写入到OSD1所管理的持久化存储资源上所耗费的时长来判断OSD1的健康状态。比如,当OSD1将数据块1写入到OSD1所管理的持久化存储资源上所耗费的时长超过一定阈值时,判断该OSD1处于亚健康状态。当所述OSD1处于亚健康状态时,可以在OSD1所在存储节点1上报给管理节点的心跳消息中包括所述的OSD1的ID以及该OSD1处于亚健康状态的指示信息。
同样的方法,上述S314中,当OSD2获取数据块1写入到OSD2所管理的持久化存储资源上所耗费的时长之后,也可以将OSD2的ID和OSD2的健康状态信息上报给管理节点。OSD1和OSD2可以参考上述的例子,利用现有的心跳消息上报本OSD的健康状态信息。
当然上述的OSD的健康状态信息也可以通过另外的消息来上报,本发明不作限制。
S320,所述VBS1接收到OSD1发送的写数据响应之后,比较所述VBS1发送写数据请求的时间和接收到所述写数据响应的时间,获得发送该写数据请求到主OSD的时长。
VBS1获得的发送该写数据请求到主OSD的时长之后,可以将OSD1的ID和OSD1的健康状态信息上报给管理节点。所述OSD1的健康状态信息用以反映OSD1的健康状态。其中,OSD1的健康状态信息可以是发送该写数据请求到主OSD的时长,也可以是该OSD1处于亚健康状态的指示信息。OSD1可以根据发送该写数据请求到主OSD来判断OSD1的健康 状态。比如,当发送该写数据请求到主OSD的时长超过一定阈值时,判断该OSD1处于亚健康状态。当所述OSD1处于亚健康状态时,可以在VBS1所在计算节点1上报给管理节点的心跳消息中包括所述的OSD1的ID以及该OSD1处于亚健康状态的指示信息。
通常,在上报给管理节点的心跳消息的报文头中携带上报者的标识,通过下面的报文字段携带OSD1的ID以及OSD1处于亚健康状态的指示信息:
typedef struct PACKFLAG unhealthy_osd_list_req_s
{
required dsw_u16 osd_num;
required dsw_u8 type;
required dsw_u32 osd_array[0];
}unhealthy_osd_list_req_t;
其中,osd_num是指被上报的OSD的数量,type是指被上报的OSD亚健康;
osd_array则是指被上报的OSD的列表。
可以理解的是,OSD和计算节点上报心跳消息的时间并无严格的时间顺序。
S322,管理节点接收上述的存储节点和/或计算节点上报的存储节点的健康状态信息,并根据接收到的健康状态信息确定数据存储系统中的OSD的健康状态,并进行相应的处理。
可以理解的是,数据存储系统中的计算节点和存储节点可以是多个的,其中,每个计算节点和存储节点在处理写数据请求或者读数据请求的时候,都可能上报该写数据请求或者读数据请求所涉及到OSD的健康状态信息。而且,在实际的部署中,一个OSD可以作为很多分区的主OSD,也可以作为很多其他分区的备OSD。
因此,管理节点会接收到多个心跳消息,管理节点上可以设置健康状态记录表,用以记录该数据存储系统中处于亚健康状态的OSD。可以是记录亚健康OSD的标识以及上报该亚健康OSD的上报者的标识之间的对应关系,记录的格式本发明不做限定。根据发送者的不同,可以将收到的上报信息可以分为下面三类:计算节点上报的OSD的健康状态信息、存储节点上报的其他存储节点上的OSD的健康状态信息、存储节点上报本节点中的OSD的健康状态信息。正如前面所说的,上报的健康状态信息可以是指示OSD处于亚健康状态的指示信息,也可以是处理读写数据请求的时延。
这里的处理读写数据请求的时延可以包括OSD将数据写入自身所管理的持久化存储资源中所耗时长、主OSD将数据复制到备OSD所耗时长、VBS将读写数据请求发送到主OSD所耗时长。
可以理解的是,当从持久化存储资源上读取数据中,所述处理读写数据请求的时延可以包括OSD从自身管理的持久化存储资源中读数据所耗时长、VBS将读写数据请求发送到主OSD所耗时长。
管理节点接收到各节点上报的OSD的健康状态信息之后,可以根据设定好的管理策略对OSD进行管理。下面的例子中,各节点上报的健康装信息为指示OSD处于亚健康状态的指示信息,管理节点对上述三类信息分别管理。
针对计算节点上报的OSD的健康状态信息,管理节点执行下述的步骤:
S322-11,获取某一段时间段内,该数据存储系统中处理过读写数据请求的计算节点的数量,将这些计算节点的数量记为X。
S322-12,在上述所有读写数据请求所涉及到的n个主OSD中,统计每个主OSD被多少个计算节点上报为亚健康状态,并记录上报每个主OSD的计算节点数量Yi,其中,i为1-n的整数。
可替代地,也可以在上述所有读写数据请求所涉及到的存储节点中,统计出被最多的计算节点上报为亚健康的那个主OSD,将上报该主OSD的计算节点数记为Y。
S322-13,针对每个主OSD,计算上报该主OSD亚健康状态的计算节点比率(Yi/X)是否落在预先设定的范围内,如果是,则判定该主OSD处于亚健康状态,根据其故障等级,将其永久隔离或在线临时隔离。其中,故障等级可以根据处理读写数据请求的时延大小、该时延影响的计算节点比率等因素确定的,本发明不作限制。
可替代地,如果S322-12中上报的是Y,上述的计算节点比率也可以是Y/X。
针对存储节点上报其他存储节点上的OSD的健康状态信息,管理节点执行下述的步骤:
S322-21,获取某一段时间内,向其他存储节点上的备OSD复制过写数据请求的主存储节点的数量,将这些主存储节点的数量记为X’。此处的主存储节点是指主OSD所在的存储节点。可以理解的是,主OSD是与分区相对应的,某一分区的主OSD也可能是另外一个分区的备OSD。因此,同一个存储节点,既可以作为主存储节点,也可以作为备存储节点。
S322-22,在上述复制操作所涉及到的n个备OSD中,统计每个备OSD被多少个其他存储节点上报为亚健康状态,并记录上报每个OSD的其他存储节点的数量Y’i,其中,i为1-n的整数。
可替代的,在上述复制操作所涉及到的n个备OSD中,统计出被最多的其他存储节点上报为亚健康的那个备OSD,将上报该备OSD的其他存储节点数记为Y。
S322-23,针对每个备OSD,计算上报备OSD亚健康状态的存储节点比率(Y’i/X’)是否落在预先设定的范围内,如果是,则判定该存储节点亚健康,根据其故障等级,将其永久隔离或在线临时隔离。其中,故障等级可以根据处理读写数据请求的时延大小、该时延影响的存储节点比率等因素确定的,本发明不作限制。
针对存储节点上报本存储节点上的OSD的健康状态信息,管理节点执行下述的步骤:
根据故障等级将其永久隔离或在线临时隔离。
可替代地,管理节点接收到的健康状态信息是指处理读写数据请求的时延。那么管理节点在接收到这样的健康状态信息之后,根据一定的策略确定数据存储系统中的哪个OSD是亚健康状态需要临时隔离或者永久隔离。具体的策略,本发明不作限制。
在上述的实施例中,通过检测处理读写数据请求所经过的路径中可能遇到的时延,更加全面和准确的检测数据存储系统中各节点的健康状态,根据该检测结果对亚健康OSD进行相应处理。
在上述的数据存储系统中,每个OSD既可以多个分区的主OSD,也可以另外一些分区的备OSD。以OSD2为例,假如在实际的部署中,OSD2是X个分区的备OSD,同时也是Y个分区的主OSD。那么,X+Y分区就是所述OSD2所管理的分区。当通过上述实施例的方法确定OSD2处于亚健康状态时,需要有其他的OSD来接管所述OSD2所管理的分区。这样,当一个数据要写入OSD2所管理的分区时,携带该数据的写数据请求会被分配到接管OSD2的OSD中。为便于描述,下文中,处于亚健康状态的OSD称为亚健康OSD;用来接管亚健康OSD所管理的分区的OSD被称为切换OSD。OSD2管理的分区可以分配给多个切换OSD, 也可以分配给一个切换OSD,具体的分配可以依据存储系统中其他OSD的负载等因素来确定,也可以是依据某些预设的策略来确定。
下面结合附图4,以分区1的备OSD(即OSD2)被确定为处于亚健康状态,并将OSD2上的分区分配给一个切换OSD为例,来说明本发明实施例的一种隔离亚健康OSD的方法。
S402,管理节点为所述OSD2分配切换OSD,由该切换OSD接管所述亚健康OSD所管理的分区,来代替所述亚健康OSD处理后续的写数据请求。其中,管理节点可以采用如下的算法来为所述亚健康存储节点选择一个切换节点。
首先,优先选择与原OSD不在一个冲突域里的其他OSD。数据存储系统一般都会预设冲突域。
其次,在满足冲突域的前提下,优先选择容量更低且持久化存储资源无故障的存储节点中的OSD作为切换OSD,以保证集群的存储均衡分散。
比如说,当机柜为冲突域时,在没有冲突的机柜中,选一个容量占用最少的机柜。在该机柜下选择容量占用最少的服务器;并且在该服务器中选择一个容量占用最少,持久化存储资源无故障的存储节点。
本实施例中,以冲突域设置为机柜为例,依据上述的方法将OSD4确定为OSD2的切换OSD。可以理解的是,可以是有多个切换OSD来接管OSD2所管理的分区,本发明不作限定。
由于管理节点选择切换节点OSD4来接管亚健康节点OSD2所管理的分区,OSD4成为原来由OSD2所管理的所述X个分区的备OSD。由于OSD2也是所述Y个分区的主OSD,因此,OSD4在接管所述Y个分区之前,将原来的OSD2降为所述Y个分区的备OSD,并且将原来作为所述Y个分区的备OSD升为主OSD。然后,由OSD4代替OSD2作为所述Y个分区的备OSD。可以理解的是,MDC将根据这些变化更新数据存储系统的分区分配视图,并且将更新后的分区分配视图发送给存储节点中的OSD。由于切换节点的加入,更新后的分区分配视图还可以包括亚健康节点以和切换节点之间的对应关系。参考图4,切换的过程包括如下步骤(有些步骤在图中未示出):
S404,管理节点将更新后的分区分配视图发送给存储系统中所有的计算节点、所述切换节点以及与所述切换OSD所接管的分区相关的OSD上。
切换后,OSD4成为所述X+Y个分区的备OSD,那么这X+Y个分区的当前的主OSD可以称之为切换OSD所接管的分区相关的OSD。由于分区分配视图发生了更新,因此,管理节点将该更新后的分区分配视图发送给存储系统中相关的节点。
S406,存储系统中的计算节点接收到更新的分区分配视图之后,刷新本地的I/O分配视图。和所述切换OSD所接管的分区相关的OSD接收到分区分配视图之后,更新本地的分区分配视图。
S408,计算节点1中的VBS接收到写数据请求后,根据所述写数据请求包括的待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息确定处理所述待写入数据的主OSD。
跟上述步骤304相似,VBS根据预设长度将所述写数据请求中携带的待写入数据分割为多个数据块,通过一致性哈希算法计算出每个数据块对应的分区信息,根据保存的I/O视图找到对应的主OSD。本实施例中,假设待写入数据被划分为2个数据块,数据块1和数据块2。其中,根据一致性哈希算法计算出数据块1对应的分区为分区1,数据块2对应的分区为 分区2。参考上一实施例中的举例,根据I/O视图可知,分区1对应的主OSD为OSD1,而分区2对应的主OSD为OSD2。如果切换前分区1对应的备OSD为OSD2,而分区2的备OSD为OSD3,那么切换后,分区1对应的主OSD还是OSD1,分区1对应的备OSD为OSD4;分区2的主OSD为OSD3,而分区2的备OSD为OSD4。
VBS根据更新后的I/O视图,将向OSD1发送写数据请求,发送给OSD1的写数据请求中包括数据块1和分区1的标识。另外,VBS也会根据更新后的I/O视图,向OSD3发送写数据请求,发送给OSD3的写数据请求中包括数据块2以及分区2的标识。
S410,OSD1接收到包含有数据块1的写数据请求时,根据更新后的分区分配视图,将接收到的写数据请求复制给所述OSD4。此时,OSD4代替了OSD2成为分区1的备OSD。
S412,OSD4接收到包含数据块1和分区1的标识的写数据请求后,根据保存的分区分配视图,得知自己为分区1的备OSD,且为OSD2的切换OSD。所述OSD4将数据块1写入其所管理的持久化存储资源上,并获取数据块1写入到该持久化存储资源所耗费的时长。
S414,OSD4还将该包含有数据块1的写数据请求发送给后台的切换(handoff)线程,由handoff线程将该写数据请求给异步发送给OSD2,并且记录发送该写数据请求的时间。
S416,OSD4接收到OSD2返回的写入响应后,根据步骤414中记录的发送写数据请求的时间以及接收到写入响应的时间确定数据同步到亚健康OSD(OSD2)所需的时长。
OSD4可以根据设定好的策略决定是否将该OSD2的健康状态信息上报给管理节点。比如,当同步到OSD2所需的时长超过一定阈值时,将OSD2的健康状态信息发送给管理节点。上报OSD2的健康状态信息的方法参考上面实施例的描述,此处不再赘述。
S416,管理节点根据接收到的健康状态信息,确定亚健康OSD的健康状态。管理节点判断OSD健康状态的方法参考上一实施例,此处不再赘述。
上述S408中VBS将包含数据块2和分区2的标识的写数据请求发送给了OSD3。下面描述OSD3接收到包含数据块2和分区2的标识的写数据请求后的处理过程。
S418,OSD3接收到包含数据块2和分区2的标识的写数据请求后,根据更新后的分区分配视图,得知自己为分区2的主OSD。所述OSD3将数据块2写入其所管理的持久化存储资源上,并获取数据块2写入到该持久化存储资源所耗费的时长,并且OSD3根据更新后的分区分配视图得知所述分区2的备OSD为亚健康节点,且该亚健康节点的切换节点为OSD4,则将写数据请求复制给切换节点OSD4。
S420,OSD4接收到包含数据块2和分区2的标识的写数据请求之后,将数据块2写入其所管理的持久化存储资源中,并获取数据块2写入到持久化存储资源的时长。
S422,OSD4根据本地保存的更新后的分区分配视图,得知自己是亚健康OSD(OSD2)的切换节点,且OSD2为分区2的备OSD。则OSD4将该包含有数据块2的写数据请求发送给后台的切换(handoff)线程,由handoff线程将该写数据请求给异步发送给OSD2,并且记录发送该写数据请求的时间。
S424,OSD4接收到OSD2返回的写入响应后,根据步骤422中记录的发送写数据请求的时间以及接收到写入响应的时间确定数据同步到切换前的OSD(OSD2)所需的时长。
OSD4可以根据设定好的策略决定是否将该OSD2的健康状态信息上报给管理节点。比如,当同步到OSD2所需的时长超过一定阈值时,将OSD2的健康状态信息发送给管理节点。上报OSD2的健康状态信息的方法参考上面实施例的描述,此处不再赘述。
可以理解的是,随着时间的推移,管理节点上收集的数据有刷新,若在某个时刻,管理节点确定OSD2已恢复正常,据此更新分区分配视图,也就是把分区分配视图中,OSD2和切换节点之间的对应关系删除。将更新后的分区分配视图再发送给存储系统中所有的计算节点以及跟所述切换OSD所接管的分区相关的OSD接收到该广播后,更新其保存的I/O分配视图或者分区分配视图。此时,也可以根据保存的最优分区分配,确定是否需要刷新主备OSD的身份。如上文所言,OSD2是某些分区的主OSD,因此,在OSD2切换前,会将OSD2降为这些分区的备OSD,而将原来的备OSD升为这些分区的主OSD。那么当OSD2恢复健康后,也可以将这些分区的主备OSD切换回来。切换回来的方式可以是在管理节点确定OSD2恢复健康后,更新分区分配视图,在更新后的分区分配视图中,针对这些分区,将主备OSD的身份切换回来。
由于,处于亚健康状态的OSD2被在线隔离之后,切换节点将接收到的写数据请求通过handoff线程异步推送给了OSD2,因此,对于分区1而言,OSD2跟OSD1保持着数据一致性。对于分区2而言,OSD2和OSD3保持着数据一致性。一旦OSD2故障解除,OSD2可以直接投入使用。
相反,如果在持续一段时间内,OSD2一直处于亚健康状态,管理节点可以将它清除出数据存储系统。
上述实施例提到的VBS、SOD和管理节点(MDC)可以通过安装在服务器的硬件设备上的软件模块来实现。参考图5,所述的VBS、OSD和管理节点都可以分别包括处理器501和存储器502,其中,所述处理器501和存储器502总线完成相互间的通信。
其中,所述存储器502,用于存放计算机操作指令。具体可以是高速RAM存储器,也可以是非易失性存储器(non-volatile memory)。
所述处理器501,用于执行存储器中存放的计算机操作指令。处理器501具体可以是中央处理器(central processing unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
其中,处理器501通过执行该存储器502中存储的不同计算机操作指令以执行所述VBS、SOD和MDC在上述实施例中的动作,实现其功能。
本发明的另外一个实施例提供了一种存储介质,用以存储上述实施例中提到的计算机操作指令。当该存储介质的存储的操作指令被计算机执行时,可以执行上述实施例中的方法,实现上述MDC或VBS或OSD在数据存储系统中的功能。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (29)

  1. 一种数据存储系统,所述系统包括管理节点和多个存储节点,其中所述系统中部署了多个OSD,所述多个OSD位于所述多个存储节点上;所述多个OSD包括第一OSD和第二OSD,其特征在于,
    所述第一OSD用于接收第一写数据请求,所述写数据请求中包括待写入数据块以及相应的待写入的分区,根据分区分配视图确定所述待写入的分区的备OSD为所述第二OSD,将所述第一写数据请求复制给所述第二OSD,获得所述数据块复制到所述第二OSD所耗时长之后向所述管理节点发送第一报告消息,所述第一报告消息中包括所述第一OSD的标识、所述第二OSD的标识以及所述第二OSD的健康状态信息;
    所述管理节点用于接收所述第一报告消息,根据所述第一报告消息更新所述管理节点上保存的OSD健康状态记录,根据所述OSD健康状态记录确定所述第二OSD为亚健康OSD,所述OSD健康状态记录包括所述多个OSD中的其他OSD上报的所述第二OSD的健康状态信息。
  2. 如权利要求1所述的系统,所述系统还包括多个计算节点,其特征在于,
    所述计算节点用于接收第二写数据请求,将所述第二写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个数据块要写入的分区,根据I/O视图确定出处理所述待写入数据块的主OSD包括所述第一OSD,向所述第一OSD发送所述第一写数据请求,获得发送所述第一写数据请求到所述第一OSD所耗时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息;
    所述管理节点还用于根据所述第二报告消息更新所述管理节点上记录的OSD健康状态记录,根据所述OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第二OSD的健康状态信息。
  3. 如权利要求1所述的系统,所述系统还包括多个计算节点,其特征在于,
    所述计算节点用于接收第一读数据请求,确定所述第一读数据请求所要读取的每个待读取数据块所在的分区,根据I/O视图确定出处理所述待读取数据块的主OSD包括所述第一OSD,向所述第一OSD发送第二读数据请求,获得发送所述第二读数据请求到所述第一OSD所耗时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息,所述第二读数据请求中包括待读取数据块所在的分区;
    所述管理节点还用于接收所述第二报告消息,根据所述第二报告消息更新所述管理节点上保存的OSD健康状态记录,根据OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第一OSD的健康状态信息。
  4. 如权利要求1所述的系统,其特征在于,
    所述第一OSD还用于将所述待写入所述第一OSD所管理的分区的数据块写入到相应的待写入的分区对应的持久化存储资源中,获得将所述数据块写入到所述持久化存储资源所耗时长后,向所述管理节点发送第三报告消息,所述第三报告消息包括所述第一OSD的标识以及所述第一OSD的健康状态信息;
    所述管理节点用于根据所述第三报告信息确定所述第一OSD为亚健康OSD。
  5. 如权利要求2-4任意一项所述的系统,其特征在于,
    所述管理节点用于从所述多个OSD中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O视图,将所述更新后的I/O视图发送给所述多个计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD,所述切换OSD不同于所述第一OSD及所述第二OSD;
    所述计算节点用于接收第三写数据请求,将所述第三写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个待写入数据块要写入的分区,根据I/O视图确定出处理所述至少一个待写入数据块的主OSD包括所述第三OSD,向所述第三OSD发送第四写数据请求,所述第四写数据请求中包括待写入所述第三OSD所管理的分区的数据块以及相应的待写入的分区,其中,所述第三OSD为所述多个OSD中一个,且不同于所述切换OSD;
    所述第三OSD接收所述第三写数据请求后,根据更新后的分区分配视图将所述第三写数据请求复制给所述切换OSD,其中,所述更新后的分区分配视图中所述第三写数据请求中包括的待写入的分区对应的备OSD为所述亚健康节点;
    所述切换OSD还用于基于更新后的分区分配视图将接收到的第三写数据请求同步给所述亚健康OSD。
  6. 如权利要求4所述的系统,其特征在于,
    所述切换OSD还用于在获得将所述第三写数据请求同步给所述亚健康OSD所耗时长后向所述管理节点发送第三报告消息,所述第三报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第三健康状态信息;
    所述管理节点还用于根据所述第三报告消息更新所述管理节点上记录的OSD健康状态记录,根据更新后的OSD健康状态记录确定所述亚健康OSD恢复正常,所述OSD健康状态记录包括其他OSD上报的所述亚健康OSD的健康状态信息。
  7. 如权利要求1-4所述的系统,其特征在于,
    所述第一OSD用于接收所述第二OSD返回的复制响应,通过比较发送所述第一写数据请求的时间和接收到所述复制响应的时间的差值获得所述数据块复制到所述第二OSD所耗时长。
  8. 如权利要求2所述的系统,其特征在于,
    所述计算节点还用于接收所述第一OSD返回的第一写数据响应,通过比较发送所述第一写数据请求的时间和接收到所述第一写数据响应的时间的差值获得所述发送所述第一写数据请求到所述第一OSD所耗时长。
  9. 如权利要求3所述的系统,其特征在于,
    所述计算节点还用于接收所述第一OSD返回的针对所述第二读数据请求的读数据响应,通过比较发送所述第二读数据请求和接收所述读数据响应的时间的差值获得发送所述第二读数据请求到所述第一OSD所耗时长。
  10. 如权利要求1-4任意一项所述的系统,其特征在于,所述健康状态信息包括OSD处于亚健康的指示信息。
  11. 一种识别亚健康OSD的方法,所述方法应用于数据存储系统中,所述数据存 储系统包括管理节点、多个存储节点,其中所述系统中部署了多个OSD,所述多个OSD位于所述多个存储节点上,所述多个OSD包括第一OSD和第二OSD,其特征在于,
    所述第一OSD接收第一写数据请求,所述写数据请求中包括待写入所述第一OSD所管理的分区的数据块以及相应的待写入的分区,根据分区分配视图确定所述待写入的分区的备OSD为所述第二OSD,将所述写数据请求复制给所述第二OSD,获得所述数据复制到所述第二OSD所耗时长之后向所述管理节点发送第一报告消息,所述第一报告消息中包括所述第一OSD的标识、所述第二OSD的标识以及所述第二OSD的健康状态信息;
    所述管理节点接收所述第一报告消息,根据所述第一报告消息更新所述管理节点上保存的OSD健康状态记录,根据所述OSD健康状态记录确定所述第二OSD为亚健康OSD,所述OSD健康状态记录包括所述其他OSD上报的所述第二OSD的健康状态信息。
  12. 如权利要求11所述的方法,所述的系统还包括多个计算节点,其特征在于,所述第一OSD接收第一写数据请求之前包括:
    所述计算节点接收第二写数据请求,将所述第二写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个数据块要写入的分区,根据I/O视图确定出处理所述待写入数据块的主OSD包括所述第一OSD,向所述第一OSD发送所述第一写数据请求;
    所述方法还包括:
    获得发送所述第一写数据请求到所述第一OSD所耗时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息;
    所述管理节点根据所述第二报告消息更新所述管理节点上记录的OSD健康状态记录,根据所述OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第二OSD的健康状态信息。
  13. 如权利要求11所述的方法,所述系统还包括多个计算节点,其特征在于,所述方法还包括:
    所述计算节点接收第一读数据请求,确定所述第一读数据请求所要读取的每个待读取数据块所在的分区,根据I/O视图确定出处理所述待读取数据块的主OSD包括所述第一OSD,向所述第一OSD发送第二读数据请求,获得发送所述第二读数据请求到所述第一OSD的时长之后向所述管理节点发送第二报告消息,所述第二报告消息中包括所述计算节点的标识、所述第一OSD的标识以及所述第一OSD的健康状态信息,所述第二读数据请求中包括待读取数据块所在的分区;
    所述管理节点接收所述第二报告消息,根据所述第二报告消息更新所述管理节点上保存的OSD健康状态记录,根据OSD健康状态记录确定所述第一OSD为亚健康OSD,所述OSD健康状态记录包括其他OSD上报的所述第一OSD的健康状态信息。
  14. 如权利要求11所述的方法,其特征在于,所述第一OSD接收所述第一写数据请求之后,所述方法还包括:
    所述第一OSD将所述待写入所述第一OSD所管理的分区的数据块写入到所述OSD所管理的分区对应的持久化存储资源中,获得写入到所述持久化存储资源所耗时长后, 向所述管理节点发送第三报告消息,所述第三报告消息包括所述第一OSD的标识以及所述第一OSD的健康状态信息;
    所述管理节点根据所述第三报告信息确定所述第一OSD为亚健康OSD。
  15. 如权利要求11-14任意一项所述的方法,其特征在于,所述管理节点确定亚健康OSD之后,所述方法还包括:
    所述管理节点从所述多个OSD中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O视图,将所述更新后的I/O视图发送给所述多个计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD,所述切换OSD不同于所述第一OSD及所述第二OSD;
    所述计算节点接收第三写数据请求,将所述第三写数据请求中包括的待写入数据分成至少一个待写入数据块,确定所述至少一个数据块中每个待写入数据块要写入的分区,根据I/O视图确定出处理所述至少一个待写入数据块的主OSD包括所述第三OSD,向所述第三OSD发送第四写数据请求,所述第四写数据请求中包括待写入所述第三OSD所管理的分区的数据块以及相应的待写入的分区,其中,所述第三OSD为所述多个OSD中一个,且不同于所述切换OSD;
    所述第三OSD接收所述第三写数据请求后,根据更新后的分区分配视图将所述第三写数据请求复制给所述切换OSD,其中,所述更新后的分区分配视图中所述第三写数据请求中包括的待写入的分区对应的备OSD为所述亚健康节点;
    所述切换OSD基于更新后的分区分配视图将接收到的第三写数据请求同步给所述亚健康OSD。
  16. 如权利要求15所述的方法,其特征在于,所述第三OSD接收所述第三写数据请求之后还包括:
    所述切换OSD还用于在获得将所述第三写数据请求同步给所述亚健康OSD所耗时长后向所述管理节点发送第三报告消息,所述第三报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第三健康状态信息;
    所述管理节点根据所述第三报告消息更新所述管理节点上记录的OSD健康状态记录,根据所述OSD健康状态记录确定所述亚健康OSD恢复正常,所述OSD健康状态记录包括其他OSD上报的所述亚健康OSD的健康状态信息。
  17. 如权利要求11-14所述的方法,其特征在于,将所述写数据请求复制给所述第二OSD之后,所述方法还包括:
    所述第一OSD接收所述第二OSD返回的复制响应,通过比较发送所述第一写数据请求的时间和接收到所述复制响应的时间的差值获得所述数据复制到所述第二OSD所耗时长。
  18. 如权利要求12所述的方法,其特征在于,向所述第一OSD发送所述第一写数据请求之后,所述方法还包括:
    所述计算节点接收所述第一OSD返回的第一写数据响应,通过比较发送所述第一写数据请求的时间和接收到所述第一写数据响应的时间的差值获得所述发送所述第一写数据请求到所述第一OSD所耗时长。
  19. 如权利要求13所述的方法,其特征在于,向所述第一OSD发送第二读数据请求之后,所述方法包括:
    所述计算节点接收所述第一OSD返回的针对所述第二读数据请求的读数据响应,通过比较发送所述第二读数据请求和接收所述读数据响应的时间差值获得发送所述第二读数据请求到所述第一OSD所耗时长。
  20. 如权利要求11-14任意一项所述的方法,其特征在于,所述健康状态信息包括OSD处于亚健康的指示信息。
  21. 一种对象存储设备(Object Storage Device,OSD),其特征在于,所述OSD包括写数据模块、复制模块和上报模块,其中,
    所述复制模块用于接收写数据请求,所述写数据请求中包括待写入数据块及所述待写入数据要写入的分区,将所述写数据请求复制给写请求中包括的待写入的分区对应的备OSD,获得所述待写入数据块复制到所述备OSD所耗时长,并且将所述写数据请求发送给所述写数据模块;
    写数据模块,用于将写数据请求中包括的待写入数据写入到对应的待写入分区对应的持久化存储资源中;
    上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述OSD的标识、所述备OSD的标识以及所述备OSD的健康状态信息。
  22. 如权利要求21所述的OSD,其特征在于,所述写数据模块还用于获得将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长;
    所述上报模块,还用于向所述管理节点发送第二报告消息,所述第二报告消息中包括所述OSD的标识以及所述OSD的健康状态信息。
  23. 如权利要求21或22所述的OSD,其特征在于,所述OSD还包括判断模块,
    所述判断模块,用于确定所述待写入数据块复制到所述备OSD所耗时长超过阈值时或者确定将所述待写入数据块写入到所述待写入分区对应的持久化存储资源中所耗时长超过所述阈值时,将所述备OSD的亚健康状态信息发送给所述上报模块。
  24. 如权利要求21所述的OSD,其特征在于,
    所述复制模块还用于接收所述备OSD返回的复制响应,通过将所述写数据请求复制给所述备OSD的时间和接收到所述复制响应的时间的差值获得所述待写入数据块复制到所述备OSD所耗时长。
  25. 一种虚拟块系统(Virtual Block System,VBS),其特征在于,所述VBS包括访问接口、业务模块、客户端、上报模块,其中,
    所述访问接口,用于接收第一写数据请求,所述第一写数据请求中包括待写入数据、所述待写入数据的写入位置、所述待写入数据的数据长度和所述待写入数据的块设备信息;
    所述业务模块,用于将所述第一写数据请求中包含的数据分割成数据块,并根据每个数据块的写入位置、偏移和所述块设备信息计算与所述每个数据块要写入的分区;
    所述客户端,用于根据I/O视图,找到与所述分区对应的主OSD,向所述主OSD发送第二写数据请求,获得所述第二写数据请求发送到所述主OSD所耗时长,所述第二写数据请求中包括待写入的数据块以及该待写入数据块要写入的分区;
    所述上报模块,用于向所述管理节点发送第一报告消息,所述第一报告消息中包括所述VBS的标识、所述主OSD的标识以及所述主OSD的健康状态信息。
  26. 如权利要求25所述的VBS,其特征在于,
    所述客户端还用于接收所述主OSD返回的写数据响应,通过比较发送所述第二写数据请求的时间和接收到所述写数据响应的时间的差值获得所述第二写数据请求发送到所述主OSD所耗时长。
  27. 如权利要求25或26所述VBS,其特征在于所述VBS还包括判断模块,
    所述判断模块,用于确定所述第二写数据请求发送到所述主OSD所耗时长超过阈值时,将所述备OSD的亚健康状态信息发送给所述上报模块。
  28. 一种元数据控制器(Meta Data Controller,MDC),其特征在于,所述MDC包括管理模块和接收模块,
    所述接收模块用于接收数据存储系统中的计算节点或者OSD上报的报告消息,所述报告消息中包括上报者的标识,被上报OSD的标识以及被上报者的健康状态信息;
    所述管理模块用于根据接收到的报告消息更新保存的OSD健康状态记录,根据所述更新后的OSD健康状态记录确定所述被上报OSD中的一个或多个OSD为亚健康OSD。
  29. 如权利要求28所述的MDC,其特征在于,
    所述管理模块用于从所述数据存储系统中确定所述亚健康OSD的切换OSD,建立所述亚健康OSD与所述切换OSD之间的对应关系,根据所述亚健康OSD与所述切换OSD之间的对应关系更新分区分配视图,所述更新后的分区分配视图包括更新后的I/O视图,将所述更新后的I/O视图发送给计算节点、将所述更新后的分区分配视图发送给所述切换OSD以及与所述亚健康OSD有主备关系的OSD;
    所述接收模块还用于接收所述切换OSD发送的所述亚健康OSD的第一报告消息,所述第一报告消息包括所述切换OSD的标识、所述亚健康OSD的标识以及所述亚健康OSD的第一健康状态信息,所述亚健康OSD的第一健康状态信息是所述切换OSD基于更新后的分区分配视图将接收到的写数据请求同步给所述亚健康OSD之后,根据所述亚健康OSD返回的写数据响应发送的;
    所述管理模块根据接收到所述亚健康OSD的第一报告消息更新保存的OSD健康状态记录,根据更新后的所述OSD健康状态记录判断确定所述亚健康OSD恢复正常。
PCT/CN2017/116951 2017-12-18 2017-12-18 识别osd亚健康的方法、装置和数据存储系统 WO2019119212A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17935811.4A EP3620905B1 (en) 2017-12-18 2017-12-18 Method and device for identifying osd sub-health, and data storage system
PCT/CN2017/116951 WO2019119212A1 (zh) 2017-12-18 2017-12-18 识别osd亚健康的方法、装置和数据存储系统
CN201780003315.3A CN108235751B (zh) 2017-12-18 2017-12-18 识别对象存储设备亚健康的方法、装置和数据存储系统
US16/903,762 US11320991B2 (en) 2017-12-18 2020-06-17 Identifying sub-health object storage devices in a data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/116951 WO2019119212A1 (zh) 2017-12-18 2017-12-18 识别osd亚健康的方法、装置和数据存储系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/903,762 Continuation US11320991B2 (en) 2017-12-18 2020-06-17 Identifying sub-health object storage devices in a data storage system

Publications (1)

Publication Number Publication Date
WO2019119212A1 true WO2019119212A1 (zh) 2019-06-27

Family

ID=62645533

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/116951 WO2019119212A1 (zh) 2017-12-18 2017-12-18 识别osd亚健康的方法、装置和数据存储系统

Country Status (4)

Country Link
US (1) US11320991B2 (zh)
EP (1) EP3620905B1 (zh)
CN (1) CN108235751B (zh)
WO (1) WO2019119212A1 (zh)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109120556B (zh) * 2018-08-21 2019-07-09 广州市品高软件股份有限公司 一种云主机访问对象存储服务器的方法及系统
CN109189738A (zh) * 2018-09-18 2019-01-11 郑州云海信息技术有限公司 一种分布式文件系统中主osd的选取方法、装置及系统
CN110955382A (zh) * 2018-09-26 2020-04-03 华为技术有限公司 一种在分布式系统中写入数据的方法和装置
CN109451029A (zh) * 2018-11-16 2019-03-08 浪潮电子信息产业股份有限公司 一种分布式对象存储的数据缓存方法、装置、设备及介质
CN109614276B (zh) * 2018-11-28 2021-09-21 平安科技(深圳)有限公司 故障处理方法、装置、分布式存储系统和存储介质
CN109656895B (zh) * 2018-11-28 2024-03-12 平安科技(深圳)有限公司 分布式存储系统、数据写入方法、装置和存储介质
CN110049091A (zh) * 2019-01-10 2019-07-23 阿里巴巴集团控股有限公司 数据存储方法及装置、电子设备、存储介质
WO2020181478A1 (zh) * 2019-03-12 2020-09-17 华为技术有限公司 亚健康节点的管理方法和装置
CN111404980B (zh) * 2019-09-29 2023-04-18 杭州海康威视系统技术有限公司 一种数据存储方法及一种对象存储系统
CN111064613B (zh) * 2019-12-13 2022-03-22 新华三大数据技术有限公司 一种网络故障检测方法及装置
CN111142801B (zh) * 2019-12-26 2021-05-04 星辰天合(北京)数据科技有限公司 分布式存储系统网络亚健康检测方法及装置
CN111240899B (zh) * 2020-01-10 2023-07-25 北京百度网讯科技有限公司 状态机复制方法、装置、系统及存储介质
CN111510338B (zh) * 2020-03-09 2022-04-26 苏州浪潮智能科技有限公司 一种分布式块存储网络亚健康测试方法、装置及存储介质
CN112000500A (zh) * 2020-07-29 2020-11-27 新华三大数据技术有限公司 一种通信故障的确定方法、处理方法和存储设备
CN112363980B (zh) * 2020-11-03 2024-07-02 网宿科技股份有限公司 一种分布式系统的数据处理方法及装置
CN112306815B (zh) * 2020-11-16 2023-07-25 新华三大数据技术有限公司 Ceph中OSD侧主从间IO信息监控方法、装置、设备及介质
CN112306781B (zh) * 2020-11-20 2022-08-19 新华三大数据技术有限公司 一种线程故障处理方法、装置、介质及设备
JP2023031907A (ja) * 2021-08-26 2023-03-09 キヤノン株式会社 情報処理装置および情報処理装置の制御方法
CN117891615B (zh) * 2024-03-15 2024-06-18 杭州涂鸦信息技术有限公司 确定设备状态的方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706805A (zh) * 2009-10-30 2010-05-12 中国科学院计算技术研究所 对象存储方法及其系统
CN102023816A (zh) * 2010-11-04 2011-04-20 天津曙光计算机产业有限公司 一种对象存储系统的对象存放策略和访问方法
CN103797770A (zh) * 2012-12-31 2014-05-14 华为技术有限公司 一种共享存储资源的方法和系统

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461302B2 (en) * 2004-08-13 2008-12-02 Panasas, Inc. System and method for I/O error recovery
CN101715575A (zh) * 2006-12-06 2010-05-26 弗森多系统公司(dba弗森-艾奥) 采用数据管道管理数据的装置、系统和方法
SE533007C2 (sv) * 2008-10-24 2010-06-08 Ilt Productions Ab Distribuerad datalagring
CN102368222A (zh) * 2011-10-25 2012-03-07 曙光信息产业(北京)有限公司 一种多副本存储系统在线修复的方法
CN102385537B (zh) * 2011-10-25 2014-12-03 曙光信息产业(北京)有限公司 一种多副本存储系统的磁盘故障处理方法
CN103503414B (zh) * 2012-12-31 2016-03-09 华为技术有限公司 一种计算存储融合的集群系统
US9304815B1 (en) * 2013-06-13 2016-04-05 Amazon Technologies, Inc. Dynamic replica failure detection and healing
SG11201703220SA (en) * 2014-11-06 2017-05-30 Huawei Tech Co Ltd Distributed storage and replication system and method
US9575828B2 (en) * 2015-07-08 2017-02-21 Cisco Technology, Inc. Correctly identifying potential anomalies in a distributed storage system
CN106406758B (zh) * 2016-09-05 2019-06-18 华为技术有限公司 一种基于分布式存储系统的数据处理方法及存储设备
US11232000B1 (en) * 2017-02-24 2022-01-25 Amazon Technologies, Inc. Moving database partitions from replica nodes
CN106980468A (zh) * 2017-03-03 2017-07-25 杭州宏杉科技股份有限公司 触发raid阵列重建的方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706805A (zh) * 2009-10-30 2010-05-12 中国科学院计算技术研究所 对象存储方法及其系统
CN102023816A (zh) * 2010-11-04 2011-04-20 天津曙光计算机产业有限公司 一种对象存储系统的对象存放策略和访问方法
CN103797770A (zh) * 2012-12-31 2014-05-14 华为技术有限公司 一种共享存储资源的方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3620905A4 *

Also Published As

Publication number Publication date
CN108235751B (zh) 2020-04-14
US20200310660A1 (en) 2020-10-01
US11320991B2 (en) 2022-05-03
CN108235751A (zh) 2018-06-29
EP3620905B1 (en) 2022-10-19
EP3620905A4 (en) 2020-07-08
EP3620905A1 (en) 2020-03-11

Similar Documents

Publication Publication Date Title
WO2019119212A1 (zh) 识别osd亚健康的方法、装置和数据存储系统
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
EP3518110B1 (en) Designation of a standby node
US20210004355A1 (en) Distributed storage system, distributed storage system control method, and storage medium
CN108509153B (zh) Osd选择方法、数据写入和读取方法、监控器和服务器集群
US11218418B2 (en) Scalable leadership election in a multi-processing computing environment
CN102355369B (zh) 虚拟化集群系统及其处理方法和设备
WO2016070375A1 (zh) 一种分布式存储复制系统和方法
CN109005045B (zh) 主备服务系统及主节点故障恢复方法
US10831612B2 (en) Primary node-standby node data transmission method, control node, and database system
US20100023564A1 (en) Synchronous replication for fault tolerance
EP3528431A1 (en) Paxos protocol-based methods and apparatuses for online capacity expansion and reduction of distributed consistency system
CN111049928B (zh) 数据同步方法、系统、电子设备及计算机可读存储介质
US9733835B2 (en) Data storage method and storage server
TWI617924B (zh) 記憶體資料分版技術
TW201824030A (zh) 主備資料庫的管理方法、系統及其設備
WO2024169612A1 (zh) 一种io处理方法、装置、设备及存储介质
WO2015196692A1 (zh) 一种云计算系统以及云计算系统的处理方法和装置
JP6720250B2 (ja) ストレージシステム及び構成情報制御方法
CN116561217A (zh) 元数据管理系统及方法
CN110928943B (zh) 一种分布式数据库及数据写入方法
US10809939B2 (en) Disk synchronization
CN111400098A (zh) 一种副本管理方法、装置、电子设备及存储介质
CN115328880B (zh) 分布式文件在线恢复方法、系统、计算机设备及存储介质
CN117555493B (zh) 数据处理方法、系统、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17935811

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017935811

Country of ref document: EP

Effective date: 20191204

NENP Non-entry into the national phase

Ref country code: DE