WO2024148854A1 - 基于监控服务的数据库故障处理方法、装置及分布式集群 - Google Patents

基于监控服务的数据库故障处理方法、装置及分布式集群 Download PDF

Info

Publication number
WO2024148854A1
WO2024148854A1 PCT/CN2023/121334 CN2023121334W WO2024148854A1 WO 2024148854 A1 WO2024148854 A1 WO 2024148854A1 CN 2023121334 W CN2023121334 W CN 2023121334W WO 2024148854 A1 WO2024148854 A1 WO 2024148854A1
Authority
WO
WIPO (PCT)
Prior art keywords
database
monitoring
fault
node
scenario
Prior art date
Application number
PCT/CN2023/121334
Other languages
English (en)
French (fr)
Inventor
赵文达
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024148854A1 publication Critical patent/WO2024148854A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of distributed storage technology, and in particular to a database fault handling method, device and distributed cluster based on monitoring service.
  • the Monitor service In a distributed storage system, the Monitor service is deployed on different physical servers. It is responsible for monitoring, maintaining, and querying the database operation status of other services such as the Object-based Storage Device (OSD) in the cluster, and reporting alarms when abnormalities occur. It is one of the most important and critical components of the cluster bottom layer. Since the Monitor process needs to monitor and maintain the status of other services in the cluster, the Monitor database (DataBase, DB) needs to save a lot of other service-related information. This information is saved in the form of various structures, such as OSDmap, PGmap and other cluster information. By maintaining, querying and updating the information saved in the DB, the monitor can monitor the status of each service in the cluster.
  • OSD Object-based Storage Device
  • the present application provides a database fault handling method, device and distributed cluster based on monitoring service, which are used to solve the defect of low fault handling efficiency in the prior art.
  • the present application provides a database fault handling method based on monitoring service, including:
  • the database damage scenario is determined based on the cluster status and the database operation status of each monitoring node;
  • repair strategy library to match a first repair strategy corresponding to the database damage scenario, so that the monitoring node can repair the database damage scenario according to the first repair strategy
  • the alarm message is a notification message generated by the management node when it determines that the database operation status monitored by the monitoring node of the monitoring service database matches the preset database fault status, and carries the fault type information corresponding to the database fault status; the cluster status is obtained by evaluating the monitoring service deployed in the distributed cluster through voting decisions among the monitoring nodes.
  • a database fault handling method based on monitoring service provided by the present application, after determining the database fault type, it also includes:
  • the database overload scenario is determined based on the disk space status of the target monitoring node
  • repair strategy library to match a second repair strategy corresponding to the database overload scenario, so that the monitoring node can repair the database overload scenario according to the second repair strategy;
  • the target monitoring node is a monitoring node whose database operation status is abnormal.
  • the database damage scenario matches the first target scenario identification code
  • the first target scene identification code is unique identification information used to distinguish the first repair strategy in the repair strategy library.
  • the database overload scenario matches the second target scenario identification code
  • the second target scene identification code is unique identification information used to distinguish the second repair strategy in the repair strategy library.
  • the database damage scenario is determined, including:
  • the first repair strategy corresponding to the first scene identification code is:
  • the monitoring service database of all monitoring nodes is rebuilt through the cluster information stored in the database of the object storage device.
  • the database damage scenario is determined based on the cluster state and the database operation state of each monitoring node, and further includes:
  • the first target scene identification code is set to the second scene identification code
  • the first repair strategy corresponding to the second scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • the database damage scenario is determined based on the cluster state and the database operation state of each monitoring node, and further includes:
  • the first target scenario identification code is set to the third scenario identification code
  • the first repair strategy corresponding to the third scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • the database damage scenario is determined based on the cluster state and the database operation state of each monitoring node, and further includes:
  • the first target scenario identification code is set to the fourth scenario identification code
  • the first repair strategy corresponding to the fourth scene identification code is:
  • a database overload scenario is determined based on the disk space status of a target monitoring node, including:
  • the second target scene identification code is set to a fifth scene identification code
  • the second repair strategy corresponding to the fifth scene identification code is:
  • a database overload scenario is determined based on the disk space status of a target monitoring node, and further includes:
  • the second target scenario identification code is set to the sixth scenario identification code
  • the second repair strategy corresponding to the sixth scene identification code is:
  • the migration condition is that the disk space of the target monitoring node has a fast disk partition, and the disk space capacity of the target monitoring node is greater than the capacity of the monitoring service database.
  • a database overload scenario is determined based on the disk space status of a target monitoring node, and further includes:
  • the second target scenario identification code is set to the seventh scenario identification code
  • the second repair strategy corresponding to the seventh scene identification code is:
  • the present application also provides a database fault handling device based on monitoring service, comprising:
  • a fault detection module is used to determine the type of database fault based on the alarm message fed back by the management node;
  • a first fault identification module is used to determine the database damage scenario based on the cluster state and the database operation state of each monitoring node when it is determined that the database fault type is the damage of the monitoring service database;
  • a first fault repair module used to match a first repair strategy corresponding to a database damage scenario using a repair strategy library, so that the monitoring node repairs the database damage scenario according to the first repair strategy;
  • the alarm message is a notification message generated by the management node when it determines that the database operation status monitored by the monitoring node of the monitoring service database matches the preset database fault status, and carries the fault type information corresponding to the database fault status; the cluster status is obtained by evaluating the monitoring service deployed by the distributed storage system through voting decisions among the monitoring nodes.
  • a database fault processing device based on monitoring service also includes:
  • a second fault identification module is used to determine the database overload scenario based on the disk space status of the target monitoring node when determining that the database fault type is the monitoring service database overload;
  • a second fault repair module used to match a second repair strategy corresponding to the database overload scenario using the repair strategy library, so that the monitoring node repairs the database overload scenario according to the second repair strategy;
  • the target monitoring node is the cluster computing node where the abnormal monitoring node is deployed.
  • the present application also provides a distributed cluster, comprising at least n monitoring nodes deploying monitoring services on cluster computing nodes, and at least 1 management node deploying software management services on cluster computing nodes, each monitoring node is used to implement any of the above database fault handling methods based on monitoring services;
  • the monitoring node is used to monitor the monitoring service database deployed by itself and feed back the acquired database operation status to the management node;
  • the management node is used to match the database operation status with the preset database fault status, generate an alarm message carrying the fault type information corresponding to the database fault status, and send the alarm message to the monitoring node whose database operation status is abnormal;
  • n is an odd number greater than 1, and the total number of cluster computing nodes is greater than the total number of monitoring nodes.
  • the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, the method for handling database faults based on monitoring services described above is implemented.
  • the present application also provides a non-transitory computer-readable storage medium having a computer program stored thereon.
  • the computer program When executed by a processor, the computer program implements any of the above-mentioned database fault handling methods based on monitoring services.
  • the database fault handling method, device and distributed cluster based on monitoring service identify the type of database fault based on the alarm message fed back by the management node.
  • the decision is made based on the cluster status and the database operation status of each monitoring node to determine the degree of damage to the monitoring service database of each monitoring node in the cluster, and thereby locate the fault to a specific database damage scenario, and then select the corresponding first repair strategy according to the database damage scenario to realize the complete process of automated DB fault detection, identification and repair.
  • FIG1 is a flow chart of a method for handling database failures based on a monitoring service provided by the present application
  • FIG2 is a schematic diagram of a fault repair process in a database fault handling method based on monitoring services provided in the present application
  • FIG3 is a second schematic diagram of a fault repair process in a database fault handling method based on a monitoring service provided by the present application;
  • FIG4 is one of the partial flow diagrams of the database fault handling method based on monitoring service provided by the present application.
  • FIG5 is a second partial flow diagram of a method for handling database failures based on monitoring services provided by the present application
  • FIG. 6 is a schematic diagram of the structure of a database fault handling device based on monitoring services provided by the present application
  • FIG7 is a schematic diagram of the structure of a distributed cluster provided by the present application.
  • FIG8 is a schematic diagram of the structure of an electronic device provided by the present application.
  • first, second, etc. in this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of the same type, and the number of objects is not limited.
  • the first object can be one or more.
  • Fig. 1 is a flow chart of a database fault handling method based on monitoring service provided by the present application.
  • the database fault handling method based on monitoring service provided by the embodiment of the present application includes: Step 101, determining the database fault type based on the alarm message fed back by the management node.
  • the alarm message is a notification message carrying fault type information corresponding to the database fault state, generated by the management node when it determines that the database operation state monitored by the monitoring node for the monitoring service database matches the preset database fault state.
  • the executor of the database fault handling method based on monitoring service is a database fault handling device based on monitoring service, which can be integrated into the processor of the physical server in the distributed cluster in the form of electronic chip, central processing unit (CPU), microcontroller unit (MCU), field programmable gate array (FPGA), etc.
  • CPU central processing unit
  • MCU microcontroller unit
  • FPGA field programmable gate array
  • step 101 it is necessary to add an alarm item about the failure status of the database (monitor Database, monitor DB) of the monitoring service on the monitoring node (monitor) in the distributed cluster and the management software side of the management node.
  • the monitoring node monitors its own monitor DB in real time according to the pre-set alarm items, and uploads the collected database operation status to the management node, which is processed by the management software in the management node.
  • the management software in the management node compares and matches the database operation status uploaded by the monitoring node with the database fault status pre-configured in the alarm item. If the two are consistent, it means that the monitor DB of the monitoring node is running normally and the alarm item is not triggered. Otherwise, it means that the monitor DB of the monitoring node is abnormal.
  • a notification message carrying the fault type information used to characterize the database fault status is sent to the monitoring node to monitor and report the alarm of the monitor DB.
  • the alarm items mainly include whether the monitor DB is too large or whether the monitor DB is damaged.
  • step 101 the database fault processing device based on the monitoring service, upon receiving the alarm message fed back by the management node when the alarm item is triggered, outputs the fault type information carried in the alarm message as the database fault type.
  • Step 102 When it is determined that the database failure type is damage to the monitoring service database, a database damage scenario is determined based on the cluster state and the database operation state of each monitoring node.
  • the cluster status is obtained by evaluating the monitoring services deployed in the distributed cluster through voting decisions among the monitoring nodes.
  • the Paxos protocol can ensure that the data can still maintain consistency in the distributed system when the above abnormal situations occur.
  • the Paxos protocol is a voting method. Each node casts its own vote. When the number of votes is greater than half of the total number of nodes, it means that a consensus has been reached in the distributed system, and this proposal will take effect. When the number of votes is equal to or less than half of the total number of nodes, it means that no consensus has been reached, and this proposal will not take effect.
  • an odd number of monitoring nodes must be deployed in a distributed cluster.
  • any one node is allowed to fail, and the remaining two nodes can still reach a consensus through Paxos, then the cluster provides normal services, that is, the cluster status is normal. If the monitor service is deployed in an even number, such as 2, then any one node fails, and the votes of the remaining node cannot exceed half, then the service is still unavailable.
  • step 102 if the database fault handling device based on the monitoring service determines that the alarm content corresponding to the database fault type obtained in step 101 is damage to the monitoring service database, then the cluster status of the distributed cluster and the database operation status of each monitoring node are used to judge the degree of damage of the resolution committee (quorum) composed of all monitoring nodes in the current cluster, and further identify which specific database damage scenario the fault belongs to under the fault classification.
  • the resolution committee (quorum)
  • Step 103 Use the repair strategy library to match a first repair strategy corresponding to the database damage scenario, so that the monitoring node can repair the database damage scenario according to the first repair strategy.
  • step 103 it is necessary to classify the possible damage and failure conditions of the monitor DB, and formulate corresponding solutions for each damage and failure in advance as the first repair strategy, store them in the repair strategy library, and perform real-time maintenance and updates according to the design requirements of the distributed cluster or changes in the fault type.
  • the database fault handling device based on the monitoring service matches the database damage scenario identified in step 102 with the repair strategy library. If the match is successful, it means that the solution strategy corresponding to the database damage scenario is pre-deployed in the repair strategy library. Then, the first repair strategy corresponding to the scenario is adopted to repair the current database damage scenario and restore the monitor service of the cluster.
  • the embodiment of the present application identifies the type of database failure based on the alarm message fed back by the management node.
  • the decision is made to determine the degree of damage to the monitoring service database of each monitoring node in the cluster based on the cluster status and the database operation status of each monitoring node, and thereby locate the failure to a specific database damage scenario, and then select the corresponding first repair strategy according to the database damage scenario to achieve a complete process of automated DB fault detection, identification, and repair.
  • determining the database failure type it also includes: when it is determined that the database failure type is monitoring service database overload, determining the database overload scenario based on the disk space status of the target monitoring node.
  • the target monitoring node is a monitoring node whose database operation status is abnormal.
  • the target monitoring node refers to a monitoring node that is judged to be in an abnormal state after the operation status of the database monitored by its monitor DB triggers the alarm item of the management node.
  • step 101 if the database fault handling device based on the monitoring service determines that the alarm content corresponding to the database fault type obtained in step 101 is a monitoring service database overload, it is necessary to use the monitor DB of the target monitoring node that triggered the alarm to judge the disk space status used to characterize the space occupancy in the current node, and further identify which specific database overload scenario the fault belongs to under the fault classification.
  • the repair strategy library is used to match a second repair strategy corresponding to the database overload scenario, so that the monitoring node can repair the database overload scenario according to the second repair strategy.
  • the database fault handling device based on the monitoring service matches the identified database overload scenario with the repair strategy library. If the match is successful, it means that the solution strategy corresponding to the database overload scenario is pre-deployed in the repair strategy library. Then, the second repair strategy corresponding to the scenario is adopted to repair the fault of the current database overload scenario to alleviate the cluster overload.
  • first repair strategy and the second repair strategy can be maintained in one repair strategy library at the same time, or can be maintained relatively independently in different databases.
  • the embodiment of the present application identifies the type of database fault based on the alarm message fed back by the management node.
  • the database fault type is determined to be an overload of the monitoring service database
  • the decision is made to locate the fault to a specific database overload scenario based on the disk space status of the target monitoring node that triggered the alarm, and then select the corresponding second repair strategy according to the database overload scenario to realize the complete process of automated DB fault detection, identification and repair.
  • the database damage scenario is matched with the first target scenario identification code.
  • the first target scene identification code is unique identification information used to distinguish the first repair strategy in the repair strategy library.
  • a first target scenario identification code for distinguishing from other database damage scenarios can be allocated to different database damage scenarios in the repair strategy library, and the first repair strategy uniquely corresponding to the database damage scenario is stored using the first target scenario identification code as an index.
  • the database fault processing device based on the monitoring service uses the first target scenario identification code corresponding to the identified database damage scenario to query the repair strategy library, and sends the first repair strategy uniquely corresponding thereto to the monitoring node.
  • the monitoring node After receiving the first repair strategy, the monitoring node performs relevant processing of fault repair according to the process carried by the first repair strategy.
  • the embodiment of the present application utilizes the first target scenario identification code that matches the database damage scenario, and realizes the complete process of automated DB fault detection, identification and repair after querying the corresponding first repair strategy in the repair strategy library. It can directly use the existing damage fault conditions of monitor DB for mapping, avoiding the process of re-locating the cause of the fault caused by monitor DB damage when similar faults occur in the future, and improving the efficiency of fault handling related to database damage.
  • the database overload scenario is matched with the second target scenario identification code.
  • the second target scene identification code is unique identification information used to distinguish the second repair strategy in the repair strategy library.
  • the database fault processing device based on the monitoring service uses the second target scenario identification code corresponding to the identified database overload scenario to query in the repair strategy library, and sends the second repair strategy uniquely corresponding thereto to the monitoring node.
  • the monitoring node After receiving the second repair strategy, the monitoring node performs relevant processing of fault repair according to the process carried by the second repair strategy.
  • the embodiment of the present application utilizes the second target scenario identification code that matches the database overload scenario, and realizes the complete process of automated DB fault detection, identification and repair after querying the corresponding second repair strategy in the repair strategy library. It can directly use the overload fault situation that has already existed in the monitor DB for mapping, avoiding the process of re-locating the cause of the fault caused by the damage of the monitor DB when similar faults occur in the future, and improving the efficiency of fault handling related to database overload.
  • the database corruption scenario is determined based on the cluster status and the database operation status of each monitoring node, including: when it is determined that the database operation status of all monitoring nodes is abnormal, the first target scenario identification code is set to the first scenario identification code.
  • the first repair strategy corresponding to the first scene identification code is:
  • the monitoring service database of all monitoring nodes is rebuilt through the cluster information stored in the database of the object storage device.
  • step 102 if the database fault handling device based on the monitoring service determines that the database operation status of each monitoring node triggers an alarm and is judged to be an abnormal state, it means that all monitoring nodes in the cluster are damaged, then the first target scenario identification code of this database damage scenario is set and reflected as the first scenario identification code.
  • the first repair strategy queried in the repair strategy library using the first scenario identification code requires rebuilding the monitor DB of all monitoring nodes through the cluster information stored in the database of the object storage device (Object-based Storage Device DataBase, OSD DB).
  • OSD DB Object-based Storage Device DataBase
  • the embodiment of the present application determines that the database operation status of all monitoring nodes is abnormal, it decides to set the first target scene identification code to the first scene identification code, and controls the monitoring node to execute the first repair strategy corresponding thereto, and uses the cluster information stored in the OSD DB to rebuild the monitor DB of all monitoring nodes, thereby realizing a complete process of automated DB fault detection, identification, and repair. It is possible to automate the fault and processing of DB damage in all monitor nodes of the cluster, and can complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • determining the database corruption scenario based on the cluster status and the database operation status of each monitoring node also includes: when it is determined that the cluster status is ERROR and the database operation status of at least one monitoring node is normal, setting the first target scene identification code to the second scene identification code.
  • the first repair strategy corresponding to the second scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • step 102 when the database fault handling device based on the monitoring service determines that not all monitoring nodes in the cluster are damaged, if it is determined that the cluster status fed back by the quorum is ERROR, and the number of monitoring nodes that can provide monitoring services normally is greater than or equal to 1, it means that all monitor databases in the cluster may be displayed as abnormal ERROR status due to improper operation or other reasons, but in fact there are monitoring nodes that can provide services normally.
  • the first target scenario identification code of this database damage scenario is set and reflected as the second scenario identification code.
  • the first repair strategy queried in the repair strategy library using the second scenario identification code is to replace the monitor DB copy of the monitoring node that can currently provide services normally with the monitor DB of the monitoring node that can provide services normally, and restore the monitoring service using the DB copy method.
  • the embodiment of the present application determines that the current cluster status is ERROR and the database operation status of at least one monitoring node is normal, it decides to set the first target scenario identification code to the second scenario identification code, and controls the monitoring node to execute the first repair strategy corresponding thereto, and uses the DB copy method to restore the monitoring service, thereby realizing the complete process of automated DB fault detection, identification and repair. It is possible to automate the fault and processing of the monitoring node whose cluster status is ERROR but still provides normal services, and can be used in the first Faults can be identified and repaired in a timely manner, improving the efficiency of troubleshooting related to database overload.
  • determining the database damage scenario based on the cluster status and the database operation status of each monitoring node also includes: when it is determined that the cluster status is WARN and there are at least three monitoring nodes with normal database operation status, setting the first target scenario identification code to the third scenario identification code.
  • the first repair strategy corresponding to the third scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • step 102 when the database fault handling device based on the monitoring service determines that not all monitoring nodes in the cluster are damaged, if it is determined that the cluster status fed back by the quorum is WARN, and the number of monitoring nodes that can normally provide monitoring services is greater than or equal to 3, that is, although the cluster can read and write normally, it still has a decision-making function, then the first target scenario identification code of this database damage scenario is set and reflected as the third scenario identification code.
  • the first repair strategy queried in the repair strategy library using the third scenario identification code is to replace the monitor DB copy of the monitoring node that can currently provide services normally with the monitor DB of the monitoring node that can provide services normally, and use the DB copy method to restore the monitoring service.
  • the embodiment of the present application determines that the current cluster status is WARN and the database operation status of at least three monitoring nodes is normal, it decides to set the first target scenario identification code to the third scenario identification code, and controls the monitoring node to execute the first repair strategy corresponding thereto, and uses the DB copy method to restore the monitoring service, thereby realizing the complete process of automated DB fault detection, identification and repair. It is possible to automate the fault and processing of the cluster status being WARN but still having the conditions to make a quorum decision, and can complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • determining the database damage scenario based on the cluster state and the database operation state of each monitoring node also includes:
  • the first target scenario identification code is set to the fourth scenario identification code.
  • the first repair strategy corresponding to the fourth scene identification code is:
  • step 102 when the database fault handling device based on the monitoring service determines that not all monitoring nodes in the cluster are damaged, if it is determined that the cluster status fed back by the quorum is WARN, and the number of monitoring nodes that can normally provide monitoring services is less than or equal to 2, it means that although the cluster can read and write normally, the cluster will be in the WARN state because it cannot decide. Then the first target scenario identification code of this database damage scenario is set and reflected as the fourth scenario identification code.
  • the first repair strategy queried in the repair strategy library using the fourth scenario identification code is a method that requires scaling down the monitor service of the faulty monitoring node and then expanding it again, and restoring the monitor DB of the faulty node is equivalent to redeploying the monitor service of the node.
  • FIG2 is one of the schematic diagrams of the fault repair process in the database fault handling method based on the monitoring service provided by the present application.
  • the embodiment of the present application provides a specific implementation process for repairing the DB damage scenario that occurs in the monitor DB:
  • the first scene identification code can be defined as code 14.
  • the monitor DB of all nodes needs to be rebuilt using the information stored in the OSD DB.
  • the specific implementation steps are as follows:
  • the second scenario identification code can be defined as code 12, and then the faulty node’s DB can be replaced with the monitor DB copy of the current normal monitor node for recovery.
  • the third scenario identification code may be defined as code 11, and then the faulty node's DB may be replaced with the monitor DB copy of the current normal monitor node for recovery.
  • the fourth scenario identification code can be defined as code 13. Then, by first scaling down the monitor service of the faulty monitor node and then scaling it up again, the monitor DB of the faulty node can be restored, which is equivalent to redeploying the monitor service of the node.
  • the embodiment of the present application determines that the current cluster status is WARN, and there are two or less monitoring nodes whose database operation status is normal, it decides to set the first target scenario identification code to the fourth scenario identification code, and controls the monitoring node to execute the first repair strategy corresponding thereto, and uses the monitor scaling method to restore the monitoring service, thereby realizing the complete process of automated DB fault detection, identification, and repair. It is possible to automate the fault and processing of the cluster status being WARN but without the conditions for quorum decision-making, and can complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • the database overload scenario is determined based on the disk space status of the target monitoring node, including: when it is determined that the disk space status of the target monitoring node is deployed in an independent partition pre-divided for the monitoring service database, the second target scenario identification code is set to the fifth scenario identification code.
  • the second repair strategy corresponding to the fifth scene identification code is:
  • the database fault handling device based on the monitoring service determines that the disk space status of the target monitoring node that triggers the alarm indicates that its monitor DB has been deployed in a pre-divided independent partition, it means that the deployment location of the monitor DB is correct and only needs to be compressed based on the original deployment location.
  • the second target scenario identification code of this database damage scenario is set and reflected as the fifth scenario identification code.
  • the second repair strategy queried in the repair strategy library using the fifth scenario identification code is to directly compress the monitor DB of the faulty monitoring node within the independent partition where it is deployed.
  • the embodiment of the present application determines that the disk space status of the target monitoring node is deployed in an independent partition pre-divided for the monitoring service database, it decides to set the second target scenario identification code to the fifth scenario identification code, and controls the monitoring node to execute the corresponding second repair strategy, and compresses the space using the method of compressing in the original deployment space, thereby realizing a complete process of automated DB fault detection, identification, and repair. It is possible to automate the fault and processing of the monitor DB with the correct deployment location but overloaded storage content, and can complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • determining the database overload scenario based on the disk space status of the target monitoring node also includes: when it is determined that the disk space status of the target monitoring node is not deployed in an independent partition pre-allocated for the monitoring service database, and the disk space of the target monitoring node meets the migration conditions, setting the second target scenario identification code to the sixth scenario identification code.
  • the second repair strategy corresponding to the sixth scene identification code is:
  • the migration condition is that the disk space of the target monitoring node has a fast disk partition, and the disk space capacity of the target monitoring node is greater than the capacity of the monitoring service database.
  • the database fault handling device based on the monitoring service determines that the disk space status of the target monitoring node that triggers the alarm indicates that its monitor DB is not deployed in a pre-divided independent partition, it means that the monitor DB is deployed in an incorrect location, and it is necessary to further determine whether the current node meets the monitor DB migration conditions.
  • the second target scenario identification code of this database damage scenario is set and reflected as the sixth scenario identification code.
  • the migration condition refers to whether the physical disk space of the current node is sufficient to divide the monitor DB into independent partitions.
  • the migration condition may be whether an NVMe or SSD disk is configured, and whether there is enough space on the disk to partition the monitor.
  • the second repair strategy queried in the repair strategy library using the sixth scenario identification code is to first migrate the monitor DB of the node from the system disk to the designated fast disk partition and then compress the DB.
  • the decision is made to set the second target scenario identification code to the sixth scenario identification code, and control the monitoring node to execute the corresponding second repair strategy, using the method of compressing the monitor DB after migrating it to the correct deployment location to compress and utilize the space, thereby realizing the complete process of automated DB fault detection, identification, and repair. It is possible to automate the overload fault and processing related to the monitor DB deployment location error and the migration conditions, and complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • determining the database overload scenario based on the disk space status of the target monitoring node also includes: when it is determined that the disk space status of the target monitoring node is not deployed in an independent partition pre-allocated for the monitoring service database, and the disk space of the target monitoring node does not meet the migration conditions, setting the second target scenario identification code to the seventh scenario identification code.
  • the second repair strategy corresponding to the seventh scene identification code is:
  • the database fault handling device based on the monitoring service determines that the disk space status of the target monitoring node that triggers the alarm indicates that its monitor DB is not deployed in a pre-divided independent partition, it means that the monitor DB is deployed in an incorrect location, and it is necessary to further determine whether the current node meets the monitor DB migration conditions.
  • the second target scenario identification code of this database damage scenario is set and reflected as the seventh scenario identification code.
  • the second repair strategy queried in the repair strategy library using the seventh scenario identification code is that since the disk space of the node does not meet the migration conditions, the monitor DB of the faulty monitoring node can only be compressed in the current deployment location.
  • FIG3 is a second schematic diagram of a fault repair process in a database fault handling method based on a monitoring service provided by the present application.
  • the present application embodiment provides a specific implementation process for repairing a DB overload scenario that occurs in a monitor DB:
  • the fifth scene identification code can be defined as code 21, and the monitor DB of the node can be directly compressed.
  • the sixth scenario identification code can be defined as code 22, and the monitor DB of the node is first migrated from the system disk to the specified fast disk partition, and then the DB is compressed; otherwise, the DB is directly compressed.
  • the sixth scene identification code can be defined as code 23, and the monitor DB of the node can be directly compressed.
  • FIG4 is a partial flow diagram of a database fault handling method based on a monitoring service provided by the present application.
  • FIG5 is a partial flow diagram of a database fault handling method based on a monitoring service provided by the present application.
  • the embodiments of the present application respectively provide a specific implementation process of a database fault handling method based on a monitoring service:
  • the software side of the monitoring node and the management node adds detection of the monitor DB and corresponding alarm items, including whether the DB is too large or damaged.
  • the two fault scenarios are the inspection items of the monitor itself:
  • monitor node When a monitor node fails to open its monitor DB, the monitor node records the abnormal situation in the database operation status, so that the software side of the management node sends an alarm message with the fault type information that the database is damaged to the monitor node.
  • the software side displays and reports the alarm on the interface platform based on the alarm message triggered by the monitoring node.
  • monitor DB is deployed on the system disk by default and is not partitioned separately, you need to determine whether the current node meets the monitor DB migration conditions: whether the current node is configured with an nvme or ssd disk, and whether there is enough space on the disk to partition the monitor. If the conditions are met, return number 22. Otherwise, return number 23.
  • monitor DB of some nodes in the cluster is damaged, but there are still monitor processes in the cluster that can provide services normally, you need to determine the number of monitors that can provide services normally:
  • the cluster status is WARN and the number of monitor nodes in the cluster that are currently providing normal monitor services is greater than 2, the number 11 is returned.
  • the number 12 is returned.
  • the embodiment of the present application determines that the disk space status of the target monitoring node is not deployed in the independent partition pre-divided for the monitoring service database, and the disk space of the target monitoring node does not meet the migration conditions, it decides to set the second target scenario identification code to the seventh scenario identification code, and controls the monitoring node to execute the corresponding second repair strategy, and compresses the space by compressing the monitor DB in the default deployment position, thereby realizing a complete process of automated DB fault detection, identification, and repair. It is possible to automate overload faults and processing related to monitor DB deployment errors and the lack of migration conditions, and can complete the identification and repair of the fault as soon as the fault occurs, thereby improving the efficiency of fault handling related to database overload.
  • FIG6 is a schematic diagram of the structure of a database fault handling device based on a monitoring service provided by the present application.
  • the database fault handling device based on a monitoring service provided by the embodiment of the present application includes a fault detection module 610, a first fault identification module 620 and a first fault repair module 630, wherein:
  • the fault detection module 610 is used to determine the database fault type based on the alarm message fed back by the management node.
  • the first fault identification module 620 is used to determine the database damage scenario based on the cluster state and the database operation state of each monitoring node when it is determined that the database fault type is the damage of the monitoring service database.
  • the first fault repair module 630 is used to match a first repair strategy corresponding to a database damage scenario using a repair strategy library, so that the monitoring node can repair the database damage scenario according to the first repair strategy.
  • the alarm message is a notification message generated by the management node when it determines that the database operation status monitored by the monitoring node matches the preset database fault status.
  • the notification message carries the fault type information corresponding to the database fault status.
  • the cluster status is obtained by evaluating the monitoring service deployed by the distributed storage system through voting decisions between the monitoring nodes.
  • the fault detection module 610 the first fault identification module 620 , and the first fault repair module 630 are electrically connected in sequence.
  • the fault detection module 610 When receiving the alarm message fed back by the management node when the alarm item is triggered, the fault detection module 610 outputs the fault type information carried in the alarm message as the database fault type.
  • the first fault identification module 620 determines that the alarm content corresponding to the database fault type obtained by the fault detection module 610 is damage to the monitoring service database, it uses the cluster status of the distributed cluster and the database operation status of each monitoring node to judge the degree of damage of the resolution committee (quorum) composed of all monitoring nodes in the current cluster, and further identifies which specific database damage scenario the fault belongs to under the fault classification.
  • the first fault repair module 630 matches the database corruption scenario identified by the first fault identification module 620 with the repair strategy library. If the match is successful, it means that the solution strategy corresponding to the database corruption scenario has been pre-deployed in the repair strategy library. In this case, the first repair strategy corresponding to the scenario is adopted to repair the current database corruption scenario and restore the monitor service of the cluster.
  • the database damage scenario matches the first target scenario identification code.
  • the first target scene identification code is unique identification information used to distinguish the first repair strategy in the repair strategy library.
  • the first fault identification module 620 is specifically configured to set the first target scenario identification code to the first scenario identification code when it is determined that the database operation status of all monitoring nodes is an abnormal status.
  • the first repair strategy corresponding to the first scene identification code is:
  • the monitoring service database of all monitoring nodes is rebuilt through the cluster information stored in the database of the object storage device.
  • the first fault identification module 620 is further specifically configured to set the first target scenario identification code to the second scenario identification code when it is determined that the cluster state is ERROR and the database operation state of at least one monitoring node is normal;
  • the first repair strategy corresponding to the second scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • the first fault identification module 620 is further specifically configured to set the first target scenario identification code to a third scenario identification code when it is determined that the cluster state is WARN and there are at least three monitoring nodes whose database operation states are normal;
  • the first repair strategy corresponding to the third scene identification code is:
  • the monitoring node whose database operation status is abnormal is replaced with the monitoring service database copy of the monitoring node whose database operation status is normal.
  • the first fault identification module 620 is further specifically configured to set the first target scenario identification code to the fourth scenario identification code when it is determined that the cluster state is WARN and there are two or less monitoring nodes whose database operation state is normal;
  • the first repair strategy corresponding to the fourth scene identification code is:
  • the database fault handling device based on monitoring service provided in the embodiment of the present application is used to execute the above-mentioned database fault handling method based on monitoring service of the present application. Its implementation method is consistent with the implementation method of the database fault handling method based on monitoring service provided in the present application, and can achieve the same beneficial effects, which will not be repeated here.
  • the embodiment of the present application identifies the type of database fault based on the alarm message fed back by the management node.
  • the decision is made based on the cluster status and the database operation status of each monitoring node to determine the degree of damage to the monitoring service database of each monitoring node in the cluster, and thereby locate the fault to a specific database damage scenario, and then select the corresponding first repair strategy according to the database damage scenario to achieve a complete process of automated DB fault detection, identification and repair.
  • the device further includes a second fault identification module and a second fault repair module, wherein:
  • the second fault identification module is used to determine the database overload scenario based on the disk space status of the target monitoring node when it is determined that the database fault type is the monitoring service database overload.
  • the second fault repair module is used to use the repair strategy library to match a second repair strategy corresponding to the database overload scenario, so that the monitoring node can repair the database overload scenario according to the second repair strategy.
  • the target monitoring node is the cluster computing node where the abnormal monitoring node is deployed.
  • the database overload scenario matches the second target scenario identification code.
  • the second target scene identification code is unique identification information used to distinguish the second repair strategy in the repair strategy library.
  • the second fault identification module is specifically configured to set the second target scenario identification code to the fifth scenario identification code when it is determined that the disk space status of the target monitoring node is deployed in an independent partition pre-divided for the monitoring service database;
  • the second repair strategy corresponding to the fifth scene identification code is:
  • the second fault identification module is further specifically used to set the second target scenario identification code to the sixth scenario identification code when it is determined that the disk space status of the target monitoring node is not deployed in an independent partition pre-divided for the monitoring service database and the disk space of the target monitoring node meets the migration condition;
  • the second repair strategy corresponding to the sixth scene identification code is:
  • the migration condition is that the disk space of the target monitoring node has a fast disk partition, and the disk space capacity of the target monitoring node is greater than the capacity of the monitoring service database.
  • the second fault identification module is further specifically used to set the second target scenario identification code to the seventh scenario identification code when it is determined that the disk space status of the target monitoring node is not deployed in an independent partition pre-divided for the monitoring service database and the disk space of the target monitoring node does not meet the migration condition;
  • the second repair strategy corresponding to the seventh scene identification code is:
  • the database fault processing device based on monitoring service provided in the embodiment of the present application is used to execute the above-mentioned database fault processing device based on monitoring service of the present application.
  • the implementation method of the database fault handling method is consistent with the implementation method of the database fault handling method based on the monitoring service provided in this application, and can achieve the same beneficial effects, so it will not be repeated here.
  • the embodiment of the present application identifies the type of database fault based on the alarm message fed back by the management node.
  • the database fault type is determined to be an overload of the monitoring service database
  • the decision is made to locate the fault to a specific database overload scenario based on the disk space status of the target monitoring node that triggered the alarm, and then select the corresponding second repair strategy according to the database overload scenario to realize the complete process of automated DB fault detection, identification and repair.
  • FIG7 is a schematic diagram of the structure of the distributed cluster provided by the present application.
  • the distributed cluster provided by the embodiment of the present application includes at least n monitoring nodes 710 that deploy monitoring services on cluster computing nodes, and at least 1 management node 720 that deploys software management services on cluster computing nodes.
  • Each monitoring node 710 is used to implement the above database fault handling method based on monitoring services.
  • the monitoring node 710 is used to monitor the monitoring service database deployed by itself and feed back the acquired database operation status to the management node.
  • the management node 720 is used to match the database operation status with the preset database fault status, generate an alarm message carrying fault type information corresponding to the database fault status, and send the alarm message to the monitoring node whose database operation status is abnormal.
  • n is an odd number greater than 1, and the total number of cluster computing nodes is greater than the total number of monitoring nodes 710 .
  • the distributed cluster is composed of an odd number of monitoring nodes 710 that deploy monitoring services on cluster computing nodes, and at least one management node 720 that deploys software management services on cluster computing nodes.
  • the monitoring node 710 monitors the operation of the monitor DB deployed by itself, and feeds back the monitored database operation status to the management node 720.
  • the management node 720 matches the received database operation status with the alarm item configured for the database failure status, and feeds back to the monitoring node 710 an alarm message carrying the fault type information corresponding to the database failure status of the monitoring node 710 when the alarm item is triggered.
  • the monitoring node 710 receives the alarm message fed back by the management node 720, identifies the fault scenario and cluster conditions that generate the alarm, and then calls the relevant commands to repair the DB failure.
  • the embodiment of the present application identifies the type of database fault based on the alarm message fed back by the management node.
  • the decision is made based on the cluster status and the database operation status of each monitoring node to determine the degree of damage to the monitoring service database of each monitoring node in the cluster, and thereby locate the fault to a specific database damage scenario, and then select the corresponding first repair strategy according to the database damage scenario to achieve a complete process of automated DB fault detection, identification and repair.
  • Figure 8 illustrates a schematic diagram of the physical structure of an electronic device.
  • the electronic device may include: a processor (processor) 810, a communication interface (Communications Interface) 820, a memory (memory) 830 and a communication bus 840, wherein the processor 810, the communication interface 820, and the memory 830 communicate with each other through the communication bus 840.
  • processor processor
  • Communication interface Communication interface
  • memory memory
  • the processor 810 can call the logic instructions in the memory 830 to execute a database fault handling method based on a monitoring service, the method comprising: determining the database fault type based on the alarm message fed back by the management node; when it is determined that the database fault type is damage to the monitoring service database, determining the database damage scenario based on the cluster status and the database operation status of each monitoring node; using the repair strategy library to match the first repair strategy corresponding to the database damage scenario, so that the monitoring node can repair the database damage scenario according to the first repair strategy; wherein the alarm message is a notification message carrying fault type information corresponding to the database fault status generated by the management node when it is determined that the database operation status monitored by the monitoring node for the monitoring service database matches the preset database fault status; the cluster status is obtained by evaluating the monitoring service deployed in the distributed cluster through voting decisions between the monitoring nodes.
  • the logic instructions in the memory 830 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the part that makes technical contributions or the part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc.
  • the present application also provides a computer program product, which includes a computer program.
  • the computer program can be stored on a non-transitory computer-readable storage medium.
  • the computer can execute the database fault handling method based on the monitoring service provided by the above-mentioned methods, the method including: determining the database fault type based on the alarm message fed back by the management node; when it is determined that the database fault type is damage to the monitoring service database, determining the database damage scenario based on the cluster status and the database operation status of each monitoring node; using the repair strategy library to match the first repair strategy corresponding to the database damage scenario, so that the monitoring node can repair the database damage scenario according to the first repair strategy; wherein the alarm message is a notification message carrying fault type information corresponding to the database fault status generated by the management node when it is determined that the database operation status monitored by the monitoring node for the monitoring service database matches the preset database fault status; the cluster status is obtained by evaluating the monitoring service deployed by the distributed cluster through voting
  • the present application also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the database fault handling method based on the monitoring service provided by the above-mentioned methods, the method comprising: determining the database fault type based on the alarm message fed back by the management node; when it is determined that the database fault type is damage to the monitoring service database, determining the database damage scenario based on the cluster status and the database operation status of each monitoring node; using the repair strategy library to match a first repair strategy corresponding to the database damage scenario, so that the monitoring node can repair the database damage scenario according to the first repair strategy; wherein the alarm message is a notification message carrying fault type information corresponding to the database fault status generated by the management node when it is determined that the database operation status monitored by the monitoring node for the monitoring service database matches the preset database fault status; the cluster status is obtained by evaluating the monitoring service deployed in the distributed cluster through voting decisions between the monitoring nodes.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, i.e., they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative effort.
  • each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of each embodiment or some parts of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供一种基于监控服务的数据库故障处理方法、装置及分布式集群,属于分布式存储技术领域。该方法包括:基于管理节点反馈的告警消息,确定数据库故障类型;在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复。本申请提供的基于监控服务的数据库故障处理方法、装置及分布式集群,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。

Description

基于监控服务的数据库故障处理方法、装置及分布式集群
本申请要求于2023年1月9日提交中国专利局、申请号为202310027120.X、发明名称为“基于监控服务的数据库故障处理方法、装置及分布式集群”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及分布式存储技术领域,尤其涉及一种基于监控服务的数据库故障处理方法、装置及分布式集群。
背景技术
在分布式存储系统中,监控(Monitor)服务会部署在不同的物理服务器上,负责监控、维护和查询集群中各对象存储设备(Object-based Storage Device,OSD)等其他服务的数据库运行状态,并在产生异常进行告警的上报,是集群底层最为重要和关键的组成部分之一。由于Monitor进程需要对集群的其他各服务状态进行监控和维护,因此Monitor数据库(DataBase,DB)中需要保存许多其他服务相关的信息,这些信息以各类结构体的形式进行保存,如OSDmap、PGmap等集群信息。通过对DB中保存信息的维护、查询和更新,monitor进而可以实现对集群各服务状态的监控。
现有技术背景下,在某一节点监控服务的数据库出现异常时,虽然集群中其他节点上的监控服务不会受到影响,但本节点的服务是不会进行自动的识别和修复的。并且,对于所有监控服务的数据库均发生异常所导致的集群业务阻塞的情况,也只将故障定位至监控服务存在异常,后续还需专业的工作人员手动进行故障的逐一排查和修复。所以,现有技术中对于监控服务所发生的故障无法进行精准定位,导致故障处理效率低下,容易带来业务被长时间阻塞的问题和风险。
发明内容
本申请提供一种基于监控服务的数据库故障处理方法、装置及分布式集群,用以解决现有技术中故障处理效率低下的缺陷。
本申请提供一种基于监控服务的数据库故障处理方法,包括:
基于管理节点反馈的告警消息,确定数据库故障类型;
在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;
利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复;
其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;集群状态是各监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
根据本申请提供的一种基于监控服务的数据库故障处理方法,在确定数据库故障类型之后,还包括:
在确定数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景;
利用修复策略库匹配到与数据库过载场景对应的第二修复策略,以供监控节点根据第二修复策略对数据库过载场景进行修复;
其中,目标监控节点为数据库运行状态为异常状态的监控节点。
根据本申请提供的一种基于监控服务的数据库故障处理方法,数据库损坏场景与第一目标场景识别码相匹配;
其中,第一目标场景识别码为用于在修复策略库中区分第一修复策略的唯一标识信息。
根据本申请提供的一种基于监控服务的数据库故障处理方法,数据库过载场景与第二目标场景识别码相匹配;
其中,第二目标场景识别码为用于在修复策略库中区分第二修复策略的唯一标识信息。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,包括:
在确定所有监控节点的数据库运行状态为异常状态的情况下,将第一目标场景识别码设置为第一场景识别码;
与第一场景识别码对应的第一修复策略为:
通过对象存储设备的数据库中所保存的集群信息来对所有监控节点的监控服务数据库进行重建。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
在确定集群状态为ERROR,且至少存在一个监控节点的数据库运行状态为正常状态的情况下,将第一目标场景识别码设置为第二场景识别码;
与第二场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
在确定集群状态为WARN,且至少存在三个数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第三场景识别码;
与第三场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
在确定集群状态为WARN,且存在两个或两个以下数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第四场景识别码;
与第四场景识别码对应的第一修复策略为:
对数据库运行状态为异常状态的监控节点的监控服务进行重新部署。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于目标监控节点的磁盘空间状态,确定数据库过载场景,包括:
在确定目标监控节点的磁盘空间状态为部署在为监控服务数据库预先划分的独立分区的情况下,将第二目标场景识别码设置为第五场景识别码;
与第五场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:
在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间满足迁移条件的情况下,将第二目标场景识别码设置为第六场景识别码;
与第六场景识别码对应的第二修复策略为:
先将目标监控节点的监控服务数据库从系统盘上迁移至指定的快速盘分区上,再对迁移后的监控服务数据库进行压缩;
其中,迁移条件为目标监控节点的磁盘空间存在快速盘分区,且目标监控节点的磁盘空间容量大于监控服务数据库容量。
根据本申请提供的一种基于监控服务的数据库故障处理方法,基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:
在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间不满足迁移条件的情况下,将第二目标场景识别码设置为第七场景识别码;
与第七场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
本申请还提供一种基于监控服务的数据库故障处理装置,包括:
故障检测模块,用于基于管理节点反馈的告警消息,确定数据库故障类型;
第一故障识别模块,用于在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;
第一故障修复模块,用于利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复;
其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;集群状态是各监控节点之间通过投票决策的方式对分布式存储系统所部署的监控服务进行评估得到的。
根据本申请提供的一种基于监控服务的数据库故障处理装置,还包括:
第二故障识别模块,用于在确定数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景;
第二故障修复模块,用于利用修复策略库匹配到与数据库过载场景对应的第二修复策略,以供监控节点根据第二修复策略对数据库过载场景进行修复;
其中,目标监控节点为异常监控节点所部署的集群计算节点。
本申请还提供一种分布式集群,包括至少n个在集群计算节点上部署监控服务的监控节点,以及至少1个在集群计算节点上部署软件管理服务的管理节点,每一监控节点用于实现如上任一的基于监控服务的数据库故障处理方法;
监控节点,用于对其自身部署的监控服务数据库进行监控,将获取到的数据库运行状态反馈至管理节点;
管理节点,用于根据数据库运行状态与预设的数据库故障状态进行匹配,生成携带有与数据库故障状态对应的故障类型信息的告警消息,并将告警消息下发至数据库运行状态为异常状态的监控节点;
其中,n为大于1的奇数,集群计算节点的总数量大于监控节点的总数量。
本申请还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现如上述任一种基于监控服务的数据库故障处理方法。
本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种基于监控服务的数据库故障处理方法。
本申请提供的基于监控服务的数据库故障处理方法、装置及分布式集群,基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库损坏时,决策根据集群状态和各监控节点的数据库运行状态,确定集群各监控节点的监控服务数据库的损坏程度,并以此将故障定位至具体的数据库损坏场景,进而根据数据库损坏场景选择对应的第一修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的损坏故障情况进行场景分离,再按照相应策略实现了当monitor DB发生损坏时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障 重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。
附图说明
为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的基于监控服务的数据库故障处理方法的流程示意图;
图2是本申请提供的基于监控服务的数据库故障处理方法中的故障修复流程示意图之一;
图3是本申请提供的基于监控服务的数据库故障处理方法中的故障修复流程示意图之二;
图4是本申请提供的基于监控服务的数据库故障处理方法的部分流程示意图之一;
图5是本申请提供的基于监控服务的数据库故障处理方法的部分流程示意图之二;
图6是本申请提供的基于监控服务的数据库故障处理装置的结构示意图;
图7是本申请提供的分布式集群的结构示意图;
图8是本申请提供的电子设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。
应当理解,在本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
图1是本申请提供的基于监控服务的数据库故障处理方法的流程示意图。如图1所示,本申请实施例提供的基于监控服务的数据库故障处理方法,包括:步骤101、基于管理节点反馈的告警消息,确定数据库故障类型。
其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息。
需要说明的是,本申请实施例提供的基于监控服务的数据库故障处理方法的执行主体为基于监控服务的数据库故障处理装置,该装置可以以电子芯片、中央处理器(Central Processing Unit,CPU)、微控制器单元(Micro Control Unit,MCU)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)等形式集成在分布式集群中的物理服务器中的处理器。
需要说明的是,在步骤101之前,需要通过在分布式集群中的监控节点(monitor),以及管理节点的管理软件侧增加关于监控服务的数据库(monitor Database,monitor DB)故障状态的告警项。
紧接着,由监控节点根据预先设置好的告警项,实时对其自身的monitor DB进行监控,将所采集到的数据库运行状态上传至管理节点,由管理节点中的管理软件对其进行处理。
管理节点中的管理软件将监控节点上传的数据库运行状态与在告警项中预先配置的数据库故障状态进行对比匹配,若二者一致,则说明该监控节点的monitor DB正常运行,并未触发告警项。反之,则说明该监控节点的monitor DB发生异常,触发告警项的同时,向该监控节点下发携带有用于表征其数据库故障状态的故障类型信息的通知消息,以实现对monitor DB的告警进行监控和上报。
其中,告警项主要包括monitor DB是否过大,monitor DB是否损坏。
具体地,在步骤101中,基于监控服务的数据库故障处理装置在接收到管理节点在触发告警项时所反馈的告警消息,将告警消息中所携带的故障类型信息作为数据库故障类型输出。
步骤102、在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景。
其中,集群状态是各监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
需要说明的是,在分布式集群中,存在着节点故障、网络故障、网络延时等异常情况,通过paxos协议能够保证发生上述异常情况时数据依然能够在分布式系统中保持一致性,paxos协议是通过投票的方式,每个节点投出自己的一票,当得票数大于总节点数的一半,表示在分布式系统中达成了一致,则本次提议生效。当得票数等于或者小于总节点数的一半时,则表示没有达成一致,本次提议不生效。
根据paxos协议的原理,必须在分布式集群中部署奇数个监控节点。当部署3个monitor服务时,允许任意一个节点故障,剩余的两个节点仍然可以通过paxos协商一致,则集群提供正常服务,即集群状态为正常。如果monitor服务部署成偶数,比如2个,则任意一个节点故障,剩余一个节点的投票始终无法超过半数,则服务还是无法使用。
具体地,在步骤102中,基于监控服务的数据库故障处理装置若确定步骤101得到的数据库故障类型所对应的告警内容为监控服务数据库损坏,则利用分布式集群的集群状态,以及各监控节点的数据库运行状态对当前集群中由所有监控节点构成的决议委员会(quorum)的损坏程度进行判断,进一步识别出其故障隶属于该故障分类下哪一种具体的数据库损坏场景。
其中,数据库损坏场景主要包括两种,一是quorum中所有的监控节点均发生损坏。二是quorum中存在部分监控节点的monitor DB发生损坏,但仍有可以正常提供服务的monitor进程存在,则需要再依据可以正常提供服务的monitor数量进行更具体的判断和识别。
步骤103、利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复。
需要说明的是,在步骤103之前,需要将monitor DB可能存在的损坏故障情况进行分类,并将预先为每一种损坏故障制定相应的解决措施作为第一修复策略,存储在修复策略库中,并根据分布式集群的设计需求或者故障类型的改变而进行实时的维护和更新。
具体地,在步骤103中,基于监控服务的数据库故障处理装置根据步骤102识别到的数据库损坏场景与修复策略库进行匹配,若匹配成功,即说明该修复策略库中预先部署好该数据库损坏场景所对应的解决策略,则采取与该场景对应的第一修复策略对当前所面临的数据库损坏场景进行故障修复,恢复集群的monitor服务。
反之,即说明该修复策略库中预先并未枚举到该数据库损坏场景,所以并未部署好所对应的解决策略,则需要为其制定新的第一修复策略去解决该故障,并利用新的第一修复策略更新修复策略库。
本申请实施例基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库损坏时,决策根据集群状态和各监控节点的数据库运行状态,确定集群各监控节点的监控服务数据库的损坏程度,并以此将故障定位至具体的数据库损坏场景,进而根据数据库损坏场景选择对应的第一修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的损坏故障情况进行场景分离,再按照相应策略实现了当monitor DB发生损坏时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处 理效率和完成度。
在上述任一实施例的基础上,在确定数据库故障类型之后,还包括:在确定数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景。
其中,目标监控节点为数据库运行状态为异常状态的监控节点。
需要说明的是,目标监控节点,是指其自身对其所属的monitor DB所监控到的数据库运行状态触发了管理节点的告警项后,被判定为异常状态的监控节点。
具体地,在步骤101之后,基于监控服务的数据库故障处理装置若确定步骤101得到的数据库故障类型所对应的告警内容为监控服务数据库过载,则需要利用触发告警的目标监控节点的monitor DB在当前节点中用于表征空间占用情况的磁盘空间状态进行判断,进一步识别出其故障隶属于该故障分类下哪一种具体的数据库过载场景。
其中,数据库过载场景主要包括两种,一是当前目标监控节点的monitor DB已经处于为其单独划分的快速盘分区上。二是当前目标监控节点的monitor DB未处于为其单独划分的快速盘分区上,则需要进一步根据该节点的磁盘空间状态对其是否有条件配置独立的磁盘分区进行判断。
利用修复策略库匹配到与数据库过载场景对应的第二修复策略,以供监控节点根据第二修复策略对数据库过载场景进行修复。
具体地,基于监控服务的数据库故障处理装置根据识别到的数据库过载场景与修复策略库进行匹配,若匹配成功,即说明该修复策略库中预先部署好该数据库过载场景所对应的解决策略,则采取与该场景对应的第二修复策略对当前所面临的数据库过载场景进行故障修复,缓解集群过载。
反之,即说明该修复策略库中预先并未枚举到该数据库过载场景,所以并未部署好所对应的解决策略,则需要为其制定新的第二修复策略去解决该故障,并利用新的第二修复策略更新修复策略库。
可以理解的是,第一修复策略和第二修复策略可以同时维护在一个修复策略库中,也可以相对独立的维护在不同的数据库中。
本申请实施例基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库过载时,决策根据触发告警的目标监控节点的磁盘空间状态,将故障定位至具体的数据库过载场景,进而根据数据库过载场景选择对应的第二修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的过载故障情况进行场景分离,再按照相应策略实现了当monitor DB发生过载时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。
在上述任一实施例的基础上,数据库损坏场景与第一目标场景识别码相匹配。
其中,第一目标场景识别码为用于在修复策略库中区分第一修复策略的唯一标识信息。
需要说明的是,在步骤103之前,可以在修复策略库中为不同的数据库损坏场景,分配一个用于区分与其他数据库损坏场景的第一目标场景识别码,并以第一目标场景识别码为索引,存储着与该数据库损坏场景唯一对应的第一修复策略。
具体地,在步骤103中,基于监控服务的数据库故障处理装置利用识别出的数据库损坏场景所对应的第一目标场景识别码在修复策略库进行查询,将与之唯一对应的第一修复策略下发至监控节点。
监控节点接收到第一修复策略之后,便按照其所携带的流程进行故障修复的相关处理。
本申请实施例利用与数据库损坏场景相匹配的第一目标场景识别码,在修复策略库查询出与之对应的第一修复策略后实现自动化的DB故障检测、识别和修复的完整流程。能够直接利用monitor DB已经存在过的损坏故障情况进行映射,避免了后续发生类似故障时对由monitor DB损坏引起的故障原因的重新定位过程,提高数据库损坏相关的故障处理效率。
在上述任一实施例的基础上,数据库过载场景与第二目标场景识别码相匹配。
其中,第二目标场景识别码为用于在修复策略库中区分第二修复策略的唯一标识信息。
需要说明的是,在进行过载故障修复之前,可以在修复策略库中为不同的数据库过载场景,分配一个 用于区分与其他数据库过载场景的第二目标场景识别码,并以第二目标场景识别码为索引,存储着与该数据库过载场景唯一对应的第二修复策略。
具体地,基于监控服务的数据库故障处理装置利用识别出的数据库过载场景所对应的第二目标场景识别码在修复策略库进行查询,将与之唯一对应的第二修复策略下发至监控节点。
监控节点接收到第二修复策略之后,便按照其所携带的流程进行故障修复的相关处理。
本申请实施例利用与数据库过载场景相匹配的第二目标场景识别码,在修复策略库查询出与之对应的第二修复策略后实现自动化的DB故障检测、识别和修复的完整流程。能够直接利用monitor DB已经存在过的过载故障情况进行映射,避免了后续发生类似故障时对由monitor DB损坏引起的故障原因的重新定位过程,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,包括:在确定所有监控节点的数据库运行状态为异常状态的情况下,将第一目标场景识别码设置为第一场景识别码。
与第一场景识别码对应的第一修复策略为:
通过对象存储设备的数据库中所保存的集群信息来对所有监控节点的监控服务数据库进行重建。
具体地,在步骤102中,基于监控服务的数据库故障处理装置若确定各监控节点的数据库运行状态均触发告警被判定为异常状态时,即说明集群的所有监控节点均发生损坏,则将此种数据库损坏场景的第一目标场景识别码设置并体现为第一场景识别码。
利用第一场景识别码在修复策略库所查询到的第一修复策略为需要通过对象存储设备的数据库(Object-based Storage Device DataBase,OSD DB)中保存的集群信息来对所有监控节点的monitor DB进行重建。
本申请实施例在确定所有监控节点的数据库运行状态为异常状态时,决策将第一目标场景识别码设置为第一场景识别码,并控制监控节点执行与之对应的第一修复策略,利用OSD DB中所保存的集群信息来对所有监控节点的monitor DB进行重建,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于集群所有monitor节点均发生DB损坏的故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:在确定集群状态为ERROR,且至少存在一个监控节点的数据库运行状态为正常状态的情况下,将第一目标场景识别码设置为第二场景识别码。
与第二场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
具体地,在步骤102中,基于监控服务的数据库故障处理装置在确定集群中并非所有监控节点均发生损坏时,若确定quorum所反馈的集群状态为ERROR,且能够正常提供监控服务的监控节点的数量大于或者等于1,即说明可能由于操作不当等原因造成集群所有的monitor数据库显示为均发生异常的ERROR状态,但实际上却存在能够正常提供服务的监控节点,则将此种数据库损坏场景的第一目标场景识别码设置并体现为第二场景识别码。
利用第二场景识别码在修复策略库所查询到的第一修复策略为需要将当前能够正常提供服务的监控节点的monitor DB拷贝替换成能够正常提供服务的监控节点的monitor DB,采用DB拷贝的方法重新恢复监控服务。
本申请实施例在确定当前集群状态为ERROR,且至少存在一个监控节点的数据库运行状态为正常状态时,决策将第一目标场景识别码设置为第二场景识别码,并控制监控节点执行与之对应的第一修复策略,利用DB拷贝的方法重新恢复监控服务,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于集群状态为ERROR但仍存在提供正常服务的监控节点的故障及处理实现自动化,可以在该故障发生的第一 时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:在确定集群状态为WARN,且至少存在三个数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第三场景识别码。
与第三场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
具体地,在步骤102中,基于监控服务的数据库故障处理装置在确定集群中并非所有监控节点均发生损坏时,若确定quorum所反馈的集群状态为WARN,且能够正常提供监控服务的监控节点的数量大于或者等于3,即说明集群虽然能正常读写,但仍具有决策功能,则将此种数据库损坏场景的第一目标场景识别码设置并体现为第三场景识别码。
利用第三场景识别码在修复策略库所查询到的第一修复策略为需要将当前能够正常提供服务的监控节点的monitor DB拷贝替换成能够正常提供服务的监控节点的monitor DB,采用DB拷贝的方法重新恢复监控服务。
本申请实施例在确定当前集群状态为WARN,且至少存在三个监控节点的数据库运行状态为正常状态时,决策将第一目标场景识别码设置为第三场景识别码,并控制监控节点执行与之对应的第一修复策略,利用DB拷贝的方法重新恢复监控服务,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于集群状态为WARN但仍有条件进行quorum决策的故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
在确定集群状态为WARN,且存在两个或两个以下数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第四场景识别码。
与第四场景识别码对应的第一修复策略为:
对数据库运行状态为异常状态的监控节点的监控服务进行重新部署。
具体地,在步骤102中,基于监控服务的数据库故障处理装置在确定集群中并非所有监控节点均发生损坏时,若确定quorum所反馈的集群状态为WARN,且能够正常提供监控服务的监控节点的数量小于或者等于2,即说明集群虽然能正常读写,但是由于无法决策集群会处在WARN状态,则将此种数据库损坏场景的第一目标场景识别码设置并体现为第四场景识别码。
利用第四场景识别码在修复策略库所查询到的第一修复策略为需要将故障监控节点的monitor服务先缩容掉再重新扩容的方法,恢复故障节点的monitor DB,相当于对该节点的monitor服务重新进行部署。
示例性地,图2是本申请提供的基于监控服务的数据库故障处理方法中的故障修复流程示意图之一。如图2所示,本申请实施例给出一种修复monitor DB所发生的DB损坏场景时的具体实施过程:
(1)可以将第一场景识别码定义为编码14,需要通过OSD DB中保存的信息来对所有节点的monitor DB进行重建,其具体实施步骤如下:
a、结合OSD DB中保存的信息修复monitor DB中的OSDMap、AuthMap数据。
b、通过OSD上报的消息修复monitor DB中的PGMap数据。
c、修复PGMap_meta信息,使得通过修复后的DB可以实现monitor的正常启动。
d、完成monitor DB的修复,重新启动monitor进程。
(2)可以将第二场景识别码定义为编码12,则通过将当前正常monitor节点的monitor DB拷贝替换故障节点的DB进行恢复。
(3)可以将第三场景识别码定义为编码11,则通过将当前正常monitor节点的monitor DB拷贝替换故障节点的DB进行恢复。
(4)可以将第四场景识别码定义为编码13,则通过将故障monitor节点的monitor服务先缩容掉再重新扩容的方法,恢复故障节点的monitor DB,相当于对该节点的monitor服务重新进行部署。
可以理解的是,以上步骤的代码实现最后都将封装成维护命令的形式,以方便管理软件在收到monitor上报的告警后根据策略进行调用,达到实现自动对monitor DB进行修复的目的。
本申请实施例在确定当前集群状态为WARN,且存在两个或两个以下监控节点的数据库运行状态为正常状态时,决策将第一目标场景识别码设置为第四场景识别码,并控制监控节点执行与之对应的第一修复策略,利用monitor缩扩容的方法重新恢复监控服务,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于集群状态为WARN但没有条件进行quorum决策的故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于目标监控节点的磁盘空间状态,确定数据库过载场景,包括:在确定目标监控节点的磁盘空间状态为部署在为监控服务数据库预先划分的独立分区的情况下,将第二目标场景识别码设置为第五场景识别码。
与第五场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
具体地,基于监控服务的数据库故障处理装置在确定触发告警的目标监控节点的磁盘空间状态指示其monitor DB已经部署在预先划分出的独立分区时,即说明monitor DB部署位置正确,只需在原有部署位置的基础上进行压缩,则将此种数据库损坏场景的第二目标场景识别码设置并体现为第五场景识别码。
利用第五场景识别码在修复策略库所查询到的第二修复策略为需要直接将故障监控节点的monitor DB在其所部署的独立分区内进行压缩。
本申请实施例在确定目标监控节点的磁盘空间状态为部署在为监控服务数据库预先划分的独立分区时,决策将第二目标场景识别码设置为第五场景识别码,并控制监控节点执行与之对应的第二修复策略,利用在原有部署空间中进行压缩的方法对空间进行压缩,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于monitor DB部署位置正确但存储内容过载的故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间满足迁移条件的情况下,将第二目标场景识别码设置为第六场景识别码。
与第六场景识别码对应的第二修复策略为:
先将目标监控节点的监控服务数据库从系统盘上迁移至指定的快速盘分区上,再对迁移后的监控服务数据库进行压缩。
其中,迁移条件为目标监控节点的磁盘空间存在快速盘分区,且目标监控节点的磁盘空间容量大于监控服务数据库容量。
具体地,基于监控服务的数据库故障处理装置在确定触发告警的目标监控节点的磁盘空间状态指示其monitor DB未部署在预先划分出的独立分区时,即说明monitor DB部署位置错误,需要进一步判断当前节点是否满足monitor DB迁移条件,则将此种数据库损坏场景的第二目标场景识别码设置并体现为第六场景识别码。
其中,迁移条件是针对当前节点的物理磁盘空间是否具备单独为monitor DB划分独立分区所设置的条件。
示例性地,迁移条件可以为是否配置有nvme或ssd盘,并且盘上是否还有足够的空间来给monitor划分分区。
利用第六场景识别码在修复策略库所查询到的第二修复策略为需要先对该节点的monitor DB从系统盘上迁移至指定的快速盘分区上,在对DB进行压缩。
本申请实施例在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区, 且目标监控节点的磁盘空间满足迁移条件时,决策将第二目标场景识别码设置为第六场景识别码,并控制监控节点执行与之对应的第二修复策略,利用将monitor DB迁移至正确的部署位置后进行压缩的方法对空间进行压缩和利用,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于monitor DB部署位置错误且具备迁移条件的过载故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
在上述任一实施例的基础上,基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间不满足迁移条件的情况下,将第二目标场景识别码设置为第七场景识别码。
与第七场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
具体地,基于监控服务的数据库故障处理装置在确定触发告警的目标监控节点的磁盘空间状态指示其monitor DB未部署在预先划分出的独立分区时,即说明monitor DB部署位置错误,需要进一步判断当前节点是否满足monitor DB迁移条件,则将此种数据库损坏场景的第二目标场景识别码设置并体现为第七场景识别码。
利用第七场景识别码在修复策略库所查询到的第二修复策略为由于该节点的磁盘空间不具备迁移条件,所以只能将故障监控节点的monitor DB在当前的部署位置中进行压缩。
示例性地,图3是本申请提供的基于监控服务的数据库故障处理方法中的故障修复流程示意图之二。如图3所示,本申请实施例给出一种修复monitor DB所发生的DB过载场景时的具体实施过程:
(1)可以将第五场景识别码定义为编码21,则直接对该节点monitor DB进行压缩。
(2)可以将第六场景识别码定义为编码22,则先对该节点的monitor DB从系统盘上迁移至指定的快速盘分区上,在对DB进行压缩;否则直接对DB进行压缩。
(3)可以将第六场景识别码定义为编码23,则直接对该节点monitor DB进行压缩。
示例性地,图4是本申请提供的基于监控服务的数据库故障处理方法的部分流程示意图之一。图5是本申请提供的基于监控服务的数据库故障处理方法的部分流程示意图之二。如图4和图5所示,本申请实施例分别给出一种基于监控服务的数据库故障处理方法的具体实施过程:
如图4所示,首先在监控节点和管理节点中的软件侧增加对monitor DB的检测和相应的告警项,包括DB是否过大,DB是否损坏。两种故障场景为monitor自身的检查项:
(1)当根据数据库运行状态中记载的DB大于集群部署时分配空间的某个阈值时,管理节点的软件侧将携带有故障类型信息为数据库过载的告警消息下发至监控节点。
(2)当存在某监控节点monitor DB打开失败时,监控节点将该异常情况记录在数据库运行状态中,以使得管理节点的软件侧将携带有故障类型信息为数据库损坏的告警消息下发至监控节点
(3)软件侧根据监控节点触发的告警消息在界面平台上进行告警的前端显示和上报。
紧接着,如图5所示,根据管理节点触发的告警项,对故障场景和集群条件进行识别:
(1)若告警消息指示监控节点monitor DB过载,则判断当前节点monitor DB是否已经是在在单独划分的快速盘分区上。反之,则判断该节点是否有条件配置独立的磁盘分区:
a、若部署时已经划分了monitor DB的独立分区,并将DB部署在该分区上,则返回编号21。
b、若部署时默认monitor DB部署在了系统盘上,并未单独划分分区,则需要判断当前节点是否满足monitor DB迁移条件:当前节点是否配置有nvme或ssd盘,并且盘上是否还有足够的空间来给monitor划分分区。若满足条件,则返回编号22。否则返回编号23。
(2)若告警消息指示监控节点monitor DB损坏,则需要对当前集群quorum的损坏程度进行判断:
a、若集群存在部分节点的monitor DB发生损坏,但集群仍有可以正常提供服务的monitor进程存在,则需要对可以正常提供服务的monitor数量进行判断:
若集群状态为WARN且集群当前monitor服务正常的monitor节点数量>2,则返回编号11。
若集群状态已经为ERROR且还有可以正常提供服务的monitor节点存在,则返回编号12。
若集群状态为WARN且集群当前monitor服务正常的monitor节点数量<=2,则返回编号13。
b、若集群的所有monitor节点均发生损坏,则返回编号14。
本申请实施例在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间不满足迁移条件时,决策将第二目标场景识别码设置为第七场景识别码,并控制监控节点执行与之对应的第二修复策略,利用将monitor DB在默认的部署位置中进行压缩的方法对空间进行压缩,实现自动化的DB故障检测、识别和修复的完整流程。能够对关于monitor DB部署位置错误且不具备迁移条件的过载故障及处理实现自动化,可以在该故障发生的第一时间就完成对故障的识别和修复,提高数据库过载相关的故障处理效率。
图6是本申请提供的基于监控服务的数据库故障处理装置的结构示意图。在上述任一实施例的基础上,如图6所示,本申请实施例提供的基于监控服务的数据库故障处理装置包括故障检测模块610、第一故障识别模块620和第一故障修复模块630,其中:
故障检测模块610,用于基于管理节点反馈的告警消息,确定数据库故障类型。
第一故障识别模块620,用于在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景。
第一故障修复模块630,用于利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复。
其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息。集群状态是各监控节点之间通过投票决策的方式对分布式存储系统所部署的监控服务进行评估得到的。
具体地,故障检测模块610、第一故障识别模块620和第一故障修复模块630顺次电连接。
故障检测模块610在接收到管理节点在触发告警项时所反馈的告警消息,将告警消息中所携带的故障类型信息作为数据库故障类型输出。
第一故障识别模块620若确定故障检测模块610得到的数据库故障类型所对应的告警内容为监控服务数据库损坏,则利用分布式集群的集群状态,以及各监控节点的数据库运行状态对当前集群中由所有监控节点构成的决议委员会(quorum)的损坏程度进行判断,进一步识别出其故障隶属于该故障分类下哪一种具体的数据库损坏场景。
第一故障修复模块630根据第一故障识别模块620识别到的数据库损坏场景与修复策略库进行匹配,若匹配成功,即说明该修复策略库中预先部署好该数据库损坏场景所对应的解决策略,则采取与该场景对应的第一修复策略对当前所面临的数据库损坏场景进行故障修复,恢复集群的monitor服务。
可选地,数据库损坏场景与第一目标场景识别码相匹配。
其中,第一目标场景识别码为用于在修复策略库中区分第一修复策略的唯一标识信息。
可选地,第一故障识别模块620,具体用于在确定所有监控节点的数据库运行状态为异常状态的情况下,将第一目标场景识别码设置为第一场景识别码。
与第一场景识别码对应的第一修复策略为:
通过对象存储设备的数据库中所保存的集群信息来对所有监控节点的监控服务数据库进行重建。
可选地,第一故障识别模块620,还具体用于在确定集群状态为ERROR,且至少存在一个监控节点的数据库运行状态为正常状态的情况下,将第一目标场景识别码设置为第二场景识别码;
与第二场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
可选地,第一故障识别模块620,还具体用于在确定集群状态为WARN,且至少存在三个数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第三场景识别码;
与第三场景识别码对应的第一修复策略为:
将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
可选地,第一故障识别模块620,还具体用于在确定集群状态为WARN,且存在两个或两个以下数据库运行状态为正常的监控节点的情况下,将第一目标场景识别码设置为第四场景识别码;
与第四场景识别码对应的第一修复策略为:
对数据库运行状态为异常状态的监控节点的监控服务进行重新部署。
本申请实施例提供的基于监控服务的数据库故障处理装置,用于执行本申请上述基于监控服务的数据库故障处理方法,其实施方式与本申请提供的基于监控服务的数据库故障处理方法的实施方式一致,且可以达到相同的有益效果,此处不再赘述。
本申请实施例基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库损坏时,决策根据集群状态和各监控节点的数据库运行状态,确定集群各监控节点的监控服务数据库的损坏程度,并以此将故障定位至具体的数据库损坏场景,进而根据数据库损坏场景选择对应的第一修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的损坏故障情况进行场景分离,再按照相应策略实现了当monitor DB发生损坏时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。
在上述任一实施例的基础上,该装置还包括第二故障识别模块和第二故障修复模块,其中:
第二故障识别模块,用于在确定数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景。
第二故障修复模块,用于利用修复策略库匹配到与数据库过载场景对应的第二修复策略,以供监控节点根据第二修复策略对数据库过载场景进行修复。
其中,目标监控节点为异常监控节点所部署的集群计算节点。
可选地,数据库过载场景与第二目标场景识别码相匹配。
其中,第二目标场景识别码为用于在修复策略库中区分第二修复策略的唯一标识信息。
可选地,第二故障识别模块,具体用于在确定目标监控节点的磁盘空间状态为部署在为监控服务数据库预先划分的独立分区的情况下,将第二目标场景识别码设置为第五场景识别码;
与第五场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
可选地,第二故障识别模块,还具体用于在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间满足迁移条件的情况下,将第二目标场景识别码设置为第六场景识别码;
与第六场景识别码对应的第二修复策略为:
先将目标监控节点的监控服务数据库从系统盘上迁移至指定的快速盘分区上,再对迁移后的监控服务数据库进行压缩;
其中,迁移条件为目标监控节点的磁盘空间存在快速盘分区,且目标监控节点的磁盘空间容量大于监控服务数据库容量。
可选地,第二故障识别模块,还具体用于在确定目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间不满足迁移条件的情况下,将第二目标场景识别码设置为第七场景识别码;
与第七场景识别码对应的第二修复策略为:
对目标监控节点的监控服务数据库进行压缩。
本申请实施例提供的基于监控服务的数据库故障处理装置,用于执行本申请上述基于监控服务的数据 库故障处理方法,其实施方式与本申请提供的基于监控服务的数据库故障处理方法的实施方式一致,且可以达到相同的有益效果,此处不再赘述。
本申请实施例基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库过载时,决策根据触发告警的目标监控节点的磁盘空间状态,将故障定位至具体的数据库过载场景,进而根据数据库过载场景选择对应的第二修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的过载故障情况进行场景分离,再按照相应策略实现了当monitor DB发生过载时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。
图7是本申请提供的分布式集群的结构示意图。在上述任一实施例的基础上,如图7所示,本申请实施例提供的分布式集群包括至少n个在集群计算节点上部署监控服务的监控节点710,以及至少1个在集群计算节点上部署软件管理服务的管理节点720。每一监控节点710用于实现如上的基于监控服务的数据库故障处理方法。
监控节点710,用于对其自身部署的监控服务数据库进行监控,将获取到的数据库运行状态反馈至管理节点。
管理节点720,用于根据数据库运行状态与预设的数据库故障状态进行匹配,生成携带有与数据库故障状态对应的故障类型信息的告警消息,并将告警消息下发至数据库运行状态为异常状态的监控节点。
其中,n为大于1的奇数,集群计算节点的总数量大于监控节点710的总数量。
具体地,分布式集群由奇数个在集群计算节点上部署监控服务的监控节点710,以及至少1个在集群计算节点上部署软件管理服务的管理节点720构成。
首先,由监控节点710对其自身部署的monitor DB的运行情况进行监控,将所监控到的数据库运行状态反馈至管理节点720。紧接着,由管理节点720将接收到的数据库运行状态与针对数据库故障状态所配置的告警项进行匹配,在触发告警项时向监控节点710反馈携带有与该监控节点710所发生的数据库故障状态对应的故障类型信息的告警消息。最后,再由监控节点710接收管理节点720所反馈的告警消息,对产生告警的故障场景和集群条件进行识别,进而调用相关的命令实现对DB故障的修复。
本申请实施例基于管理节点反馈的告警消息进行数据库故障类型的识别,在确定数据库故障类型为监控服务数据库损坏时,决策根据集群状态和各监控节点的数据库运行状态,确定集群各监控节点的监控服务数据库的损坏程度,并以此将故障定位至具体的数据库损坏场景,进而根据数据库损坏场景选择对应的第一修复策略来实现自动化的DB故障检测、识别和修复的完整流程。能够通过将monitor DB可能存在的损坏故障情况进行场景分离,再按照相应策略实现了当monitor DB发生损坏时的自动识别和修复,解决了现有实现方式每次都需要对DB导致的相同故障重复投入大量人力和时间,提高数据库损坏相关的故障处理效率和完成度。
图8示例了一种电子设备的实体结构示意图,如图8所示,该电子设备可以包括:处理器(processor)810、通信接口(Communications Interface)820、存储器(memory)830和通信总线840,其中,处理器810,通信接口820,存储器830通过通信总线840完成相互间的通信。处理器810可以调用存储器830中的逻辑指令,以执行基于监控服务的数据库故障处理方法,该方法包括:基于管理节点反馈的告警消息,确定数据库故障类型;在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复;其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;集群状态是各监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
此外,上述的存储器830中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现 有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
另一方面,本申请还提供一种计算机程序产品,计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,计算机程序被处理器执行时,计算机能够执行上述各方法所提供的基于监控服务的数据库故障处理方法,该方法包括:基于管理节点反馈的告警消息,确定数据库故障类型;在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复;其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;集群状态是各监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
又一方面,本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的基于监控服务的数据库故障处理方法,该方法包括:基于管理节点反馈的告警消息,确定数据库故障类型;在确定数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;利用修复策略库匹配到与数据库损坏场景对应的第一修复策略,以供监控节点根据第一修复策略对数据库损坏场景进行修复;其中,告警消息是管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;集群状态是各监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种基于监控服务的数据库故障处理方法,其特征在于,包括:
    基于管理节点反馈的告警消息,确定数据库故障类型;
    在确定所述数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;
    利用修复策略库匹配到与所述数据库损坏场景对应的第一修复策略,以供监控节点根据所述第一修复策略对所述数据库损坏场景进行修复;
    其中,所述告警消息是所述管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;所述集群状态是各所述监控节点之间通过投票决策的方式对分布式集群所部署的监控服务进行评估得到的。
  2. 根据权利要求1所述的基于监控服务的数据库故障处理方法,其特征在于,在所述确定数据库故障类型之后,还包括:
    在确定所述数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景;
    利用修复策略库匹配到与所述数据库过载场景对应的第二修复策略,以供监控节点根据所述第二修复策略对所述数据库过载场景进行修复;
    其中,所述目标监控节点为所述数据库运行状态为异常状态的监控节点。
  3. 根据权利要求1所述的基于监控服务的数据库故障处理方法,其特征在于,所述数据库损坏场景与第一目标场景识别码相匹配;
    其中,所述第一目标场景识别码为用于在所述修复策略库中区分第一修复策略的唯一标识信息。
  4. 根据权利要求2所述的基于监控服务的数据库故障处理方法,其特征在于,所述数据库过载场景与第二目标场景识别码相匹配;
    其中,所述第二目标场景识别码为用于在所述修复策略库中区分第二修复策略的唯一标识信息。
  5. 根据权利要求3所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,包括:
    在确定所有监控节点的数据库运行状态为异常状态的情况下,将所述第一目标场景识别码设置为第一场景识别码;
    与所述第一场景识别码对应的第一修复策略为:
    通过对象存储设备的数据库中所保存的集群信息来对所有监控节点的监控服务数据库进行重建。
  6. 根据权利要求5所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
    在确定所述集群状态为ERROR,且至少存在一个监控节点的数据库运行状态为正常状态的情况下,将所述第一目标场景识别码设置为第二场景识别码;
    与所述第二场景识别码对应的第一修复策略为:
    将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监控节点。
  7. 根据权利要求6所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
    在确定所述集群状态为WARN,且至少存在三个数据库运行状态为正常的监控节点的情况下,将所述第一目标场景识别码设置为第三场景识别码;
    与所述第三场景识别码对应的第一修复策略为:
    将数据库运行状态为正常状态的监控节点的监控服务数据库拷贝替换数据库运行状态为异常状态的监 控节点。
  8. 根据权利要求7所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景,还包括:
    在确定所述集群状态为WARN,且存在两个或两个以下数据库运行状态为正常的监控节点的情况下,将所述第一目标场景识别码设置为第四场景识别码;
    与所述第四场景识别码对应的第一修复策略为:
    对数据库运行状态为异常状态的监控节点的监控服务进行重新部署。
  9. 根据权利要求4所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于目标监控节点的磁盘空间状态,确定数据库过载场景,包括:
    在确定所述目标监控节点的磁盘空间状态为部署在为监控服务数据库预先划分的独立分区的情况下,将所述第二目标场景识别码设置为第五场景识别码;
    与所述第五场景识别码对应的第二修复策略为:
    对所述目标监控节点的监控服务数据库进行压缩。
  10. 根据权利要求9所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:
    在确定所述目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间满足迁移条件的情况下,将所述第二目标场景识别码设置为第六场景识别码;
    与所述第六场景识别码对应的第二修复策略为:
    先将所述目标监控节点的监控服务数据库从系统盘上迁移至指定的快速盘分区上,再对迁移后的监控服务数据库进行压缩;
    其中,所述迁移条件为所述目标监控节点的磁盘空间存在快速盘分区,且所述目标监控节点的磁盘空间容量大于监控服务数据库容量。
  11. 根据权利要求10所述的基于监控服务的数据库故障处理方法,其特征在于,所述基于目标监控节点的磁盘空间状态,确定数据库过载场景,还包括:
    在确定所述目标监控节点的磁盘空间状态为未部署在为监控服务数据库预先划分的独立分区,且目标监控节点的磁盘空间不满足迁移条件的情况下,将所述第二目标场景识别码设置为第七场景识别码;
    与所述第七场景识别码对应的第二修复策略为:
    对所述目标监控节点的监控服务数据库进行压缩。
  12. 一种基于监控服务的数据库故障处理装置,其特征在于,包括:
    故障检测模块,用于基于管理节点反馈的告警消息,确定数据库故障类型;
    第一故障识别模块,用于在确定所述数据库故障类型为监控服务数据库损坏的情况下,基于集群状态和各监控节点的数据库运行状态,确定数据库损坏场景;
    第一故障修复模块,用于利用修复策略库匹配到与所述数据库损坏场景对应的第一修复策略,以供监控节点根据所述第一修复策略对所述数据库损坏场景进行修复;
    其中,所述告警消息是所述管理节点在确定监控节点对监控服务数据库所监控到的数据库运行状态与预设的数据库故障状态匹配的情况下,所生成的携带有与数据库故障状态对应故障类型信息的通知消息;所述集群状态是各所述监控节点之间通过投票决策的方式对分布式存储系统所部署的监控服务进行评估得到的。
  13. 根据权利要求12所述的基于监控服务的数据库故障处理装置,其特征在于,还包括:
    第二故障识别模块,用于在确定所述数据库故障类型为监控服务数据库过载的情况下,基于目标监控节点的磁盘空间状态,确定数据库过载场景;
    第二故障修复模块,用于利用修复策略库匹配到与所述数据库过载场景对应的第二修复策略,以供监控节点根据所述第二修复策略对所述数据库过载场景进行修复;
    其中,所述目标监控节点为异常监控节点所部署的集群计算节点。
  14. 一种分布式集群,包括至少n个在集群计算节点上部署监控服务的监控节点,以及至少1个在集群计算节点上部署软件管理服务的管理节点,其特征在于,每一所述监控节点用于实现如权利要求1-11任一所述的基于监控服务的数据库故障处理方法;
    所述监控节点,用于对其自身部署的监控服务数据库进行监控,将获取到的数据库运行状态反馈至所述管理节点;
    所述管理节点,用于根据所述数据库运行状态与预设的数据库故障状态进行匹配,生成携带有与数据库故障状态对应的故障类型信息的告警消息,并将所述告警消息下发至数据库运行状态为异常状态的监控节点;
    其中,所述n为大于1的奇数,所述集群计算节点的总数量大于所述监控节点的总数量。
  15. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至11任一项所述基于监控服务的数据库故障处理方法。
  16. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述基于监控服务的数据库故障处理方法。
  17. 根据权利要求1所述的基于监控服务的数据库故障处理方法,其特征在于,若所述数据库损坏场景为由所有监控节点构成的决议委员会确定所有的监控节点均发生损坏,则直接根据所述集群状态确定数据库损坏;若所述数据库损坏场景为由所有监控节点构成的决议委员会确定存在部分监控节点的monitor DB发生损坏,但仍有可以正常提供服务的monitor进程存在,此时所述集群状态指示当前未发生故障,则需要再依据各监控节点的数据库运行状态中指示的可以正常提供服务的monitor数量,将所述数据库损坏场景具体定位至monitor DB发生损坏的监控节点。
  18. 根据权利要求12所述的基于监控服务的数据库故障处理装置,其特征在于,若所述数据库损坏场景为由所有监控节点构成的决议委员会确定所有的监控节点均发生损坏,则直接根据所述集群状态确定数据库损坏;若所述数据库损坏场景为由所有监控节点构成的决议委员会确定存在部分监控节点的monitor DB发生损坏,但仍有可以正常提供服务的monitor进程存在,此时所述集群状态指示当前未发生故障,则需要再依据各监控节点的数据库运行状态中指示的可以正常提供服务的monitor数量,将所述数据库损坏场景具体定位至monitor DB发生损坏的监控节点。
  19. 根据权利要求15所述的电子设备,其特征在于,若所述数据库损坏场景为由所有监控节点构成的决议委员会确定所有的监控节点均发生损坏,则直接根据所述集群状态确定数据库损坏;若所述数据库损坏场景为由所有监控节点构成的决议委员会确定存在部分监控节点的monitor DB发生损坏,但仍有可以正常提供服务的monitor进程存在,此时所述集群状态指示当前未发生故障,则需要再依据各监控节点的数据库运行状态中指示的可以正常提供服务的monitor数量,将所述数据库损坏场景具体定位至monitor DB发生损坏的监控节点。
  20. 根据权利要求16所述的非暂态计算机可读存储介质,其特征在于,若所述数据库损坏场景为由所有监控节点构成的决议委员会确定所有的监控节点均发生损坏,则直接根据所述集群状态确定数据库损坏;若所述数据库损坏场景为由所有监控节点构成的决议委员会确定存在部分监控节点的monitor DB发生损坏,但仍有可以正常提供服务的monitor进程存在,此时所述集群状态指示当前未发生故障,则需要再依据各监控节点的数据库运行状态中指示的可以正常提供服务的monitor数量,将所述数据库损坏场景具体定位至monitor DB发生损坏的监控节点。
PCT/CN2023/121334 2023-01-09 2023-09-26 基于监控服务的数据库故障处理方法、装置及分布式集群 WO2024148854A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310027120.X 2023-01-09
CN202310027120.XA CN115994044B (zh) 2023-01-09 2023-01-09 基于监控服务的数据库故障处理方法、装置及分布式集群

Publications (1)

Publication Number Publication Date
WO2024148854A1 true WO2024148854A1 (zh) 2024-07-18

Family

ID=85989996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121334 WO2024148854A1 (zh) 2023-01-09 2023-09-26 基于监控服务的数据库故障处理方法、装置及分布式集群

Country Status (2)

Country Link
CN (1) CN115994044B (zh)
WO (1) WO2024148854A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994044B (zh) * 2023-01-09 2023-06-13 苏州浪潮智能科技有限公司 基于监控服务的数据库故障处理方法、装置及分布式集群
CN116662059B (zh) * 2023-07-24 2023-10-24 上海爱可生信息技术股份有限公司 MySQL数据库CPU故障诊断及自愈方法及可读存储介质
CN117170985B (zh) * 2023-11-02 2024-01-12 武汉大学 面向开放式地理信息网络服务的分布式监测方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719841A (zh) * 2009-11-13 2010-06-02 曙光信息产业(北京)有限公司 分布式集群监控系统及方法
CN105337765A (zh) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 一种分布式hadoop集群故障自动诊断修复系统
KR20180076172A (ko) * 2016-12-27 2018-07-05 주식회사 씨에스리 데이터베이스 시스템의 이상을 탐지하는 장치 및 방법
CN108599996A (zh) * 2018-04-03 2018-09-28 武汉斗鱼网络科技有限公司 数据库集群的故障处理方法、装置及终端
CN108833131A (zh) * 2018-04-25 2018-11-16 北京百度网讯科技有限公司 分布式数据库云服务的系统、方法、设备和计算机存储介质
CN109614283A (zh) * 2018-10-24 2019-04-12 世纪龙信息网络有限责任公司 分布式数据库集群的监控系统
CN111371599A (zh) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 一种基于etcd的集群容灾管理系统
CN115994044A (zh) * 2023-01-09 2023-04-21 苏州浪潮智能科技有限公司 基于监控服务的数据库故障处理方法、装置及分布式集群

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9240937B2 (en) * 2011-03-31 2016-01-19 Microsoft Technology Licensing, Llc Fault detection and recovery as a service
CN103684817B (zh) * 2012-09-06 2017-11-17 百度在线网络技术(北京)有限公司 数据中心的监控方法及系统
CN103559108B (zh) * 2013-11-11 2017-05-17 中国科学院信息工程研究所 一种基于虚拟化实现主备故障自动恢复的方法及系统
US9747183B2 (en) * 2013-12-31 2017-08-29 Ciena Corporation Method and system for intelligent distributed health monitoring in switching system equipment
CN104052634B (zh) * 2014-05-30 2015-09-02 国家电网公司 信息安全监控系统及方法
CN106933693A (zh) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 一种数据库集群节点故障自动修复方法及系统
CN109343987A (zh) * 2018-08-20 2019-02-15 科大国创软件股份有限公司 It系统故障诊断及修复方法、装置、设备、存储介质
CN109522287B (zh) * 2018-09-18 2023-08-18 平安科技(深圳)有限公司 分布式文件存储集群的监控方法、系统、设备及介质
CN109783307A (zh) * 2018-12-03 2019-05-21 日照钢铁控股集团有限公司 一种数据库集中监管方法及终端
CN111444032A (zh) * 2020-03-04 2020-07-24 无锡华云数据技术服务有限公司 一种计算机系统故障修复方法、系统及设备
CN115422010A (zh) * 2022-09-19 2022-12-02 Oppo广东移动通信有限公司 数据集群中的节点管理方法、装置及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719841A (zh) * 2009-11-13 2010-06-02 曙光信息产业(北京)有限公司 分布式集群监控系统及方法
CN105337765A (zh) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 一种分布式hadoop集群故障自动诊断修复系统
KR20180076172A (ko) * 2016-12-27 2018-07-05 주식회사 씨에스리 데이터베이스 시스템의 이상을 탐지하는 장치 및 방법
CN108599996A (zh) * 2018-04-03 2018-09-28 武汉斗鱼网络科技有限公司 数据库集群的故障处理方法、装置及终端
CN108833131A (zh) * 2018-04-25 2018-11-16 北京百度网讯科技有限公司 分布式数据库云服务的系统、方法、设备和计算机存储介质
CN109614283A (zh) * 2018-10-24 2019-04-12 世纪龙信息网络有限责任公司 分布式数据库集群的监控系统
CN111371599A (zh) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 一种基于etcd的集群容灾管理系统
CN115994044A (zh) * 2023-01-09 2023-04-21 苏州浪潮智能科技有限公司 基于监控服务的数据库故障处理方法、装置及分布式集群

Also Published As

Publication number Publication date
CN115994044A (zh) 2023-04-21
CN115994044B (zh) 2023-06-13

Similar Documents

Publication Publication Date Title
WO2024148854A1 (zh) 基于监控服务的数据库故障处理方法、装置及分布式集群
US11614943B2 (en) Determining problem dependencies in application dependency discovery, reporting, and management tool
US12111752B2 (en) Intelligent services for application dependency discovery, reporting, and management tool
US11966324B2 (en) Discovery crawler for application dependency discovery, reporting, and management tool
US11663055B2 (en) Dependency analyzer in application dependency discovery, reporting, and management tool
US20230251955A1 (en) Intelligent services and training agent for application dependency discovery, reporting, and management tool
US12099438B2 (en) Testing agent for application dependency discovery, reporting, and management tool
CN108234170A (zh) 一种服务器集群的监控方法和装置
CN110532278B (zh) 声明式的MySQL数据库系统高可用方法
WO2017220013A1 (zh) 业务处理方法及装置、存储介质
CN111294845A (zh) 节点切换方法、装置、计算机设备和存储介质
CN112272113B (zh) 基于多种区块链节点的监控及自动切换的方法及系统
CN108769170A (zh) 一种集群网络故障自检系统及方法
CN108763312B (zh) 一种基于负载的从数据节点筛选方法
CN114020509A (zh) 工作负载集群的修复方法、装置、设备及可读存储介质
CN117313012A (zh) 服务编排系统的故障管理方法、装置、设备及存储介质
CN116360687A (zh) 一种集群分布式存储的方法、装置、设备及介质
CN110502496A (zh) 一种分布式文件系统修复方法、系统、终端及存储介质
CN115686368A (zh) 区块链网络的节点的存储扩容的方法、系统、装置和介质
CN112231142B (zh) 系统备份恢复方法、装置、计算机设备和存储介质
CN105550094B (zh) 一种高可用系统状态自动监控方法
CN113656358A (zh) 一种数据库日志文件处理方法及系统
CN113946474A (zh) 一种储存系统高效容灾保护方法及容灾处理系统
CN108897645B (zh) 一种基于备用心跳磁盘的数据库集群容灾方法和系统
CN107707402B (zh) 一种分布式系统中服务仲裁的管理系统及其管理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23915612

Country of ref document: EP

Kind code of ref document: A1