CN116048859B - Distributed database fault diagnosis method and device, electronic equipment and storage medium - Google Patents
Distributed database fault diagnosis method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116048859B CN116048859B CN202310042844.1A CN202310042844A CN116048859B CN 116048859 B CN116048859 B CN 116048859B CN 202310042844 A CN202310042844 A CN 202310042844A CN 116048859 B CN116048859 B CN 116048859B
- Authority
- CN
- China
- Prior art keywords
- node
- fault
- diagnosed
- determining
- distributed database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/80—Database-specific techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Hardware Design (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
Abstract
The application provides a distributed database fault diagnosis method, a distributed database fault diagnosis device, electronic equipment and a storage medium, and belongs to the technical field of databases. The application obtains fault error reporting information of the distributed database and determines a fault diagnosis strategy corresponding to the distributed database; determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database; and diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result. Therefore, the application realizes the automatic diagnosis of the faults of the distributed database, and compared with the current manual diagnosis of each node, the application saves the labor cost and improves the diagnosis efficiency.
Description
Technical Field
The present application relates to the field of database technologies, and in particular, to a distributed database fault diagnosis method, device, electronic apparatus, and storage medium.
Background
At present, the distributed database is widely applied to the field of computers due to high reliability and high security. The distributed database adopts an extensible system structure, and data are stored on a plurality of independent devices in a scattered manner, so that the reliability, availability and access efficiency of the system are improved. However, the distributed database is relatively large compared to the centralized database, and if a fault occurs during operation, it is not easy to diagnose the faulty device. In the prior art, when a distributed database fails, a developer usually manually checks each node in the distributed database one by one, so as to diagnose a failed device or further diagnose a specific failed module in the failed device.
However, this method of performing fault diagnosis by manual inspection is labor-intensive and has low diagnosis efficiency.
Disclosure of Invention
In order to solve the technical problems of high labor consumption and low diagnosis efficiency in the fault diagnosis mode through manual investigation, the application provides a distributed database fault diagnosis method, a distributed database fault diagnosis device, electronic equipment and a storage medium.
In a first aspect, a distributed database fault diagnosis method is provided, the method including:
acquiring fault error reporting information of a distributed database, and determining a fault diagnosis strategy corresponding to the distributed database;
determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database;
and diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result.
In one possible implementation manner, the determining the fault diagnosis policy corresponding to the distributed database includes:
obtaining a fault tree template corresponding to the distributed database and cluster attribute information corresponding to the distributed database;
initializing the fault tree template by utilizing the cluster attribute information to obtain an execution fault tree corresponding to the distributed database, and taking the execution fault tree as the fault diagnosis strategy.
In one possible implementation manner, the diagnosing at least one node to be diagnosed based on the fault diagnosis policy, to obtain a diagnosis result, includes:
determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes;
and sending the fault diagnosis strategy to the management node so that the management node can acquire corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In a possible implementation manner, the determining a management node in at least one node to be diagnosed includes:
determining the node priority of each node to be diagnosed;
and determining the node to be diagnosed with the highest priority of the corresponding node as the management node.
In one possible implementation manner, the determining the node priority of each node to be diagnosed includes:
and determining the system resource utilization rate of each node to be diagnosed, and determining the corresponding node priority based on the system resource utilization rate, wherein the lower the system resource utilization rate is, the higher the corresponding node priority is.
In one possible implementation manner, the diagnosing at least one node to be diagnosed based on the fault diagnosis policy, to obtain a diagnosis result, includes:
determining the number of nodes corresponding to at least one node to be diagnosed;
and under the condition that the number of the nodes is smaller than a preset threshold value, acquiring all pieces of information to be diagnosed corresponding to the nodes to be diagnosed based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In one possible implementation manner, the determining the type of the fault component based on the fault error-reporting information includes:
and searching a component type corresponding to the fault error information in a preset fault comparison library, and determining the component type as the fault component type.
In a second aspect, there is provided a distributed database fault diagnosis apparatus, the apparatus comprising:
the acquisition module is used for acquiring fault error reporting information of the distributed database and determining a fault diagnosis strategy corresponding to the distributed database;
the determining module is used for determining a fault component type based on the fault error reporting information and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database;
and the diagnosis module is used for diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result.
In one possible implementation manner, the acquiring module is specifically configured to:
obtaining a fault tree template corresponding to the distributed database and cluster attribute information corresponding to the distributed database;
initializing the fault tree template by utilizing the cluster attribute information to obtain an execution fault tree corresponding to the distributed database, and taking the execution fault tree as the fault diagnosis strategy.
In one possible embodiment, the diagnostic module is specifically configured to:
determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes;
and sending the fault diagnosis strategy to the management node so that the management node can acquire corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In one possible embodiment, the diagnostic module is further configured to:
determining the node priority of each node to be diagnosed;
and determining the node to be diagnosed with the highest priority of the corresponding node as the management node.
In one possible embodiment, the diagnostic module is further configured to:
and determining the system resource utilization rate of each node to be diagnosed, and determining the corresponding node priority based on the system resource utilization rate, wherein the lower the system resource utilization rate is, the higher the corresponding node priority is.
In one possible embodiment, the diagnostic module is further configured to:
determining the number of nodes corresponding to at least one node to be diagnosed;
and under the condition that the number of the nodes is smaller than a preset threshold value, acquiring all pieces of information to be diagnosed corresponding to the nodes to be diagnosed based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In a possible implementation manner, the determining module is specifically configured to:
and searching a component type corresponding to the fault error information in a preset fault comparison library, and determining the component type as the fault component type.
In a third aspect, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the first aspects when executing a program stored on a memory.
In a fourth aspect, a computer-readable storage medium is provided, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of the first aspects.
In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above described distributed database fault diagnosis methods.
The embodiment of the application has the beneficial effects that:
the embodiment of the application provides a distributed database fault diagnosis method, a device, electronic equipment and a storage medium. Therefore, the fault diagnosis of the distributed database is automatically realized, compared with the conventional manual diagnosis of each node, the labor cost is saved, and the diagnosis efficiency is improved.
Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flowchart of a distributed database fault diagnosis method according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for diagnosing a distributed database fault according to an embodiment of the present application;
FIG. 3 is a flowchart of another method for diagnosing a distributed database fault according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a distributed database fault diagnosis device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the prior art, when a distributed database fails, a developer usually manually checks each node in the distributed database one by one, so as to diagnose the failed device or further diagnose a specific failed module in the failed device. However, this method of performing fault diagnosis by manual inspection is labor-intensive and has low diagnosis efficiency. Therefore, the embodiment of the application provides a distributed database fault diagnosis method which can be applied to a fault diagnosis platform, wherein the fault diagnosis platform comprises a Controller end and an Agent end.
The Controller end is a management platform of the Web end and is used for acquiring fault error reporting information of the distributed database and determining a fault diagnosis strategy corresponding to the distributed database; determining the type of a fault component based on the fault error reporting information, determining at least one node to be diagnosed based on the type of the fault component in all nodes of the distributed database, scheduling an Agent end for data acquisition operation according to a fault diagnosis strategy, and performing fault diagnosis operation according to data acquired by the Agent end, or sending the fault diagnosis strategy to the Agent end to instruct the Agent end to perform fault diagnosis operation.
The Agent end is deployed on each node in the distributed database and is used for carrying out data acquisition operation according to the indication of the Controller end and carrying out fault diagnosis operation based on the fault diagnosis strategy when the fault diagnosis strategy sent by the Controller end is received.
The distributed database fault diagnosis method provided by the application is explained in the following with specific embodiments in combination with the accompanying drawings, and the embodiments do not limit the embodiments of the application.
Referring to fig. 1, a flowchart of an embodiment of a distributed database fault diagnosis method is provided in an embodiment of the present application. As shown in fig. 1, the process may include the steps of:
s101, acquiring fault error information of a distributed database, and determining a fault diagnosis strategy corresponding to the distributed database.
The fault error reporting information refers to error reporting information generated when the distributed database fails, such as error number, error information, alarm code, alarm information, message number (UUID), and the like.
In the embodiment of the application, when the distributed database fails, a user can send the fault error reporting information generated when the distributed database fails to the Controller.
The fault diagnosis strategy refers to a strategy for diagnosing the fault cause of the fault of the distributed database.
Specifically, the Controller end may determine the fault diagnosis policy corresponding to the distributed database through the following steps: obtaining a fault tree template corresponding to the distributed database and cluster attribute information corresponding to the distributed database, initializing the fault tree template by utilizing the cluster attribute information to obtain an execution fault tree corresponding to the distributed database, and taking the execution fault tree as the fault diagnosis strategy.
The fault tree template is generated in advance by using a fault tree analysis method (Fault Tree Analysis, FTA) based on a preset generation strategy.
In practical application, the generation strategy can be obtained by learning according to sample data based on a machine learning algorithm in advance. The generation policy may also be built empirically by a developer.
Next, the corresponding cluster attribute information (for example, architecture information, physical distribution information, component type, component number, installation user name, installation path, service ip, service port, etc.) of the distributed database is obtained from the metadata of the distributed database, and the fault tree template is initialized by using the cluster attribute information, so as to obtain an execution fault tree (i.e., fault diagnosis strategy) suitable for performing fault diagnosis on the distributed database.
S102, determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database.
The failure component type refers to the type of node that may fail, e.g., compute node, storage node, management node, etc.
In an embodiment, a fault comparison library is preset in the Controller, and fault error reporting information and a component type corresponding to the fault error reporting information are stored in the fault comparison library, wherein the fault error reporting information and the component type are generally in a many-to-one relationship.
Based on this, determining the specific implementation of the failed component type based on the failure fault information may include: and searching a component type corresponding to the fault error information in a preset fault comparison library, and determining the component type as the fault component type. Thus, a fast determination of the type of faulty component is achieved.
Specifically, in a preset fault comparison library, fuzzy query is performed based on fault error reporting information to obtain at least one component type and a matching degree corresponding to each component type, and the component type with the highest matching degree is determined as the component type corresponding to the fault error reporting information, wherein the corresponding matching degree is higher than a preset threshold.
In practical application, there is a fault error reporting information error (such as user error transmission information), which results in a situation that the corresponding fault component type cannot be matched in the preset fault comparison library (such as the matching degree of all component types is smaller than a preset threshold value), at this time, a prompt message can be generated based on the fault error reporting information, and the prompt message is fed back to the user equipment (such as a user computer or a mobile phone), so that the user can modify the fault error reporting information based on the prompt message. Upon detecting that the user modification is completed or the fault-reporting information is re-entered, the step of S101 is re-executed based on the new fault-reporting information.
In another embodiment, a classification model may be pre-trained to provide the classification model with the ability to identify the corresponding type of failed component via fault-reporting information.
Based on this, determining the specific implementation of the failed component type based on the failure fault information may include: and inputting fault error reporting information into a trained classification model, so that the classification model outputs a corresponding fault component type.
The node to be diagnosed refers to a node needing diagnosis in the distributed database, namely, a node possibly causing faults.
In an embodiment, determining a specific implementation of at least one node to be diagnosed based on the failure component type among all nodes of the distributed database may include: at least one type node corresponding to the type of the fault component is determined among all nodes of the distributed database, and each type node is determined as a node to be diagnosed.
For example, if the type of the fault component is "compute node", all nodes with the type of "compute node" in the database are determined as nodes to be diagnosed.
It should be noted that, when a child node is provided in the presence of a type node, the child node may also be a node causing a failure, and therefore, the child node of the type node may also be a node to be diagnosed.
S103, diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result.
In the embodiment of the application, after the fault diagnosis strategy corresponding to the distributed database is determined and all the nodes to be diagnosed, each node to be diagnosed can be diagnosed based on the fault diagnosis strategy, so as to obtain a diagnosis result. Thus, the fault diagnosis of the distributed database is realized automatically.
As to how to diagnose at least one node to be diagnosed based on the fault diagnosis strategy, a diagnosis result is obtained, which will be explained in detail by different embodiments hereinafter, and will not be described in detail.
In the embodiment of the application, firstly, fault error reporting information of a distributed database is obtained, a fault diagnosis strategy corresponding to the distributed database is determined, then, the type of a fault component is determined based on the fault error reporting information, at least one node to be diagnosed is determined in all nodes of the distributed database based on the type of the fault component, and finally, the diagnosis is performed on the at least one node to be diagnosed based on the fault diagnosis strategy, so that a diagnosis result is obtained. Therefore, the fault diagnosis of the distributed database is automatically realized, compared with the conventional manual diagnosis of each node, the labor cost is saved, and the diagnosis efficiency is improved.
Referring to fig. 2, a flowchart of an embodiment of another distributed database fault diagnosis method is provided in an embodiment of the present application. The flow shown in fig. 2 describes how to diagnose at least one node to be diagnosed based on the fault diagnosis policy on the basis of the flow shown in fig. 1, so as to obtain a diagnosis result. As shown in fig. 2, the process may include the steps of:
s201, determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes.
The management node is used for uniformly managing the management node and all following nodes.
In an embodiment, determining a specific implementation of a management node among at least one of the nodes to be diagnosed may include: randomly selecting one node to be diagnosed from at least one node to be diagnosed as a management node. In this way, the management node can be determined quickly.
In another embodiment, determining a specific implementation of a management node in at least one of the nodes to be diagnosed may further include: and determining the node priority of each node to be diagnosed, and determining the node to be diagnosed with the highest corresponding node priority as the management node. Thereby, it is achieved that the management node is determined according to the node priority of each node to be diagnosed.
As a possible implementation manner, the user may preset the priority of each node, and determining the specific implementation of the node priority of each node to be diagnosed may include: and determining the priority preset by the user as the corresponding node priority for each node to be diagnosed. Thus, the user can flexibly set the priority according to the actual demand.
As another possible implementation, determining the node priority of each node to be diagnosed may include: and determining the system resource utilization rate of each node to be diagnosed, and determining the corresponding node priority based on the system resource utilization rate, wherein the lower the system resource utilization rate is, the higher the corresponding node priority is. That is, the priority of the node to be diagnosed with the lowest system resource usage is highest, whereby the node to be diagnosed with the lowest system resource usage can be determined as the management node.
The system resource utilization rate can be any one of CPU utilization rate, disk throughput utilization rate, memory utilization rate and network throughput utilization rate; the method can also be the maximum value of CPU utilization rate, disk throughput utilization rate, memory utilization rate and network throughput utilization rate; the weighted calculation results of the CPU utilization, the disk throughput utilization, the memory utilization, and the network throughput utilization may also be, for example: system resource usage = CPU usage + X1X 2X disk throughput usage + X3X memory usage + X4X network throughput usage, where X1, X2, X3, X4 are weight coefficients and x1+x2+x3+x4 = 1.
It will be appreciated that when there is only one node to be diagnosed, then the unique node to be diagnosed may be determined directly as the management node, where there is no following node.
S202, the fault diagnosis strategy is sent to the management node, so that the management node collects corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosis is carried out based on the information to be diagnosed, and the diagnosis result is obtained.
In the embodiment of the application, after the management node and the following node are determined, the fault diagnosis strategy can be sent to the Agent of the management node, the Agent of the management node is updated to the Leader, and the Agent of the following node is set as the follow, so that the management node has the capability of uniformly managing the management node and the following node.
Based on the above, the leader Agent of the management node can determine an information item to be collected according to the indication of the fault diagnosis policy, collect own information (i.e. information to be diagnosed) according to the information item and generate an information collection instruction, then send the information collection instruction to the Follower agents of each following node, and each Follower Agent collects respective information (i.e. information to be diagnosed) according to the information collection instruction and feeds back the information to the leader Agent. And finally, diagnosing by the leader according to the fault diagnosis strategy and all the acquired information to obtain a diagnosis result, and feeding back the diagnosis result to the Controller.
In practical application, there are situations that the node to be diagnosed cannot find the information corresponding to the information item, if the information item is the log data before 7 days, and the log data before 7 days in the node to be diagnosed is cleared, the corresponding information cannot be successfully collected. Based on this, in another embodiment, the leader may generate error reporting information when any node to be diagnosed (including the management node itself and all following nodes) does not feed back the collected information, and feed back the error reporting information to the Controller. To prompt the user for missing information.
By the flow shown in fig. 2, it is realized that one management node is determined among the nodes to be diagnosed, the nodes to be diagnosed except the management node are determined as following nodes, and then, the fault diagnosis strategy is sent to the management node, so that the management node collects the corresponding information to be diagnosed in itself and all following nodes based on the fault diagnosis strategy, and performs diagnosis based on the information to be diagnosed, thereby obtaining a diagnosis result. Therefore, the node to be diagnosed in the cluster can carry out self-checking according to the fault diagnosis strategy, so that the calculation pressure of the Controller end is reduced.
Referring to fig. 3, a flowchart of an embodiment of another distributed database fault diagnosis method is provided in an embodiment of the present application. The flow shown in fig. 3 describes how to diagnose at least one node to be diagnosed based on the fault diagnosis policy on the basis of the flow shown in fig. 1, so as to obtain a diagnosis result. As shown in fig. 3, the process may include the steps of:
s301, determining the number of nodes corresponding to at least one node to be diagnosed;
s302, under the condition that the number of the nodes is smaller than a preset threshold value, acquiring all information to be diagnosed corresponding to the nodes to be diagnosed based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
S301 and S302 are collectively described below:
in the embodiment of the application, firstly, the number of nodes corresponding to all nodes to be diagnosed is determined, and under the condition that the number of nodes is smaller than a preset threshold value, the number of the nodes to be diagnosed is smaller, at this time, the pressure of the Controller end for receiving the information to be diagnosed is smaller, and optionally, the pressure is smaller than the pressure for sending the fault diagnosis strategy. Therefore, an information acquisition instruction can be generated according to the information items to be acquired, indicated by the fault diagnosis strategy, and the information acquisition instruction is sent to each node Agent to be diagnosed, and each node Agent to be diagnosed acquires respective information (namely information to be diagnosed) according to the information acquisition instruction and feeds the information back to the Controller end. And diagnosing by the Controller end according to the fault diagnosis strategy and all the collected information to obtain a diagnosis result.
It should be noted that, the preset threshold may be set by the user according to the actual situation.
Through the flow shown in fig. 3, when the number of nodes is smaller than a preset threshold, all the information to be diagnosed corresponding to the nodes to be diagnosed can be collected based on the fault diagnosis strategy, and diagnosis can be performed based on the information to be diagnosed, so that the diagnosis result is obtained. Therefore, the Controller end can not need to send a fault diagnosis strategy to the node to be diagnosed and receive a diagnosis result fed back by the node to be diagnosed, so that the data transmission can be reduced, and the transmission pressure is reduced.
Based on the same technical concept, the embodiment of the application also provides a distributed database fault diagnosis device, as shown in fig. 4, which comprises:
the acquiring module 401 is configured to acquire fault reporting information of a distributed database, and determine a fault diagnosis policy corresponding to the distributed database;
a determining module 402, configured to determine a fault component type based on the fault error reporting information, and determine at least one node to be diagnosed based on the fault component type in all nodes of the distributed database;
and the diagnosis module 403 is configured to diagnose at least one node to be diagnosed based on the fault diagnosis policy, so as to obtain a diagnosis result.
In one possible implementation manner, the acquiring module is specifically configured to:
obtaining a fault tree template corresponding to the distributed database and cluster attribute information corresponding to the distributed database;
initializing the fault tree template by utilizing the cluster attribute information to obtain an execution fault tree corresponding to the distributed database, and taking the execution fault tree as the fault diagnosis strategy.
In one possible embodiment, the diagnostic module is specifically configured to:
determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes;
and sending the fault diagnosis strategy to the management node so that the management node can acquire corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In one possible embodiment, the diagnostic module is further configured to:
determining the node priority of each node to be diagnosed;
and determining the node to be diagnosed with the highest priority of the corresponding node as the management node.
In one possible embodiment, the diagnostic module is further configured to:
and determining the system resource utilization rate of each node to be diagnosed, and determining the corresponding node priority based on the system resource utilization rate, wherein the lower the system resource utilization rate is, the higher the corresponding node priority is.
In one possible embodiment, the diagnostic module is further configured to:
determining the number of nodes corresponding to at least one node to be diagnosed;
and under the condition that the number of the nodes is smaller than a preset threshold value, acquiring all pieces of information to be diagnosed corresponding to the nodes to be diagnosed based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
In a possible implementation manner, the determining module is specifically configured to:
searching a component type corresponding to fault error reporting information in a preset fault comparison library;
and determining the component type as the failed component type.
In the embodiment of the application, firstly, fault error reporting information of a distributed database is obtained, a fault diagnosis strategy corresponding to the distributed database is determined, then, the type of a fault component is determined based on the fault error reporting information, at least one node to be diagnosed is determined in all nodes of the distributed database based on the type of the fault component, and finally, the diagnosis is performed on the at least one node to be diagnosed based on the fault diagnosis strategy, so that a diagnosis result is obtained. Therefore, the fault diagnosis of the distributed database is automatically realized, compared with the conventional manual diagnosis of each node, the labor cost is saved, and the diagnosis efficiency is improved.
Based on the same technical concept, the embodiment of the present application further provides an electronic device, as shown in fig. 5, including a processor 111, a communication interface 112, a memory 113 and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114,
a memory 113 for storing a computer program;
the processor 111 is configured to execute a program stored in the memory 113, and implement the following steps:
acquiring fault error reporting information of a distributed database, and determining a fault diagnosis strategy corresponding to the distributed database;
determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database;
and diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (ExtendedIndustry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (DigitalSignal Processing, DSP), application specific integrated circuits (ApplicationSpecific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present application, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of any of the above-described distributed database fault diagnosis methods.
In yet another embodiment of the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the distributed database fault diagnosis method of any of the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. The fault diagnosis method of the distributed database is characterized by being applied to a fault diagnosis platform, wherein the fault diagnosis platform comprises a Controller end, and the Controller end is a management platform of a Web end, and the method comprises the following steps:
acquiring fault error reporting information of a distributed database, and determining a fault diagnosis strategy corresponding to the distributed database;
determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database, wherein at least one node to be diagnosed is a node of the same type;
diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result;
the diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result includes:
determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes;
the Controller end sends the fault diagnosis strategy to the management node so that the management node can collect corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosis is carried out based on the information to be diagnosed to obtain the diagnosis result.
2. The method of claim 1, wherein the determining a fault diagnosis policy for the distributed database comprises:
obtaining a fault tree template corresponding to the distributed database and cluster attribute information corresponding to the distributed database;
initializing the fault tree template by utilizing the cluster attribute information to obtain an execution fault tree corresponding to the distributed database, and taking the execution fault tree as the fault diagnosis strategy.
3. The method according to claim 1, wherein said determining a management node among at least one of said nodes to be diagnosed comprises:
determining the node priority of each node to be diagnosed;
and determining the node to be diagnosed with the highest priority of the corresponding node as the management node.
4. A method according to claim 3, wherein said determining a node priority for each node to be diagnosed comprises:
and determining the system resource utilization rate of each node to be diagnosed, and determining the corresponding node priority based on the system resource utilization rate, wherein the lower the system resource utilization rate is, the higher the corresponding node priority is.
5. The method according to claim 1, wherein diagnosing at least one node to be diagnosed based on the fault diagnosis policy, to obtain a diagnosis result, comprises:
determining the number of nodes corresponding to at least one node to be diagnosed;
and under the condition that the number of the nodes is smaller than a preset threshold value, acquiring all pieces of information to be diagnosed corresponding to the nodes to be diagnosed based on the fault diagnosis strategy, and diagnosing based on the information to be diagnosed to obtain the diagnosis result.
6. The method of claim 1, wherein the determining a failed component type based on the fault-reporting information comprises:
and searching a component type corresponding to the fault error information in a preset fault comparison library, and determining the component type as the fault component type.
7. A distributed database fault diagnosis device, which is characterized in that the device is applied to a fault diagnosis platform, the fault diagnosis platform comprises a Controller end, the Controller end is a management platform of a Web end, and the device comprises:
the acquisition module is used for acquiring fault error reporting information of the distributed database and determining a fault diagnosis strategy corresponding to the distributed database;
the determining module is used for determining a fault component type based on the fault error reporting information, and determining at least one node to be diagnosed based on the fault component type in all nodes of the distributed database, wherein at least one node to be diagnosed is a node of the same type;
the diagnosis module is used for diagnosing at least one node to be diagnosed based on the fault diagnosis strategy to obtain a diagnosis result;
wherein, the diagnosis module is specifically used for:
determining a management node in at least one node to be diagnosed, and determining the nodes to be diagnosed except the management node as following nodes;
the Controller end sends the fault diagnosis strategy to the management node so that the management node can collect corresponding information to be diagnosed in the management node and all following nodes based on the fault diagnosis strategy, and diagnosis is carried out based on the information to be diagnosed to obtain the diagnosis result.
8. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-6 when executing a program stored on a memory.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310042844.1A CN116048859B (en) | 2023-01-28 | 2023-01-28 | Distributed database fault diagnosis method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310042844.1A CN116048859B (en) | 2023-01-28 | 2023-01-28 | Distributed database fault diagnosis method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116048859A CN116048859A (en) | 2023-05-02 |
CN116048859B true CN116048859B (en) | 2023-08-25 |
Family
ID=86119754
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310042844.1A Active CN116048859B (en) | 2023-01-28 | 2023-01-28 | Distributed database fault diagnosis method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116048859B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9274902B1 (en) * | 2013-08-07 | 2016-03-01 | Amazon Technologies, Inc. | Distributed computing fault management |
CN108334427A (en) * | 2018-02-24 | 2018-07-27 | 腾讯科技(深圳)有限公司 | Method for diagnosing faults in storage system and device |
CN109614289A (en) * | 2018-12-10 | 2019-04-12 | 浪潮(北京)电子信息产业有限公司 | A kind of memory node monitoring method, system, equipment and computer storage medium |
CN111796956A (en) * | 2020-06-22 | 2020-10-20 | 深圳壹账通智能科技有限公司 | Distributed system fault diagnosis method, device, equipment and storage medium |
CN111913133A (en) * | 2020-06-30 | 2020-11-10 | 北京航天测控技术有限公司 | Distributed fault diagnosis and maintenance method, device, equipment and computer readable medium |
CN113448303A (en) * | 2020-03-27 | 2021-09-28 | 广州汽车集团股份有限公司 | Vehicle fault diagnosis method and system |
CN113626236A (en) * | 2021-07-09 | 2021-11-09 | 浪潮电子信息产业股份有限公司 | Fault diagnosis method, device, equipment and medium for distributed file system |
-
2023
- 2023-01-28 CN CN202310042844.1A patent/CN116048859B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9274902B1 (en) * | 2013-08-07 | 2016-03-01 | Amazon Technologies, Inc. | Distributed computing fault management |
CN108334427A (en) * | 2018-02-24 | 2018-07-27 | 腾讯科技(深圳)有限公司 | Method for diagnosing faults in storage system and device |
CN109614289A (en) * | 2018-12-10 | 2019-04-12 | 浪潮(北京)电子信息产业有限公司 | A kind of memory node monitoring method, system, equipment and computer storage medium |
CN113448303A (en) * | 2020-03-27 | 2021-09-28 | 广州汽车集团股份有限公司 | Vehicle fault diagnosis method and system |
CN111796956A (en) * | 2020-06-22 | 2020-10-20 | 深圳壹账通智能科技有限公司 | Distributed system fault diagnosis method, device, equipment and storage medium |
CN111913133A (en) * | 2020-06-30 | 2020-11-10 | 北京航天测控技术有限公司 | Distributed fault diagnosis and maintenance method, device, equipment and computer readable medium |
CN113626236A (en) * | 2021-07-09 | 2021-11-09 | 浪潮电子信息产业股份有限公司 | Fault diagnosis method, device, equipment and medium for distributed file system |
Also Published As
Publication number | Publication date |
---|---|
CN116048859A (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110519365A (en) | A kind of method and business change system changing appliance services | |
CN111897724A (en) | Automatic testing method and device suitable for cloud platform | |
CN112905323B (en) | Data processing method, device, electronic equipment and storage medium | |
CN111859047A (en) | Fault solving method and device | |
CN110851471A (en) | Distributed log data processing method, device and system | |
CN114528175A (en) | Micro-service application system root cause positioning method, device, medium and equipment | |
JP2009294837A (en) | Failure monitoring system and device, monitoring apparatus, and failure monitoring method | |
CN117093465B (en) | Server log collection method, device, communication equipment and storage medium | |
CN110875832A (en) | Abnormal service monitoring method, device and system and computer readable storage medium | |
CN116048859B (en) | Distributed database fault diagnosis method and device, electronic equipment and storage medium | |
CN113518367A (en) | Fault diagnosis method and system based on service characteristics under 5G network slice | |
CN113553243A (en) | Remote error detection method | |
CN114138522A (en) | Micro-service fault recovery method and device, electronic equipment and medium | |
CN115037653B (en) | Service flow monitoring method, device, electronic equipment and storage medium | |
CN111835566A (en) | System fault management method, device and system | |
CN113282496B (en) | Automatic interface testing method, device, equipment and storage medium | |
CN112131077B (en) | Positioning method and positioning device for fault node and database cluster system | |
CN112291302B (en) | Internet of things equipment behavior data analysis method and processing system | |
CN110932926B (en) | Container cluster monitoring method, system and device | |
CN112272206B (en) | Management method and system of load balancing equipment | |
CN114598588B (en) | Server fault determination method and device and terminal equipment | |
CN116233103B (en) | Interface adaptation method, device, communication equipment and storage medium | |
CN114089725B (en) | Test method and device for control software of chemical mechanical polishing equipment and electronic equipment | |
CN111324846B (en) | Information processing method, information processing device, electronic equipment and computer readable storage medium | |
WO2023103627A1 (en) | Network inspection method and apparatus, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |