CN113535474A - Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault - Google Patents

Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault Download PDF

Info

Publication number
CN113535474A
CN113535474A CN202110741169.2A CN202110741169A CN113535474A CN 113535474 A CN113535474 A CN 113535474A CN 202110741169 A CN202110741169 A CN 202110741169A CN 113535474 A CN113535474 A CN 113535474A
Authority
CN
China
Prior art keywords
fault
node
cluster
recovery
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110741169.2A
Other languages
Chinese (zh)
Other versions
CN113535474B (en
Inventor
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Unisinsight Technology Co Ltd
Original Assignee
Chongqing Unisinsight Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Unisinsight Technology Co Ltd filed Critical Chongqing Unisinsight Technology Co Ltd
Priority to CN202110741169.2A priority Critical patent/CN113535474B/en
Publication of CN113535474A publication Critical patent/CN113535474A/en
Application granted granted Critical
Publication of CN113535474B publication Critical patent/CN113535474B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1435Saving, restoring, recovering or retrying at system level using file system or storage system metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention provides a method, a system, a medium and a terminal for automatically repairing heterogeneous cloud storage cluster faults, wherein the method comprises the steps of obtaining information data of all cluster nodes, detecting the information and obtaining the fault rate of the nodes; if the node is high in failure rate, marking the node as a failure node; acquiring the fault grade of the cluster according to the fault mark; executing fault recovery, if the fault recovery is not completed, marking the fault recovery as recovery, and stopping information detection; if the fault recovery is finished, marking the fault as healthy and sending fault recovery finishing notification information; according to the invention, the node health is comprehensively checked through node fault detection, and then a corresponding recovery scheme is adopted according to the fault grade.

Description

Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault
Technical Field
The invention relates to the field of computer application, in particular to a method, a system, a medium and a terminal for automatically repairing a fault of a heterogeneous cloud storage cluster.
Background
The heterogeneous cloud storage service is a middleware product of cloud storage and external third-party standard storage equipment, is flexibly selected according to the type and the service scene of the third-party storage equipment, and is high-reliability and high-performance unified view cloud storage service. The heterogeneous cloud storage service has the greatest advantages of being environment-friendly, having low requirements on server hardware, and being capable of providing high-reliability and high-performance view cloud storage service even if being deployed on a relatively old server.
However, once the server hardware is seriously aged, the failure frequency is increased, and certain difficulty is brought to the maintenance and recovery of the heterogeneous cluster, at present, the cluster failure recovery means of cloud storage is only limited to partial failure of the cluster nodes, in addition, the storage cluster failure recovery means mainly depends on backup data of the cluster, when the backup data does not exist after all the cluster nodes fail, the cluster cannot be recovered, and in addition, many conventional recovery means also need manual intervention, so that the recovery efficiency is low.
Disclosure of Invention
In view of the above drawbacks of the prior art, the present invention provides a method, a system, a medium, and a terminal for automatically repairing a failure of a heterogeneous cloud storage cluster, so as to solve the above technical problems.
The invention provides a method for automatically repairing a fault of a heterogeneous cloud storage cluster, which comprises the following steps:
acquiring information data of all cluster nodes, detecting the information, and acquiring the fault rates of all the cluster nodes;
if the fault rate of one node is higher than a preset comparison threshold, judging that the node is high in fault rate, and marking the node as a fault node;
dividing fault grades in advance, and acquiring the fault grades of the clusters according to the fault marks of each node in the clusters;
initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
if the fault recovery is not finished, marking the state of the fault node to be recovered as recovery through secondary marking, and stopping information detection on the fault node to be recovered;
and if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
In an embodiment of the present invention, the information detection includes multiple detection items for node working states, specifically including one or a combination of several of a baseboard management controller network, a service network, a server running state, a background access condition, a memory usage rate, a CPU usage rate, a system space, a core process, a configuration file, a library file, and a network fluctuation.
In one embodiment of the invention, the weights of different detection items are distributed according to the fault degree, information data detection is periodically carried out, the fault rate of the node is obtained by utilizing the preset judgment conforming condition of normal distribution, and the judgment conforming condition is set according to the skewness and the kurtosis of the data and the random variable, the expected value and the standard deviation in the normal distribution; and if the fault rate of one node is lower than a preset comparison threshold, judging that the node is low in fault rate, and performing self-adaptive recovery on the node judged to be low in fault rate.
In an embodiment of the present invention, fault flags of all nodes of a cluster are obtained, and the cluster is divided into a plurality of fault levels representing different fault degrees according to a ratio of a fault node to a cluster node, where the fault levels include a minor fault, a major fault, and an emergency fault indicating that the fault degree is that the cluster cannot perform data reading and writing.
In an embodiment of the present invention, the fault recovery of the emergency fault includes:
stopping reading and writing of the upper-layer service of the cluster, and acquiring interaction information between the upper-layer service and the cluster and third-party configuration information;
concurrently installing an operating system through the pre-execution environment nodes and finishing version deployment;
accessing the cluster to third-party equipment through the third-party configuration information;
and after the cluster ID and the configuration information of each fault node are recovered by the third-party equipment, starting an upper-layer service.
In an embodiment of the present invention, the fault recovery of the catastrophic failure includes:
stopping service reading and writing of the fault node in the cluster, and concurrently installing an operating system through the pre-execution environment node;
judging whether the fault node is an operation and maintenance node or not, and if the fault node is the operation and maintenance node, installing a product complete package;
acquiring a cluster ID through a healthy node in a cluster, adding the fault node into the cluster, and modifying the cluster ID of the fault node to be consistent with the acquired cluster ID;
acquiring third party configuration information, accessing the third party configuration information to third party equipment, and recovering the configuration information of the fault node by reading data in the third party equipment;
and restarting the service reading and writing of the fault node in the cluster after the recovery is completed.
In an embodiment of the present invention, the fault recovery of the minor fault includes:
concurrently installing an operating system through the pre-execution environment nodes;
judging whether the fault node is an operation and maintenance node or not, and if the fault node is the operation and maintenance node, installing a product complete package;
acquiring a cluster ID through a healthy node in a cluster, adding the fault node into the cluster, and modifying the cluster ID of the fault node to be consistent with the acquired cluster ID;
and acquiring third party configuration information, accessing the third party configuration information into third party equipment, and recovering the configuration information of the fault node by reading data in the third party equipment.
In an embodiment of the present invention, when the cluster accesses the third party device,
stopping the storage process of the fault node, and judging the access type:
if the interface is a small computer system interface, acquiring a drive letter; if the file system is a network file system, acquiring a mark file;
and reading corresponding data content on the third-party equipment according to the written data of the cluster, and further acquiring cluster ID and configuration information.
The invention also provides an automatic fault repairing system for the heterogeneous cloud storage cluster, which comprises the following steps: a fault detection module, a fault grade confirmation module and a fault recovery module,
acquiring information data of all cluster nodes, and performing information detection through a fault detection module to acquire fault rates of all cluster nodes;
if the fault rate of one node is higher than a preset comparison threshold, judging that the node is high in fault rate, and marking the node as a fault node;
dividing fault grades in advance, and acquiring the fault grades of the clusters by a fault grade confirmation module according to the fault marks of each node in the clusters;
initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
if the fault recovery is not finished, marking the state of the fault node to be recovered as recovery through secondary marking, and stopping information detection on the fault node to be recovered;
and if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
The present invention also provides an electronic terminal, comprising: a processor and a memory;
the memory is adapted to store a computer program and the processor is adapted to execute the computer program stored by the memory to cause the terminal to perform the method as defined in any one of the above.
The invention has the beneficial effects that: according to the method, the system, the medium and the terminal for automatically repairing the heterogeneous cloud storage cluster fault, the node health is comprehensively checked through node fault detection, and the corresponding recovery scheme is adopted according to different node fault rates.
Drawings
Fig. 1 is a schematic flow diagram of a method for automatically repairing a fault of a heterogeneous cloud storage cluster in an embodiment of the present invention.
Fig. 2 is a schematic diagram of interaction between modules of an automatic fault repair system for a heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 3 is a schematic diagram of a fault detection flow in the method for automatically repairing the fault of the heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 4 is a schematic view of a level confirmation flow of the method for automatically repairing a failure of a heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 5 is a schematic fault recovery flow diagram of a method for automatically repairing a fault of a heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 6 is a schematic emergency failure recovery flow diagram of a method for automatically repairing a failure of a heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 7 is a schematic serious failure recovery flow diagram of a method for automatically repairing a failure of a heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 8 is a schematic diagram of a slight fault recovery flow of the method for automatically repairing the fault of the heterogeneous cloud storage cluster in the embodiment of the present invention.
Fig. 9 is a schematic flow chart illustrating a process of reading stored data to recover cluster configuration information according to the method for automatically repairing a fault of a heterogeneous cloud storage cluster in the embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
In the following description, numerous details are set forth to provide a more thorough explanation of embodiments of the present invention, however, it will be apparent to one skilled in the art that embodiments of the present invention may be practiced without these specific details, and in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present invention.
As shown in fig. 2, the method for automatically repairing a failure of a heterogeneous cloud storage cluster in this embodiment includes:
s1, acquiring information data of all cluster nodes, detecting the information, and acquiring the fault rates of all the cluster nodes;
s2, if the fault rate of one node is higher than a preset comparison threshold value, judging that the node is high in fault rate, and marking the node as a fault node;
s3, dividing fault levels in advance, and acquiring the fault levels of the clusters according to the fault marks of each node in the clusters;
s4, initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
s5, if the fault recovery is not completed, marking the state of the fault node to be recovered as recovery through a secondary mark, and stopping information detection on the fault node to be recovered;
and S6, if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
In this embodiment, information data of all cluster nodes is acquired, a module affecting the operation of the nodes is subjected to fault detection, detection data and results are collected, then a node fault rate is calculated according to the collected information data, if the fault rate is low, adaptive repair is directly performed, if the fault rate is too high, the node is directly marked as a fault node, and recovery can be performed through a fault recovery module. Acquiring a fault mark of each node of the cluster, calculating a cluster fault grade, determining whether to start fault recovery, if determining to start fault recovery, initiating a recovery request message, entering a fault recovery process, simultaneously sending a fault processing signal to represent that the node is in fault recovery, and after receiving the fault processing signal, marking the corresponding node as the fault recovery, and not detecting the fault of the node. And when the fault recovery is completed, sending a task completion message, marking the node as healthy, and continuing the next round of detection work. The specific flow is shown in fig. 1 and 2.
In this embodiment, first, node information is collected, and then a failure rate is calculated according to the collected information, for example, when the failure rate is greater than 0.5, the node is considered to be failed, and the node is marked as a failed node, otherwise, the node is marked as normal after self recovery. And acquiring a fault mark of each node of the cluster, calculating a fault grade, and then sending notification information and performing fault recovery. And after receiving the notification information, entering a fault recovery process, sending a fault processing signal to represent that the node is in fault recovery, marking the fault of the corresponding node as fault recovery, and not detecting the fault of the node any more. And when the fault recovery is finished, sending a message, marking the recovery mark as healthy, and continuing the next round of detection work.
In this embodiment, the information detection includes a plurality of detection items for the node operating state, specifically including a Baseboard Management Controller (BMC) network, a service network, a server operating state, a background access condition, a memory usage rate, a CPU usage rate, a system space, a core process, a configuration file, a library file, and a network fluctuation. The collected data is processed and analyzed to calculate the failure rate, and then the relevant measures are started according to the value of the failure rate, as shown in fig. 3. For example, the acquired information includes whether the BMC network, the service network, the server are allowed, whether the background can be accessed, specific values of the usage rates of the memory and the CPU, whether the key configuration file is damaged, whether the version library file is damaged, fluctuation of network stability, usage of a system space, and the like, and optionally, the period T for acquiring the received data is 0.5 hour, and a group of data is acquired in 10 seconds. The data values collected each time are T3600/10 ═ 180 (pieces)
In this embodiment, the weights of different detection items are distributed according to the failure degree, information data detection is periodically performed, and the failure rate of the node is acquired by using a preset judgment conforming condition of normal distribution, where the judgment conforming condition is set according to the skewness and kurtosis of the data, and a random variable, an expected value, and a standard deviation in the normal distribution.
When the random variable X obeys N (mu, sigma)2) Distribution, g (x) is 1, otherwise g (x) is 0
Figure BDA0003141435680000051
Whether N (μ, σ) is satisfied or not can be determined by2) And (3) judging distribution:
Figure BDA0003141435680000052
Figure BDA0003141435680000053
wherein the content of the first and second substances,
Figure BDA0003141435680000054
the failure rate is calculated as follows:
Figure BDA0003141435680000055
r(xi,t)=r(t-1)+pt-1+x`i
s(xj,t)=s(t-1)+pt-1+x`j
Figure BDA0003141435680000056
wherein, r (x)iT) represents the weight of the check item (BMC network, service network, server running state, background access) with high fault degree, optionally, in this embodiment, the initial value is 0.6, s (x)jT) represents the weight of the check item (memory, CPU utilization, system space, core process, relevant configuration file, library file, and network fluctuation) with a low failure degree, optionally, in this embodiment, the initial value is 0.4, and when t is 0, p is equal to 0t=0;
When g (x) are all 0, F (g (x)) is 0.
When F (g (x)) is less than 0.5, adaptive recovery is performed. Otherwise, the fault node is marked, and the fault recovery module is informed to carry out fault recovery.
In this embodiment, as shown in fig. 4, after collecting the fault flags of all the nodes of the cluster, the fault levels of the cluster are calculated, where the fault levels include a slight fault, a serious fault, and an emergency fault indicating that the fault level is that the cluster cannot perform data reading and writing, and the calculation method is as follows:
Figure BDA0003141435680000061
wherein, health is healthy and has no fault, minor is slight fault, major is serious fault, and emergency is urgent fault.
In this embodiment, the failure level and the failed node IP are confirmed, failure recovery is started, and the failure is marked as failure recovery, and failure detection is not performed any more, so that it is avoided that a failure recovery message is received again in the recovery process, as shown in fig. 5.
In the present embodiment, the recovery procedure for an emergency failure is shown in fig. 6:
and S101, the data reading and writing cannot be carried out in the fault level, and a reading and writing request is sent to the cluster in order to avoid data service in the recovery process. And stopping reading and writing of the upper layer service.
S102, by reading the information of the interaction between the upper-layer service and the cluster: (1) acquiring a CM VIP of the cluster by an upper application VMS; (2) third party configuration information provided by the administrator (accessible via the administrator provided path, the administrator confirms that the data path is provided upon recovery) (IP and port for iSCSi and IP and absolute path for NFS).
S103, automatically installing the operating system (installing the operating system by nodes concurrently) through a PXE (Pre-execution Environment) technology.
And S104, automatically distributing the dynamic IP by the dhcp server, logging in the server after obtaining the dynamic IP, and modifying the IP address as the IP address of the cluster.
And S105, after the operating systems of all the nodes are successfully installed and the IP addresses are modified, selecting any node to install the version packet.
And S106, automatically completing the deployment of the version through the web.
And S107, accessing third party configuration through third party configuration information provided by an administrator.
And S108, reading the data on the accessed third-party equipment, analyzing the data, and recovering the relevant configuration files of each node, including but not limited to the cluster ID, the corresponding relation between the dncode and the port, the host name and the like.
And S109, judging the video storage type through a web automatic login VMS interface, if the video storage type is a transfer storage type, not performing related operation, and if the video storage type is a direct storage type, clicking a direct storage service to install OSS and STDU services.
And S110, starting the related process of the cluster by using the version command.
And S111, returning a message to the fault detection module under the condition that the auxiliary tool is used for performing data reading, checking and judging on all the nodes and recovering successfully, and recovering the fault mark by the fault detection module.
And S112, a service switch is turned on through a web automatic login VMS interface, and video storage is carried out.
And S113, sending fault recovery information to an administrator.
In the present embodiment, the recovery flow for a serious failure is shown in fig. 7:
s201, if more than half of the nodes in the cluster fail, there is a large impact on the actual service reading and writing, mainly the problem that the cluster capacity may not be enough for writing, so that the service reading and writing need to be suspended first.
S202, automatically installing the operating system (installing the operating system by the nodes concurrently) through the PXE technology.
S203, the dhcp server automatically allocates the dynamic IP, logs in the server after acquiring the dynamic IP, and modifies the IP address as the IP address of the cluster.
S204, logging in a storage cluster web interface, judging whether high availability of operation and maintenance is configured or not, checking whether a fault node is an operation and maintenance node or not, installing a product complete package if the fault node is the operation and maintenance node, and only installing a part of software packages if the fault node is not the operation and maintenance node.
S205, acquiring the cluster ID from the cluster health node, adding the node into the cluster, and modifying the node name to be consistent with the record in the cluster.
And S206, accessing third-party equipment after the third-party configuration information is acquired, wherein the third-party equipment in the embodiment is third-party storage equipment.
And S207, analyzing the data by reading the third-party data, and recovering important configuration information of the fault node, including but not limited to the corresponding relation between the dncode and the port, the node capacity provided by the node, and the like.
And S208, starting the storage process of the node.
S209, judging whether the storage service type is directly stored or not through the web automatic login VMS interface, and if the storage service type is directly stored, issuing the installation of the streaming media service on the VMS direct storage management interface and judging whether the process is pulled up or not.
And S210, automatically opening a service switch through the web.
And S211, checking a recovery result by using an auxiliary tool of the cluster, and ensuring that the fault recovery is successful.
And S212, sending a recovery completion signal after all the fault nodes are recovered.
In this embodiment, the flow for the minor fault recovery is shown in fig. 8:
s311, operating system automatic installation (node concurrent installation operating system) is carried out through PXE technology
And S312, the dhcp server automatically allocates the dynamic IP, logs in the server after the dynamic IP is acquired, and modifies the IP address as the IP address of the cluster.
S313, logging in a storage cluster web interface, judging whether high availability of operation and maintenance is configured, checking whether a fault node is an operation and maintenance node, if the fault node is the operation and maintenance node, installing a product complete package, and if the fault node is not the operation and maintenance node, only installing a part of software packages.
S314, acquiring the cluster ID from the cluster health node, adding the node into the cluster, and modifying the node name to be consistent with the record in the cluster.
And S315, accessing the third-party equipment after the third-party configuration information is acquired.
And S316, analyzing the data by reading the third-party data, and recovering important configuration information of the fault node, including but not limited to the corresponding relation between the dncode and the port, the node capacity provided by the node, and the like.
And S317, starting the storage process of the node.
And S318, judging whether the storage service type is directly stored or not through the web automatic login VMS interface, and if the storage service type is directly stored, issuing the installation of the streaming media service on the VMS direct storage management interface and judging whether the process is pulled up or not.
And S319, automatically opening a service switch through the web.
S310, using an auxiliary tool of the cluster to check the recovery result, and ensuring that the fault recovery is successful.
And S311, sending a recovery completion signal after all the fault nodes are recovered.
In this embodiment, the flow of reading the third-party device data recovery node configuration is shown in fig. 9.
S411, stopping the storage process.
S412, judging the access type, if the access type is iSCSI protocol access, finding a drive letter, and if the access type is nfs protocol access, finding a mark file.
And S413, reading the content of the specific byte on the equipment according to the specificity and the purpose of the cluster written data, wherein the data is displayed as a binary system.
And S414, carrying out binary conversion to obtain key configuration information, such as cluster ID and the like.
And S415, acquiring equipment capacity information.
And S416, restoring the key configuration file of the node according to the received information.
And S417, restarting the related processes of the nodes.
Correspondingly, the invention also provides an automatic fault repairing system for the heterogeneous cloud storage cluster, which comprises the following steps: a fault detection module, a fault grade confirmation module and a fault recovery module,
acquiring information data of all cluster nodes, and performing information detection through a fault detection module to acquire fault rates of all cluster nodes;
if the fault rate of one node is higher than a preset comparison threshold, judging that the node is high in fault rate, and marking the node as a fault node;
dividing fault grades in advance, and acquiring the fault grades of the clusters by a fault grade confirmation module according to the fault marks of each node in the clusters;
initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
if the fault recovery is not finished, marking the state of the fault node to be recovered as recovery through secondary marking, and stopping information detection on the fault node to be recovered;
and if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
In this embodiment, the fault detection module mainly performs fault detection on a module that affects operation of a node and collects detection data and results, then calculates a node fault rate according to the collected information data, and directly performs adaptive repair if the fault rate is low, or directly marks the node as a faulty node if the fault rate is too high, and waits for the fault recovery module to recover. The level confirmation module acquires the fault mark of each node of the cluster, calculates the fault level of the cluster, informs the fault recovery module of fault recovery and reports an alarm, and requires an administrator to investigate the recovery hardware equipment to determine whether to start fault recovery. After receiving the message of the fault level confirmation module and the message of the manager confirmation recovery, the fault recovery module enters a fault recovery process and sends a fault processing signal to the fault detection module to represent that the node is in fault recovery, and after receiving the message, the fault detection module marks the fault of the corresponding node as fault recovery and does not perform fault detection on the node any more. And when the fault recovery is finished, sending a message to the fault detection module and the administrator, wherein the fault detection module is recovered and marked as healthy, and continuing the next detection work.
In this embodiment, the failure detection module is composed of a message receiving module, a node information collecting module, a data processing module, and an adaptive recovery module. The detection items comprise a BMC network, a service network, a server running state, a background access condition, a memory, a CPU utilization rate, a system space, a core process, a related configuration file, a library file and network fluctuation. The data processing module is mainly used for processing, analyzing and calculating the failure rate of the data collected by the information collecting module, and starting related measures according to the value of the failure rate, as shown in fig. 4.
In this embodiment, the level confirmation module mainly obtains the fault flag of the fault detection module, determines whether the node starts fault recovery, calculates the fault level of the cluster after collecting the fault flags of all the nodes of the cluster, and sends the fault level of the cluster to the administrator in an alarm manner, and in addition, reports the fault level and the fault node IP to the fault recovery module.
In this embodiment, after the fault recovery module receives the fault message (fault level and fault node IP) confirmed by the level and the message confirmed by the administrator to be recovered, the fault recovery module starts fault recovery, and at this time, the fault recovery module sends the message in fault recovery to the fault detection module, and when the fault recovery module receives the message, the fault recovery module marks the fault as fault recovery, and does not perform fault detection any more, and after the fault recovery is completed, the fault detection routing inspection is performed. Avoiding receiving the failure recovery message again during the recovery process, as shown in fig. 6. The three modules in the embodiment mutually transmit information to confirm the working progress of the modules, and the method in the embodiment executes fault recovery, is well-ordered, and avoids interference or repeated repair in the fault recovery process.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.
The present embodiment further provides an electronic terminal, including: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.
The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The electronic terminal provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for completing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program so that the electronic terminal can execute the steps of the method.
In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In the above-described embodiments, reference in the specification to "the embodiment," "an embodiment," "another embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment.
In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (11)

1. A method for automatically repairing a fault of a heterogeneous cloud storage cluster is characterized by comprising the following steps:
acquiring information data of all cluster nodes, detecting the information, and acquiring the fault rates of all the cluster nodes;
if the fault rate of one node is higher than a preset comparison threshold, judging that the node is high in fault rate, and marking the node as a fault node;
dividing fault grades in advance, and acquiring the fault grades of the clusters according to the fault marks of each node in the clusters;
initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
if the fault recovery is not finished, marking the state of the fault node to be recovered as recovery through secondary marking, and stopping information detection on the fault node to be recovered;
and if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
2. The method according to claim 1, wherein the information detection includes a plurality of detection items for node working states, specifically including one or a combination of several of a baseboard management controller network, a service network, a server running state, a background access condition, a memory usage rate, a CPU usage rate, a system space, a core process, a configuration file, a library file, and a network fluctuation.
3. The method for automatically repairing the fault of the heterogeneous cloud storage cluster according to claim 2, wherein weights of different detection items are distributed according to fault degrees, information data are periodically detected, the fault rate of the node is obtained by utilizing a preset judgment conforming condition of normal distribution, and the judgment conforming condition is set according to skewness and kurtosis of data and a random variable, an expected value and a standard deviation in the normal distribution;
and if the fault rate of one node is lower than a preset comparison threshold, judging that the node is low in fault rate, and performing self-adaptive recovery on the node judged to be low in fault rate.
4. The method for automatically repairing the fault of the heterogeneous cloud storage cluster according to claim 1, wherein the fault flags of all nodes of the cluster are acquired, the cluster is divided into a plurality of fault levels representing different fault degrees according to the ratio of the fault nodes to the cluster nodes, and the fault levels include a slight fault, a serious fault and an emergency fault for representing that the fault degree is that the cluster cannot read and write data.
5. The method for fault automatic repair of the heterogeneous cloud storage cluster according to claim 4, wherein the fault recovery of the emergency fault comprises:
stopping reading and writing of the upper-layer service of the cluster, and acquiring interaction information between the upper-layer service and the cluster and third-party configuration information;
concurrently installing an operating system through the pre-execution environment nodes and finishing version deployment;
accessing the cluster to third-party equipment through the third-party configuration information;
and after the cluster ID and the configuration information of each fault node are recovered by the third-party equipment, starting an upper-layer service.
6. The method for fault automatic repair of the heterogeneous cloud storage cluster according to claim 4, wherein the fault recovery of the serious fault comprises:
stopping service reading and writing of the fault node in the cluster, and concurrently installing an operating system through the pre-execution environment node;
judging whether the fault node is an operation and maintenance node or not, and if the fault node is the operation and maintenance node, installing a product complete package;
acquiring a cluster ID through a healthy node in a cluster, adding the fault node into the cluster, and modifying the cluster ID of the fault node to be consistent with the acquired cluster ID;
acquiring third party configuration information, accessing the third party configuration information to third party equipment, and recovering the configuration information of the fault node by reading data in the third party equipment;
and restarting the service reading and writing of the fault node in the cluster after the recovery is completed.
7. The method for fault automatic repair of the heterogeneous cloud storage cluster according to claim 4, wherein the fault recovery of the slight fault comprises:
concurrently installing an operating system through the pre-execution environment nodes;
judging whether the fault node is an operation and maintenance node or not, and if the fault node is the operation and maintenance node, installing a product complete package;
acquiring a cluster ID through a healthy node in a cluster, adding the fault node into the cluster, and modifying the cluster ID of the fault node to be consistent with the acquired cluster ID;
and acquiring third party configuration information, accessing the third party configuration information into third party equipment, and recovering the configuration information of the fault node by reading data in the third party equipment.
8. The method for automatically repairing the fault of the heterogeneous cloud storage cluster according to any one of claims 5, 6 and 7, wherein when the cluster accesses the third-party device,
stopping the storage process of the fault node, and judging the access type:
if the interface is a small computer system interface, acquiring a drive letter; if the file system is a network file system, acquiring a mark file;
and reading corresponding data content on the third-party equipment according to the written data of the cluster, and further acquiring cluster ID and configuration information.
9. A heterogeneous cloud storage cluster fault automatic repair system is characterized by comprising: a fault detection module, a fault grade confirmation module and a fault recovery module,
acquiring information data of all cluster nodes, and performing information detection through a fault detection module to acquire fault rates of all cluster nodes;
if the fault rate of one node is higher than a preset comparison threshold, judging that the node is high in fault rate, and marking the node as a fault node;
dividing fault grades in advance, and acquiring the fault grades of the clusters by a fault grade confirmation module according to the fault marks of each node in the clusters;
initiating recovery request information aiming at a fault node to be recovered, executing fault recovery according to the recovery request information, carrying out secondary marking on the fault node to be recovered, and representing the state of the fault node to be recovered through the secondary marking;
if the fault recovery is not finished, marking the state of the fault node to be recovered as recovery through secondary marking, and stopping information detection on the fault node to be recovered;
and if the fault recovery is finished, marking the state of the fault node to be recovered as healthy through secondary marking, and sending fault recovery finishing notification information.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1 to 8.
11. An electronic terminal, comprising: a processor and a memory;
the memory is for storing a computer program and the processor is for executing the computer program stored by the memory to cause the terminal to perform the method of any of claims 1 to 8.
CN202110741169.2A 2021-06-30 2021-06-30 Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault Active CN113535474B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110741169.2A CN113535474B (en) 2021-06-30 2021-06-30 Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110741169.2A CN113535474B (en) 2021-06-30 2021-06-30 Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault

Publications (2)

Publication Number Publication Date
CN113535474A true CN113535474A (en) 2021-10-22
CN113535474B CN113535474B (en) 2022-11-11

Family

ID=78126416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110741169.2A Active CN113535474B (en) 2021-06-30 2021-06-30 Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault

Country Status (1)

Country Link
CN (1) CN113535474B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114043994A (en) * 2021-11-17 2022-02-15 国汽智控(北京)科技有限公司 Vehicle fault processing method, device, equipment and storage medium
CN114513401A (en) * 2022-01-28 2022-05-17 上海云轴信息科技有限公司 Automatic operation and maintenance repair method and device for private cloud and computer readable medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002033551A1 (en) * 2000-10-18 2002-04-25 Tricord Systems, Inc. Controller fault recovery system for a distributed file system
US20030147352A1 (en) * 2002-02-06 2003-08-07 Nec Corporation Path establishment method for establishing paths of different fault recovery types in a communications network
US20060184939A1 (en) * 2005-02-15 2006-08-17 International Business Machines Corporation Method for using a priority queue to perform job scheduling on a cluster based on node rank and performance
JP2008310591A (en) * 2007-06-14 2008-12-25 Nomura Research Institute Ltd Cluster system, computer, and failure recovery method
CN103838637A (en) * 2014-03-03 2014-06-04 江苏智联天地科技有限公司 Terminal automatic fault diagnosis and restoration method on basis of data mining
CN104813276A (en) * 2012-11-26 2015-07-29 亚马逊科技公司 Streaming restore of a database from a backup system
US9201887B1 (en) * 2012-03-30 2015-12-01 Emc Corporation Cluster file server proxy server for backup and recovery
CN105302667A (en) * 2015-10-12 2016-02-03 国家计算机网络与信息安全管理中心 Cluster architecture based high-reliability data backup and recovery method
CN107705054A (en) * 2017-11-23 2018-02-16 国网山东省电力公司电力科学研究院 Meet the new energy grid-connected power remote measuring and diagnosing platform and method of complex data
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
US10467115B1 (en) * 2017-11-03 2019-11-05 Nutanix, Inc. Data consistency management in large computing clusters
CN111309524A (en) * 2020-02-14 2020-06-19 苏州浪潮智能科技有限公司 Distributed storage system fault recovery method, device, terminal and storage medium
CN112084072A (en) * 2020-09-11 2020-12-15 重庆紫光华山智安科技有限公司 Method, system, medium and terminal for improving disaster tolerance capability of PostgreSQL cluster

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002033551A1 (en) * 2000-10-18 2002-04-25 Tricord Systems, Inc. Controller fault recovery system for a distributed file system
US20030147352A1 (en) * 2002-02-06 2003-08-07 Nec Corporation Path establishment method for establishing paths of different fault recovery types in a communications network
US20060184939A1 (en) * 2005-02-15 2006-08-17 International Business Machines Corporation Method for using a priority queue to perform job scheduling on a cluster based on node rank and performance
JP2008310591A (en) * 2007-06-14 2008-12-25 Nomura Research Institute Ltd Cluster system, computer, and failure recovery method
US9201887B1 (en) * 2012-03-30 2015-12-01 Emc Corporation Cluster file server proxy server for backup and recovery
CN104813276A (en) * 2012-11-26 2015-07-29 亚马逊科技公司 Streaming restore of a database from a backup system
CN103838637A (en) * 2014-03-03 2014-06-04 江苏智联天地科技有限公司 Terminal automatic fault diagnosis and restoration method on basis of data mining
CN105302667A (en) * 2015-10-12 2016-02-03 国家计算机网络与信息安全管理中心 Cluster architecture based high-reliability data backup and recovery method
US10467115B1 (en) * 2017-11-03 2019-11-05 Nutanix, Inc. Data consistency management in large computing clusters
CN107705054A (en) * 2017-11-23 2018-02-16 国网山东省电力公司电力科学研究院 Meet the new energy grid-connected power remote measuring and diagnosing platform and method of complex data
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
CN111309524A (en) * 2020-02-14 2020-06-19 苏州浪潮智能科技有限公司 Distributed storage system fault recovery method, device, terminal and storage medium
CN112084072A (en) * 2020-09-11 2020-12-15 重庆紫光华山智安科技有限公司 Method, system, medium and terminal for improving disaster tolerance capability of PostgreSQL cluster

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BASHIR MOHAMMED等: "Failover strategy for fault tolerance in cloud", 《SOFTWARE:PRACTICE AND EXPERIENCE》 *
崔毅: "基于状态评估的电力运维系统设计与实现", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》 *
贾伟: "大数据中心容灾备份的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114043994A (en) * 2021-11-17 2022-02-15 国汽智控(北京)科技有限公司 Vehicle fault processing method, device, equipment and storage medium
CN114513401A (en) * 2022-01-28 2022-05-17 上海云轴信息科技有限公司 Automatic operation and maintenance repair method and device for private cloud and computer readable medium

Also Published As

Publication number Publication date
CN113535474B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN113535474B (en) Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault
US20120174112A1 (en) Application resource switchover systems and methods
EP3221795A1 (en) Service addressing in distributed environment
JP5692414B2 (en) Detection device, detection program, and detection method
US20170041207A1 (en) Dynamic discovery of applications, external dependencies, and relationships
CN111770002B (en) Test data forwarding control method and device, readable storage medium and electronic equipment
CN110784515B (en) Data storage method based on distributed cluster and related equipment thereof
CN104718533A (en) Robust hardware fault management system, method and framework for enterprise devices
JP2014522052A (en) Reduce hardware failure
US10255124B1 (en) Determining abnormal conditions of host state from log files through Markov modeling
CN109271172B (en) Host performance expansion method and device of sweep cluster
CN115277566B (en) Load balancing method and device for data access, computer equipment and medium
US11836067B2 (en) Hyper-converged infrastructure (HCI) log system
CN103559124A (en) Fast fault detection method and device
CN105446792A (en) Deployment method, deployment device and management node of virtual machines
CN109634802B (en) Process monitoring method and terminal equipment
US10102088B2 (en) Cluster system, server device, cluster system management method, and computer-readable recording medium
WO2023226380A1 (en) Disk processing method and system, and electronic device
CN110647318A (en) Method, device, equipment and medium for creating instance of stateful application
CN107104844B (en) Method and device for migrating public IP address by CTDB
CN110502399B (en) Fault detection method and device
US11263069B1 (en) Using unsupervised learning to monitor changes in fleet behavior
CN103810038A (en) Method and device for transferring virtual machine storage files in HA cluster
CN109660392B (en) Hardware unification self-adaptive management deployment method and system under Linux system
TW201513605A (en) System and method for monitoring multi-level devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant