CN114035997A - High-availability fault switching method based on MGR - Google Patents

High-availability fault switching method based on MGR Download PDF

Info

Publication number
CN114035997A
CN114035997A CN202111372825.2A CN202111372825A CN114035997A CN 114035997 A CN114035997 A CN 114035997A CN 202111372825 A CN202111372825 A CN 202111372825A CN 114035997 A CN114035997 A CN 114035997A
Authority
CN
China
Prior art keywords
mgr
node
database
priority
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111372825.2A
Other languages
Chinese (zh)
Inventor
刘攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Fumin Bank Co Ltd
Original Assignee
Chongqing Fumin Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Fumin Bank Co Ltd filed Critical Chongqing Fumin Bank Co Ltd
Priority to CN202111372825.2A priority Critical patent/CN114035997A/en
Publication of CN114035997A publication Critical patent/CN114035997A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the technical field of communication, and discloses a high-availability fault switching method based on MGR, which comprises the following steps: s1: building an MGR cluster of a single main mode; s2: configuring a priority switching sequence of each node of the MGR cluster; s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster; s4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP. S5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database. Through the mode of combining the priority of each node of the MGR cluster with the keepaliVed priority, the keepaliVed VIP switching sequence is consistent with the MGR node, the high-availability fault switching efficiency of the MGR node can be improved, and the high availability of the MGR is realized.

Description

High-availability fault switching method based on MGR
Technical Field
The invention relates to the technical field of communication, in particular to a high-availability fault switching method based on MGR.
Background
MySQL Group Replication, MGR for short, is a state machine Replication based on the Paxos protocol, which is promoted by MySQL officials. The MGR can well ensure data consistency, can automatically switch, has a fault detection function, and is generally composed of 3 or more MySQL nodes. However, in the single master mode, the MGR high availability failover mechanism is not friendly, and other nodes except the master node are all in a read-only state, so that the application client needs to automatically detect and discover the high availability state of the MGR cluster and automatically connect the master node. How to solve the problem of automatic switching of databases connected by applications when the MGR node fails is a key for improving the high availability mechanism of the MGR.
Disclosure of Invention
The invention aims to provide a high-availability fault switching method based on MGR, which can keep the keepalive VIP switching sequence consistent with the MGR node by combining the priority of each node of an MGR cluster with the keepalive priority, improve the high-availability fault switching efficiency of the MGR node and realize the high availability of the MGR.
The technical scheme provided by the invention is as follows: a high-availability failover method based on MGR includes the following steps:
s1: building an MGR cluster of a single main mode;
s2: configuring a priority switching sequence of each node of the MGR cluster;
s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster;
s4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP.
S5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database.
The working principle and the advantages of the invention are as follows: aiming at the problem that the MGR node state is difficult to automatically discover by an application program in a single master mode, so that a high-availability fault switching mechanism is not friendly. The method comprises the steps of firstly configuring the priority of each node of an MGR cluster at each MGR node through database parameters to control the switching sequence of the MGR nodes, simultaneously configuring the switching priority of each node through keepalive to control the switching sequence of VIP, connecting the database cluster through keepalive VIP (virtual IP), and keeping the VIP switching sequence consistent with the MGR nodes in a mode of combining the keepalive priority and the MGR priority to realize high-availability MGR fault switching.
Further, the priority order in S2 is a master-slave node switching order.
Setting the node database commonly used by the application program as a main node database, wherein the priority is highest, and the other nodes to be switched are slave node databases, and the priorities are sequentially decreased. The main node database provides daily use functions of the application program, and the rest slave node databases are in a hot standby state.
Further, the S2 specifically includes:
s21: configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node;
s22: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct.
After the priority sequence is set according to the switching sequence of the master node and the slave node, simulating the fault or the downtime of a next master node database, and testing whether the MGR cluster performs the switching of the nodes according to the set sequence.
Further, the priority order in S2 is that the system automatically configures according to the comprehensive performance of each node database and the order of the master node and the slave node.
If the MGR nodes are too many and inconvenient to manually set, the system can analyze and evaluate the comprehensive performance of each node database, and the master node and the slave node are sequentially set according to the performance, so that the trouble of manual setting one by one is avoided.
Further, the overall performance includes performance parameters and stability of the database.
The performance parameters reflect the operational performance of the database server and the operational efficiency of the application program thereon. The stability reflects the reliability of the database server. In the process of automatically setting the priority of the system, the optimal switching sequence is set mainly by comprehensively judging from two aspects of performance and reliability.
Further, the S5 specifically includes:
s5-1: when the fault of the current main node database is detected, the MGR cluster switches the nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that the application program is switched to the next node database.
S5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node.
After the master node database fails and is successfully switched to the slave node database to operate, after the failed master node database is repaired, the system detects that the master node database is recovered to a normal state, and the MGR cluster is switched back to the master node first and the keepalived is also switched back to the VIP of the master node according to the same switching mode, so that the application program is switched back to the connection of the master node database. Therefore, the slave node database is not in a running state for a long time, and is switched back to the master node database to run as far as possible, so that the efficiency and reliability of the whole MGR cluster are ensured.
Further, the method also comprises the step of S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function.
In the process of sending faults and switching nodes in the node database, the system can record and store the relevant information in the process, so that the subsequent analysis and summarization are facilitated, and the required relevant information can be quickly inquired according to the keywords.
Further, the method also comprises the step of S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report.
The system provides an intelligent analysis function for the relevant information of the fault, judges the fault type through the fault data and generates the fault type in a fault condition report.
Further, the failure categories include transaction failures, system failures, and media failures.
The system intelligently identifies the type of the fault according to the recorded fault data, and is convenient for subsequent repair.
Further, the fault condition report includes the fault occurrence reason and the suggested solution.
By combining fault data and other first-pass data in the switching process, the reason of the fault occurrence is further judged and a suggested solution is proposed through system intelligent analysis, so that the dependence on professionals is reduced, common workers can also perform simple repair work under the guidance of the system, and later-stage summary and prevention and control are facilitated.
Drawings
Fig. 1 is a logic block diagram of a high availability failover method based on an MGR according to a first embodiment of the present invention.
Detailed Description
The following is further detailed by the specific embodiments:
the first embodiment is as follows:
as shown in fig. 1, the present embodiment discloses a high availability fault switching method based on MGR, which specifically includes the following steps:
s1: and building the MGR cluster of the single master mode. In this embodiment, three MGR database servers form an MGR cluster, and an a node, a B node, and a C node are configured for the three MGR databases, and are set as a single master mode.
S2-1: and configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node. The node A is defined as a main node, and the configuration priority of the node A database is 99. And defining the node B and the node C as slave nodes, wherein the configuration priority of the node B database is 80, and the configuration priority of the node C database is 70.
S2-2: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct. Simulating the crash of the main node database, closing the main node database, checking which slave node database becomes a new main node database, and if the system is switched to the B node database, the switching sequence of the MGR nodes is correct.
S3: and configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster. Keepalived is configured for the three MGR database servers, the keepalived of the A node database is configured with the priority 99, the keepalived of the B node database is configured with the priority 80, and the keepalived of the C node database is configured with the priority 70.
S4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP. Configuring the same keepalive VIP for the three MGR database servers, adding application program authority on the three MGR database servers, connecting the application program to the MGR database cluster through the keepalive VIP, and allowing only a single node to serve when the MGR cluster where the VIP is located provides services.
S5-1: when the fault of the current main node database is detected, the MGR cluster switches the nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that the application program is switched to the next node database. When a fault occurs in the operation of the main node database, the MGR cluster is switched to the slave node database, namely the B node database. And meanwhile, keepalived judges that the A node database is in a fault state, namely the A node database is switched to the B node database with the second priority, the keepalived enters a new main node, the VIP is switched to a new database server, and the application program finishes the automatic switching process.
S5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node. After the worker completes the repair of the A node database server, the A node database is recovered to be normal, and the MGR cluster is switched back to the main node database, namely the A node database. And meanwhile, keepalive judges that the A node database is in a normal state, namely, the A node database with the highest priority is switched back, the keepalive reenters the main node, the VIP is switched back to the main node database server, and the application program finishes the process of switching back to the main node database.
S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function. The system records the relevant data information from the failure of the main node to the completion of the switching process, stores the relevant data information in the back-end server, improves the query function, and enables the staff to search the required relevant data information according to the keywords of the application program, the time, the server and the like.
S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report. The system receives relevant information from the occurrence of the fault to the completion of the switching process, wherein the relevant information comprises database fault data, database switching data, VIP switching data and application program operation data. And integrating the data information, and firstly judging the fault type of the database by an intelligent algorithm, wherein the fault type mainly comprises a transaction fault, a system fault and a medium fault. The transaction fault refers to a common reason that a certain transaction does not run to a normal termination point due to various reasons in the running process, and the input data has error operation overflow and is locked by violating certain integrity limit. The system fault value causes for some reason a sudden stop in the normal operation of the entire system, causing all running transactions to terminate in an abnormal manner. When a system failure occurs, all the information in the database buffer area in the memory is lost, but the data stored on the external storage device is not affected. Media failures refer to hardware failures that cause data stored in external memory to be partially or completely lost, with media failures being much less likely than the first two types of failures, but most disruptive. And after the fault type is judged and identified, further analyzing the fault occurrence reason, sorting and correcting the fault data, comparing the fault data with the cases in the cloud database, selecting the case with the most similar data as the fault occurrence reason, searching the big data according to the reason, and selecting the most suitable solution. And (4) performing content sorting on the fault types, the fault occurrence reasons and the solutions to generate a comprehensive fault condition report.
Example two:
the difference between this embodiment and the first embodiment is that, in the step of configuring the priority order of each node of the MGR cluster, the system tests the performance and stability of three MGR databases, and can test the hardware parameters of the databases by testing the operating state of the server cluster of the database to be tested. The running state of the server cluster of the database to be tested is tested through the resource allocation condition, and the stability of the database can be tested.
The performance parameters and the stability data are obtained through the method, then the comparison is carried out, the performance parameters and the stability are scored, the full score is 50, the scores of the performance parameters and the stability data are added, the score with the highest total score is set as the main node database, the priority of the main node database is also the highest, the rest are set as the slave node databases, and the priorities are sequentially decreased according to the scores.
Therefore, the node a database has the highest total score, and is the master node database, and the priority is set to 99. The node B database is ranked in total with a priority of 80. The C-node database is overall the lowest, with a priority of 70.
And configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster, the keepalived configuration priority of the A node database is 99, the keepalived configuration priority of the B node database is 80, and the keepalived configuration priority of the C node database is 70.
The remaining steps of this embodiment are the same as those of the first embodiment.
The foregoing are merely exemplary embodiments of the present invention, and no attempt is made to show structural details of the invention in more detail than is necessary for the fundamental understanding of the art, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice with the teachings of the invention. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (10)

1. A high-availability fault switching method based on MGR is characterized by comprising the following steps:
s1: building an MGR cluster of a single main mode;
s2: configuring a priority switching sequence of each node of the MGR cluster;
s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster;
s4: setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP;
s5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database.
2. The MGR-based high availability failover method of claim 1, further comprising: the priority switching sequence in S2 is a master-slave node switching sequence.
3. The MGR-based high availability failover method of claim 2, wherein: the S2 specifically includes:
s2-1: configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node;
s2-2: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct.
4. The MGR-based high availability failover method of claim 2, wherein: and in the step S2, the priority sequence is automatically configured by the system according to the comprehensive performance of each node database and the sequence of the master node and the slave node.
5. The MGR-based high availability failover method of claim 4 wherein: the overall performance includes performance parameters and stability of the database.
6. The MGR-based high availability failover method of claim 2, wherein: the S5 specifically includes:
s5-1: when the fault of the current main node database is detected, the MGR cluster switches nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that an application program is switched to the next node database;
s5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node.
7. The MGR-based high availability failover method of claim 1, further comprising: further comprising S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function.
8. The MGR-based high availability failover method of claim 7, wherein: further comprising S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report.
9. The MGR-based high availability failover method of claim 8, wherein: the failure categories include transaction failures, system failures, and media failures.
10. The MGR-based high availability failover method of claim 9, wherein: the fault condition report includes the cause of the fault and a proposed solution.
CN202111372825.2A 2021-11-18 2021-11-18 High-availability fault switching method based on MGR Pending CN114035997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111372825.2A CN114035997A (en) 2021-11-18 2021-11-18 High-availability fault switching method based on MGR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111372825.2A CN114035997A (en) 2021-11-18 2021-11-18 High-availability fault switching method based on MGR

Publications (1)

Publication Number Publication Date
CN114035997A true CN114035997A (en) 2022-02-11

Family

ID=80138216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111372825.2A Pending CN114035997A (en) 2021-11-18 2021-11-18 High-availability fault switching method based on MGR

Country Status (1)

Country Link
CN (1) CN114035997A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826905A (en) * 2022-03-31 2022-07-29 西安超越申泰信息科技有限公司 Method, system, equipment and medium for switching management service of lower node
CN117240694A (en) * 2023-11-01 2023-12-15 广东保伦电子股份有限公司 Method, device and system for switching active and standby hot standby based on keepaled

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826905A (en) * 2022-03-31 2022-07-29 西安超越申泰信息科技有限公司 Method, system, equipment and medium for switching management service of lower node
CN117240694A (en) * 2023-11-01 2023-12-15 广东保伦电子股份有限公司 Method, device and system for switching active and standby hot standby based on keepaled

Similar Documents

Publication Publication Date Title
US10949280B2 (en) Predicting failure reoccurrence in a high availability system
CN107291787B (en) Main and standby database switching method and device
US7191198B2 (en) Storage operation management program and method and a storage management computer
US7533292B2 (en) Management method for spare disk drives in a raid system
US8347143B2 (en) Facilitating event management and analysis within a communications environment
CN108710673B (en) Method, system, computer device and storage medium for realizing high availability of database
CN114035997A (en) High-availability fault switching method based on MGR
CN110178121B (en) Database detection method and terminal thereof
US7730029B2 (en) System and method of fault tolerant reconciliation for control card redundancy
CN109120522B (en) Multipath state monitoring method and device
JP2009510624A (en) A method and system for verifying the availability and freshness of replicated data.
WO2007147327A1 (en) Method, system and apparatus of fault location for communicaion apparatus
CN115114064A (en) Micro-service fault analysis method, system, equipment and storage medium
US10938623B2 (en) Computing element failure identification mechanism
CN112069018B (en) Database high availability method and system
EP0632381B1 (en) Fault-tolerant computer systems
CN111198920A (en) Method and device for synchronously determining comparison table snapshot based on database
CN111669452B (en) High-availability method and device based on multi-master DNS (Domain name System) architecture
CN113688017B (en) Automatic abnormality testing method and device for multi-node BeeGFS file system
CN107707402B (en) Management system and management method for service arbitration in distributed system
KR20200106124A (en) Test automation framework for dbms for analysis of bigdata and method of test automation
KR100302899B1 (en) The method of loading relation data in on-line while using database
CN114116885A (en) Database synchronization method, system, device and medium
KR100604552B1 (en) Method for dealing with system troubles through joint-owning of state information and control commands
KR950010488B1 (en) A db data obstacie recovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination