CN114035997A - High-availability fault switching method based on MGR - Google Patents
High-availability fault switching method based on MGR Download PDFInfo
- Publication number
- CN114035997A CN114035997A CN202111372825.2A CN202111372825A CN114035997A CN 114035997 A CN114035997 A CN 114035997A CN 202111372825 A CN202111372825 A CN 202111372825A CN 114035997 A CN114035997 A CN 114035997A
- Authority
- CN
- China
- Prior art keywords
- mgr
- node
- database
- priority
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to the technical field of communication, and discloses a high-availability fault switching method based on MGR, which comprises the following steps: s1: building an MGR cluster of a single main mode; s2: configuring a priority switching sequence of each node of the MGR cluster; s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster; s4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP. S5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database. Through the mode of combining the priority of each node of the MGR cluster with the keepaliVed priority, the keepaliVed VIP switching sequence is consistent with the MGR node, the high-availability fault switching efficiency of the MGR node can be improved, and the high availability of the MGR is realized.
Description
Technical Field
The invention relates to the technical field of communication, in particular to a high-availability fault switching method based on MGR.
Background
MySQL Group Replication, MGR for short, is a state machine Replication based on the Paxos protocol, which is promoted by MySQL officials. The MGR can well ensure data consistency, can automatically switch, has a fault detection function, and is generally composed of 3 or more MySQL nodes. However, in the single master mode, the MGR high availability failover mechanism is not friendly, and other nodes except the master node are all in a read-only state, so that the application client needs to automatically detect and discover the high availability state of the MGR cluster and automatically connect the master node. How to solve the problem of automatic switching of databases connected by applications when the MGR node fails is a key for improving the high availability mechanism of the MGR.
Disclosure of Invention
The invention aims to provide a high-availability fault switching method based on MGR, which can keep the keepalive VIP switching sequence consistent with the MGR node by combining the priority of each node of an MGR cluster with the keepalive priority, improve the high-availability fault switching efficiency of the MGR node and realize the high availability of the MGR.
The technical scheme provided by the invention is as follows: a high-availability failover method based on MGR includes the following steps:
s1: building an MGR cluster of a single main mode;
s2: configuring a priority switching sequence of each node of the MGR cluster;
s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster;
s4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP.
S5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database.
The working principle and the advantages of the invention are as follows: aiming at the problem that the MGR node state is difficult to automatically discover by an application program in a single master mode, so that a high-availability fault switching mechanism is not friendly. The method comprises the steps of firstly configuring the priority of each node of an MGR cluster at each MGR node through database parameters to control the switching sequence of the MGR nodes, simultaneously configuring the switching priority of each node through keepalive to control the switching sequence of VIP, connecting the database cluster through keepalive VIP (virtual IP), and keeping the VIP switching sequence consistent with the MGR nodes in a mode of combining the keepalive priority and the MGR priority to realize high-availability MGR fault switching.
Further, the priority order in S2 is a master-slave node switching order.
Setting the node database commonly used by the application program as a main node database, wherein the priority is highest, and the other nodes to be switched are slave node databases, and the priorities are sequentially decreased. The main node database provides daily use functions of the application program, and the rest slave node databases are in a hot standby state.
Further, the S2 specifically includes:
s21: configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node;
s22: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct.
After the priority sequence is set according to the switching sequence of the master node and the slave node, simulating the fault or the downtime of a next master node database, and testing whether the MGR cluster performs the switching of the nodes according to the set sequence.
Further, the priority order in S2 is that the system automatically configures according to the comprehensive performance of each node database and the order of the master node and the slave node.
If the MGR nodes are too many and inconvenient to manually set, the system can analyze and evaluate the comprehensive performance of each node database, and the master node and the slave node are sequentially set according to the performance, so that the trouble of manual setting one by one is avoided.
Further, the overall performance includes performance parameters and stability of the database.
The performance parameters reflect the operational performance of the database server and the operational efficiency of the application program thereon. The stability reflects the reliability of the database server. In the process of automatically setting the priority of the system, the optimal switching sequence is set mainly by comprehensively judging from two aspects of performance and reliability.
Further, the S5 specifically includes:
s5-1: when the fault of the current main node database is detected, the MGR cluster switches the nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that the application program is switched to the next node database.
S5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node.
After the master node database fails and is successfully switched to the slave node database to operate, after the failed master node database is repaired, the system detects that the master node database is recovered to a normal state, and the MGR cluster is switched back to the master node first and the keepalived is also switched back to the VIP of the master node according to the same switching mode, so that the application program is switched back to the connection of the master node database. Therefore, the slave node database is not in a running state for a long time, and is switched back to the master node database to run as far as possible, so that the efficiency and reliability of the whole MGR cluster are ensured.
Further, the method also comprises the step of S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function.
In the process of sending faults and switching nodes in the node database, the system can record and store the relevant information in the process, so that the subsequent analysis and summarization are facilitated, and the required relevant information can be quickly inquired according to the keywords.
Further, the method also comprises the step of S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report.
The system provides an intelligent analysis function for the relevant information of the fault, judges the fault type through the fault data and generates the fault type in a fault condition report.
Further, the failure categories include transaction failures, system failures, and media failures.
The system intelligently identifies the type of the fault according to the recorded fault data, and is convenient for subsequent repair.
Further, the fault condition report includes the fault occurrence reason and the suggested solution.
By combining fault data and other first-pass data in the switching process, the reason of the fault occurrence is further judged and a suggested solution is proposed through system intelligent analysis, so that the dependence on professionals is reduced, common workers can also perform simple repair work under the guidance of the system, and later-stage summary and prevention and control are facilitated.
Drawings
Fig. 1 is a logic block diagram of a high availability failover method based on an MGR according to a first embodiment of the present invention.
Detailed Description
The following is further detailed by the specific embodiments:
the first embodiment is as follows:
as shown in fig. 1, the present embodiment discloses a high availability fault switching method based on MGR, which specifically includes the following steps:
s1: and building the MGR cluster of the single master mode. In this embodiment, three MGR database servers form an MGR cluster, and an a node, a B node, and a C node are configured for the three MGR databases, and are set as a single master mode.
S2-1: and configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node. The node A is defined as a main node, and the configuration priority of the node A database is 99. And defining the node B and the node C as slave nodes, wherein the configuration priority of the node B database is 80, and the configuration priority of the node C database is 70.
S2-2: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct. Simulating the crash of the main node database, closing the main node database, checking which slave node database becomes a new main node database, and if the system is switched to the B node database, the switching sequence of the MGR nodes is correct.
S3: and configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster. Keepalived is configured for the three MGR database servers, the keepalived of the A node database is configured with the priority 99, the keepalived of the B node database is configured with the priority 80, and the keepalived of the C node database is configured with the priority 70.
S4: and setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP. Configuring the same keepalive VIP for the three MGR database servers, adding application program authority on the three MGR database servers, connecting the application program to the MGR database cluster through the keepalive VIP, and allowing only a single node to serve when the MGR cluster where the VIP is located provides services.
S5-1: when the fault of the current main node database is detected, the MGR cluster switches the nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that the application program is switched to the next node database. When a fault occurs in the operation of the main node database, the MGR cluster is switched to the slave node database, namely the B node database. And meanwhile, keepalived judges that the A node database is in a fault state, namely the A node database is switched to the B node database with the second priority, the keepalived enters a new main node, the VIP is switched to a new database server, and the application program finishes the automatic switching process.
S5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node. After the worker completes the repair of the A node database server, the A node database is recovered to be normal, and the MGR cluster is switched back to the main node database, namely the A node database. And meanwhile, keepalive judges that the A node database is in a normal state, namely, the A node database with the highest priority is switched back, the keepalive reenters the main node, the VIP is switched back to the main node database server, and the application program finishes the process of switching back to the main node database.
S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function. The system records the relevant data information from the failure of the main node to the completion of the switching process, stores the relevant data information in the back-end server, improves the query function, and enables the staff to search the required relevant data information according to the keywords of the application program, the time, the server and the like.
S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report. The system receives relevant information from the occurrence of the fault to the completion of the switching process, wherein the relevant information comprises database fault data, database switching data, VIP switching data and application program operation data. And integrating the data information, and firstly judging the fault type of the database by an intelligent algorithm, wherein the fault type mainly comprises a transaction fault, a system fault and a medium fault. The transaction fault refers to a common reason that a certain transaction does not run to a normal termination point due to various reasons in the running process, and the input data has error operation overflow and is locked by violating certain integrity limit. The system fault value causes for some reason a sudden stop in the normal operation of the entire system, causing all running transactions to terminate in an abnormal manner. When a system failure occurs, all the information in the database buffer area in the memory is lost, but the data stored on the external storage device is not affected. Media failures refer to hardware failures that cause data stored in external memory to be partially or completely lost, with media failures being much less likely than the first two types of failures, but most disruptive. And after the fault type is judged and identified, further analyzing the fault occurrence reason, sorting and correcting the fault data, comparing the fault data with the cases in the cloud database, selecting the case with the most similar data as the fault occurrence reason, searching the big data according to the reason, and selecting the most suitable solution. And (4) performing content sorting on the fault types, the fault occurrence reasons and the solutions to generate a comprehensive fault condition report.
Example two:
the difference between this embodiment and the first embodiment is that, in the step of configuring the priority order of each node of the MGR cluster, the system tests the performance and stability of three MGR databases, and can test the hardware parameters of the databases by testing the operating state of the server cluster of the database to be tested. The running state of the server cluster of the database to be tested is tested through the resource allocation condition, and the stability of the database can be tested.
The performance parameters and the stability data are obtained through the method, then the comparison is carried out, the performance parameters and the stability are scored, the full score is 50, the scores of the performance parameters and the stability data are added, the score with the highest total score is set as the main node database, the priority of the main node database is also the highest, the rest are set as the slave node databases, and the priorities are sequentially decreased according to the scores.
Therefore, the node a database has the highest total score, and is the master node database, and the priority is set to 99. The node B database is ranked in total with a priority of 80. The C-node database is overall the lowest, with a priority of 70.
And configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster, the keepalived configuration priority of the A node database is 99, the keepalived configuration priority of the B node database is 80, and the keepalived configuration priority of the C node database is 70.
The remaining steps of this embodiment are the same as those of the first embodiment.
The foregoing are merely exemplary embodiments of the present invention, and no attempt is made to show structural details of the invention in more detail than is necessary for the fundamental understanding of the art, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice with the teachings of the invention. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.
Claims (10)
1. A high-availability fault switching method based on MGR is characterized by comprising the following steps:
s1: building an MGR cluster of a single main mode;
s2: configuring a priority switching sequence of each node of the MGR cluster;
s3: configuring a priority switching sequence for the keepalived of each node, wherein the keepalived priority switching sequence is the same as the priority switching sequence of the MGR cluster;
s4: setting keepalived VIP, and connecting the application program to the MGR database cluster through the keepalived VIP;
s5: when the fault of the current node database is detected, the MGR cluster switches the nodes according to the priority sequence, and simultaneously keepalived switches the VIP according to the same priority sequence, so that the application program is switched to the next node database.
2. The MGR-based high availability failover method of claim 1, further comprising: the priority switching sequence in S2 is a master-slave node switching sequence.
3. The MGR-based high availability failover method of claim 2, wherein: the S2 specifically includes:
s2-1: configuring the priority of each node of the MGR cluster according to the switching sequence of the master node and the slave node;
s2-2: and simulating the fault of the MGR main node database, and testing whether the switching sequence of the MGR node is correct.
4. The MGR-based high availability failover method of claim 2, wherein: and in the step S2, the priority sequence is automatically configured by the system according to the comprehensive performance of each node database and the sequence of the master node and the slave node.
5. The MGR-based high availability failover method of claim 4 wherein: the overall performance includes performance parameters and stability of the database.
6. The MGR-based high availability failover method of claim 2, wherein: the S5 specifically includes:
s5-1: when the fault of the current main node database is detected, the MGR cluster switches nodes according to the switching sequence of the main node and the slave node, and simultaneously keepalived switches the VIP according to the same sequence, so that an application program is switched to the next node database;
s5-2: and after detecting that the database of the main node is recovered, the MGR cluster is switched back to the main node, and simultaneously keepalive switches the VIP according to the same sequence, so that the application program is switched to the database of the back-main node.
7. The MGR-based high availability failover method of claim 1, further comprising: further comprising S6: and recording related information of database failure and switching process in the operation process, and providing an inquiry function.
8. The MGR-based high availability failover method of claim 7, wherein: further comprising S7: and analyzing the recorded related information, judging the fault type and generating a fault condition report.
9. The MGR-based high availability failover method of claim 8, wherein: the failure categories include transaction failures, system failures, and media failures.
10. The MGR-based high availability failover method of claim 9, wherein: the fault condition report includes the cause of the fault and a proposed solution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111372825.2A CN114035997A (en) | 2021-11-18 | 2021-11-18 | High-availability fault switching method based on MGR |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111372825.2A CN114035997A (en) | 2021-11-18 | 2021-11-18 | High-availability fault switching method based on MGR |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114035997A true CN114035997A (en) | 2022-02-11 |
Family
ID=80138216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111372825.2A Pending CN114035997A (en) | 2021-11-18 | 2021-11-18 | High-availability fault switching method based on MGR |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114035997A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114826905A (en) * | 2022-03-31 | 2022-07-29 | 西安超越申泰信息科技有限公司 | Method, system, equipment and medium for switching management service of lower node |
CN117240694A (en) * | 2023-11-01 | 2023-12-15 | 广东保伦电子股份有限公司 | Method, device and system for switching active and standby hot standby based on keepaled |
-
2021
- 2021-11-18 CN CN202111372825.2A patent/CN114035997A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114826905A (en) * | 2022-03-31 | 2022-07-29 | 西安超越申泰信息科技有限公司 | Method, system, equipment and medium for switching management service of lower node |
CN117240694A (en) * | 2023-11-01 | 2023-12-15 | 广东保伦电子股份有限公司 | Method, device and system for switching active and standby hot standby based on keepaled |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10949280B2 (en) | Predicting failure reoccurrence in a high availability system | |
CN107291787B (en) | Main and standby database switching method and device | |
US7191198B2 (en) | Storage operation management program and method and a storage management computer | |
US7533292B2 (en) | Management method for spare disk drives in a raid system | |
US8347143B2 (en) | Facilitating event management and analysis within a communications environment | |
CN108710673B (en) | Method, system, computer device and storage medium for realizing high availability of database | |
CN114035997A (en) | High-availability fault switching method based on MGR | |
CN110178121B (en) | Database detection method and terminal thereof | |
US7730029B2 (en) | System and method of fault tolerant reconciliation for control card redundancy | |
CN109120522B (en) | Multipath state monitoring method and device | |
JP2009510624A (en) | A method and system for verifying the availability and freshness of replicated data. | |
WO2007147327A1 (en) | Method, system and apparatus of fault location for communicaion apparatus | |
CN115114064A (en) | Micro-service fault analysis method, system, equipment and storage medium | |
US10938623B2 (en) | Computing element failure identification mechanism | |
CN112069018B (en) | Database high availability method and system | |
EP0632381B1 (en) | Fault-tolerant computer systems | |
CN111198920A (en) | Method and device for synchronously determining comparison table snapshot based on database | |
CN111669452B (en) | High-availability method and device based on multi-master DNS (Domain name System) architecture | |
CN113688017B (en) | Automatic abnormality testing method and device for multi-node BeeGFS file system | |
CN107707402B (en) | Management system and management method for service arbitration in distributed system | |
KR20200106124A (en) | Test automation framework for dbms for analysis of bigdata and method of test automation | |
KR100302899B1 (en) | The method of loading relation data in on-line while using database | |
CN114116885A (en) | Database synchronization method, system, device and medium | |
KR100604552B1 (en) | Method for dealing with system troubles through joint-owning of state information and control commands | |
KR950010488B1 (en) | A db data obstacie recovery method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |