CN114546427A

CN114546427A - MySQL high-availability implementation method based on DNS and MGR

Info

Publication number: CN114546427A
Application number: CN202210154698.7A
Authority: CN
Inventors: 李锡超; 张创新; 欧阳韬; 王烨; 罗明星
Original assignee: Jiangsu Suning Bank Co Ltd
Current assignee: Jiangsu Suning Bank Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-27

Abstract

The invention discloses a MySQL high-availability realization method based on a DNS and an MGR. The method comprises the steps of installing and online configuring a database cluster, initializing and configuring a probe, detecting the running database cluster by the probe, judging whether the database cluster runs normally or not according to the database cluster structure, the transaction accumulation condition and the domain name IP obtained by detection, and completing the switching of the database by combining the probe and the MGR after abnormity occurs so that an application is automatically connected to the switched database cluster. The invention is designed by combining the whole structure of the application system, the DNS, the MGR and the characteristics of the probe, so the structure is simple. Meanwhile, the probe is based on an interfacing design, is not limited to any language, and can be realized by combining with a language familiar to a user during actual use so as to facilitate maintenance.

Description

MySQL high-availability implementation method based on DNS and MGR

Technical Field

The invention relates to the technical field of databases, in particular to a MySQL high-availability implementation method based on DNS and MGR.

Background

The database is used as data storage equipment of the information system, mainly provides data storage and access functions, and is often the core of the information system. With the rapid iteration and development of database technologies, various database vendors provide more functions and features, including database highly available technologies.

The database high-availability technology is a technology which can continuously provide services after partial nodes fail due to a server, an operating system or database software and the like through the characteristics of database software. The MySQL database is a technology widely used in various industries, provides asynchronous master-slave, semi-synchronous master-slave and latest MGR distributed technologies in sequence, and the technology and the scheme are gradually perfected and mature.

MGR refers to MySQL Group Replication (MySQL Group Replication), and is a synchronization technology provided by MySQL databases based on the Paxos distributed protocol from version 5.7. By the technology, conflict check and analysis are performed on the transaction logs synchronized between the main database and the standby database, so that the service and data of the database are ensured to be consistent under an extreme scene.

A Domain Name Service (DNS) is a Service that provides a mapping relationship between a Domain Name and an IP address, and a main purpose application system thereof can access a real Service by configuring a Domain Name server and accessing a Domain Name having a business meaning without paying attention to the IP address, thereby realizing unification of application configuration and reducing maintenance difficulty. The DNS is becoming a standard configuration for information system construction in various industries. The industry has now provided sophisticated solutions including the business version ZDNS, the open source version bind, bind-dlz, etc.

By combining the technologies, the database and the DNS belong to different technical fields, and effective integration is not performed. Although the database provides complete synchronization schemes such as master-slave synchronization and MGR, applications cannot be identified after switching is completed, and the applications need to independently realize related functions. The common schemes in the industry and the advantages and disadvantages thereof are analyzed as follows:

1. a VIP based switching scheme. According to the scheme, the database nodes are detected through the independent control server, fault switching is started after the detection failure of the main node is found, and the auxiliary node is promoted to be the main node. However, because nodes outside the database are adopted for detection, when fault switching is triggered due to abnormal network isolation, server death and the like, the database may be cracked, and data loss is caused; meanwhile, due to the particularity of the control node, a single-node or main-standby framework is often adopted, and when a complex fault scene is faced, the control node may break down, so that the switching operation cannot be executed; in addition, the scheme is mostly realized by a third-party organization, the core switching logic of the scheme is strongly associated with the database version, and the possibility of failure of the high-availability scheme due to incompatibility with the actually used database version exists, particularly the current MySQL8 version.

2. An intermediate proxy scheme. The solution provides for connection of the application by establishing a proxy layer between the database and the application, identifying the master node of the database through the proxy layer, and then providing a protocol close to the database. After the database master node fails, the intermediate proxy identifies the new node and then re-establishes the application connection to the new database node. However, because an additional intermediate proxy layer is required to be introduced, any database request needs to pass through the intermediate proxy layer, which inevitably causes the problems of more complex overall architecture of the application, more request paths, longer time consumption and the like. Meanwhile, the application agent often realizes compatibility through a third-party organization and simulation of MySQL common operations, but operations which cannot be identified by the intermediate agent inevitably exist, and additional compatibility transformation is required; furthermore, the intermediate proxy layer itself may have high availability issues, performance bottlenecks, etc.

3. Based on DNS health check scheme. According to the scheme, the DNS server is used for regularly checking the database, and when the condition change caused by the failure of the original main node is found, the IP of the corresponding domain name is updated to be the IP of the new main node. However, the scheme only provides how to connect the database after the database is switched, and a complete scheme is not provided, including how to perform fault switching on the database, not losing data, and enabling the application to be connected to a new host node according to updated domain name information; meanwhile, the method also depends on a specific DNS product to realize the active detection function based on the database, and has insufficient popularization value.

Disclosure of Invention

The invention aims to provide a MySQL high-availability implementation method based on DNS and MGR aiming at the defects in the prior art.

In order to achieve the above object, the present invention provides a MySQL high-availability implementation method based on DNS and MGR, including:

step 1, installing and online configuring a database cluster;

step 2, initializing the probe, and then detecting a database cluster in operation;

step 3, judging whether the database cluster normally operates according to the database cluster architecture, the transaction accumulation condition and the domain name IP acquired by detection, and specifically comprising the following steps:

if the database cluster main node IP is equal to the domain name IP and the database cluster node is equal to the configuration node, indicating that the cluster runs normally;

if the IP of the database cluster master node is equal to the domain name IP, but the database cluster node is not equal to the configuration node, indicating that the slave node is abnormal in operation, and triggering an alarm;

if the database cluster main node IP is not equal to the domain name IP and the node IP is equal to the database cluster main node IP, triggering a fault switching process;

and if the database cluster main node IP is not equal to the domain name IP and the local node IP is not equal to the database cluster main node IP, triggering an alarm.

Further, the failover process includes:

selecting a new main node through a distributed consistency protocol, terminating the original main node database instance or entering an offline mode, and adjusting the detection interval time;

analyzing the running state of the database, and determining whether a transaction which is not completely applied exists in the new main node;

after the application of the accumulated transaction is finished, if the node is a new main node, the probe initiates a DNS switching process;

after the DNS is successfully switched, if the DNS has a cache, executing the refreshing operation of the DNS cache;

checking the domain name IP, the new main node IP and the PING result, wherein the failure switching is completed if the checking is successful;

and recovering the detection interval as the configuration interval time and continuously executing the detection.

Further, the DNS switching process includes consistency check of a domain name and an IP before updating, IP for updating a domain name, and consistency check of a domain name and an IP after updating, checking and updating the commercial DNS device using an API interface provided by the commercial DNS device, updating the open-source DNS device using a remote command or an SQL request, and the like, and if the check is successful, the DNS switching is successful, otherwise, the DNS switching is aborted.

Further, the step 1 specifically includes:

step 1.1, resource application and configuration are carried out;

step 1.2, deploying database software;

step 1.3, optimizing parameter configuration;

step 1.4, initializing and starting a database;

step 1.5, configuring the user and the authority;

step 1.6, MGR plug-in and parameter configuration is carried out;

step 1.7, starting MGR operation;

step 1.8, configuring and starting each database node probe, so as to complete the configuration of the database node and gradually complete the configuration of all the database nodes;

and step 1.9, the application system connects the database cluster according to the configured information such as the domain name, the user name, the password and the like of the database.

Further, the step 2 specifically includes:

step 2.1, reading the fixed path configuration file, and determining whether the configuration file format is illegal and the parameter value is reasonable;

step 2.2, analyzing to obtain database login information, a login database query variable, a cluster structure and a transaction accumulation condition;

step 2.3, checking whether the cluster structure is matched with the information in the configuration file, if the node set of the configuration information is equal to or contains the real-time cluster architecture of the database, the checking returns success, otherwise, the checking returns failure;

step 2.4, executing ping operation according to the domain name to obtain an IP address corresponding to the domain name;

and 2.5, checking the DNS update interface, printing initialization information, and reasonably checking initialization configuration.

Further, after the fault switching process is completed, repairing the fault specifically includes:

analyzing the root cause of the database cluster for fault switching, and determining whether the components such as an operating system, a database, a probe and the like need to be optimized according to the analysis result;

checking data such as an operating system running state and a log, confirming whether repair operation on the operating system level needs to be executed, and restarting the operating system if fault switching is caused by operating system defects;

checking a database error log, confirming whether the repair operation of the database instance needs to be executed or not, and restarting the database if the database instance or the process abnormally triggers the fault switching;

confirming whether key parameters of a database of a fault node are normal or not, wherein the key parameters comprise related parameters such as an offline mode, read-only configuration and the like;

restarting the MGR, automatically recovering data after starting operation, and adding the fault node into the database cluster again;

and executing probe starting operation, and if the log is normal and the initialization is successful, indicating that the fault repair is successful.

Has the advantages that: the invention is suitable for the application using MySQL database, and has the following advantages:

1. the application is completely transparent, and the method has a wide application foundation. According to the database high-availability scheme, the native MySQL standard is reserved from application program coding, a driver and SQL analysis and execution, and the database high-availability scheme is completely transparent to an application system. When the application system is transferred from other high-availability schemes to use the scheme, the application system can be put into operation and used only by necessary standardized transformation.

2. The method is completely based on the distributed consistency protocol, and is safe and reliable. The scheme is designed based on a distributed consistency protocol provided by MGR, all operations affecting a database and a cluster structure, such as transaction verification, view switching and the like of a cluster are subjected to consistency negotiation, the operations are performed after half members vote and agree, and meanwhile, a probe is further performed based on a consistency switching result, such as view updating and the like. Therefore, the whole scheme is safe and reliable under any condition.

3. Based on common infrastructure, the operation and the cost are controllable. Since the whole architecture of the system is not changed by the scheme, the basic architecture of the common application service + database is maintained, and other intermediate proxy services are not introduced. Meanwhile, the DNS serves as a basic resource of the data center, and whether commercial DNS software (such as ZDNS) or DNS based on open sources (such as bind and bind-dlz) is adopted, the DNS can meet the requirement of the scheme. Meanwhile, in the fault switching process, the probe actively switches to the DNS according to the database consistency switching result, customized design (such as an API (application program interface), a remote command and an SQL (structured query language) request) can be carried out according to the target DNS, and the DNS does not need to be subjected to customized transformation so as to control cost and controllability.

4. The structure is simple, and the configuration and the maintenance are convenient. In summary, the highly available solution is designed by combining the overall architecture of the application system itself, the characteristics of the DNS, MGR, and probe, so the architecture is simple. Meanwhile, the probe is in an interface design and is not limited to any language, and can be realized by combining with the language familiar to a user in actual use so as to facilitate maintenance.

Drawings

FIG. 1 is a schematic flow diagram of an installation and on-line configuration of a database cluster;

FIG. 2 is a schematic flow chart of initializing a probe and probing a database cluster;

FIG. 3 is a schematic diagram of a failover process;

FIG. 4 is a schematic flow chart for repairing a fault;

FIG. 5 is an architecture diagram of an application system accessing DNS directly;

fig. 6 is an architecture diagram of an application system accessing a DNS cache.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific examples, which are carried out on the premise of the technical solution of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

As shown in fig. 1 to 4, an embodiment of the present invention provides a MySQL high-availability implementation method based on DNS and MGR, including:

step 1, installing and online configuring a database cluster. Specifically, referring to fig. 1, step 1 includes:

and 1.1, resource application and configuration are carried out.

In the system online process, an application system person in charge applies for resources such as servers and domain names through an existing standard process (such as ITSM). After the application is submitted and approved, the relevant responsibility team allocates and configures the resources. For example, a system management department allocates a database server and configures modules such as an IP address, a security policy, a monitoring policy and the like; and the network management department allocates domain names according to the allocated IP and configures related views and authorities.

And step 1.2, deploying database software.

The step is that the database management department performs installation and deployment of the database software. The method comprises the operations of operating system dependency package installation, operating system user configuration, operating system kernel parameter optimization, database directory creation, database software installation, directory authority modification and the like.

And 1.3, optimizing parameter configuration.

Specifically, the relevant database parameter values are calculated according to the server configuration (such as the number of CPU cores and the total size of the memory). If the Innodb _ thread _ security is set as the number of the CPUs of the server, the Innodb _ buffer _ pool _ size is 70% of the total size of the memory of the server, the slave _ parallel _ works is set as the number of the cores of the server, and the unique server-id is calculated. And other key parameter settings, detailed in table 1:

TABLE 1

And step 1.4, initializing and starting the database.

And (4) using deployed database software, and combining the optimized parameter file and the optimized directory configuration to perform initialization operation of the database. The method mainly comprises the operations of generating a system table space, generating a redo (redo) log file, generating a double-write (double-write) cache file, creating an undo (undo) table space, creating a temporary sequencing file, generating a temporary password and the like. And after the initialization is completed, executing a command to start the database.

And step 1.5, configuring the user and the authority.

After the initialization of the database is completed, the random password is inconvenient for unified management, and meanwhile, the related password is recorded in a database log, so that the security risk exists. Related users are also required to be established according to the application and the operation and maintenance requirements, and the user list is shown in table 2:

TABLE 2

It should be noted that, the user according to the principle of minimizing the permissions is noted above, the permissions of the table or the database should be allocated according to the actual needs. In particular, the CONNECTION _ ADMIN and the SUPER authority cannot be given.

And step 1.6, MGR plug-in and parameter configuration is carried out.

The configuration database supports a complete MGR (MySQL Group replication) function, two plug-ins, namely Group _ replication and mysql _ clone, need to be installed, wherein the Group _ replication is used for realizing the core function of the MGR, and the mysql _ clone is used for carrying out full data synchronization during fault repair. In addition, MGR related parameters need to be set, and the specification setting is shown in table 3:

TABLE 3

Step 1.7, start MGR operation.

After the completion of the plug-in and parameter configuration for the MGR, the database instance is restarted to validate the parameters. After the restart is completed, setting a group _ reproduction _ recovery channel, and then starting group copy. When the first node of the database cluster executes the startup group replication, the boot mode is set, and after the startup is completed, the boot mode is closed, and then the group replication of other nodes is started.

And step 1.8, configuring and starting each database node probe, so as to complete the configuration of the database node and gradually complete the configuration of all the database nodes.

When configuring and starting each database node probe, according to the applied domain name address and database cluster configuration, perfecting a probe configuration file, executing probe starting operation, and observing whether the log is normal and the initialization is successful.

It is further noted that, first, modifications to the MGR configuration are made in accordance with the above-described specification configuration to ensure that the switching operation is completed within a desired time frame after the occurrence of the fault. Secondly, the permission of the application program database user is distributed according to the requirement by adopting a permission minimization principle, so that the original main node can interrupt the existing database connection after being driven out of the cluster.

And 2, initializing the probe, and then detecting the running database cluster. Specifically, referring to fig. 2, step 2 specifically includes:

and 2.1, reading the fixed path configuration file, and determining whether the configuration file format is illegal and the parameter value is reasonable.

And 2.2, analyzing to obtain database login information, database login query variables, cluster structures and accumulation conditions.

And 2.3, checking whether the cluster structure is matched with the information in the configuration file, if the node set of the configuration information is equal to or contains the real-time cluster architecture of the database, the checking returns success, and if not, the returning fails.

And 2.4, executing ping operation according to the domain name to acquire the IP address corresponding to the domain name.

if the database cluster main node IP is equal to the domain name IP and the database cluster node is equal to the configuration node, the cluster is normally operated.

If the IP of the database cluster master node is equal to the IP of the domain name, but the database cluster node is not equal to the configuration node, the fact that the slave node runs abnormally is indicated, and an alarm is triggered.

And if the database cluster main node IP is not equal to the domain name IP and the local node IP is equal to the database cluster main node IP, triggering a fault switching process.

Referring to fig. 3, the above-mentioned failover process specifically includes:

and selecting a new main node through a distributed consistency protocol, terminating the original main node database instance or entering an offline mode, and adjusting the detection interval time. After the original master node database instance terminates or enters an offline mode, all sessions or connections running at the original master node are interrupted. The probe interval time is preferably adjusted to 1 second.

Analyzing the running state of the database, and determining whether the new host node has a transaction which is not completed by the application.

After the application of the accumulated transaction is completed, if the node is a new main node, the probe initiates a DNS switching process.

And after the DNS is successfully switched, if the DNS has a cache, executing the refreshing operation of the DNS cache.

And checking the domain name IP, the new main node IP and the PING result, wherein the failure switching is completed when the checking is successful.

And restoring the probe interval to be the configured interval time and continuing to execute the detection.

The DNS switching process comprises consistency check of a domain name and an IP before updating, IP updating of the domain name and consistency check of the domain name and the IP after updating, the business DNS equipment is checked and updated by using an API (application program interface) provided by the business DNS equipment, the open source DNS equipment is updated by using a remote command or SQL (structured query language) request, and finally, the DNS switching is successful if the verification is successful, otherwise, the DNS switching process is abnormal and exits.

After the fault switching process is completed, the following method may be further adopted to repair the fault, which is shown in fig. 4 and specifically includes:

and analyzing the root cause of the database cluster in the fault switching, and determining whether the components such as the operating system, the database, the probe and the like need to be optimized according to the analysis result.

Checking data such as operating system running state and log, determining whether repair operation on operating system level needs to be executed, and restarting the operating system if fault switching is caused by operating system defects.

And checking the database error log, confirming whether the repair operation of the database instance needs to be executed or not, and restarting the database if the database instance or the process triggers the failover.

And confirming whether the key parameters of the database of the fault node are normal or not, wherein the key parameters comprise the relevant parameters of an offline mode and a read-only configuration. See table 4 for details:

parameter name	Parameter value
		offline_mode	off
super_read_only	on
		read_only	on

TABLE 4

And restarting the MGR, automatically recovering data after starting operation, and adding the fault node into the database cluster again.

Referring to fig. 5 and 6, the present invention can be implemented based on an application system directly accessing a DNS or accessing a DNS cache, and when a failure occurs by directly accessing the DNS, the MGR identifies the failure and performs switching; the probe is combined with the MGR switching result, and accesses a DNS to provide an API (application program interface) to initiate the updating and checking of the domain name; the application system obtains the IP with the updated domain name through the DNS when reconnecting because the original main node of the database cluster fails so as to connect to the new main node of the database cluster. When the DNS cache is accessed for implementation, after a fault occurs, the MGR identifies the fault and performs switching; the probe is combined with the MGR switching result, accesses a DNS to provide an API (application program interface) to initiate the updating and the verification of the domain name, and refreshes the updated information to a cache in real time; the application system obtains the IP with the updated domain name through the DNS cache when reconnecting because the original main node of the database cluster is in fault so as to connect to the new main node of the database cluster.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that other parts not specifically described are within the prior art or common general knowledge to those of ordinary skill in the art. Without departing from the principle of the invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the scope of the invention.

Claims

1. A MySQL high-availability implementation method based on DNS and MGR is characterized by comprising the following steps:

step 1, installing and online configuring a database cluster;

2. The MySQL high-availability implementation method based on DNS and MGR of claim 1, wherein the failover process comprises:

selecting a new main node through a distributed consistency protocol, terminating the original main node database instance or entering an offline mode, and adjusting the probe interval time;

3. The MySQL high-availability implementation method based on DNS and MGR as claimed in claim 1, wherein the DNS switching process comprises consistency check of domain name and IP before update, IP for updating domain name, and consistency check of domain name and IP after update, the business DNS device is checked and updated by using API provided by the business DNS device, the open source DNS device is updated by using remote command or SQL request, and finally, successful verification means successful DNS switching, otherwise, the DNS switching process is exited.

4. The MySQL high availability implementation method based on DNS and MGR as claimed in claim 1, wherein the step 1 specifically comprises:

step 1.1, resource application and configuration are carried out;

step 1.2, deploying database software;

step 1.3, optimizing parameter configuration;

step 1.4, initializing and starting a database;

step 1.5, configuring the user and the authority;

step 1.6, MGR plug-in and parameter configuration is carried out;

step 1.7, starting MGR operation;

5. The MySQL high availability implementation method based on DNS and MGR as claimed in claim 1, wherein the step 2 specifically comprises:

6. The MySQL high-availability implementation method based on the DNS and the MGR as recited in claim 1, wherein after the fault switching process is completed, the fault is repaired, and the method specifically comprises the following steps:

confirming whether key parameters of a database of a fault node are normal or not, wherein the key parameters comprise an offline mode and read-only configuration related parameters;