MySQL application layer high-availability system and method suitable for various cloud environments
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a MySQL application layer high-availability system and method suitable for various cloud environments.
Background
The MySQL method is available, and the following four methods are currently common: firstly, a dual-node master-slave + keepalived/hash; second, MHA + multi-node cluster; third, zookeeper + proxy; fourth, the SAN shares storage.
The first method (dual-node master-slave + keepalive/hash) is simple in structure, and dual-node direct switching has the defect that a high availability mechanism of hash and keepalive needs to be additionally considered, and especially in a cloud environment, because the hash is already virtualized, keepalive is often not supported.
The second method (MHA + multi-node cluster) can automatically detect and transfer faults, has better expandability and lower unavailable probability of MySQL of three nodes/multi nodes, and has the defect that relative double nodes need more resources; the logic is relatively complex, the problem is solved after the fault occurs, and the positioning problem is more difficult; split brain may occur due to network partitioning.
The third method (zookeeper + proxy) well ensures the high availability of the whole system, has good expansibility, and has the defect that the logic of the whole system becomes more complex due to the introduction of zk.
The fourth method (SAN shared storage) is only needed by two nodes, the deployment is simple, and the switching logic is simple; the strong consistency of data is well ensured; the disadvantage is that it needs to consider the high availability of shared storage, which is expensive.
As can be seen from the above, the conventional common MySQL high availability method has many disadvantages, and therefore, a system which has a simple structure and is suitable for various cloud environments is urgently needed.
Disclosure of Invention
In order to solve the problems, the invention provides a MySQL application layer high-availability system suitable for various cloud environments, which is simple in structure and can adapt to various cloud environments.
The technical scheme of the invention is as follows: a MySQL application layer high-availability system suitable for various cloud environments comprises:
a plurality of MySQL instance units;
the guard unit is used for checking whether the MySQL instance unit is healthy;
the state storage unit is used for storing the state information of the MySQL instance unit according to the checking result of the guard unit, and setting a switching mark if the MySQL instance unit is unhealthy;
the switching unit is used for closing the connection of the unhealthy MySQL instance unit and creating the connection of the healthy MySQL instance unit;
and the switching detection unit detects the switching mark at regular time, and calls the switching unit to trigger switching if the switching mark is found.
The invention can be applied to various cloud environments, such as Ariiyun, Tencent cloud, Huacheng cloud, Baidu cloud and the like.
Preferably, the MySQL instance units are two, and the two MySQL instance units are mainly copied. The MySQL instance unit main master copy can enable switched data to be automatically synchronized, and manual data processing after switching is not needed.
Preferably, the system further comprises a switching monitoring unit for directly reading the recorded information of the state storage unit, a system state query interface for querying the system state information, and a manual switching interface for manual intervention switching. The manual switching interface is mainly used for maintenance, sometimes, the operation and maintenance need to maintain the MySQL instance unit, and the manual switching interface can be called to switch first and then shut down for maintenance. Before and after switching, the system state can be called to inquire the current state of the interface, and correct switching is ensured. The switching monitoring unit can directly read the record of the state storage unit, and timely alarms when switching is found, so that the switching can be conveniently carried out manually, and the switching can be carried out generally in the low peak period of the service.
Preferably, the guard unit is connected with each MySQL instance unit at regular time, if the connection is successful within a fixed time, the health is reported to the state storage unit, otherwise, the unhealthy state is reported.
In order to ensure the accuracy of the check, it is preferable that the guard units are plural, the plural guard units are connected independently of each other, and when half or more guard units report unhealthy condition, the state storage unit is set with a switch flag. In the invention, a plurality of guard units are used for checking, and in order to ensure reliability and not to interfere with the service, the connection of the guard units is independent and has no relation with a service connection pool.
As a further preference, the guard units are at least three.
Preferably, the number of the handover detection units is plural, and after the handover starts, if all the handover detection units respond or one of the handover detection units does not respond after time out, the handover flag is deleted. The switching unit in the invention is mainly used for closing the connection of the old instance (unhealthy MySQL instance unit) and creating the connection of the new instance, and can enter the connection protection for a short time in the switching intermediate state and directly return the prompt in the switching without connecting the new instance and the old instance. In order to ensure the availability of switching, each switching unit independently switches in parallel, so that the overall availability cannot be influenced when individual switching fails under the condition of a complex network. In order to ensure the consistency of the copied data, appropriate waiting can be carried out during switching, and when all the switching detection units are successfully switched, the guard unit deletes the switching mark.
Preferably, the state storage unit is a redis, and the switching detection unit is integrated in an api packet form. The switching detection unit is generally integrated into each item in an api packet form, the switching mark is detected at regular time, and if the switching mark is found, the switching unit is called to initiate switching. The state storage unit is redis, and the sentinel mechanism of the redis ensures high availability. In order to prevent accidental network failure, the invention can also adopt a polling mechanism to ensure reliability, and if the access has problems, the access can be carried out again.
The invention also provides a MySQL application layer high-availability method suitable for various cloud environments, which comprises the following steps:
s1, each MySQL instance unit starts a connection check thread timing task, if the connection is successful within a fixed time, the MySQL instance unit is healthy, otherwise, the MySQL instance unit is unhealthy;
and S2, if the MySQL instance unit is unhealthy, switching the MySQL instance unit, namely closing the connection of the unhealthy MySQL instance unit and creating the connection of the healthy MySQL instance unit.
Preferably, in step S1, the MySQL instance unit is checked by a guard unit, and the checking process of the guard unit includes:
s1-1, initializing when the system is started, and starting a connection check thread timing task by each MySQL instance unit;
s1-2, if the connection is successful within a fixed time, the state storage unit is healthy, otherwise, unhealthy is reported;
s1-3, if the report received by the state storage unit is unhealthy, the state storage unit sets the switch flag and the switch start time.
Preferably, the number of guard units is plural, the plurality of guard units are connected independently of each other, and when half or more of the guard units report unhealthy conditions, the state storage unit sets a switch flag. In the invention, a plurality of guard units are used for checking, and in order to ensure reliability and not to interfere with the service, the connection of the guard units is independent and has no relation with a service connection pool. Generally, more than 3 guard units are arranged.
Preferably, the switching detection unit detects whether the switching flag exists, monitors the response of the switching detection unit after the switching is started, deletes the switching flag if all the responses are responded or at least one timeout does not respond, continues to execute the timing task, and prepares for the next switching.
In order to prevent frequent switching, it is preferable that the switching is performed only once within a fixed time. More preferably, the switching is performed only once within 3 minutes.
Preferably, in step S2, the switching unit performs switching, and a switching flow includes:
s2-1, initializing when the system is started, and starting a switching detection thread timing task by each MySQL instance unit;
s2-2, if the switch detection unit detects the switch mark, the switch unit is called to trigger the switch, and the state storage unit is reported that the switch unit responds to the switch.
Preferably, when switching is performed in step S2-2, the relevant connection of the unhealthy MySQL instance unit is first stopped, then the maintenance set is added, a new connection of the MySQL instance unit is established, and after the new connection is established, the temporary maintenance state is exited, and normal business operation is started.
Preferably, the switching time is 5 to 15 seconds. The invention can set the shortest switching time of 5 seconds, which is convenient for copying and keeping time, and can set the longest switching time of 15 seconds to prevent unsuccessful switching for a long time.
Compared with the prior art, the invention has the beneficial effects that:
the method realizes MySQL high availability from an application layer, has simple structure, is suitable for various cloud environments based on double-master copy of double nodes, can be used on a plurality of cloud machine rooms, is fast to switch, does not influence services, and has no perception for users.
Drawings
Fig. 1 is a schematic diagram of the framework of the present invention.
FIG. 2 is a schematic diagram of a framework structure of the MySQL application layer high-availability system of the invention.
Detailed Description
The embodiment is a MySQL application layer high availability system suitable for various cloud environments, and may be applied to various cloud environments, for example, various cloud environments such as the airy cloud, the tengyuan cloud, the huashi cloud, and the hundredth cloud. As shown in fig. 1 and 2, the system includes:
two MySQL instance units 1, two MySQL instance units 1 host replication. The MySQL example unit 1 can make the switched data automatically synchronous by main copy, and does not need to manually process the data after switching;
the guard unit 2 is used for checking whether the MySQL instance unit 1 is healthy;
the state storage unit 3 is used for storing the state information of the MySQL instance unit 1 according to the checking result of the guard unit 2, and setting a switching mark if the MySQL instance unit 1 is unhealthy;
the switching unit 5 is used for closing the connection of the unhealthy MySQL instance unit 1 and creating the connection of the healthy MySQL instance unit 1;
and the switching detection unit 4 detects the switching mark at regular time, and calls the switching unit 5 to trigger switching if the switching mark is found.
The system in the embodiment further comprises a switching monitoring unit 6 for directly reading the recorded information of the state storage unit 3, a system state query interface for querying the system state information, and a manual switching interface 7 for manual intervention switching. The manual switching interface is mainly used for maintenance, sometimes, the operation and maintenance need to maintain the MySQL instance unit 1, and the manual switching interface can be called to switch first and then shut down for maintenance. Before and after switching, the system state can be called to inquire the current state of the interface, and correct switching is ensured. The switching monitoring unit can directly read the record of the state storage unit 3, and timely alarms when switching is found, so that the switching can be conveniently carried out manually, and the switching can be carried out generally in the low peak period of the service.
In the embodiment, the guard unit 2 is connected with each MySQL instance unit 1 regularly, if the connection is successful within a fixed time, the health is reported to the state storage unit 3, otherwise, the unhealthy state is reported. In order to ensure the accuracy of the check, the guard units 2 are generally multiple, for example, at least three, the connections of the guard units 2 are independent, and when half or more guard units 2 report unhealthy conditions, the switching flag is set in the state storage unit 3. In the invention, a plurality of guard units 2 are used for checking, and in order to ensure reliability and not to interfere with the service, the connection of the guard units 2 is independent and has no relation with a service connection pool.
In this embodiment, there are multiple handover detection units 4, and after the handover starts, if all the handover detection units 4 respond or one of the handover detection units does not respond after time out, the handover flag is deleted. The switching unit 5 in the invention is mainly used for closing the connection of the old instance (unhealthy MySQL instance unit 1) and creating the connection of the new instance, and can enter the connection protection for a short time in the intermediate state of switching and directly return the prompt in the switching without connecting the new instance and the old instance. In order to ensure the availability of the handover, each switching unit 5 independently switches in parallel, so that the overall availability is not affected when individual handover fails under the complex network condition. In order to ensure the consistency of the copied data, appropriate waiting can be made during switching, and when all the switching detection units 4 are successfully switched, the guard unit 2 deletes the switching flag.
The state storage unit 3 in the present invention can adopt various existing modes, in this embodiment, the state storage unit 3 is redis, and the switching detection unit 4 is integrated in an api packet form. The switching detection unit 4 in the invention is generally integrated into each item in the form of an api packet, detects the switching mark at regular time, and calls the switching unit 5 to trigger switching if the switching mark is found. In the invention, the state storage unit 3 is a redis, and the sentinel mechanism of the redis ensures high availability. In order to prevent accidental network failure, the invention can also adopt a polling mechanism to ensure reliability, and if the access has problems, the access can be carried out again.
The main process of the guard unit 2 includes:
1. initializing when the system is started, and starting a connection check thread timing task by each MySQL instance;
2. if the connection is successful within 6 seconds, reporting health to the state storage unit 3, otherwise reporting unhealthy;
3. reading the number of guards reporting unhealthy reports, and if the reports are unhealthy, setting a switching mark and the switching starting time;
4. in order to prevent frequent switching, switching can be performed only once in 3 minutes;
5. after the handover starts, the responses of the handover detection units 4 are monitored, and if all responses are received or if there is a specific timeout without response, the handover flag is deleted.
6. And continuing to execute the timing task to prepare for next switching.
The main process of the handover detection unit 4 includes:
1. initializing when the system is started, and starting a switching detection thread timing task by each MySQL instance;
2. if the switch flag is detected, the switch unit 5 is invoked to initiate a switch and report to the state storage unit 3 that it has responded to the switch.
3. When switching, firstly, stopping related connection, then adding maintenance set, then establishing new connection, after the new connection is established, exiting from temporary maintenance state, and starting normal service operation.
4. In order to prevent the switching from being too fast, the shortest switching time of 5 seconds is set, and a point time is reserved for copying;
5. in order to prevent long handover from being unsuccessful, a maximum handover time of 15 seconds is set.
In the invention, guard unit 2 cluster, generally, one machine room deploys three tomcats, and sets parameters mha _ role ═ guard and mha _ cluster ═ maha. The database configures each MySQL instance unit 1 that needs to be monitored. redis configures the current ip and switch ip of the MySQL instance.
State storage redis, mhamaster, requires three redis-sentinel sentinels, one master and one slave to two redis instances to form a high availability cluster.
The service MySQL instance needs to be configured into copy structures which are master and slave, in order to fully utilize instance resources, the service MySQL instance is configured to be separately connected according to library odd and even, and both instances are used.
To bring the service item into the handoff detection unit 4 package, the parameter mha _ role is set to tgl _ chk,
mha _ cluster is maha, mha cluster is the same as guard unit 2 cluster.
After the switch occurs, nagios will alarm with set monitor. Common switching reasons include cloud host downtime, network sporadic failures, and the like.
And after receiving a switching alarm, calling api/mha/toggle.do to switch back when the relative idle time at night. Typically the switching time is around 10 seconds.
The system can be used on a plurality of cloud machine room lines, the switching is rapid, the service is not influenced, and the user does not feel.