CN115001950A

CN115001950A - Database cluster fault processing method, storage medium and equipment

Info

Publication number: CN115001950A
Application number: CN202210594391.9A
Authority: CN
Inventors: 郭道兵; 李翔
Original assignee: Beijing Kingbase Information Technologies Co Ltd
Current assignee: Beijing Kingbase Information Technologies Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-02

Abstract

The invention provides a fault processing method, a storage medium and equipment of a database cluster, wherein the fault processing method of the database cluster comprises the following steps: acquiring abnormal events occurring in a database cluster where a current database is located; confirming the main and standby states of the current database; acquiring the communication state of the current database and the trust gateway; and configuring a virtual IP according to the communication state and the main/standby state, wherein the virtual IP is an external connection address of the database cluster. The invention can ensure that transparent application fault switching is always realized by setting the virtual IP, thereby ensuring that the application connection and the read-write request are normal.

Description

Database cluster fault processing method, storage medium and equipment

Technical Field

The present invention relates to the field of database technologies, and in particular, to a method, a storage medium, and a device for handling a failure of a database cluster.

Background

The clustering technology is a newer technology, relatively high benefits in performance, reliability, flexibility and the like can be obtained at low cost through the clustering technology, and task scheduling is a core technology in a clustering system. A cluster is a group of mutually independent computers interconnected by a high-speed network, which form a group and are managed in a single system mode. A client interacts with a cluster, which appears as a stand-alone server. The cluster configuration is for improved availability and scalability.

In the main and standby database clusters, the connection supporting the application read-write request service can only be provided by the main database, and the standby database usually does not provide external connection or only provides the query connection service of a read-only business system. When the master/slave database is switched (especially in the case of failover), the application may not know that the master/slave is switched, and the connection configuration of the application system remains before the switching, which results in abnormal application read/write request service. In order to reduce the impact of the switching on the Application connection service, it is usually necessary to perform Transparent Application failover (Transparent Application failover) on the active/standby cluster.

The main and standby cluster support transparent application fault switching is realized by the standard that application connection is switched with the main and standby databases after the main and standby databases are switched, so that smooth connection of the databases is kept and read-write request service is normal. Transparent application failover is typically implemented by configuring a string of connections for multiple IP rounds.

Fig. 1 is a diagram of an implementation architecture for transparent application failover in the prior art. In the above conventional implementation mechanism, the network Manager service (for example, LOAD _ bandwidth, failure parameter) is statically configured in JDBC (Java Database Connectivity) in combination with multiple IP address strings, when the parameter is set to LOAD _ bandwidth OFF and failure is set to failure, multiple IP address round robin can be implemented, and the round robin is terminated after an address that can be normally connected is directly found, so that the read-write request can be maintained ON the normally connected Database service. Thus, a primary capability of transparent application failover in a primary-backup cluster is achieved.

The premise of realizing the multi-IP round trip is that the multi-IP round trip can be completed only by the Net Manager service component. Not all users can shop for such database products. In addition, after the scheme deals with the main and standby failover, when the original main database is regrouped by the original IP address or hostname and works by the identity of the standby database (read-only), both the example and the monitoring service are normal, and the multi-IP configuration strings are circulated from top to bottom according to the original logic, so that the problem that the read-write request is sent to the read-only node to cause the failure of writing exists. At this time, manual intervention processing is required (for example, the sequence of the original connection string is adjusted, the IP address of the new master node is adjusted to the top of the connection string, or the switch over of the node is used as the master library). Therefore, it can be seen that the above scheme is used for the main/standby cluster and does not really realize transparent application failover.

Based on the above consideration, Transparent Application failover (Transparent Application failover) can be always realized when the original master library returns to the cluster with the identity of the backup library (read-only) after the master and backup failover (switch over), so as to ensure normal Application connection and read-write requests.

Disclosure of Invention

An object of the present invention is to provide a method, a storage medium, and a device for handling a failure of a database cluster, which can solve any of the above problems.

It is a further object of the present invention to prevent application connectivity anomalies.

It is another further object of the present invention to prevent brain cracks.

Particularly, the invention provides a fault handling method of a database cluster, which comprises the following steps:

acquiring abnormal events occurring in a database cluster where a current database is located;

confirming the main and standby states of the current database;

acquiring the communication state of a current database and a trust gateway;

and configuring a virtual IP according to the communication state and the master-slave state, wherein the virtual IP is an external connection address of the database cluster.

Optionally, the step of configuring the virtual IP according to the connected state and the active/standby state further includes:

under the condition that the current database is a main database, if the communication state is normal communication, maintaining the virtual IP of the current database;

and if the connection state is abnormal, deleting the virtual IP.

Optionally, the exception event comprises:

discovering that any other database in the database cluster has a fault;

and finding that any other database in the database cluster can not perform data synchronization with the current database.

Optionally, configuring the virtual IP according to the connected state and the active/standby state further includes:

under the condition that the current database is a standby database, if the communication state is normal communication, the current database is promoted to be a main database, and a virtual IP is added and started;

and if the communication state is abnormal connection, degrading the current database into an abnormal mode, and trying to kick the current database out of the cluster.

Optionally, the step of obtaining the connection status between the current database and the trusted gateway further includes:

and sending a detection message to a trust gateway of the database cluster, and confirming the communication state of the current database and the trust gateway according to the response of the trust gateway.

Optionally, the response of the trusted gateway comprises a normal response and a fault response, the fault response comprises an error response, a timeout response, an unresponsiveness, and

the step of confirming the communication state of the current database and the trust gateway according to the response of the trust gateway comprises the following steps:

determining that the communication state is normal communication under the condition that the response of the trust gateway is normal response;

and under the condition that the response of the trust gateway is a fault response, determining that the connection state is abnormal.

Optionally, the trusted gateway is a gateway device of a network segment where the database cluster is located;

optionally, the virtual IP is an IP address of the same network segment of the database cluster.

According to another aspect of the present invention, there is also provided a machine-readable storage medium having stored thereon a machine-executable program which, when executed by a processor, implements the fault handling method of any one of the above-described database clusters.

According to another aspect of the present invention, there is also provided a computer device, including a memory, a processor, and a machine-executable program stored on the memory and running on the processor, and when the machine-executable program is executed by the processor, the method for fault handling of any one of the above-mentioned database clusters is implemented.

The fault processing method of the database cluster comprises the steps of obtaining abnormal events in the database cluster where the current database is located; confirming the main and standby states of the current database; acquiring the communication state of the current database and the trust gateway; and configuring a virtual IP according to the communication state and the main/standby state, wherein the virtual IP is an external connection address of the database cluster. The invention can ensure that transparent application fault switching is always realized by setting the virtual IP, thereby ensuring that application connection and read-write requests are normal.

Furthermore, the database cluster fault processing method introduces a trust gateway concept, the database cluster can determine the communication state of the current database and the trust gateway according to the response of the trust gateway by sending a detection message to the trust gateway of the database cluster, and further judges the network state of the current database, so that corresponding measures are taken to prevent the occurrence of split brain.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the invention will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a diagram of an architecture for implementing transparent application failover in the prior art;

fig. 2 is a schematic diagram of a data interaction process between a user side and a database cluster of the database cluster fault handling method according to an embodiment of the present invention;

FIG. 3 is a schematic architecture diagram of a database cluster of a method of failure handling of the database cluster according to one embodiment of the present invention;

FIG. 4 is a schematic flow chart diagram of a method of fault handling for a database cluster of one embodiment of the present invention;

FIG. 5 is a schematic flow diagram of a method for fault handling for a database cluster in the case where the current database is the master database, according to one embodiment of the present invention;

FIG. 6 is a schematic flow diagram of a method for failure handling of a database cluster in the case where a current database is a standby database, according to one embodiment of the present invention;

FIG. 7 is a schematic diagram of a machine-readable storage medium according to one embodiment of the invention; and

FIG. 8 is a schematic diagram of a computer device according to one embodiment of the invention.

Detailed Description

In the main and standby database clusters, the connection supporting the application read-write request service can only be provided by the main database, and the standby database usually does not provide external connection or only provides the query connection service of a read-only business system. When the master and slave databases are switched (because the switching is not necessarily planned, especially in case of failover), the application end may not know that the master and slave are switched, and the connection configuration of the application system is still maintained before the switching, which results in abnormal application read-write request service. In order to reduce the impact of the switching on the Application connection service, it is usually necessary to perform Transparent Application failover (Transparent Application failover) on the active/standby cluster.

The main and standby cluster support transparent application fault switching is realized by the standard that application connection is switched with the main and standby databases after the main and standby databases are switched, so that smooth connection of the databases is kept and read-write request service is normal. Transparent application failover is typically achieved by configuring a connection string for multiple IP rounds.

The premise for realizing multiple IP rounds is that the multiple IP rounds can be completed only by providing a Net Manager (network application Manager) service component, but not all customers can purchase the database products. In addition, after the scheme deals with the main and standby Failover, when the original main database returns the cluster again with the original IP address or hostname to work with the identity of the standby database (read only), both the example and the monitoring service are normal, and the multi-IP configuration strings are circulated from top to bottom according to the original logic, so that the problem that the read-write request is sent to the read-only node to cause the failure of writing exists. At this time, manual intervention (e.g., adjusting the sequence of the original connection string, adjusting the IP address of the new master node to the top of the connection string, or switching the node over to the master library) is required. It can be seen that the above scheme is used in the main/standby cluster and does not really realize transparent application failure switching.

Based on the above consideration, transparent application Failover can be always realized under the condition that the original master library is required to be returned to the cluster again with the backup library (read-only) identity after the master and backup Failover, so that the application connection and the read-write request are normal.

To solve the problems of transparent fault switching of the above main/standby cluster architecture, the concepts of virtual IP and trusted gateway need to be introduced. Before the standby database meeting the main/standby switching trigger condition and meeting the condition is not upgraded into the main database, the virtual IP can be dynamically migrated to a new main database according to the switching condition, and the virtual IP address on the fault node or the original main database is unloaded (or deleted) at the same time, so that even if any node is newly added to the database cluster in the later period (for the identity of the standby database), no link service is provided for the external because no virtual IP service exists, and the problem of abnormal communication caused by the fact that a write request accesses the standby database due to the fact that multiple addresses are sequentially patrolled is solved.

Fig. 2 is a schematic diagram of a data interaction process between a user side and a database cluster of the database cluster fault handling method according to an embodiment of the present invention. The data interaction process of the database cluster fault processing method of the present invention relates to the user side 400, the database cluster and the virtual IP. The virtual IP is an IP address of the same network segment of the database cluster and is used as an external connection address of the database cluster. The virtual IP refers to adding an IP address in the same network segment as a virtual address in a main database server (located in the same network card as the public network IP) as an external connection address. The database cluster includes a primary database server 100 and a backup database 200. The database cluster can be a framework of one master database and one standby database, and can also be a framework of a plurality of standby databases of one master database.

Fig. 3 is a schematic architecture diagram of a database cluster of a method of failure handling of the database cluster according to one embodiment of the present invention. The architecture of the database cluster of the present invention includes a primary database server 100, a backup database server 200, and a trust gateway 300. The trusted gateway 300 is a gateway device for the network segment in which the database cluster is located. The gateway device may be a router or a switch. The database cluster must first identify any gateway device in the network segment where the database cluster is located as a trusted gateway 300, and the primary database and the standby database are interacted by means of the trusted gateway 300. In the subsequent check, once the disconnection from the trusted gateway 300 is found, that is, the disconnection from all other devices in the network is simultaneously found. A trusted gateway is one design to ensure that a virtual IP address is always present only at the master database node.

The database cluster will use the existing device of the local network segment as the trusted gateway 300, and after providing the IP or host of the trusted gateway 300, all databases in the database cluster will send ICMP (Internet Control Message Protocol) messages to the trusted gateway 300 through ping and return and judge the communication state with the trusted gateway 300 through the messages. When the ping returns a normal response message, the trusted gateway 300 responds normally. When the ping returns an error message, the trusted gateway 300 responds incorrectly at this time. When the ping returns a timeout message, the trusted gateway 300 responds with a timeout at this time. When the ping time-out does not receive any message, the trust gateway 300 is not responded at this time. Determining that the communication state is normal communication when the response of the trusted gateway 300 is normal; in the case where the response of the trusted gateway 300 is an abnormal response, it is determined that the connectivity status is a connection abnormality.

FIG. 4 is a schematic flow chart diagram of a method of failure handling for a database cluster of one embodiment of the present invention. The method for processing the fault of the database cluster comprises the following steps:

step S202, abnormal events occurring in the database cluster where the current database is located are obtained. The abnormal events comprise the discovery of the fault of any other database in the database cluster and the failure of data synchronization between any other database in the database cluster and the current database. The current database is a database within a database cluster. The current database can be a main database or a standby database.

Step S204, the active/standby state of the current database is obtained. The master/standby state refers to the current database being a master database or a standby database.

Step S206, the communication state of the current database and the trust gateway is obtained. Step S208 may include: and sending a detection message to a trust gateway of the database cluster, and confirming the communication state of the current database and the trust gateway according to the response of the trust gateway.

The trusted gateway may be a gateway device of the network segment in which the database cluster is located. The gateway device can be a router or a switch, and the device does not need to be modified and only needs to provide an IP address. The response of the trusted gateway may include a normal response and a fault response.

Fault responses may include error responses, timeout responses, no responses. The database cluster can take the existing equipment of the local network segment as a trust gateway, and after the IP or host of the trust gateway is provided, all databases in the database cluster can send ICMP messages to the trust gateway through ping and return to judge the communication state with the trust gateway through the messages. And when the ping returns a normal response message, the trust gateway responds normally. And when the ping returns an error message, the trust gateway responds with an error. When the ping returns the timeout information, the trust gateway responds to the timeout. And when the ping is overtime and does not receive any message, the trust gateway is not responded at the moment. Determining that the communication state is normal communication under the condition that the response of the trust gateway is normal response; and under the condition that the response of the trust gateway is an abnormal response, determining that the communication state is abnormal connection.

And step S208, configuring a virtual IP according to the communication state and the master/standby state. In other embodiments, when no abnormal event is found in the current database, network check is performed at preset time intervals to determine whether the network of the device has a problem or not and whether other databases are disconnected, so that the occurrence of an error determination is determined.

The method for processing the database cluster fault comprises the steps of obtaining abnormal events in the database cluster where a current database is located; confirming the main and standby states of the current database; acquiring the communication state of a current database and a trust gateway; and configuring a virtual IP according to the communication state and the main/standby state, wherein the virtual IP is an external connection address of the database cluster. By setting the virtual IP, the invention can prevent abnormal application communication compared with the traditional main and standby clusters, always realize transparent application fault switching and ensure normal application connection and read-write requests.

Fig. 5 is a schematic flowchart of a fault handling method of a database cluster in a case where a current database is a master database according to an embodiment of the present invention. The step of configuring the virtual IP according to the connected state and the active/standby state in the method for processing a failure of the database cluster in this embodiment further includes:

step S302, the current database is confirmed to be a main database.

And step S304, confirming the communication state of the current database and the trust gateway. If the connection state is normal connection, executing step S306; if the connection status is abnormal, step S308 is executed.

And S306, maintaining the virtual IP of the current database, and kicking the database with the abnormal event out of the cluster.

Step S308, closing the current database and deleting the virtual IP.

And under the condition that the current database (main database) and other databases (standby databases) are synchronously interrupted, preliminarily judging the standby database to be in fault, simultaneously carrying out trust gateway detection, and judging that the local network is abnormal after the situation that the current database (main database) and the other databases (standby databases) cannot be communicated with the trust gateway is found. And closing the main database and deleting the virtual IP address.

Under the condition that the current database (main database) and other databases (standby databases) are synchronously interrupted, the standby database is preliminarily judged to be in fault, meanwhile, the trusted gateway is detected, the trusted gateway is normally connected, namely, the local network is normal, state change is not carried out, the virtual IP keeps running in the current database, and the backup database in fault is tried to be kicked out of the cluster.

Fig. 6 is a schematic flow chart of a fault handling method of a database cluster in the case that a current database is a standby database according to an embodiment of the present invention. The method for processing the failure of the database cluster according to this embodiment further includes the step of configuring a virtual IP according to the connected state and the active/standby state:

step S402, the current database is confirmed to be a standby database.

And step S404, confirming the communication state of the current database and the trust gateway. If the connection state is normal connection, go to step S406; if the connection status is abnormal, step S408 is executed.

Step S406, the current database is promoted to be a main database, and a virtual IP is added and started.

And step S408, degrading the current database into an abnormal mode, and trying to kick the current database out of the database cluster.

And under the condition that the current database (standby database) and other databases (main databases) are synchronously interrupted, primarily judging the fault of the main database, simultaneously carrying out trust gateway detection, ensuring that the communication with the trust gateway is normal and the local network is normal. And executing automatic failover, adding and starting a virtual IP (Internet protocol), and promoting a current database (standby database) to be a main database.

And under the condition that the current database (standby database) and other databases (main databases) are synchronously interrupted, primarily judging the fault of the main database, simultaneously carrying out trust gateway detection, and judging the local network is abnormal after finding that the trust gateway cannot be communicated. Automatic failover cannot be performed. And (3) degrading the current database (standby database) into an abnormal mode, trying to kick out the current database from the cluster, and only keeping the database running, wherein any fault processing cannot be executed subsequently. At this time, the virtual IP remains running in the other database (master database) unchanged.

As shown in table 1, the comparison of the two schemes can show the effect of the failure processing method (scheme:) of the database cluster of this embodiment. The scheme is as follows: and the main database and the standby database are automatically switched, and the application switching function is realized by the polling of multiple IP addresses. Scheme II: and the main and standby database clusters introduce a trust gateway and have the automatic virtual IP switching function. And the scheme shown in the table takes a main database and a cluster of standby databases as an example.

TABLE 1

As can be seen from the above figure, the failure processing method (solution ii) of the database cluster according to the present embodiment, compared to the solution i, avoids the occurrence of split brain when the main database or the backup database has a network problem, and can prevent the application connectivity (read-write request) from being abnormal.

That is, compared with the conventional active/standby cluster without any external service or device, the method for processing the failure of the database cluster in this embodiment can prevent the occurrence of split brain and simultaneously prevent the application connectivity (read/write request) from being abnormal.

The embodiment also provides a machine-readable storage medium and a computer device. Fig. 7 is a schematic diagram of a machine-readable storage medium according to an embodiment of the present invention, and fig. 8 is a schematic diagram of a computer apparatus according to an embodiment of the present invention.

The machine-readable storage medium 40 has stored thereon a machine-executable program 41, the machine-executable program 41 when executed by a processor implementing the method of fault handling for a database cluster of any of the embodiments described above.

The computer device 50 may comprise a memory 520, a processor 510 and a machine executable program 41 stored on the memory 520 and running on the processor 510, and the processor 510 implements the method of failure handling of a database cluster of any of the embodiments described above when executing the machine executable program 41.

It should be noted that the logic and/or steps shown in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any machine-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a machine-readable storage medium 40 can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium 40 may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system.

The computer device 50 may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or a smartphone. In some examples, computer device 50 may be a cloud computing node. Computer device 50 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer device 50 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

The computer device 50 may include a processor 510 adapted to execute stored instructions, a memory 520 providing temporary storage for the operation of the instructions during operation. Processor 510 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 520 may include Random Access Memory (RAM), read only memory, flash memory, or any other suitable storage system.

The processor 510 may also be linked through a system interconnect to a display interface suitable for connecting the computer device 50 to a display device. The display device may include a display screen as a built-in component of the computer device 50. The display device may also include a computer monitor, television, or projector, etc. externally connected to the computer device 50. In addition, a Network Interface Controller (NIC) may be adapted to connect computer device 50 to a network via a system interconnect. In some embodiments, the NIC may use any suitable interface or protocol (such as an internet small computer system interface, etc.) to transfer data. The network may be a cellular network, a radio network, a Wide Area Network (WAN)), a Local Area Network (LAN), the internet, or the like. The remote device may be connected to the computing device through a network.

The flowcharts provided by this embodiment are not intended to indicate that the operations of the method are to be performed in any particular order, or that all the operations of the method are included in each case. Further, the method may include additional operations. Additional variations on the above-described method are possible within the scope of the technical ideas provided by the method of this embodiment.

Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims

1. A failure processing method of a database cluster comprises the following steps:

confirming the main and standby states of the current database;

acquiring the communication state of the current database and the trust gateway;

and configuring a virtual IP according to the communication state and the main/standby state, wherein the virtual IP is an external connection address of the database cluster.

2. The method of database cluster failure handling according to claim 1, wherein said step of configuring virtual IPs according to said connected state and said active/standby state further comprises:

and if the communication state is abnormal connection, deleting the virtual IP.

3. The database cluster failure handling method of claim 1, wherein the exception event comprises:

discovering that any other database in the database cluster has a fault; and/or

And finding that any other database in the database cluster can not carry out data synchronization with the current database.

4. The database cluster failure handling method of claim 3, wherein the configuring virtual IPs according to the connected state and the active/standby state further comprises:

under the condition that the current database is a standby database, if the communication state is normal communication, the current database is promoted to be a main database, and the virtual IP is added and started;

5. The database cluster failure handling method of claim 1, wherein the step of obtaining the connectivity status of the current database with a trusted gateway further comprises:

6. The database cluster failure handling method of claim 5, the response of the trusted gateway comprising a normal response and a failure response, the failure response comprising an error response, a timeout response, a non-response, and

7. The database cluster failure handling method of claim 1,

and the trust gateway is gateway equipment of the network segment where the database cluster is located.

8. The database cluster failure handling method of claim 1,

and the virtual IP is the IP address of the same network segment of the database cluster.

9. A machine readable storage medium having stored thereon a machine executable program which when executed by a processor implements a method of fault handling for a database cluster according to any one of claims 1 to 8.

10. A computer device comprising a memory, a processor and a machine-executable program stored on the memory and running on the processor, and the processor when executing the machine-executable program implements a method of fault handling for a database cluster according to any of claims 1 to 8.