CN110704223B

CN110704223B - Recovery system and method for single-node abnormity of database

Info

Publication number: CN110704223B
Application number: CN201910870884.9A
Authority: CN
Inventors: 张立明; 郭业俊; 孙迁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Jiangsu Suning Cloud Computing Co ltd; SuningCom Co ltd
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2022-12-13
Anticipated expiration: 2039-09-16
Also published as: CN110704223A

Abstract

The invention discloses a recovery system and a recovery method for single-node abnormity of a database. The recovery system comprises: the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database; and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions. The embodiment of the invention can automatically process the abnormity of the database node and improve the stability of the database cluster by automatically recovering the abnormity of the database node.

Description

Recovery system and method for single-node abnormity of database

Technical Field

The invention relates to the field of big data, in particular to a system and a method for recovering single-node abnormity of a database.

Background

At present, the application range of a database is wide, wherein different types of the database also exist along with the application range, for example, the HBase database is a distributed key-value database, and has the characteristics of high availability, easy expansion, mass storage and the like, the data of the HBase is stored on the HDFS, the data is highly available, and the data loss cannot be caused when any machine is down, but the database is also unstable in the application process, including the hardware problem of the server: disk damage, memory anomalies, etc.; service process problem: GC appears in the regionserver, and BUG of unknown programs and the like appear; database service issues: resource contention and the like are caused by wrong query, and the problem can occur that a certain node of the database is abnormal in the application process, so that the user request delay is increased, even completely blocked, the stability of the HBase is greatly influenced, and the use of the database by the user is further influenced.

For the problem occurring in the database node, there are generally two approaches to solve the problem, the first approach is to individually control the traffic threshold for each node in the cluster network of the database, and then the traffic threshold synchronization of the entire database cluster network can be reduced, but this solution has the disadvantages that: when the requests in the cluster are inclined, the request quantity of a single node is overlarge, but the request quantity of the whole database cluster is still small, and further the flow control is inaccurate. The second approach is to count the cluster request amount by counting the services of a single node, and determine the traffic threshold of the whole cluster network by passing each request in the cluster network through the service, which has the following disadvantages: the service sequence judgment of a single node easily causes a bottleneck point of the whole cluster network, and the reliability is difficult to guarantee. The above methods are all approaches for controlling the node flow to solve the problem of node abnormality, and cannot effectively solve the stability of database service in work.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a system and a method for recovering a single-node exception of a database, which can automatically process an exception of a database node, automatically recover an exception of the database node, and improve stability of a database cluster.

In order to solve the technical problems, the invention adopts the technical scheme that:

the first method, an embodiment of the present invention provides a recovery system for a single-node exception of a database, where the recovery system includes:

the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions.

Further, the index monitoring module comprises a request quantity monitoring unit and a request time consumption monitoring unit, wherein the request quantity monitoring unit is used for comparing the request quantity of the request situation with half an hour before backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold value, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation is not greater than the preset threshold value on the storage and management server is migrated; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server.

Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consuming monitoring unit migrates the Region units.

Further, the check period for the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.

Further, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.

Further, the database at least comprises an HBase database, a connection pool of the database at least comprises a Druid connection pool, and the storage and management server comprises a plurality of Region units.

On the other hand, the embodiment of the invention also discloses a recovery method of the single-node abnormity of the database, which comprises the following steps:

collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.

Further, the backlog of the request queue comprises a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the same-ratio rise of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.

Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the embodiment of the invention discloses a recovery system and a recovery method for single-node abnormity of a database, wherein when the node abnormity is caused by serious backlog of a request queue of a database node, the request queue in the database is monitored and processed by an index acquisition module and an index monitoring module, and the quantity and the time of the request queue led into a connection pool of the database are judged to be remarkably increased, a storage pipe server corresponding to the backlog queue is processed according to a set request quantity threshold and a set request time threshold, and a Region unit on the storage pipe server is migrated, so that the high delay and node blockage caused by the backlog of the request queue in the database node are reduced, the data circulation of the node is automatically recovered, the data transmission efficiency in the database is improved, and the stability in the using process of the database is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a recovery system for single-node anomalies in a database according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a method for recovering a single-node exception of a database according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of processing backlog in the method for recovering from a single-node exception in a database according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

fig. 1 is a schematic structural diagram of a recovery system for a single-node exception of a database disclosed in this embodiment, and as shown in fig. 1, an embodiment of the present invention provides a recovery system for a single-node exception of a database, where the recovery system includes:

Specifically, the recovery system for single-node abnormality of a database provided in the embodiment of the present invention is mainly based on a fast recovery scheme for single-node abnormality of an HBase database, and preferably, the database at least includes an HBase database, a connection pool of the database at least includes a Druid connection pool, and the storage server includes a plurality of Region units. The HBase database is a Distributed key-value database and has the characteristics of high availability, easiness in expansion, mass storage and the like, the data of the HBase is stored on the HDFS (Hadoop Distributed File System), and the HDFS (Hadoop Distributed File System) indicates that the Distributed File System is designed to be a Distributed File System suitable for running on general hardware. Therefore, data loss cannot be caused when any one machine is down, if a certain node is abnormal, the user request delay is increased, even completely blocked, the use stability of the HBase database is greatly influenced, and in the process of using the HBase database, hardware problems of a server, such as disk damage, abnormal memory and the like, are caused; service process problems such as GC occurrence of the regionserver, program unknown BUG and the like; database service issues: according to the recovery system for the single-node abnormality of the database disclosed by the embodiment of the invention, once the node abnormality occurs, the index acquisition module and the index monitoring module are utilized to acquire and monitor the request quantity and the request time of the request queue in the database, and when the quantity and the time of the request queue led into a connection pool of the database are obviously increased, the storage server corresponding to the backlog queue is processed according to the set request quantity threshold and the set request time threshold, and the Region unit on the storage server is migrated, so that the occurrence of high delay and blockage caused by backlog of the request queue in the node of the database is reduced, the data circulation of the node is automatically recovered, the data transmission efficiency in the database is improved, and the stability in the use process of the database is ensured.

Further, as can be seen in fig. 1, one data table is composed of a plurality of Region units and is distributed on different Region servers, where a Region server in the present invention represents a storage and management server, a Region unit represents a unit for storing data in an HBase database, an index here refers to an index of operation of an HBase database cluster, and the index collection module specifically includes a unit time increment for performing difference calculation and aggregation processing on the node indexes, and aggregates the Region units into the data table.

Preferably, the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the comparable fluctuation of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage and management server, and migrate the Region unit whose fluctuation does not exceed the preset threshold on the storage and management server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server. Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage server and migrate all Region units on the storage server when neither the request number monitoring unit nor the request time consumption monitoring unit migrates the Region units. Therefore, three problems occurring in the use process of the HBase database can be effectively solved through the three units included in the index monitoring module, wherein when the request quantity of the Region unit increases to cause backlog of the storage and management server, and further cause large backlog of the request queue, the request quantity monitoring unit migrates the Region unit to avoid the Region unit which does not meet the condition from being affected, so as to achieve the purpose of abnormal recovery. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all regions need to be migrated by using the hardware monitoring unit at this time. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.

Preferably, the check period for the request queue within the node metrics is 30 seconds per time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable working of the system, wherein the 30S periodic check at each time is only performed on the HBase database, and other databases or check periods under different application scenes can be adjusted through a computer language, so that the check on the request queue is in a circulating working state and in an optimal check period.

Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the index acquisition module acquires each node index of the database and the index monitoring module corresponds to parameters in the migration process of the Region unit, so that different node indexes at least include the above parameters, wherein the number distribution of request queues, the time distribution of the request queues and the time consumption distribution of reading, writing, searching and the like are factors for performing backlog judgment subsequently, and specific parameters of the factors are quantized, so that the backlog judgment standardization is achieved, and the recovery time when a database cluster node is abnormal can be prolonged.

The second embodiment:

the embodiment of the invention also discloses a method for recovering the single-node abnormity of the database, wherein, FIG. 2 is a flow diagram of the method for recovering the single-node abnormity of the database disclosed by the embodiment of the invention; fig. 3 is a schematic flowchart of processing backlog in the recovery method for single-node exception of a database disclosed in the embodiment of the present invention, and as shown in fig. 2 and fig. 3, the recovery method includes the following steps:

s1: collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

s2: and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.

Specifically, the recovery method for the database node exception actually includes two main steps, the first is to collect node indexes of a connection pool node imported into the database, and the second is to migrate the Region unit when a request is backlogged, so that the request queue in the connection pool cannot be blocked under any condition, and smooth data of the cluster node is achieved. Further, by collecting and monitoring the request quantity and the request time of the request queues in the database, and judging that the quantity and the time of the request queues led into the connection pool of the database are remarkably increased, processing the storage and management server corresponding to the backlog queue according to a set request quantity threshold and a set request time threshold, and migrating the Region unit on the storage and management server, wherein the quantity threshold and the time threshold can be properly adjusted according to the use scene of the database. The method and the system achieve the aims of reducing the occurrence of high delay and blockage caused by backlog of the request queue in the database node, automatically recovering the data circulation of the node, improving the efficiency of data transmission in the database and ensuring the stability of the database in the using process.

Preferably, in the recovery of the database node abnormality through the steps S1 and S2, the database at least includes an HBase database, the connection pool of the database at least includes a droid connection pool, and the storage server includes a plurality of Region units, so that, when the database node is abnormal, the migration of the Region units is preferably adjusted for the Region units in the HBase database, and the occurrence of request queue backlog is avoided.

Preferably, the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog compares the request quantity of the request situation with half an hour before the backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated. Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated. Specifically, three problems occurring in the use process of the HBase database can be effectively solved through recovery when cluster nodes under three different conditions of request quantity backlog, request time backlog and hardware request backlog are abnormal, wherein when the backlog of the storage server is caused by increase of the request quantity of the Region unit, and further the request queue is greatly backlogged, the Region unit is migrated through a processing program of the request quantity backlog, so that influence on other Region units which do not meet the conditions is avoided, the purpose of abnormal recovery is achieved, when the request processing time of part of the Region units is increased, the Region data problem may be caused, query performance is poor, the request queue backlog corresponding to the storage server is caused, at the moment, when the Region unit needs to be migrated through the request time backlog program, other Region units are also prevented from being influenced, and automatic recovery of node abnormality can be achieved. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all the Region units need to be migrated by the hardware request backlog program at this time. In addition, in the above three processes of migrating the Region units, in order to prevent the Region units from reflowing, the balance path of the storage server needs to be closed. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.

Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the collection of each node index of the database in step S1 and the parameter in the migration process of the Region unit in step S2 are all corresponding to each other, so that at least the above-mentioned several node indexes are included through different node indexes, wherein the number distribution of the request queues, the time distribution of the request queues and the time consumption distribution of reading, writing, searching and the like are all factors for performing backlog judgment subsequently, and specific parameters of the factors are quantized, so that the backlog judgment standardization is achieved. Further, the checking period of the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable work of the system.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present invention, and are not described in detail herein.

It should be noted that: in the recovery system for single-node database exception provided in the foregoing embodiment, when the database cluster node exception is recovered, only the division of the foregoing functional modules is used for illustration, and in practical applications, the foregoing function distribution may be completed by different functional modules as needed, that is, the internal structure of the recovery system for single-node database exception is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the recovery system for the single-node abnormality of the database provided by the above embodiment and the recovery method embodiment for the single-node abnormality of the database belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A system for recovering from a single-node exception to a database, the system comprising:

2. The recovery system of single-node abnormality of a database according to claim 1, wherein the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the commensurable increase of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage server, and migrate the Region unit whose increase does not exceed the preset threshold on the storage server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the same-ratio fluctuation before half an hour do not exceed the preset threshold, closing a balance passage of the storage and management server if the request time is increased and exceeds the preset time, and migrating the Region units which do not exceed the preset time on the storage and management server.

3. The system for recovering the single-node abnormality of the database according to claim 2, wherein the index monitoring module further includes a hardware monitoring unit, and is configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consumption monitoring unit migrates the Region units.

4. The system for recovering the single-node exception of the database according to claim 1, wherein the check period of the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.

5. The system for recovering a single-node exception from a database according to claim 1, wherein said node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of query requests.

6. The system for recovering the single-node abnormality of the database according to claim 1, wherein the database at least comprises an HBase database, the connection pool of the database at least comprises a drain connection pool, and the storage and management server comprises a plurality of Region units.

7. A recovery method for single-node exception of a database is characterized by comprising the following steps:

8. The method for recovering the single-node abnormality of the database according to claim 7, wherein the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the commensurable rise of the request quantity of the Region unit exceeds a preset threshold, the balance path of the storage and management server is closed, and the Region unit whose rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.

9. The method according to claim 8, wherein the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request number backlog nor the request time backlog occurs, a balance path of the storage and management server is closed, and all Region units on the storage and management server are migrated.

10. The method according to claim 7, wherein the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.