CN110704223A

CN110704223A - Recovery system and method for single-node abnormity of database

Info

Publication number: CN110704223A
Application number: CN201910870884.9A
Authority: CN
Inventors: 张立明; 郭业俊; 孙迁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Jiangsu Suning Cloud Computing Co ltd; SuningCom Co ltd
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2020-01-17
Anticipated expiration: 2039-09-16
Also published as: CN110704223B

Abstract

The invention discloses a recovery system and a recovery method for single-node abnormity of a database. The recovery system comprises: the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database; and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions. The embodiment of the invention can automatically process the abnormity of the database node and improve the stability of the database cluster by automatically recovering the abnormity of the database node.

Description

Recovery system and method for single-node abnormity of database

Technical Field

The invention relates to the field of big data, in particular to a system and a method for recovering single-node abnormity of a database.

Background

At present, the application range of a database is wide, wherein different types of the database also exist along with the application range, for example, the HBase database is a distributed key-value database, and has the characteristics of high availability, easy expansion, mass storage and the like, the data of the HBase is stored on the HDFS, the data is highly available, and the data loss cannot be caused when any machine is down, but the database is also unstable in the application process, including the hardware problem of the server: disk damage, memory anomalies, etc.; service process problem: GC appears in the regionserver, and BUG of unknown programs and the like appear; database service issues: resource contention and the like are caused by wrong inquiry, and the problem can occur that a certain node of the database is abnormal in the application process, so that the user request delay is increased, even completely blocked, the stability of HBase is greatly influenced, and the use of the database by the user is further influenced.

For the problem of the database node, there are two approaches to solve the problem, the first approach is to control the traffic threshold individually for each node in the cluster network of the database, and then the traffic threshold synchronization of the whole database cluster network can be reduced, but the solution has the following disadvantages: when the requests in the cluster are inclined, the request quantity of a single node is overlarge, but the request quantity of the whole database cluster is still small, and further the flow control is inaccurate. The second approach is to count the cluster request amount by counting the services of a single node, and determine the traffic threshold of the whole cluster network by passing each request in the cluster network through the service, which has the following disadvantages: the service sequence judgment of a single node easily causes a bottleneck point of the whole cluster network, and the reliability is difficult to guarantee. The above methods are all approaches for controlling the node flow to solve the problem of node exception, and cannot effectively solve the stability of database service in work.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a system and a method for recovering a single-node exception of a database, which can automatically process an exception of a database node, automatically recover an exception of the database node, and improve stability of a database cluster.

In order to solve the technical problems, the invention adopts the technical scheme that:

the first method, an embodiment of the present invention, provides a recovery system for a single-node exception of a database, where the recovery system includes:

the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions.

Further, the index monitoring module comprises a request quantity monitoring unit and a request time consumption monitoring unit, wherein the request quantity monitoring unit is used for comparing the request quantity of the request situation with half an hour before backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold value, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation is not greater than the preset threshold value on the storage and management server is migrated; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server.

Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consuming monitoring unit migrates the Region units.

Further, the check period for the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.

Further, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.

Further, the database at least comprises an HBase database, a connection pool of the database at least comprises a Druid connection pool, and the storage and management server comprises a plurality of Region units.

On the other hand, the embodiment of the invention also discloses a recovery method of the single-node abnormity of the database, which comprises the following steps:

collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.

Further, the backlog of the request queue comprises a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the same-ratio rise of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.

Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the embodiment of the invention discloses a recovery system and a recovery method for single-node abnormity of a database, when the node abnormality is caused by the serious backlog of the database node request queues, the index acquisition module and the index monitoring module are used for monitoring and processing the request quantity and the request time of the request queues in the database, and judging that the quantity and the time of the request queues led into a connection pool of the database are obviously increased, processing the storage and management server corresponding to the backlog queue according to the set request quantity threshold and the request time threshold, migrating the Region unit on the storage and management server, therefore, the occurrence of high delay and node blockage caused by backlog of request queues in the database nodes is reduced, the data circulation of the nodes is automatically recovered, the efficiency of data transmission in the database is improved, and the stability of the database in the using process is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a recovery system for single-node anomalies in a database according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a method for recovering a single-node exception of a database according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating backlog processing in the method for recovering a single-node exception of a database according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

fig. 1 is a schematic structural diagram of a recovery system for a single-node exception of a database disclosed in this embodiment, and as shown in fig. 1, an embodiment of the present invention provides a recovery system for a single-node exception of a database, where the recovery system includes:

Specifically, the recovery system for single-node abnormality of the database provided by the embodiment of the present invention is a fast recovery scheme based on single-node abnormality of the HBase database, and preferably, the database at least includes the HBase database, a connection pool of the database at least includes a drain connection pool, and the storage and management server includes a plurality of Region units. The HBase database is a Distributed key-value database and has the characteristics of high availability, easiness in expansion, mass storage and the like, the data of the HBase is stored on the HDFS, the data are highly available, and the HDFS (Hadoop Distributed File System) indicates that the Distributed File system is designed to be a Distributed File system suitable for running on general hardware. Therefore, data loss cannot be caused when any one machine is down, if a certain node is abnormal, the user request delay is increased, even completely blocked, the use stability of the HBase database is greatly influenced, and in the process of using the HBase database, hardware problems of a server, such as disk damage, abnormal memory and the like, are caused; service process problems such as GC occurrence of the regionserver, program unknown BUG and the like; database service issues: resource contention caused by wrong inquiry and the like can cause several abnormal conditions, once the node abnormal conditions occur, by the recovery system for the single-node abnormity of the database disclosed by the embodiment of the invention, the index acquisition module and the index monitoring module are utilized to acquire and monitor the request quantity and the request time of the request queue in the database, when the quantity and the time of the request queue led into the connection pool of the database are obviously increased, processing the storage and management server corresponding to the backlog queue according to the set request quantity threshold and the request time threshold, migrating the Region unit on the storage and management server, therefore, the occurrence of high delay and blockage caused by backlog of request queues in the database nodes is reduced, the data circulation of the nodes is automatically recovered, the efficiency of data transmission in the database is improved, and the stability of the database in the using process is ensured.

Further, as can be seen in fig. 1, one data table is composed of a plurality of Region units and is distributed on different Region servers, where a Region server in the present invention represents a storage and management server, a Region unit represents a unit for storing data in an HBase database, an index here refers to an index of operation of an HBase database cluster, and the index collection module specifically includes a unit time increment for performing difference calculation and aggregation processing on the node indexes, and aggregates the Region units into the data table.

Preferably, the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the comparable fluctuation of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage and management server, and migrate the Region unit whose fluctuation does not exceed the preset threshold on the storage and management server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server. Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consuming monitoring unit migrates the Region units. Therefore, three problems occurring in the use process of the HBase database can be effectively solved through the three units included in the index monitoring module, wherein when the request quantity of the Region unit increases to cause backlog of the storage and management server, and further cause large backlog of the request queue, the request quantity monitoring unit migrates the Region unit to avoid the Region unit which does not meet the condition from being affected, so as to achieve the purpose of abnormal recovery. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all regions need to be migrated by using the hardware monitoring unit at this time. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.

Preferably, the check period for the request queue within the node metrics is 30 seconds per time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable working of the system, wherein the 30S periodic check is only performed on the HBase database every time, and other databases or check periods under different application scenes can be adjusted through a computer language, so that the check on the request queue is in a circulating working state and in an optimal check period, further, the judgment that the backlog of the request queue exceeds 1000 times in two continuous minutes in the embodiment is also based on the optimal selection of the HBase database in the actual working process, and other judgment standards under different scenes are also based on different data stocks in difference, other criteria and inspection cycle differences are within the scope of the present invention.

Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the index collection module collects the node indexes of the database and the index monitoring module corresponds to parameters of the Region unit in the migration process, so that different node indexes at least comprise the above parameters, wherein the number distribution of request queues, the time distribution of the request queues and the time consumption distribution of reading, writing, searching and the like are all factors for performing backlog judgment subsequently, specific parameters of the factors are quantized, the backlog judgment standardization is achieved, the recovery time when the database cluster nodes are abnormal can be improved, when different databases are aimed at, the types of the node indexes can be expanded according to different scenes, the backlog meeting the request queues can be properly adjusted, and therefore, more modules for adjusting the node abnormality can be provided, and for the expansion type of the index, are intended to be within the scope of the present invention.

Example two:

the embodiment of the invention also discloses a method for recovering the single-node abnormity of the database, wherein, FIG. 2 is a flow diagram of the method for recovering the single-node abnormity of the database disclosed by the embodiment of the invention; fig. 3 is a schematic flow chart of backlog processing in the recovery method of single-node exception of a database disclosed in the embodiment of the present invention, and as shown in fig. 2 and fig. 3, the recovery method includes the following steps:

s1: collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;

s2: and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.

Specifically, the recovery method for the database node exception actually includes two main steps, the first is to collect node indexes of a connection pool node imported into the database, and the second is to migrate the Region unit when a request is backlogged, so that the request queue in the connection pool cannot be blocked under any condition, and smooth data of the cluster node is achieved. Further, by collecting and monitoring the request quantity and the request time of the request queues in the database, and judging that the quantity and the time of the request queues led into the connection pool of the database are remarkably increased, processing the storage and management server corresponding to the backlog queue according to a set request quantity threshold and a set request time threshold, and migrating the Region unit on the storage and management server, wherein the quantity threshold and the time threshold can be properly adjusted according to the use scene of the database. The method and the system achieve the aims of reducing the occurrence of high delay and blockage caused by backlog of the request queue in the database node, automatically recovering the data circulation of the node, improving the efficiency of data transmission in the database and ensuring the stability of the database in the using process.

Preferably, in the recovery of the database node exception through the above steps S1 and S2, the database at least includes an HBase database, the connection pool of the database at least includes a drain connection pool, and the storage server includes a plurality of Region units, so that, when the database node exception occurs, it is preferable to adjust the Region units in the HBase database for the migration of the Region units, thereby avoiding the occurrence of request queue backlog.

Preferably, the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog compares the request quantity of the request situation with half an hour before the backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated. Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated. Particularly, three problems in the use process of the HBase database can be effectively solved through the recovery of abnormal cluster nodes under three different conditions of request quantity backlog, request time backlog and hardware request backlog, wherein, when the request quantity of the Region unit is increased to cause the backlog of the storage and management server and further cause the backlog of the request queue to a large extent, firstly, the Region unit is migrated through the processing program with the request quantity backlog, the influence of other Region units which do not meet the condition is avoided, the purpose of abnormal recovery is achieved, when the request processing time of part of the Region units is increased, it may be the Region data problem, and the query performance is poor to cause the backlog of the request queue corresponding to the storage and management server, and at this time, when the Region unit needs to be migrated by the request time backlog program, and other Region units are also prevented from being influenced, and automatic recovery of node exception can be achieved. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all the Region units need to be migrated by the hardware request backlog program at this time. In addition, in the above three processes of migrating the Region units, in order to prevent the Region units from reflowing, the balance path of the storage server needs to be closed. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.

Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the collection of the node indexes of the database in step S1 and the parameters in the migration process of the Region unit in step S2 are all corresponding to each other, so that at least the above-mentioned several different node indexes are included, where the number distribution of the request queues, the time distribution of the request queues, and the time consumption distribution of reading, writing, searching, etc. are all factors for performing backlog judgment in the following process, and specific parameters of the factors are quantized, so that the backlog judgment standardization is achieved. Further, the check period for the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable work of the system.

All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.

It should be noted that: in the recovery system for the single-node abnormality of the database provided in the above embodiment, when the abnormal recovery of the database cluster node is performed, only the division of the functional modules is taken as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the recovery system for the single-node abnormality of the database is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the recovery system for the single-node abnormality of the database provided by the above embodiment and the recovery method embodiment for the single-node abnormality of the database belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A system for recovering from a single-node exception to a database, the system comprising:

2. The recovery system of single-node abnormality of a database according to claim 1, wherein the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the commensurable increase of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage server, and migrate the Region unit whose increase does not exceed the preset threshold on the storage server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server.

3. The system for recovering the single-node abnormality of the database according to claim 2, wherein the index monitoring module further includes a hardware monitoring unit, and is configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consumption monitoring unit migrates the Region units.

4. The system for recovering the single-node exception of the database according to claim 1, wherein the check period of the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.

5. The system for recovering database single-node anomaly according to claim 1, wherein the node indexes at least comprise: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.

6. The system for recovering the single-node abnormality of the database according to claim 1, wherein the database at least comprises an HBase database, the connection pool of the database at least comprises a drain connection pool, and the storage and management server comprises a plurality of Region units.

7. A recovery method for single-node exception of a database is characterized by comprising the following steps:

8. The method for recovering the single-node abnormality of the database according to claim 7, wherein the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the commensurable rise of the request quantity of the Region unit exceeds a preset threshold, the balance path of the storage and management server is closed, and the Region unit whose rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.

9. The method according to claim 8, wherein the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request number backlog nor the request time backlog occurs, a balance path of the storage and management server is closed, and all Region units on the storage and management server are migrated.

10. The method according to claim 7, wherein the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.