CN110704223A - Recovery system and method for single-node abnormity of database - Google Patents

Recovery system and method for single-node abnormity of database Download PDF

Info

Publication number
CN110704223A
CN110704223A CN201910870884.9A CN201910870884A CN110704223A CN 110704223 A CN110704223 A CN 110704223A CN 201910870884 A CN201910870884 A CN 201910870884A CN 110704223 A CN110704223 A CN 110704223A
Authority
CN
China
Prior art keywords
request
database
node
storage
backlog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910870884.9A
Other languages
Chinese (zh)
Other versions
CN110704223B (en
Inventor
张立明
郭业俊
孙迁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suning Cloud Computing Co ltd
SuningCom Co ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201910870884.9A priority Critical patent/CN110704223B/en
Publication of CN110704223A publication Critical patent/CN110704223A/en
Application granted granted Critical
Publication of CN110704223B publication Critical patent/CN110704223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a recovery system and a recovery method for single-node abnormity of a database. The recovery system comprises: the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database; and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions. The embodiment of the invention can automatically process the abnormity of the database node and improve the stability of the database cluster by automatically recovering the abnormity of the database node.

Description

Recovery system and method for single-node abnormity of database
Technical Field
The invention relates to the field of big data, in particular to a system and a method for recovering single-node abnormity of a database.
Background
At present, the application range of a database is wide, wherein different types of the database also exist along with the application range, for example, the HBase database is a distributed key-value database, and has the characteristics of high availability, easy expansion, mass storage and the like, the data of the HBase is stored on the HDFS, the data is highly available, and the data loss cannot be caused when any machine is down, but the database is also unstable in the application process, including the hardware problem of the server: disk damage, memory anomalies, etc.; service process problem: GC appears in the regionserver, and BUG of unknown programs and the like appear; database service issues: resource contention and the like are caused by wrong inquiry, and the problem can occur that a certain node of the database is abnormal in the application process, so that the user request delay is increased, even completely blocked, the stability of HBase is greatly influenced, and the use of the database by the user is further influenced.
For the problem of the database node, there are two approaches to solve the problem, the first approach is to control the traffic threshold individually for each node in the cluster network of the database, and then the traffic threshold synchronization of the whole database cluster network can be reduced, but the solution has the following disadvantages: when the requests in the cluster are inclined, the request quantity of a single node is overlarge, but the request quantity of the whole database cluster is still small, and further the flow control is inaccurate. The second approach is to count the cluster request amount by counting the services of a single node, and determine the traffic threshold of the whole cluster network by passing each request in the cluster network through the service, which has the following disadvantages: the service sequence judgment of a single node easily causes a bottleneck point of the whole cluster network, and the reliability is difficult to guarantee. The above methods are all approaches for controlling the node flow to solve the problem of node exception, and cannot effectively solve the stability of database service in work.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a system and a method for recovering a single-node exception of a database, which can automatically process an exception of a database node, automatically recover an exception of the database node, and improve stability of a database cluster.
In order to solve the technical problems, the invention adopts the technical scheme that:
the first method, an embodiment of the present invention, provides a recovery system for a single-node exception of a database, where the recovery system includes:
the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions.
Further, the index monitoring module comprises a request quantity monitoring unit and a request time consumption monitoring unit, wherein the request quantity monitoring unit is used for comparing the request quantity of the request situation with half an hour before backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold value, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation is not greater than the preset threshold value on the storage and management server is migrated; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server.
Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consuming monitoring unit migrates the Region units.
Further, the check period for the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.
Further, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.
Further, the database at least comprises an HBase database, a connection pool of the database at least comprises a Druid connection pool, and the storage and management server comprises a plurality of Region units.
On the other hand, the embodiment of the invention also discloses a recovery method of the single-node abnormity of the database, which comprises the following steps:
collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.
Further, the backlog of the request queue comprises a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the same-ratio rise of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.
Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated.
Further, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the embodiment of the invention discloses a recovery system and a recovery method for single-node abnormity of a database, when the node abnormality is caused by the serious backlog of the database node request queues, the index acquisition module and the index monitoring module are used for monitoring and processing the request quantity and the request time of the request queues in the database, and judging that the quantity and the time of the request queues led into a connection pool of the database are obviously increased, processing the storage and management server corresponding to the backlog queue according to the set request quantity threshold and the request time threshold, migrating the Region unit on the storage and management server, therefore, the occurrence of high delay and node blockage caused by backlog of request queues in the database nodes is reduced, the data circulation of the nodes is automatically recovered, the efficiency of data transmission in the database is improved, and the stability of the database in the using process is ensured.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a recovery system for single-node anomalies in a database according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of a method for recovering a single-node exception of a database according to an embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating backlog processing in the method for recovering a single-node exception of a database according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
fig. 1 is a schematic structural diagram of a recovery system for a single-node exception of a database disclosed in this embodiment, and as shown in fig. 1, an embodiment of the present invention provides a recovery system for a single-node exception of a database, where the recovery system includes:
the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions.
Specifically, the recovery system for single-node abnormality of the database provided by the embodiment of the present invention is a fast recovery scheme based on single-node abnormality of the HBase database, and preferably, the database at least includes the HBase database, a connection pool of the database at least includes a drain connection pool, and the storage and management server includes a plurality of Region units. The HBase database is a Distributed key-value database and has the characteristics of high availability, easiness in expansion, mass storage and the like, the data of the HBase is stored on the HDFS, the data are highly available, and the HDFS (Hadoop Distributed File System) indicates that the Distributed File system is designed to be a Distributed File system suitable for running on general hardware. Therefore, data loss cannot be caused when any one machine is down, if a certain node is abnormal, the user request delay is increased, even completely blocked, the use stability of the HBase database is greatly influenced, and in the process of using the HBase database, hardware problems of a server, such as disk damage, abnormal memory and the like, are caused; service process problems such as GC occurrence of the regionserver, program unknown BUG and the like; database service issues: resource contention caused by wrong inquiry and the like can cause several abnormal conditions, once the node abnormal conditions occur, by the recovery system for the single-node abnormity of the database disclosed by the embodiment of the invention, the index acquisition module and the index monitoring module are utilized to acquire and monitor the request quantity and the request time of the request queue in the database, when the quantity and the time of the request queue led into the connection pool of the database are obviously increased, processing the storage and management server corresponding to the backlog queue according to the set request quantity threshold and the request time threshold, migrating the Region unit on the storage and management server, therefore, the occurrence of high delay and blockage caused by backlog of request queues in the database nodes is reduced, the data circulation of the nodes is automatically recovered, the efficiency of data transmission in the database is improved, and the stability of the database in the using process is ensured.
Further, as can be seen in fig. 1, one data table is composed of a plurality of Region units and is distributed on different Region servers, where a Region server in the present invention represents a storage and management server, a Region unit represents a unit for storing data in an HBase database, an index here refers to an index of operation of an HBase database cluster, and the index collection module specifically includes a unit time increment for performing difference calculation and aggregation processing on the node indexes, and aggregates the Region units into the data table.
Preferably, the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the comparable fluctuation of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage and management server, and migrate the Region unit whose fluctuation does not exceed the preset threshold on the storage and management server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server. Further, the index monitoring module further includes a hardware monitoring unit, configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consuming monitoring unit migrates the Region units. Therefore, three problems occurring in the use process of the HBase database can be effectively solved through the three units included in the index monitoring module, wherein when the request quantity of the Region unit increases to cause backlog of the storage and management server, and further cause large backlog of the request queue, the request quantity monitoring unit migrates the Region unit to avoid the Region unit which does not meet the condition from being affected, so as to achieve the purpose of abnormal recovery. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all regions need to be migrated by using the hardware monitoring unit at this time. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.
Preferably, the check period for the request queue within the node metrics is 30 seconds per time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable working of the system, wherein the 30S periodic check is only performed on the HBase database every time, and other databases or check periods under different application scenes can be adjusted through a computer language, so that the check on the request queue is in a circulating working state and in an optimal check period, further, the judgment that the backlog of the request queue exceeds 1000 times in two continuous minutes in the embodiment is also based on the optimal selection of the HBase database in the actual working process, and other judgment standards under different scenes are also based on different data stocks in difference, other criteria and inspection cycle differences are within the scope of the present invention.
Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the index collection module collects the node indexes of the database and the index monitoring module corresponds to parameters of the Region unit in the migration process, so that different node indexes at least comprise the above parameters, wherein the number distribution of request queues, the time distribution of the request queues and the time consumption distribution of reading, writing, searching and the like are all factors for performing backlog judgment subsequently, specific parameters of the factors are quantized, the backlog judgment standardization is achieved, the recovery time when the database cluster nodes are abnormal can be improved, when different databases are aimed at, the types of the node indexes can be expanded according to different scenes, the backlog meeting the request queues can be properly adjusted, and therefore, more modules for adjusting the node abnormality can be provided, and for the expansion type of the index, are intended to be within the scope of the present invention.
Example two:
the embodiment of the invention also discloses a method for recovering the single-node abnormity of the database, wherein, FIG. 2 is a flow diagram of the method for recovering the single-node abnormity of the database disclosed by the embodiment of the invention; fig. 3 is a schematic flow chart of backlog processing in the recovery method of single-node exception of a database disclosed in the embodiment of the present invention, and as shown in fig. 2 and fig. 3, the recovery method includes the following steps:
s1: collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
s2: and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.
Specifically, the recovery method for the database node exception actually includes two main steps, the first is to collect node indexes of a connection pool node imported into the database, and the second is to migrate the Region unit when a request is backlogged, so that the request queue in the connection pool cannot be blocked under any condition, and smooth data of the cluster node is achieved. Further, by collecting and monitoring the request quantity and the request time of the request queues in the database, and judging that the quantity and the time of the request queues led into the connection pool of the database are remarkably increased, processing the storage and management server corresponding to the backlog queue according to a set request quantity threshold and a set request time threshold, and migrating the Region unit on the storage and management server, wherein the quantity threshold and the time threshold can be properly adjusted according to the use scene of the database. The method and the system achieve the aims of reducing the occurrence of high delay and blockage caused by backlog of the request queue in the database node, automatically recovering the data circulation of the node, improving the efficiency of data transmission in the database and ensuring the stability of the database in the using process.
Preferably, in the recovery of the database node exception through the above steps S1 and S2, the database at least includes an HBase database, the connection pool of the database at least includes a drain connection pool, and the storage server includes a plurality of Region units, so that, when the database node exception occurs, it is preferable to adjust the Region units in the HBase database for the migration of the Region units, thereby avoiding the occurrence of request queue backlog.
Preferably, the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog compares the request quantity of the request situation with half an hour before the backlog, if the same-ratio fluctuation of the request quantity of the Region unit exceeds a preset threshold, a balance passage of the storage and management server is closed, and the Region unit of which the fluctuation does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated. Further, the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request quantity backlog nor the request time backlog occurs, the balance path of the storage and management server is closed, and all Region units on the storage and management server are completely migrated. Particularly, three problems in the use process of the HBase database can be effectively solved through the recovery of abnormal cluster nodes under three different conditions of request quantity backlog, request time backlog and hardware request backlog, wherein, when the request quantity of the Region unit is increased to cause the backlog of the storage and management server and further cause the backlog of the request queue to a large extent, firstly, the Region unit is migrated through the processing program with the request quantity backlog, the influence of other Region units which do not meet the condition is avoided, the purpose of abnormal recovery is achieved, when the request processing time of part of the Region units is increased, it may be the Region data problem, and the query performance is poor to cause the backlog of the request queue corresponding to the storage and management server, and at this time, when the Region unit needs to be migrated by the request time backlog program, and other Region units are also prevented from being influenced, and automatic recovery of node exception can be achieved. Of course, in addition to the above two cases, another case is that a hardware problem occurs, which causes a bug to occur in the system, and all the Region units need to be migrated by the hardware request backlog program at this time. In addition, in the above three processes of migrating the Region units, in order to prevent the Region units from reflowing, the balance path of the storage server needs to be closed. Because migration to the Region unit is automatic processing in this embodiment, the cost of manpower operation and maintenance is reduced, and at the same time, the mutual influence of tables in the same cluster is reduced, and a certain data table has a problem, and the use of other tables cannot be influenced.
Preferably, the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests. Specifically, the collection of the node indexes of the database in step S1 and the parameters in the migration process of the Region unit in step S2 are all corresponding to each other, so that at least the above-mentioned several different node indexes are included, where the number distribution of the request queues, the time distribution of the request queues, and the time consumption distribution of reading, writing, searching, etc. are all factors for performing backlog judgment in the following process, and specific parameters of the factors are quantized, so that the backlog judgment standardization is achieved. Further, the check period for the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes. The request queue is fixedly checked every other fixed period, so that the recovery system for the single-node abnormality of the database disclosed in the embodiment is always in a working state, and the stability of the database is indirectly ensured through the stable work of the system.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
It should be noted that: in the recovery system for the single-node abnormality of the database provided in the above embodiment, when the abnormal recovery of the database cluster node is performed, only the division of the functional modules is taken as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the recovery system for the single-node abnormality of the database is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the recovery system for the single-node abnormality of the database provided by the above embodiment and the recovery method embodiment for the single-node abnormality of the database belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A system for recovering from a single-node exception to a database, the system comprising:
the index acquisition module is used for acquiring each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
and the index monitoring module is used for monitoring the node indexes imported into the database connection pool, checking a request queue in the node indexes, checking a storage server corresponding to the request queue after the request queue is determined to have backlog, determining the request conditions of all Region units on the storage server, and migrating the Region units on the storage server if the request conditions meet preset conditions.
2. The recovery system of single-node abnormality of a database according to claim 1, wherein the index monitoring module includes a request quantity monitoring unit and a request time consumption monitoring unit, the request quantity monitoring unit is configured to compare the request quantity of the request situation with half an hour before backlog, and if the commensurable increase of the request quantity of the Region unit exceeds a preset threshold, close the balance path of the storage server, and migrate the Region unit whose increase does not exceed the preset threshold on the storage server; and the request time consumption monitoring unit is used for checking the request time of the storage and management server when the request quantity and the comparable rise before half an hour of the Region unit do not exceed the preset threshold, closing a balance path of the storage and management server if the request time is increased and exceeds the preset time, and transferring the Region unit which does not exceed the preset time on the storage and management server.
3. The system for recovering the single-node abnormality of the database according to claim 2, wherein the index monitoring module further includes a hardware monitoring unit, and is configured to close a balance path of the storage and management server and migrate all the Region units on the storage and management server when neither the request number monitoring unit nor the request time consumption monitoring unit migrates the Region units.
4. The system for recovering the single-node exception of the database according to claim 1, wherein the check period of the request queue in the node index is 30 seconds each time; the judgment criterion of backlog of the request queue is that the number of the request queues is more than 1000 in two consecutive minutes.
5. The system for recovering database single-node anomaly according to claim 1, wherein the node indexes at least comprise: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.
6. The system for recovering the single-node abnormality of the database according to claim 1, wherein the database at least comprises an HBase database, the connection pool of the database at least comprises a drain connection pool, and the storage and management server comprises a plurality of Region units.
7. A recovery method for single-node exception of a database is characterized by comprising the following steps:
collecting each node index of the database, performing difference calculation and aggregation processing on the node indexes, and then importing the processed node indexes into a connection pool of the database;
and checking the request queue in the node index, checking a storage and management server corresponding to the request queue after confirming that the request queue has backlog, determining the request conditions of all Region units on the storage and management server, and migrating the Region units on the storage and management server if the request conditions meet preset conditions.
8. The method for recovering the single-node abnormality of the database according to claim 7, wherein the backlog of the request queue includes a request quantity backlog and a request time backlog, the request quantity backlog is obtained by comparing the request quantity of the request situation with half an hour before the backlog, if the commensurable rise of the request quantity of the Region unit exceeds a preset threshold, the balance path of the storage and management server is closed, and the Region unit whose rise does not exceed the preset threshold on the storage and management server is migrated; and the request time backlog is that when the request quantity of the Region unit does not exceed a preset threshold value in unit time, the request time of the storage and management server is checked, if the request time is increased and exceeds preset time, a balance path of the storage and management server is closed, and the Region unit which does not exceed the preset time on the storage and management server is migrated.
9. The method according to claim 8, wherein the backlog of the request queue further includes a hardware request backlog, and the hardware request backlog is that when neither the request number backlog nor the request time backlog occurs, a balance path of the storage and management server is closed, and all Region units on the storage and management server are migrated.
10. The method according to claim 7, wherein the node metrics include at least: the number of request queues, the time distribution of the request queues, the number of read-write requests and the time consumption distribution of the query requests.
CN201910870884.9A 2019-09-16 2019-09-16 Recovery system and method for single-node abnormity of database Active CN110704223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910870884.9A CN110704223B (en) 2019-09-16 2019-09-16 Recovery system and method for single-node abnormity of database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910870884.9A CN110704223B (en) 2019-09-16 2019-09-16 Recovery system and method for single-node abnormity of database

Publications (2)

Publication Number Publication Date
CN110704223A true CN110704223A (en) 2020-01-17
CN110704223B CN110704223B (en) 2022-12-13

Family

ID=69195290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910870884.9A Active CN110704223B (en) 2019-09-16 2019-09-16 Recovery system and method for single-node abnormity of database

Country Status (1)

Country Link
CN (1) CN110704223B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115020A (en) * 2020-08-27 2020-12-22 北京基调网络股份有限公司 Database connection pool abnormity monitoring method and device and computer equipment
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101854400A (en) * 2010-06-09 2010-10-06 中兴通讯股份有限公司 Database synchronization deployment and monitoring method and device
US20140258224A1 (en) * 2013-03-11 2014-09-11 Oracle International Corporation Automatic recovery of a failed standby database in a cluster

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101854400A (en) * 2010-06-09 2010-10-06 中兴通讯股份有限公司 Database synchronization deployment and monitoring method and device
US20140258224A1 (en) * 2013-03-11 2014-09-11 Oracle International Corporation Automatic recovery of a failed standby database in a cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DHARAVATH RAMESH等: "Design of a transaction recovery instance based on bi-directional ring election algorithm for crashed coordinator in distributed database systems", 《2012 WORLD CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGIES》 *
朱涛等: "分布式数据库中一致性与可用性的关系", 《软件学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115020A (en) * 2020-08-27 2020-12-22 北京基调网络股份有限公司 Database connection pool abnormity monitoring method and device and computer equipment
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees
CN115396291B (en) * 2022-08-23 2024-06-18 度小满科技(北京)有限公司 Kubernetes-managed-based redis cluster fault self-healing method

Also Published As

Publication number Publication date
CN110704223B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN105426292B (en) A kind of games log real time processing system and method
US11474874B2 (en) Systems and methods for auto-scaling a big data system
CN112751726B (en) Data processing method and device, electronic equipment and storage medium
CN112799923B (en) System abnormality cause determination method, device, equipment and storage medium
CN106325984B (en) Big data task scheduling device
CN110704223B (en) Recovery system and method for single-node abnormity of database
CN108132837A (en) A kind of distributed type assemblies dispatch system and method
CN110175070B (en) Distributed database management method, device, system, medium and electronic equipment
CN101808351A (en) Method and system for business impact analysis
CN110647531A (en) Data synchronization method, device, equipment and computer readable storage medium
CN111552701A (en) Method for determining data consistency in distributed cluster and distributed data system
CN111181800A (en) Test data processing method and device, electronic equipment and storage medium
JP6252309B2 (en) Monitoring omission identification processing program, monitoring omission identification processing method, and monitoring omission identification processing device
CN109684130B (en) Method and device for backing up data of computer room
CN115269193A (en) Method and device for realizing distributed load balance in automatic test
CN107105037B (en) Distributed video CDN resource management system and method based on file verification
CN104915376A (en) Cloud storage file archiving and compressing method
CN112817687A (en) Data synchronization method and device
CN112448855B (en) Method and system for updating block chain system parameters
CN113360479A (en) Data migration method and device, computer equipment and storage medium
CN110377396A (en) A kind of virtual machine Autonomic Migration Framework method, system and electronic equipment
CN112838962A (en) Performance bottleneck detection method and device for big data cluster
CN113342758B (en) Metadata management method, device, equipment and medium of file system
CN115250278B (en) Method and device for synchronizing multi-cloud management platform and cloud service provider resources
CN115065685B (en) Cloud computing resource scheduling method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee after: Jiangsu Suning cloud computing Co.,Ltd.

Country or region after: China

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee before: Suning Cloud Computing Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240315

Address after: 210000, 1-5 story, Jinshan building, 8 Shanxi Road, Nanjing, Jiangsu.

Patentee after: SUNING.COM Co.,Ltd.

Country or region after: China

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210000

Patentee before: Jiangsu Suning cloud computing Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right