CN116991580A

CN116991580A - Distributed database system load balancing method and device

Info

Publication number: CN116991580A
Application number: CN202310934592.3A
Authority: CN
Inventors: 张晖; 刘芳蕾; 史大义; 孙哲华; 缪艺玮
Original assignee: Shanghai Yunxi Technology Co ltd
Current assignee: Shanghai Yunxi Technology Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-11-03
Anticipated expiration: 2043-07-27
Also published as: CN116991580B

Abstract

The invention discloses a method and a device for balancing loads of a distributed database system, which belong to the technical field of distributed databases and optimize ordering rules of hot spot data, wherein index information required by the ordering rules comprises the following steps: load pressure index set data, storage node processing capability index set data; historical load data; adding statistical information of load balancing scheduling service; and performing migration range copy and migration target storage node candidate queue priority sequencing on the information data. According to the invention, by optimizing the selection rules of the migration in and out range copy targets, the efficient load balancing of the system can be realized, so that the read-write performance of the distributed database system under a high load pressure scene is improved.

Description

Distributed database system load balancing method and device

Technical Field

The invention relates to the technical field of distributed databases, in particular to a method and a device for balancing loads of a distributed database system.

Background

Load balancing (Load balancing) is one of the factors that must be considered in the design of a distributed system architecture, which generally refers to the uniform distribution of requests/data across multiple operating units for execution. The common internet distributed architecture is divided into a client layer, a reverse proxy nginx layer, a site layer, a service layer and a data layer, and the layers have different strategy load balancing realization:

load balancing from the client layer to the reverse proxy layer is achieved through DNS polling;

load balancing from the reverse proxy layer to the site layer is realized through 'nginx';

load balancing from a site layer to a service layer is realized through a service connection pool;

the load balancing of the data layer is to consider two points of data balancing and request balancing, and common modes are according to range horizontal segmentation and hash horizontal segmentation.

The existing load balancing implementation for the distributed database KaiwuDB is based on range horizontal segmentation. The KaiwuDB first constructs a key according to a user data table, and logically and horizontally divides the key into a plurality of fragments according to the value Range of the key, which is called Range. Multiple copies (the number of the copies can be matched) exist under each Range, the copies are distributed on different cluster nodes, strong consistent synchronization is carried out by means of a Raft protocol, and load balancing is achieved through the position of a dispatching Range while high availability and fault tolerance of partition levels are solved. To achieve load balancing, the KaiwuDB background service may repeatedly perform range splitting and range migration multiple times.

range splitting: and the main node of the range initiates split through a shift protocol by calculating an appropriate key as a split point, and updates range metadata to realize split of the range. The newly split range node inherits the node distribution of the copy of the parent range. The Range itself can not balance load, but Range splitting can generate Range with finer granularity, and the Range is migrated through the balanced scheduling algorithm of the KaiwuDB to disperse read-write traffic to other nodes, so that the effects of horizontal expansion and load balancing are achieved.

range migration: the migration process firstly adds a new copy B into the RAft Group of the copy; then the new copy B is played back through a log to achieve the consistency with the data of the main copy; after synchronization is completed, the Range metadata is updated and the source copy a is deleted. With range migration, load pressure moves from node a to node B, thereby achieving load balancing.

To sum up, the existing load balancing strategy of the KaiwuDB is realized by splitting and migrating range based on the load pressure of range, and the existing realization has the following disadvantages:

1. the range splitting and migrating load pressure conditions are only one index of QPS, and are only effective for inquiring scene pressure feedback;

2. the range copy data distribution data are balanced, only the data quantity balance is considered, and the processing capacity of the storage nodes is not considered;

3. the load balancing only considers the historical pressure, and migration can possibly lead to centralized transfer of the pressure to a certain idle node to lead to sudden increase of the node pressure, and oscillation occurs in the load pressure balancing process;

4. under the condition that a storage node is retired or down, load balance is needed to be considered during copy replenishment, otherwise, chain reactions such as concentrated pressure transmission to a certain storage node, log backlog of the node, excessive occupation of memory, node OOM and the like can occur, so that the domino effect leads to that the nodes are sequentially connected with a system kill.

Disclosure of Invention

The technical task of the invention is to provide a method and a device for balancing the load of a distributed database system aiming at the defects, and the efficient load balancing of the system can be realized by optimizing the selection rules of the migration in and out range copy targets, so that the read-write performance of the distributed database system under a high load pressure scene is improved.

The technical scheme adopted for solving the technical problems is as follows:

a load balancing method of a distributed database system optimizes the ordering rule of hot spot data, wherein index information required by the ordering rule comprises the following steps: load pressure index set data, storage node processing capability index set data; historical load data; adding statistical information of load balancing scheduling service;

performing migration range copy and migration target storage node candidate queue priority ordering on the information data,

according to the ordering rule of the hotspot data range copy, the migrating priority of the hotspot data range copy is calculated, wherein the calculating rule is as follows: calculating comprehensive index values of the same attribute, and sequentially comparing according to priority order:

the index of the load attribute is ordered, the value is multiplied by the weight = comprehensive index value, and the higher the index value is, the hotter the range data is;

the index ordering of the resource attribute, namely the higher the value multiplied by the weight=the comprehensive index value, the more intense the node resource of the range data is, and the more urgent the range copy is moved out;

the index of the service attribute is ordered, the range quantity flows out cleanly and the success rate is high, and the higher the possibility that the range is migrated out is;

the priority of the migration target storage node is calculated according to the sorting rule of the migration target node, wherein the calculation rule is as follows: calculating comprehensive index values of the same attribute, and sequentially comparing according to priority order:

the index of the resource attribute is ordered, the value is multiplied by the weight = comprehensive index value, and the higher the resource is, the more abundant the resource is, the new copy is suitable for migration, and the resource can be used as a target node;

the index of the service attribute is ordered, the range copy number is net in, the success rate is high, and the higher the feasibility of the node serving as an migration target node is.

According to the method, the selection rule of the storage node during load balancing scheduling is improved, the evaluation indexes of storage node resources, read-write pressure and performance are expanded, and the load balancing scheduling of the distributed system is optimized. For the storage read-write service, the state of the current storage node and the pressure of the future storage node are comprehensively considered to select the copy position of the new writing data and the access copy of the read service, so that the conflict of the service pressure set on the resources is relieved to the greatest extent.

Preferably, the load pressure index set data includes: QPS, WPS, traffic loads QPS and WPS in preparation for migration recorded in the statistics, and fitting the pressure peak of the future load IO flow curve according to the statistics.

The load pressure condition indexes comprise QPS (query-per-second) and WPS (write-per-second) real-time load indexes, and the CPU core number and performance of the machine where the node is located, the hard disk capacity of the storage node and the disk speed index and other processing capacity indexes, and the multiple indexes can more comprehensively judge whether the load of the current storage node needs to be shunted.

Preferably, the storage node processing capability index set, the set element includes: the method comprises the steps of storing node cache capacity, stored data quantity, schedulable thread number, CPU core number and busyness of a machine where a node is located, memory use percentage of the machine where the node is located, and residual capacity and busyness of a node storage path hard disk.

Preferably, the historical load data comprises read-write data access quantity, peak value of QPS/WPS, active range number and active copy number in about 1h/1min/1 s. Recording the load processing condition of each storage node in the history stage, and predicting a load IO flow curve in one hour in the future by using a machine learning algorithm to provide more reliable suggestions for migration target node election.

Further, newly-added statistical information records historical load data and fits a future load IO flow curve, a future load IO flow curve fitting algorithm fits the IO flow curve by adopting a polynomial function, data of the past 24 hours are trained based on a Bayesian method and a minimum likelihood function, and the IO flow curve of the future 12 hours is predicted; in order to improve the accuracy of the estimation of the IO flow curve, data training and curve fitting updating are periodically performed at intervals of 12 hours.

Preferably, the statistical information of the newly added load balancing scheduling service includes the number of migration ranges and data volume, migration success rate and average time of response migration in about 1h/1 min.

The statistical information of the load balancing scheduling service is increased, the load pressure in the intermediate state is prevented from being concentrated on one or a plurality of nodes, the load balancing process is enabled to be more stable, and the jitter condition of the load pressure of the storage nodes is reduced.

Preferably, the ordering rule of the hotspot data range is as shown in the following table 1-1:

TABLE 1-1

Attributes of	Index name	Priority level	Value taking	Weighting of
					Load(s)	QPS (number of queries processed per second)	High height	qPS actual statistics	0.5
Load(s)	WPS (write request times per second)	High height	WPS actual statistics	0.3
					Load(s)	qPS to be migrated	High height	QPS prediction statistics	0.05
Load(s)	WPS to be migrated	High height	WPS predictive statistics	0.10
					Load(s)	Future IO flow Peak	High height	Fitting curve QPS+WPS value	0.05
(Resource)	Storage node space utilization	In (a)	Occupied/total capacity	0.4
					(Resource)	CPU busyness of machine where node is located	In (a)	CPU busyness	0.2
(Resource)	Busyness of hard disk	In (a)	Disk statistics util	0.2
					(Resource)	Memory utilization rate of machine where node is located	In (a)	Used memory/total memory	0.2
Service	range number net inflow	Low and low	Number of immigrating-immigrating range	-
					Service	Migration success rate	Low and low	Successful migration/request migration	1

The ordering rules of the migration target node are shown in the following tables 1-2:

TABLE 1-2

Preferably, for the selection of the range/node target to be migrated, 5 candidate targets, 3 candidate targets and 1 candidate target are sequentially reserved based on the calculation rule, so that the final range/node target to be migrated is determined.

Preferably, the statistical information update includes:

the statistical information of the historical load data of the storage node does not need to be updated, the statistical information can be accessed and acquired to a system metric when related index information is acquired, and the part of data can be obtained through calculation processing of the statistical information stored by a KaiwuDB timing engine;

the load service statistical information can be recorded by adding a statistical structure in the structure based on the data structure provided by the original metric interface, and is updated when a range copy to be migrated to a hot spot is selected in a storage node or a range request is received to process the range copy to be migrated.

The invention also discloses a device for balancing the load of the distributed database system, which is used for realizing the method for balancing the load of the distributed database system.

Compared with the prior art, the method and the device for balancing the load of the distributed database system have the following beneficial effects:

the method expands the load pressure evaluation index set, can support load balancing in more pressure scenes, is not limited to query scenes, and can realize load balancing in various load scenes such as data writing, node addition, node deletion and the like;

according to the method, statistical information is newly added, load processing conditions of historical stages of all storage nodes are recorded, machine learning algorithm is used for training historical data to fit load IO flow curves in one hour in the future, load balancing scheduling weighting is carried out on the migrated storage nodes, and cost performance of load balancing scheduling is improved;

according to the method, statistical information of load balancing scheduling service is newly added, service pressure in the process of migration is considered in the load pressure of the storage node, and the service pressure is used as the evaluation of the load pressure of the future node, so that sight blind areas of other load balancing schedulers in the follow-up process of migration can be effectively avoided, and jitter of node pressure curves of one or more nodes serving as optimal migration-in and migration-out nodes under unexpected conditions is avoided;

the method can effectively reduce the read-write pressure of the high-load pressure node, meanwhile, the migration node with idle service and better performance can complete range copy migration more quickly, range migration is quick and effective, and extra copy data writing can be reduced.

The method is based on the existing business flow, interfaces such as range splitting, migration and the like are not required to be modified, and the method is completely compatible with the existing application.

Drawings

Fig. 1 is a schematic diagram of a method for implementing load balancing of a distributed database system according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to specific examples.

The embodiment of the invention provides a method for balancing load of a distributed database system, which optimizes the ordering rule of hot spot data, wherein index information required by the ordering rule comprises the following steps: load pressure index set data, storage node processing capability index set data; historical load data; statistical information of load balancing scheduling service is newly added.

Load pressure condition indexes, including QPS (query-per-second), WPS (write-per-second) real-time load indexes, CPU core number and performance of a machine where the node is located, hard disk capacity of a storage node, disk speed index and other processing capacity indexes, and multiple indexes are used for judging whether the load of the current storage node needs to be shunted or not more comprehensively;

adding statistical information, recording load processing conditions of historical stages of each storage node, and predicting a load IO flow curve within one hour in the future by using a machine learning algorithm to provide more reliable suggestions for migration target node elections;

And finally, the read-write performance of the distributed database system under a high load pressure scene is improved through rapid and effective load balancing.

The specific implementation is as follows:

whether lease balanced or copy balanced, it is necessary to obtain hotspot data range and sort. The method carries out optimization design on the ordering rule of the hot spot data.

Index information required by the sequencing rule of the optimal design comprises the following steps: load pressure index set data, storage node processing capability index set data;

the load pressure index set is expanded into on the basis of the existing QPS: QPS, WPS, traffic loads QPS and WPS which are recorded in statistical information and are ready to migrate, and fitting a pressure peak value of a future load IO flow curve according to the statistical information;

storing a node processing capability index set, wherein the set elements comprise: storing node cache capacity, stored data quantity, schedulable thread number, CPU core number and busyness of a machine where the node is located, memory use percentage of the machine where the node is located, residual capacity of a node storage path hard disk, busyness and the like;

the historical load data content comprises read-write data access quantity, peak value of QPS/WPS, active range number, active copy number and the like in the period of nearly 1h/1min/1 s;

the future load IO flow curve fitting algorithm fits an IO flow curve by adopting a polynomial function, trains data of the past 24 hours based on a Bayesian method and a minimum likelihood function, and predicts an IO flow curve of 12 hours in the future. To improve the accuracy of the IO flow curve estimation, data training and curve fitting updating are periodically performed at intervals of 12 hours.

The load service statistical information content comprises the number of the migration copies, the data volume, the migration success rate, the average time of response migration and the like in about 1h/1 min;

(1) The ordering rules for hotspot data range are as follows:

TABLE 1-1

Calculating rules: calculating comprehensive index values of the same attribute, and sequentially comparing according to priority order:

the index of the service attribute is ordered, the range number is out cleanly, the success rate is high, and the higher the possibility that the range copy is migrated is.

(2) The ordering rules for migrating into the target node are as follows:

TABLE 1-2

Attributes of	Index name	Priority level	Value taking	Weighting of
					Load(s)	QPS (number of queries processed per second)	High height	qPS actual statistics	0.3
Load(s)	WPS (write request times per second)	High height	WPS actual statistics	0.5
					Load(s)	qPS to be migrated	High height	QPS prediction statistics	0.05
Load(s)	WPS to be migrated	High height	WPS predictive statistics	0.10
					Load(s)	Future IO flow Peak	High height	Fitting curve QPS+WPS value	0.05
(Resource)	Storage node space free rate	High height	Residual capacity/total capacity	0.4
					(Resource)	CPU (Central processing Unit) idle degree of machine where node is located	High height	CPU count 100-CPU busyness	0.2
(Resource)	Degree of idleness of hard disk	High height	1-disk statistics util	0.2
					(Resource)	Free rate of machine memory where node is located	High height	Residual memory/total memory	0.2
Service	range number net outflow	Low and low	Number of immigrating-immigrating range	-
					Service	Migration success rate	Low and low	Successful migration/request migration	1

the index sequencing of the resource attributes, the value of weight=comprehensive index value, and the higher the resource is, the more abundant the resource is suitable for the migration of new copies, and the resource can be used as a target node;

index ordering of load attributes, wherein the higher the index ordering of load attributes is, the higher the value is the range data is;

the index of the service attribute is ordered, the range quantity flows in cleanly and the success rate is high, and the higher the feasibility of the node serving as the migration target node is.

And (3) selecting a range/node target to be migrated, wherein 5, 3 and 1 candidate targets are reserved in sequence respectively according to the indexes of the resource attribute, the load attribute and the service attribute based on the calculation rules of the steps (1) and (2), and determining the final range/node target to be migrated.

Updating statistical information: the statistical information of the historical load data of the storage node does not need to be updated, and the statistical information can be accessed and acquired to the system metric when the related index information is acquired, and the part of data can be obtained through the statistical information calculation processing stored by the KaiwuDB timing engine. The load service statistical information can be recorded by adding a statistical structure in the store instance structure based on the data structure provided by the original metric interface, and is updated when a range copy to be migrated of a hot spot is selected in a storage node or a range request is received to process the range copy to be migrated.

According to the method, more load pressure evaluation indexes are evaluated and analyzed, and the processing capacity of the storage node is combined, so that proper hot spot data can be more accurately selected for migration, and the migration target node is ensured to be capable of effectively receiving the load pressure of the part. Meanwhile, a machine learning algorithm is introduced to estimate and feed back the load balancing effectiveness and the node load, so that flexible change of load pressure of each node and load balancing among nodes of the whole cluster system are realized, and the performance stability and reliability of the distributed database under a high-load scene are improved.

The embodiment of the invention also provides a device for balancing the load of the distributed database system, which is used for realizing the method for balancing the load of the distributed database system described in the embodiment.

The present invention can be easily implemented by those skilled in the art through the above specific embodiments. It should be understood that the invention is not limited to the particular embodiments described above. Based on the disclosed embodiments, a person skilled in the art may combine different technical features at will, so as to implement different technical solutions.

Other than the technical features described in the specification, all are known to those skilled in the art.

Claims

1. The method for balancing the load of the distributed database system is characterized by optimizing the ordering rule of the hotspot data range copies, wherein the ordering rule needs index information comprising: load pressure index set data, storage node processing capability index set data; historical load data; adding statistical information of load balancing scheduling service;

the index of the service attribute is ordered, the range quantity flows out cleanly and the success rate is high, which means that the higher the feasibility of the range copy is migrated;

2. The method of claim 1, wherein the load pressure index set data comprises: QPS, WPS, traffic loads QPS and WPS in preparation for migration recorded in the statistics, and fitting the pressure peak of the future load IO flow curve according to the statistics.

3. A method of distributed database system load balancing according to claim 1 or 2, wherein the storage node processing capability index set, the aggregate element comprises: the method comprises the steps of storing node cache capacity, stored data quantity, schedulable thread number, CPU core number and busyness of a machine where a node is located, memory use percentage of the machine where the node is located, and residual capacity and busyness of a node storage path hard disk.

4. A method of distributed database system load balancing according to claim 3, wherein the historical load data comprises read-write data access, QPS/WPS peak, active range number, active copy number within approximately 1h/1min/1 s.

5. The method for load balancing of a distributed database system according to claim 4, wherein the statistical information records historical load data and fits a future load IO flow curve, the future load IO flow curve fitting algorithm fits the IO flow curve by using a polynomial function, trains data for the past 24 hours based on a Bayesian method and a minimum likelihood function, and predicts an IO flow curve for the future 12 hours;

data training and curve fitting updates were performed periodically at 12 hour intervals.

6. The method for balancing load of a distributed database system according to claim 4, wherein the statistical information of the newly added load balancing scheduling service includes the number of the migrated range copies and the data amount, the migration success rate and the average time of responding to the migrated in about 1h/1 min.

7. The method for load balancing of a distributed database system according to claim 1, wherein the ordering rule of the hotspot data range is as shown in table 1-1 below:

TABLE 1-1

TABLE 1-2

。

8. The method according to claim 1 or 7, wherein for selecting the target of the range/node to be migrated, 5 candidate targets, 3 candidate targets and 1 candidate target are sequentially reserved based on the calculation rule, so as to determine the final target of the range/node to be migrated.

9. The method of distributed database system load balancing of claim 8, wherein the statistical information update comprises:

the load service statistical information is recorded by adding a statistical structure in a store instance structure based on a data structure provided by an original metric interface, and is updated when a range copy to be migrated of a hot spot is selected in a storage node or a range request is received to process to be migrated.

10. An apparatus for load balancing a distributed database system, wherein the apparatus is configured to implement the method for load balancing a distributed database system according to any one of claims 1 to 9.