CN105512264A - Performance prediction method of concurrency working loads in distributed database - Google Patents

Performance prediction method of concurrency working loads in distributed database Download PDF

Info

Publication number
CN105512264A
CN105512264A CN201510881758.5A CN201510881758A CN105512264A CN 105512264 A CN105512264 A CN 105512264A CN 201510881758 A CN201510881758 A CN 201510881758A CN 105512264 A CN105512264 A CN 105512264A
Authority
CN
China
Prior art keywords
inquiry
regression model
linear regression
time delay
data base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510881758.5A
Other languages
Chinese (zh)
Other versions
CN105512264B (en
Inventor
李晖
陈梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Youlian Borui Technology Co Ltd
Guizhou University
Original Assignee
Guizhou Youlian Borui Technology Co Ltd
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Youlian Borui Technology Co Ltd, Guizhou University filed Critical Guizhou Youlian Borui Technology Co Ltd
Priority to CN201510881758.5A priority Critical patent/CN105512264B/en
Publication of CN105512264A publication Critical patent/CN105512264A/en
Application granted granted Critical
Publication of CN105512264B publication Critical patent/CN105512264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a performance prediction method of concurrency working loads in a distributed database. A linear regression model is established and used for judging the interaction between queries in the distributed database and predicting query time delay L under different concurrency degrees in the distributed database, and the database selectively distributes tasks through the query time delay L. The method includes the main steps that A, metrics of the query time delay L are selected; B, the interaction under combined concurrency conditions is inquired about and the linear regression model is established; C, correctness and validity of the linear regression model are verified through experiments. It is proved through repeated experiments that the total average relative error of the query time delay is 14%, the total average relative error of network delay is 30% and the total average relative error of the number of I/O block reading times is 37%, it can be seen from experiment results that the linear regression model can well conduct performance prediction of concurrency work responsibility on the distributed database, and therefore subsequent task distribution of the database is facilitated, and the average waiting time of query can be shortened.

Description

The performance prediction method of concurrent efforts load in distributed data base
Technical field
The present invention relates to the performance prediction method of operating load in a kind of database, the performance prediction method of particularly concurrent efforts load in a kind of distributed data base.
Background technology
Now current, the performance prediction for database work load has had relevant research.But the database of now research is only confined to the database of single node, and that is this database only has a station server, for its performance of a station server how, depend primarily on the disk of server and the utilization factor of CPU.The growth of the data volume produced along with research and industrial circle, distributed data base system is applied to store and management PB DBMS, and provides high concurrency and extensibility.Data in distributed data base pass through the call by pattern of dispersion/collection to process.Such as, inquiry can be gone out multiple subquery by a node split, these subqueries can by the concurrent execution of other node a lot, and then the partial results of each node will to turn back on this node and to combine, and obtain final result of query execution.Therefore, in distributed data base, data point cede territory to be stored on the multiple distributed nodes in cluster, and this cluster can be expanded easily by adding new node.This namely why distributed data base be used for one of the reason of the large data of Storage and Processing.Usually, distributed data base is used for supporting concurrence performance analytic type operating load, to reduce the query execution time of needs.But concurrence performance also also exists the challenge in resource contention while bringing a large amount of advantage, such as judges the interaction between multiple queries.Interaction between multiple queries may be different.When a table scan is shared in two inquiries, may be positive role each other.On the contrary, when two inquiries all need high network transmission bandwidth, mutually the query execution time will be increased because of network delay.For single-node data storehouse, the distribution of its task is only confined to a station server, but for the distributed data base of multinode, the distribution of its task then has multiple choices, how to pass through task matching, shorter to realize the inquiry average latency, be that database should be considered when carrying out task matching.Such as, in distributed data base, there are 3 servers, and all at execution query task, No. 1 server disk, cpu busy percentage are lower; No. 2 and No. 3 server disks and cpu busy percentage higher, if now 1 inquiry, distributed to which server and carry out having considered with regard to needs.If No. 1 server also has other inquiry to raise at wait or its disk of subsequent time, cpu busy percentage, and 2, No. 3 server subsequent time disks, CPU are when reducing, when so carrying out task matching, just should consider to assign the task to 2 goods No. 3 servers, instead of distribute to No. 1 server.Therefore, with regard to needing, performance prediction is carried out to the operating load in distributed data base, thus be convenient to follow-up task matching.Due to the singularity of distributed data base, former database performance Forecasting Methodology is not suitable for present distributed data base, and does not carry out performance prediction for concurrent operating load in existing performance prediction method.
Summary of the invention
The object of the invention is to, the performance prediction method of concurrent efforts load in a kind of distributed data base is provided.The present invention can carry out good performance prediction to concurrent efforts load in distributed data base, thus is convenient to follow-up task matching, thus shortens the average latency of inquiry.
Technical scheme of the present invention: the performance prediction method of concurrent efforts load in a kind of distributed data base, by setting up multiple linear regression model, for judging the interaction in distributed data base between inquiry, and the inquiry time delay L in prediction distribution formula Database Systems under different concurrent degree, the selectivity that database carries out task by inquiry time delay L is distributed; Its key step includes:
The metric of A, inquiry time delay L is selected;
Multiple linear regression model is set up in interaction under B, inquiry combination complications;
The correctness of C, experimental demonstration multiple linear regression model and validity.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, the inquiry time delay L in steps A includes network delay and processing locality.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, described network delay adopts transmission volume N as its metric; Described processing locality adopts I/O block to read number of times B as its metric.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, described step B is made up of following several part:
B1: predicted query interacts;
B2: predicted query time delay;
B3: based on the linear regression model (LRM) training of sampling.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, described B1 step includes: the I/O block of main inquiry q when inquiring about p1...pn concurrence performance with pair reads the prediction of number of times B and transmission volume N; Wherein I/O block reads number of times B by following linear forecast of regression model:
B = β 1 B q + β 2 Σ i = 1 n B p i + β 3 Σ i = 1 n ΔB q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔB p i / q j - - - ( 1 ) ;
Transmission volume N, following linear forecast of regression model:
N = β 1 N q + β 2 Σ i = 1 n N p i + β 3 Σ i = 1 n ΔN q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔN p i / q j - - - ( 2 ) ;
Described step B2 is predicted inquiry time delay L by following linear regression model:
L=C q1*B q2*N q(3);
Described step B3 is: by providing the inquiry of more than 2, and use LHS to generate different inquiry combination, and the inquiry combination that paired operation is different, I/O block when recording each inquiry combination reads number of times and transmission volume to form sample, uses sample to estimate the factor beta 1 of linear regression model (LRM), β 2, β 3 and β 4 by least square method;
In formula, B qbe that the main I/O block inquiring about q reads number of times;
for the I/O block of all pair inquiries reads number of times sum;
for all pair inquiries read number of times sum to the I/O block of the direct influence value of main inquiry;
for the I/O block of the remote effect value between all pair inquiries reads number of times sum;
N qit is the transmission volume of main inquiry q;
for the transmission volume sum of all pair inquiries;
for the transmission volume sum of all pair inquiries to the direct influence value of main inquiry;
for the transmission volume sum of the remote effect value between all pair inquiries;
C qfor inquiring about the CPU overhead time of q.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, described step C is: in multiple linear regression model, run inquiry Q1, Q2, Q3 ... Qn obtains measured value, then measured value is put into multiple linear regression model to export, obtain predicted value, the sampling of a predicted value part is divided into test data set, another part is divided into training dataset, and observes the fit solution between predicted value and measured value.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, the network transmission package number between described transmission volume employing node is as raw data during measurement query execution.
In the performance prediction method of concurrent efforts load in aforesaid distributed data base, described network transmission package number and I/O block are read number of times and are used SystemTap to obtain.
Beneficial effect of the present invention: compared with prior art, due to the transmission of data between node in distributed data base, network overhead can be related to when performing inquiry in system, therefore when the concurrent query execution performance of prediction, consider network delay, in the present invention, propose reciprocation when linear regression model (LRM) carrys out concurrence performance analytic type operating load in prediction distribution formula Database Systems.Because network delay and processing locality are query execution time most important two factors, therefore in the present invention, the present invention is from network delay, and processing locality and three aspects of concurrency in various degree, utilize linear regression model (LRM) to carry out analysis and consult act of execution.In addition, adopt sample technique to obtain the inquiry combination under different concurrent degree in the present invention.Model of the present invention is in the cluster built by PostgreSQL, utilize typical analytic type operating load TPC-H data set to complete performance prediction, by repeatedly testing proof inquiry time delay, the average relative error that network delay and I/O block read number of times total is respectively 14%, 30% and 37%, can find out that the linear regression model (LRM) that the present invention proposes can well carry out the responsible performance prediction of concurrent efforts to distributed data base from experimental result, thus be convenient to the follow-up task matching of database, the average latency of inquiry can be shortened.
Accompanying drawing explanation
Accompanying drawing 1 is the predicted value of inquiry time delay of the present invention and the matching schematic diagram of measured value;
Accompanying drawing 2 is that I/O block of the present invention reads the predicted value of number of times and the matching schematic diagram of measured value;
Accompanying drawing 3 is the predicted value of network delay of the present invention and the matching schematic diagram of measured value;
Accompanying drawing 4 is the average relative error schematic diagram of the present invention when concurrent degree is 3;
Accompanying drawing 5 is the average relative error schematic diagram of the present invention when concurrent degree is 4;
Embodiment
Below in conjunction with drawings and Examples, the present invention is further illustrated, but not as the foundation limited the present invention.
Embodiments of the invention:
1. performance prediction
Target of the present invention is the concurrent inquiry time delay performance prediction in research distributed data base system.Performance in distributed data base system mainly affects by the resource contention shared in basic resource situation, and the resource shared has RAM, CPU, magnetic disc i/o, network bandwidth etc.Therefore in the present invention, the availability value that may be used for inquiry time delay performance prediction under concurrent efforts load is first selected, particularly for distributed data base system.
The present invention focuses on the concurrent inquiry time delay of prediction distribution formula analytic type operating load.Analytic type inquiry in distributed data base system relates generally to network delay and processing locality two aspects.
Processing locality acquires its data needed for retrieving and process an inquiry from node.The processing locality time is the time average between the block being submitted to needs when the request of retrieve data on node obtains returning.For logical I/O request, processing locality needs search disk many times, and a series of reading continuously is write with a small amount of, or the access of buffer memory and cache pool.Usually, processing locality time major part is used for I/O operation, and read operation is far away more than write operation in addition.Therefore in the present invention, average I/O block is adopted to read number of times as weighing metric processing locality dimension being carried out inquiry time delay performance prediction.
Because the data in distributed data base are processed by the pattern of dispersion/collection, Internet Transmission when therefore performing inquiry is necessary.Data are divided and in storage multiple distributed nodes in the cluster, the data of transmission may be the partial query result obtained from local node, or net result returned to the node submitting this request to.Transmission volume is the factor affecting inquiry time delay in distributed data base system, therefore makes in the present invention to use it as the metric in network delay dimension.
The present invention is mainly for the research in medium complex analyses aspect.The present invention have chosen 10 medium complex query statements to form query analysis of the present invention combination from TPC-H for this reason, and these query statements are conceived to the concurrence performance performance of distributed data base system.First, choose 10 query statements under complications in various degree, utilize TPC-H to produce the data set of 10G, then the PostgreSQL cluster be made up of 4 nodes runs the inquiry time delay time that this query statement obtains measurement, wherein MPL represents the concurrent quantity of query execution, embodies concurrent degree.As table 1, the present invention can find out that the time delay of not every query statement all can present the trend of existing linear increase along with the increase of number of concurrent.
The average lookup time delay of table 1. 10 inquiries under the concurrent degree of difference
Inquiry MPL1 MPL2 MPL3 MPL4
3 0.07 0.13 0.12 0.10
4 5.23 5.48 5.32 5.61
5 8.92 9.62 9.70 10.46
6 2.63 3.14 2.76 2.80
7 27.80 29.48 31.03 32.06
8 26.95 28.24 31.85 28.12
10 3.13 3.68 3.61 3.71
14 3.50 4.10 3.84 4.11
18 83.14 93.47 87.93 86.03
19 4.83 5.90 5.92 6.19
2. interaction modeling
In the discussion of previous section, the present invention reads number of times and transmission volume as the metric on processing locality and network delay two dimensions using adopting I/O block respectively, predicts the performance of different inquiry combination in different concurrent degree.Therefore in this part, the present invention proposes two multiple linear regression models and study inquiry and be combined in interaction under complications.Then the present invention proposes again a linear regression model (LRM), and utilizes I/O block to read number of times and network delay to carry out inquiry time delay prediction.Data set after finally utilizing sampling carries out training and obtains forecast model of the present invention.
In order to predicted query influencing each other in concurrence performance situation, first the I/O block of the present invention's judgement when inquiry performs under two complications reads the impact in number of times and transmission volume, then increases concurrency gradually.Particularly the present invention builds multiple linear regression model to analyze influencing each other under two concurrent degree.In order to make model of the present invention easier to understand, the present invention is divided into main inquiry and secondary inquiry inquiry.Main inquiry is that the present invention wants to study the inquiry that it is affected situation under complications, and secondary inquiry is the inquiry with main inquiry concurrence performance.Before the model introducing the present invention's proposition, first introduce correlated variables.The value of these variablees can be concentrated from training data and obtain.
Separation number: the present invention proposes this variable as a base value, namely main inquiry is without value when performing under complications.The present invention using this value as the base value judging analog value under complications.Such as inquiry i, the present invention B irepresent that I/O block reads number of times, use N irepresent transmission volume.
Concurrent value: same, the value of this variable is the separation number summation of concurrent inquiry, the B as above in example ior N i.
Direct influence value: the present invention uses this value to carry out the impact of vice inquiry on main inquiry, and it is the summation of the metric of change.Such as when i is main inquiry, j is secondary inquiry, for transmission volume, and N i/jrepresent direct disturbance degree value, its changing value is Δ N i/j=N i/j-N i.
Remote effect value: the present invention applies this variable and carrys out vice inquiry and directly influence each other, and its value is the direct influence value sum of secondary inquiry.
B = β 1 B q + β 2 Σ i = 1 n B p i + β 3 Σ i = 1 n ΔB q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔB p i / q j - - - ( 1 ) ;
Therefore, the present invention uses following formula to carry out predicted query q and reads number of times B and transmission volume N at its average I/O block of situation with p1...pn concurrence performance:
N = β 1 N q + β 2 Σ i = 1 n N p i + β 3 Σ i = 1 n ΔN q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔN p i / q j - - - ( 2 ) ;
By adopting least square method to estimate, these coefficients of factor beta 1, β 2, β 3, β 4 of each inquiry will be obtained from training data is concentrated by training in the present invention.
The present invention considers that I/O block reads number of times and transmission volume two aspects to set up the inquiry time delay that each inquiry predicted by linear regression model (LRM) simultaneously.General for distributed data base system, inquiry time delay forms primarily of network delay and processing locality, and the processing locality time mainly comprises specific CPU overhead time and average logical I/O stand-by period, the inquiry time delay therefore inquiring about q can be predicted by following formula:
In above-mentioned formula 1,2 and 3, the particular CPU overhead time of Cq representative inquiry q, Bq represents that average I/O block reads number of times, and Nq represents the averaging network transmission quantity in distributed data base between multiple node.
The present invention carries out experiment repeatedly to obtain factor beta 1, β 2 by using least square method to sample.
In order to the model of easy to understand the present invention proposition more, next introduce a simple example.If the present invention want to predict inquire about in distributed data base system a with the inquiry time delay in inquiry b, c concurrence performance situation, first the present invention needs to calculate following value:
The isolation I/O block of inquiry a, b, c reads time numerical value Ba, Bb and Bc, and isolation network transmission value: Na, Nb and Nc.
With in inquiry b, c concurrence performance situation, the direct influence value of inquiry a: △ B a/b, △ N a/b,
△B a/c,△N a/c
Remote effect value: △ B c/b, △ N c/b, △ B b/c, △ N b/c.
The present invention can obtain corresponding metric respectively by following two formula.
B a=β 1B a2(B b+B c)+β 3(△B a/b+△B a/c)+β 4(△B c/b+△B b/c)
N a=β 1N a2(N b+N c)+β 3(△N a/b+△N a/c)+β 4(△N c/b+△N b/c)
Next the inquiry time delay will formula 3 being used to carry out predicted query a:
L a=C a1*B a2*N a
In order to obtain inquiry time delay from previously described formula 3, need to train forecast model of the present invention.First, provide feature when 10 inquiries run respectively, be query execution time delay, I/O block reads number of times and network delay, these 10 inquiries are baseline query statements of various inquiry combination under different MPLs, when the operation inquiry that the present invention is paired, as 55 paired inquiries can obtain the concrete feature how they affect the other side.
In order to run these query statements on multiple machines, the interbehavior of higher degree simultaneously, the present invention uses LHS to generate different inquiry combination to represent the operating load wanted required for the present invention.LHS is a stratified sampling function, and it can produce sample data very easily.In table 2, the present invention gives a LHS example out on MPL2, can see that LHS generates 5 paired inquiries in this example.In an experiment, transmission volume when recording the I/O block reading times of each inquiry and carry out each inquiry combination is to form sample, and these samples are used for the coefficient of appraising model.For each inquiry, generate a lot of inquiry example combinations to form sample.Such as, inquiry 3 represents Q3, Q4, Q5 combination, but Q3 is main inquiry, and Q4, Q5 are then secondary inquiries.
LHS example tieed up by table 2.2
Inquiry 1 2 3 4 5
1 X
2 X
3 X
4 X
5 X
For set up linear regression model (LRM), the correctness of three models needing assessment to propose and validity.To understand inquiry time delay respectively by experiment, transmission volume and I/O block read measured value and the predicted value situation of number of times, and when being 3 and 4 for concurrent degree, understand the average relative error rate of each inquiry.
3. experimental demonstration
In order to the feasibility of appraisal procedure and the accuracy of model, select to perform inquiry on the 10G data set of the QGEN generation provided by TPC-H.Because this research lays particular emphasis on analytical operating load, therefore the present invention chooses Q3 from 22 inquiries of TPC-H, and Q4, Q5, Q6, Q7, Q8, Q10, Q14, Q18, Q19 form inquiry of the present invention combination.Choosing these inquiries is because these query times are relatively long, can provide the more time, be beneficial to the present invention and collect I/O block and read number of times and transmission volume.The data-base cluster that the distributed data base system of this experiment is made up of four PostgreSQL nodes, use Postgres-XL to realize this function in experiment, Postgres-XL is a PostgreSQL data-base cluster of increasing income, and all has high-level retractility and dirigibility to the different database work load of process.Clustered deploy(ment) is at 4 cores, and 2 hertz of processors, 8G internal memory, model is in the physical machine of Intel (R) Xeon (R) CPUE5-2620, the Centos6.4 of the operating system that each node runs to be kernel be Linux2.6.32.
First obtain training dataset by sample technique, and use Matlab to obtain multiple linear regression model, then use test data set to predict that the I/O block inquired about under concurrence performance reads number of times and network delay.
Training dataset and test data set are then obtained by following manner, inquiry Q1, Q2, Q3 is run in multiple linear regression model ... Qn obtains measured value, then measured value is put into multiple linear regression model to export, obtain predicted value, the sampling of a predicted value part is divided into test data set, another part sampling obtains training dataset, and observes the fit solution between predicted value and measured value.
Fit solution between predicted value and measured value provides in FIG.In an experiment, the present invention uses coefficient of determination R 2weigh the fine of regression model whether matching.Coefficient of determination R 2span be 0 to 1, its value near 1, illustrate predicted value and measured value more close, regression model of the present invention is better.Fig. 1,2,3 respectively illustrate under many complications, and the inquiry time delay that the forecast model utilizing the present invention to propose obtains, network delay and I/O block read number of times, the fit solution between predicted value and measured value.R successively 2value be respectively 0.94,0.58 and 0.84, this illustrates in this research work and utilizes network delay and I/O block to read the ability that number of times carrys out predicted query time delay.For each inquiry, first use the model in formula 1 and 2 to predict that network delay and I/O block read number of times, finally use formula 3 to carry out predicted query time delay.In an experiment, the present invention adopts the network transmission package number between node as the raw data of transmission volume during measurement query execution.
Need to it should be noted that the method obtaining raw data, because these raw data become sample after process, and two key factors affecting linear regression model (LRM) quality are quality and the quantity of sample respectively, the method therefore obtaining raw data is very important.
In order to I/O block when collecting each query execution reads number of times and transmission volume, use SystemTap to perform the script write in this research and carry out Dynamic Acquisition data.SystemTap is the dynamic approach of monitoring and the operation following the tracks of operating linux kernel.For user provides simple command Window and script.With the statistics by catching PostgreSQL self or utilize other instruments to obtain compared with network delay, using SystemTap more can obtain network transmission package and I/O block accurately and reading number of times.Data are obtained in addition in order to obtain the more time, also in order to make the data that obtain more accurate, the shared_buffer value of what the present invention was suitable have adjusted PostgreSQL.
As previously mentioned, the present invention applies common least square method (OLS) and obtains coefficient in model.According to common least square method, basic demand could be met from least needing 6 samples to carry out predicted query time delay empirically.In the middle of experiment, present invention employs 120 sample values, also for ease of and predict that network delay and I/O block read number of times, also at least need 13 samples, in this research, employ 140 samples.Find in experiment, when increasing sample size, there is not special change in overall variation tendency, just makes a little more crypto set on the contrary yet yet.In FIG, in order to make more point show in the drawings, so there is no displaying is worth king-sized point, and the query execution time of such as inquiring about 18 does not just embody in FIG.In addition, in figure 3 for the prediction of network delay, the present invention can see the point that some predictions are higher or on the low side.This is because the fluctuation of Experimental Network or the packet loss when collecting data make the error of predicted value and observed reading larger.
In addition, in order to closer to practical application scene, all do not remove buffer memory when performing inquiry, this is also why in time increasing concurrent degree, prediction accuracy has one of reason reduced a little at every turn in experiment.In comparison diagram 4 and Fig. 5, I/O block is read the average relative error of number of times and can find this phenomenon.
Comparison diagram 4 and Fig. 5 can also find, when inquiry 3 (Q3) average relative error when concurrency is 3 is 4 higher than concurrency.Can learn by analyzing, this is because the execution time of inquiry 3 (Q3) is too short to such an extent as to can not obtain source data more accurately, the too low predicated error that makes of sample quality is higher.
Fig. 4, Fig. 5 respectively show when concurrent degree is 3,4, inquiry time delay, and network delay and I/O block read the average relative error of number of times, and wherein average relative error is passed through | (measured value-predicted value)/measured value | calculate gained.Inquiry time delay, the average relative error that network delay and I/O block read number of times total is respectively 14%, 30% and 37%.This experimental result shows that the model using the present invention to propose can well carry out the performance prediction of concurrent efforts load to distributed data base system.

Claims (8)

1. the performance prediction method of concurrent efforts load in a distributed data base, it is characterized in that: by setting up multiple linear regression model, for judging the interaction in distributed data base between inquiry, and the inquiry time delay L in prediction distribution formula database under different concurrent degree, the selectivity that database carries out task by inquiry time delay L is distributed; Its key step includes:
The metric of A, inquiry time delay L is selected;
Multiple linear regression model is set up in interaction under B, inquiry combination complications;
The correctness of C, experimental demonstration multiple linear regression model and validity.
2. the performance prediction method of concurrent efforts load in distributed data base according to claim 1, is characterized in that: the inquiry time delay L in steps A includes network delay and processing locality.
3. the performance prediction method of concurrent efforts load in distributed data base according to claim 2, is characterized in that: described network delay adopts transmission volume N as its metric; Described processing locality adopts I/O block to read number of times B as its metric.
4. the performance prediction method of concurrent efforts load in distributed data base according to claim 3, is characterized in that: described step B is made up of following several part:
B1: predicted query interacts;
B2: predicted query time delay;
B3: based on the linear regression model (LRM) training of sampling.
5. the performance prediction method of concurrent efforts load in distributed data base according to claim 4, is characterized in that:
Described B1 step includes: the I/O block of main inquiry q when inquiring about p1...pn concurrence performance with pair reads the prediction of number of times B and transmission volume N; Wherein I/O block reads number of times B by following linear forecast of regression model:
B = β 1 B q + β 2 Σ i = 1 n B p i + β 3 Σ i = 1 n ΔB q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔB p i / q j - - - ( 1 ) ; Transmission volume N, by following linear forecast of regression model:
N = β 1 N q + β 2 Σ i = 1 n N p i + β 3 Σ i = 1 n ΔN q / p i + β 4 Σ i = 1 n Σ i = 1 n , j ! = i ΔN p i / q j - - - ( 2 ) ;
Described step B2 is predicted inquiry time delay L by following linear regression model:
L=C q1*B q2*N q(3);
Described step B3 is: by providing the inquiry of more than 2, and use stratified sampling function to generate different inquiry combination, and the inquiry combination that paired operation is different, I/O block when recording each inquiry combination reads number of times B and transmission volume N to form sample, uses sample to estimate the factor beta 1 of linear regression model (LRM), β 2, β 3 and β 4 by least square method;
In formula, B qbe that the main I/O block inquiring about q reads number of times;
for the I/O block of all pair inquiries reads number of times sum;
for all pair inquiries read number of times sum to the I/O block of the direct influence value of main inquiry;
for the I/O block of the remote effect value between all pair inquiries reads number of times sum;
N qit is the transmission volume of main inquiry q;
for the transmission volume sum of all pair inquiries;
for the transmission volume sum of all pair inquiries to the direct influence value of main inquiry;
for the transmission volume sum of the remote effect value between all pair inquiries;
C qfor inquiring about the CPU overhead time of q.
6. the performance prediction method of concurrent efforts load in distributed data base according to claim 1, it is characterized in that: described step C is: in multiple linear regression model, run inquiry Q1, Q2, Q3 ... Qn obtains measured value, then measured value is put into multiple linear regression model to export, obtain predicted value, the sampling of a predicted value part is divided into test data set, another part is divided into training dataset, and observes the fit solution between predicted value and measured value.
7. the performance prediction method of concurrent efforts load in distributed data base according to claim 3, is characterized in that: the network transmission package number between described transmission volume employing node is as raw data during measurement query execution.
8. the performance prediction method of concurrent efforts load in distributed data base according to claim 7, is characterized in that: described network transmission package number and I/O block are read number of times and used SystemTap to obtain.
CN201510881758.5A 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base Active CN105512264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510881758.5A CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510881758.5A CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Publications (2)

Publication Number Publication Date
CN105512264A true CN105512264A (en) 2016-04-20
CN105512264B CN105512264B (en) 2019-04-19

Family

ID=55720246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510881758.5A Active CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Country Status (1)

Country Link
CN (1) CN105512264B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451041A (en) * 2017-07-24 2017-12-08 华中科技大学 A kind of object cloud storage system response delay Forecasting Methodology
CN107679243A (en) * 2017-10-31 2018-02-09 麦格创科技(深圳)有限公司 Task distributes the application process and system in distributed system
CN109308193A (en) * 2018-09-06 2019-02-05 广州市品高软件股份有限公司 A kind of multi-tenant function calculates the concurrency control method of service
CN110210000A (en) * 2019-04-18 2019-09-06 贵州大学 The identification of industrial process efficiency and diagnostic method based on Multiple Non Linear Regression
CN111782396A (en) * 2020-07-01 2020-10-16 浪潮云信息技术股份公司 Concurrency flexible control method based on distributed database
CN113157814A (en) * 2021-01-29 2021-07-23 东北大学 Query-driven intelligent workload analysis method under relational database
CN113296964A (en) * 2021-07-28 2021-08-24 阿里云计算有限公司 Data processing method and device
CN113485638A (en) * 2021-06-07 2021-10-08 贵州大学 Access optimization system for massive astronomical data
CN116745783A (en) * 2021-01-21 2023-09-12 斯诺弗雷克公司 Handling of system characteristic drift in machine learning applications

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609416A (en) * 2009-07-13 2009-12-23 清华大学 Improve the method for performance tuning speed of distributed system
CN101841565A (en) * 2010-04-20 2010-09-22 中国科学院软件研究所 Database cluster system load balancing method and database cluster system
CN104794186A (en) * 2015-04-13 2015-07-22 太原理工大学 Collecting method for training samples of database load response time predicting model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609416A (en) * 2009-07-13 2009-12-23 清华大学 Improve the method for performance tuning speed of distributed system
CN101841565A (en) * 2010-04-20 2010-09-22 中国科学院软件研究所 Database cluster system load balancing method and database cluster system
CN104794186A (en) * 2015-04-13 2015-07-22 太原理工大学 Collecting method for training samples of database load response time predicting model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱能俊等: "阵列数据库系统FASTDB的研究与实现", 《计算机工程与设计》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451041A (en) * 2017-07-24 2017-12-08 华中科技大学 A kind of object cloud storage system response delay Forecasting Methodology
CN107451041B (en) * 2017-07-24 2019-11-22 华中科技大学 A kind of object cloud storage system response delay prediction technique
CN107679243A (en) * 2017-10-31 2018-02-09 麦格创科技(深圳)有限公司 Task distributes the application process and system in distributed system
CN109308193A (en) * 2018-09-06 2019-02-05 广州市品高软件股份有限公司 A kind of multi-tenant function calculates the concurrency control method of service
CN110210000A (en) * 2019-04-18 2019-09-06 贵州大学 The identification of industrial process efficiency and diagnostic method based on Multiple Non Linear Regression
CN111782396A (en) * 2020-07-01 2020-10-16 浪潮云信息技术股份公司 Concurrency flexible control method based on distributed database
CN116745783A (en) * 2021-01-21 2023-09-12 斯诺弗雷克公司 Handling of system characteristic drift in machine learning applications
CN113157814A (en) * 2021-01-29 2021-07-23 东北大学 Query-driven intelligent workload analysis method under relational database
CN113157814B (en) * 2021-01-29 2023-07-18 东北大学 Query-driven intelligent workload analysis method under relational database
CN113485638A (en) * 2021-06-07 2021-10-08 贵州大学 Access optimization system for massive astronomical data
CN113296964A (en) * 2021-07-28 2021-08-24 阿里云计算有限公司 Data processing method and device

Also Published As

Publication number Publication date
CN105512264B (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN105512264A (en) Performance prediction method of concurrency working loads in distributed database
US11494636B2 (en) Machine learning-based semiconductor manufacturing yield prediction system and method
US11429584B2 (en) Automatic determination of table distribution for multinode, distributed database systems
US10621064B2 (en) Proactive impact measurement of database changes on production systems
US20180096006A1 (en) Workload-driven recommendations for columnstore and rowstore indexes in relational databases
CN110147367B (en) Temperature missing data filling method and system and electronic equipment
Scabora et al. Physical data warehouse design on NoSQL databases-OLAP query processing over HBase
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN107360026B (en) Distributed message middleware performance prediction and modeling method
Malensek et al. Fast, ad hoc query evaluations over multidimensional geospatial datasets
AU2014271274A1 (en) System and method predicting effect of cache on query elapsed response time during application development stage
Tan et al. A new traffic prediction method based on dynamic tensor completion
US20160292233A1 (en) Discarding data points in a time series
JP2023511464A (en) SYSTEM, METHOD, AND COMPUTER-READABLE MEDIUM FOR ANALYSIS OF PRODUCT FAILURE FACTORS
Tan et al. Effectiveness assessment of solid-state drive used in big data services
CN116401232A (en) Database parameter configuration optimization method and device, electronic equipment and storage medium
US20230141891A1 (en) Autonomous Column Selection for Columnar Cache
CN108733781A (en) The cluster temporal data indexing means calculated based on memory
CN114881343A (en) Short-term load prediction method and device of power system based on feature selection
Gagolewski et al. Ockham’s index of citation impact
Bai et al. Dnnabacus: Toward accurate computational cost prediction for deep neural networks
CN113157541B (en) Multi-concurrency OLAP type query performance prediction method and system for distributed database
CN113656370B (en) Data processing method and device for electric power measurement system and computer equipment
Costa et al. ONE: A predictable and scalable DW model
US10496948B1 (en) Computer trend visualization using quadratic simplified closed form linear regression

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant