CN106020719A

CN106020719A - Initial parameter configuration method of distributed storage system

Info

Publication number: CN106020719A
Application number: CN201610318767.8A
Authority: CN
Inventors: 彭泽武; 黄剑文; 王建民; 冯歆尧; 黄向东; 钟雨; 龙明盛
Original assignee: Tsinghua University; Information Center of Guangdong Power Grid Co Ltd
Current assignee: Tsinghua University; Information Center of Guangdong Power Grid Co Ltd
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2016-10-12

Abstract

The invention relates to an initial parameter configuration method of a distributed storage system, and belongs to the technical field of computer database management. The initial parameter configuration method, which is divided into three stages including a training stage, a using stage and a dynamic updating stage, comprises the steps of firstly performing the training stage by a user to obtain a performance model; then solving an initial parameter configuration problem of the distributed storage system by using the performance model in the using stage; and then performing dynamic updating for the performance model based on feedback of the user in the dynamic updating stage. The initial parameter configuration method of the distribute storage system can efficiently avoid a cold-start problem of a traditional database and a traditional parameter adjusting tool, reduces hardware cost as much as possible under the precondition of meeting user demands, improves the system performance and brings about higher cost performance for the user.

Description

A kind of initial parameter collocation method of distributed memory system

Technical field

The invention belongs to computer data base management technical field, particularly to distribution a kind of in big market demand development process The initial parameter collocation method of formula storage system.

Background technology

Along with high speed development and the quickening of social informatization paces of the Internet, the data fast development of every profession and trade, the mankind are Step into big data age.Big data outstanding feature is " 4V " feature: bigger scale, higher multiformity, faster Speed and lower value density.In these four features, bigger scale becomes people's focus of attention the most.Because The scale of big data has been TB beyond the limit of traditional data-storage system, the such as limit of traditional local file system Level, and the demand of PB DBMS storage is seen everywhere at big data age；The number that traditional relational can effectively manage It is hundred million grades according to the amount limit, and as far back as 2012, Facebook handled up to 2,500,000,000 data every day.In this background Under, the frontier nature problem of computer realm has been had become as the storage and management of big data.

Along with the development of big data technique, distributed memory system arises at the historic moment, and distributed memory system is by depositing data On multiple servers, efficiently solve the problem that data volume that single server can store is limited, there is good expansion Malleability.Distributed memory system mainly includes distributed file system, such as HDFS, GFS and Lustre；Distributed Relational number According to storehouse, such as MySQL Cluster；Distributed non-relational database, such as Cassandra, HBase, MangoDB etc..? When utilizing the storage that distributed memory system carries out big data, the initial parameter configuration first facing distributed memory system is asked Topic.The initial parameter configuration of distributed memory system, is first that the hardware parameter to system configures, including: server Number, the size etc. of CPU, internal memory and hard disk, followed by the kernel software parameter of distributed memory system is configured, bag Include number of copies, consistency level etc..For a user, the purpose of parameter configuration should ensure each of distributed memory system Aspect demand is met, and the cost on hardware is the fewest again.Research finds, the configuration of parameter is to distributed storage The impact of systematic function has conclusive effect.Bad, in distributed memory system system, any one configuration The adjustment of parameter tends not to promptly appreciate that effect, but needs when data volume arrives certain scale when, and it acts on ability Show especially out.

For the initial parameter allocation problem of distributed memory system, it it is the most all the method solution by configuring manually Certainly, workload is relatively big, rarely has the method automatically configured to occur.In addition to initial parameter configures, the parameter optimization side to system The achievement in research of method is more, especially relational database system, is all discovery performance issue after system runs a period of time And resource bottleneck, by analysis of history data and daily record, draw the parameter needing to adjust, such as Oracle 11g and cross-platform Tuning instrument iTuned.All there is cold start-up problem in these methods when parameter configuration, it is difficult to preferably at the beginning of system Beginning parameter configures.

Summary of the invention

It is an object of the invention to overcome the weak point of prior art, propose the initial parameter configuration of a kind of distributed memory system Method；Avoid traditional database and adjust the problem of ginseng instrument cold start-up, saving as much as possible on the premise of meeting user's request Save hardware cost, improve systematic function, bring higher cost performance for user.

The present invention proposes the initial parameter collocation method of a kind of distributed memory system, and the method is divided into three phases: train rank Section, operational phase and dynamic more new stage.User first carries out the training stage and obtains performance model, then in operational phase by property Can be used for solving the initial parameter allocation problem of distributed memory system by model；Afterwards, in the dynamically more new stage, according to user Feedback performance model is dynamically updated.

Said method specifically comprises the following steps that

(1) training stage: distributed memory system is tested, performance model is built；Specifically comprise the following steps that

(1-1) choose hardware parameter and the kernel software parameter of distributed memory system, and determine hardware parameter and core The span of software parameter；

(1-2) assembled scheme of each parameter in the step (1-1) needing to test is determined；

(1-3) the hardware configuration parameter of selected testing service device；

(1-4) virtual machine platform is installed on the selected testing service device of step (1-3)；

(1-5) selected client-server, configures network environment, it is ensured that client-server and testing service device are same Under one network segment, client-server is installed YCSB test program；

(1-6) creating seed virtual machine on the virtual machine platform that step (1-4) is set up, installing Linux operation is System and the software of necessity, install distributed memory system；

(1-7) according to the parameter assembled scheme determined in step (1-2), each assembled scheme is tested respectively, And the test result of record performance index；

(1-8) test result to each performance indications that step (1-7) obtains, builds performance model and solves；

(2) operational phase: distributed memory system is carried out initial parameter configuration, calculates its hardware parameter and kernel software Parameter configuration；Specifically comprise the following steps that

(2-1) user determines the desired value of multiple performance indications respectively according to the demand of oneself；

(2-2) hardware parameter determined according to step (1-1) and kernel software parameter value scope, refer to all of performance Mark, calculates the result of all parameter values combination respectively according to the performance model obtained by step (1), and according to calculating To the value of corresponding performance indications be ranked up from low to high；

(2-3) the performance Index Calculation result combined for all parameter values, filters out and sets less than in step (2-1) Desired value parameter value combination；

(2-4) remaining parameter value is combined, calculate its price successively according to the cost function being previously set respectively；

(2-5) for the result of calculation of step (2-4), the minimum parameter group cooperation of price minimum i.e. cost is chosen for distribution The parameter configuration result of formula storage system；

(3) the dynamic more new stage: dynamically update performance model, specifically comprise the following steps that

(3-1) distributed memory system that user obtains of hardware parameter and kernel software parameter collect to(for) operational phase are joined Put and the feedback of corresponding performance indications；

(3-2) each performance indications is updated performance model；

(3-3) repeat step (3-1) to (3-2), according to user's request, the new performance model obtained constantly is carried out more Newly.

The initial parameter collocation method of the distributed memory system that the present invention proposes, its advantage is:

1, inventive process avoids traditional database and adjust the problem of ginseng instrument cold start-up, meeting the premise of user's request Under save hardware cost as much as possible, improve systematic function, bring higher cost performance for user；

2, the performance model in the inventive method is supported dynamically to update, and can learn according to the feedback of user, constantly Improve the accuracy of performance model.

Detailed description of the invention

The initial parameter collocation method of a kind of distributed memory system that the present invention proposes, below in conjunction with specific embodiment to this Bright further description is as follows.

The initial parameter collocation method of a kind of distributed memory system that the present invention proposes, the method is divided into three phases: include Distributed memory system is tested by the training stage, builds performance model, at the beginning of distributed memory system is carried out by operational phase Beginning parameter configuration, calculates its hardware parameter and kernel software parameter configuration；With the dynamically more new stage according to the feedback of user's request, The new performance model obtained constantly is updated.

The distributed memory system that the present embodiment uses, as a example by Cassandra, is embodied as step as follows:

(1) training stage: distributed memory system (Cassandra) is tested, build performance model, specifically walk Rapid as follows:

(1-1) choose hardware parameter and the kernel software parameter of distributed memory system, and determine these hardware parameters and core The span of heart software parameter；The parameter list chosen in the present embodiment and corresponding span are as shown in table 1, including 5 parameters；

Table 1

(1-2) assembled scheme of each parameter value in the table 1 needing to test, the parameter value used in the present embodiment are determined Assembled scheme as shown in table 2, totally 54 groups；

Table 2

(1-3) selected testing service device, the hardware configuration parameter of selected server is at least 16 core CPU, 64GB internal memory, 2TB hard disk；

(1-4) virtual machine platform is installed on the selected testing service device of step (1-3), such as VMware, OpenStack, The commercial programs such as VirtualBox；

(1-5) selected client-server, configures network environment, it is ensured that client-server and testing service device are same Under the individual network segment, the network bandwidth at least 1GB, installs YCSB on client-server and increases income test program；

(1-6) on the virtual machine platform that step (1-4) is set up create seed virtual machine, install (SuSE) Linux OS with And the processing routine of necessity such as Java etc., distributed memory system (using Cassandra in the present embodiment) is installed；

(1-7) according to the parameter assembled scheme in table 2, each parameter assembled scheme is tested respectively, and record property The test result of energy index；Specifically comprise the following steps that

(1-7-1) the seed virtual machine set up by step (1-6) copies several identical virtual machines respectively as test Virtual machine, the number of test virtual machine is the nodes in required test combination equal in table 2；

(1-7-2) ensure that each class hardware configuration parameter (such as CPU, internal memory, hard disk) of all test virtual machines adds up mutually The sum of the same item of hardware configuration parameter that no more than selected in step (1-3) testing service device is corresponding；

(1-7-3) the IP address revising all test virtual machines makes them different, revises on all test virtual machines The call parameter (such as: seed IP address) of distributed memory system (Cassandra), by virtual for all of test Machine composition cluster, starts distributed memory system (Cassandra)；

(1-7-4) opening client-server, the IP address list arranging YCSB test program is in step (1-7-3) The IP address of all test virtual machines arranged, runs YCSB test program, the test result of record performance index such as table 3 Shown in；

Table 3

(1-8) each performance indications of obtaining step (1-7) (are read handling capacity, are write handling capacity, read latency, write and prolong Slow and maximum client number of concurrent) test result, build performance model and also solve, specifically comprise the following steps that

(1-8-1) object function of performance model is built；If object function is Y=F (N, C, M, R, L), in formula, Y is table 3 In any one performance indications, F is about parameter N, the polynomial function of C, M, R, L, total X item after polynomial expansion, It is designated as X dummy variable, corresponding X undetermined coefficient；

(1-8-2) all parameters in table 2 are combined, using the value of X dummy variable after calculating expansion as independent variable, All of independent variable, as dependent variable (i.e. Y), is formed training dataset, this enforcement with dependent variable by testing performance index result Example has the training dataset of 54 groups of data compositions；

(1-8-3) utilize multiple linear regression analysis method that the training dataset of step (1-8-2) is carried out Regressive Solution, obtain The regressand value of the undetermined coefficient of X dummy variable, thus obtain the expression of Y=F (N, C, M, R, L), it is step (1-8-1) expression formula of the performance model of performance indications is determined selected by；

In the step (1-8-1) of the present embodiment, selected performance indications, for writing handling capacity to build performance model, specifically calculate step Rapid as follows:

The object function expression formula building performance model is:

Y=c₁+c₂C+c₃M+c₄CM+c₅N+c₆CN+c₇MN+c₈CMN+c₉N²+c₁₀NL+c₁₁N/R+c₁₂NL/R

In formula, Y represents and writes handling capacity, has 12, be designated as 12 dummy variables after polynomial expansionCorresponding 12 undetermined coefficient (c₁…c₁₂)；

(1-8-2) utilize 54 groups of parameter combinations in table 2, calculate 12 dummy variables after launchingValue as independent variable, write throughput performance index test result As dependent variable (i.e. Y), all of independent variable is formed training dataset with dependent variable；

(1-8-3) utilize multiple linear regression analysis method that the training dataset of (1-8-2) is carried out Regressive Solution, obtain 12 Undetermined coefficient (the c of individual dummy variable₁…c₁₂) regressand value, thus obtain the expression of object function, be to write The performance model of handling capacity.

(2-1) user determines the desired value of 5 performance indications in table 3 respectively according to the demand of oneself；

(2-2) according to hardware parameter and the kernel software parameter value scope of table 1, to all of performance indications, press respectively Calculate the combination of all parameter values according to the performance model obtained by step (1) (in the present embodiment, one to have 10 × 4 × 7 × 5 × 3=4200 combination) result, and arrange from low to high according to the value of calculated corresponding performance indications Sequence；

In the present embodiment, the cost function of employing is: Price (N, C, M)=(45C+22M+4 (log₂M-14))N

(3-1) hardware parameter and the core of the distributed memory system (Cassandra) that user obtains are collected for operational phase Soft-hearted part parameter configuration and the feedback of corresponding performance indications；

(3-2) each performance indications (is read handling capacity, is write handling capacity, read latency, write delay and maximum client also Send out number), update its performance model, specifically comprise the following steps that

(3-2-1) according to the object function expression formula set in step (1-8-1), the distributed storage of user feedback is utilized The hardware parameter of system and kernel software parameter configuration, be calculated the value of all independent variables, and the performance indications of user feedback are made For dependent variable (Y), obtain updating training data；

(3-2-2) renewal training data adds to existing training data concentrate, the training dataset after being updated；

(3-2-3) utilizing training dataset after renewal, the object function setting step (1-8-1) carries out multiple linear and returns Return and solve, obtain the regressand value of undetermined coefficient with step in the training stage (1-8-3) equal number of dummy variable, thus Obtain the new expression of object function, as the performance model that these performance indications are new, and substitute original performance model；

(3-3) repeat step (3-1) to (3-5), according to user's request, the new performance model obtained is constantly updated； During use below, it will be more accurate that step (2) uses the performance model after updating to carry out calculated result.

Claims

1. the initial parameter collocation method of a distributed memory system, it is characterised in that the method is divided into three phases: Training stage, operational phase and dynamic more new stage；User first carries out the training stage and obtains performance model, is then using rank Performance model is used for solving the initial parameter allocation problem of distributed memory system by section；Afterwards, at dynamically more new stage, root According to the feedback of user, performance model is dynamically updated.

2. the method for claim 1, it is characterised in that the method specifically includes following steps:

(1-1) choose hardware parameter and the kernel software parameter of distributed memory system, and determine hardware parameter and kernel software The span of parameter；

(1-3) selected testing service device；

(1-5) selected client-server, configures network environment, it is ensured that client-server and testing service device are same Under the network segment, client-server is installed YCSB test program；

(1-6) on the virtual machine platform that step (1-4) is set up create seed virtual machine, install (SuSE) Linux OS and Necessary software, installs distributed memory system；

(1-7) according to the parameter assembled scheme determined in step (1-2), each assembled scheme is tested respectively, and remembers The test result of record performance indications, performance indications totally 5, including: read handling capacity, write handling capacity, read latency, write delay With maximum client number of concurrent；

(2-1) user determines the desired value of 5 performance indications respectively according to the demand of oneself；

(3-2) each performance indications is updated performance model；

3. method as claimed in claim 2, it is characterised in that step (1-7) is described tests also parameter combination The test result of record performance index, specifically comprises the following steps that

(1-7-1) the seed virtual machine set up by step (1-6) copies several identical virtual machines respectively as test Virtual machine, the number of test virtual machine is equal to the nodes in the required test combination determined in step (1-2)；

(1-7-2) ensure that each class resource of all test virtual machines is added summation and is not more than in step (1-3) selected survey The sum of the similar resource that examination server is corresponding；

(1-7-3) the IP address revising all test virtual machines makes them different, revises on all test virtual machines The call parameter of distributed memory system, all of test virtual machine is formed cluster, starts distributed memory system；

(1-7-4) opening client-server, the IP address list arranging YCSB test program is in step (1-7-3) The IP address of all test virtual machines arranged, runs YCSB test program, the test result of record performance index.

4. method as claimed in claim 2, it is characterised in that step (1-8) described structure performance model also solves,

Specifically comprise the following steps that

(1-8-1) object function of performance model is built；If object function is Y=F (N, C, M, R, L), in formula, Y is institute State any one performance indications in 5 performance indications, F be about Parameter nodes number N, CPU core calculation C, memory size M, Number of copies R and the polynomial function of submission daily record size L, after polynomial expansion, total X item, is designated as X dummy variable, Corresponding X undetermined coefficient；

(1-8-2) to all parameters combination determined in step (1-2), the value calculating X dummy variable after launching is made For independent variable, all of independent variable, as dependent variable, is formed training dataset with dependent variable by testing performance index result；

(1-8-3) training dataset utilizing multiple linear regression analysis method to obtain step (1-8-2) carries out Regressive Solution, Obtain the regressand value of the undetermined coefficient of X dummy variable, thus obtain the expression of Y=F (N, C, M, R, L), be step Suddenly the expression formula of the performance model of performance indications is determined selected by (1-8-1).

5. method as claimed in claim 2, it is characterised in that step (3-2) is described to be updated each performance indications Performance model, specifically comprises the following steps that

(3-2-1) according to the object function expression formula set in step (1-8), the distributed storage system of user feedback is utilized The hardware parameter of system and kernel software parameter configuration, be calculated the value of all independent variables, the performance indications conduct of user feedback Dependent variable, obtains updating training data；

(3-2-3) utilizing training dataset after renewal, the object function setting step (1-8) carries out multiple linear regression Solve, obtain the regressand value of undetermined coefficient with step in the training stage (1-8) equal number of dummy variable, thus obtain The new expression of object function, as the performance model that these performance indications are new, and substitutes original performance model.