CN106814994A

CN106814994A - A kind of parallel system optimization method towards big data

Info

Publication number: CN106814994A
Application number: CN201710045825.9A
Authority: CN
Inventors: 王宏志; 宋扬; 文豪; 李建中; 高宏
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2017-06-09
Anticipated expiration: 2037-01-20
Also published as: CN106814994B

Abstract

A kind of parallel system optimization method towards big data, the present invention relates to the parallel system optimization method towards big data.The invention aims to solve prior art both for a certain specific algorithm, without for complicated formula, and the calculating problem that time-consuming.Detailed process is：Step one：Data-intensive formula is carried out into abstract treatment；Step 2：Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree；Step 3：The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph；Step 4：The formula dependency graph that step 3 is generated is layered and is generated task sequence；Step 5：Task dependence is generated in parallel system according to the task sequence that step 4 is generated, the result of calculation of data-intensive formula is obtained after execution.The present invention is used for data analysis field.

Description

A kind of parallel system optimization method towards big data

Technical field

The present invention relates to the parallel system optimization method towards big data.

Background technology

Data-intensive complicated formula refers to need to calculate mass data and possess complicated dependency structure formula, most of to relate to And Lian Jia and company multiply operation, such asThe calculating process of the formula needs to consume the substantial amounts of time.It is data-intensive Complicated formula is the basis of existing big data analysis, has very important application in data analysis field, and prior art is present Following problem：

1st, existing platform only provides the operation on basis, and the operation of Map and Reduce is only provided in such as Hadoop.This mould Formula is very difficult for unfamiliar programmer.

2nd, existing kit only provides the calculating formula computational methods in existing algorithm, can not provide universality Formula computational methods.

3rd, under the prior art, data-intensive complicated calculations can only be completed by the fast resampling of many wheels, operation Time greatly increases.

Data-intensive (complexity) formula refers to multiplying etc. containing multiple Lian Jialian that statistics is calculated and to calculate data volume big Formula.

The content of the invention

The present invention is for a certain specific algorithm in solving prior art both for parallel system, without for multiple Miscellaneous calculation expression, and the problem that time-consuming is calculated, and a kind of parallel system optimization method towards big data for proposing.

A kind of parallel system optimization method following steps towards big data are realized：

Step one：Data-intensive formula is carried out into abstract treatment；

Step 2：Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree；

Step 3：The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph；

Step 4：The formula dependency graph that step 3 is generated is layered and is generated task sequence；

Step 5：Task dependence is generated in parallel system according to the task sequence that step 4 is generated, after execution To the result of calculation of data-intensive formula.

Invention effect：

1. pass through it is demonstrated experimentally that the algorithm shows better effect of optimization in the case where data volume is bigger.At GB grades Under other data volume, the calculating time of average saving 57.3%.

2. by it is demonstrated experimentally that under the algorithm, the calculating time of formula does not rely on the complexity of formula, and depends on The task quantity of formula generation.

3. the algorithm has universality, can be applied to different parallel tables, such as Hadoop and Spark；It is not required for making The programming experience of user, provides the result of calculation that complex expression can be optimized.

Brief description of the drawings

Fig. 1 is flow chart of the present invention；

Fig. 2 is semantic tree structure diagram；"/" represents the division sign in figure, and "-" represents minus sign, and " * " represents multiplication sign, and " sum " is represented and connected Put in marks, " pow " represents power symbol, avg " average numerical symbol is represented, " count " is represented and is counted symbol, " x ", " y " and " 2 " Represent performance variable and operand.

Fig. 3 is abbreviation procedure chart；

Fig. 4 is layering schematic diagram；

Fig. 5 is the task sequence figure of generation；In figure task sequence from the bottom to top, MapReduce:Avg () is represented and is performed one The mapping reduction operation that wheel average is calculated；MapReduce:Avg (), count () are represented and are performed a wheel average calculating and count The mapping reduction operation that number is calculated；MapReduce:Sum () represents the mapping reduction operation for performing a wheel even plus calculating；Totally 4 take turns MapReduce is mapping reduction operation.

Fig. 6 is task sequence scheduling graph；

Fig. 7 is that simple aggregation computing configures figure；In figure, input variable x, map mapping turns into Key:Value is key-value pair, By reduce reduction into result, valuesum even adds result, valuecount count results.Configuration operator：Sum and Count, carries out Lian Jia and counting operation as in reduction procedure.

Fig. 8 is complicated aggregate operation configuration figure；

Fig. 9 is run time comparison diagram before CCA, MCA, ACF optimization；

Figure 10 is run time comparison diagram after CCA, MCA, ACF optimization；

Figure 11 is comparison diagram before and after ACF optimizations；

Figure 12 is comparison diagram before and after MCA optimizations；

Figure 13 is comparison diagram before and after CCA optimizations.

Specific embodiment

Specific embodiment one：A kind of parallel system optimization method towards big data is comprised the following steps：

Step one：Data-intensive complicated formula is carried out into abstract treatment；

Specific embodiment two：Present embodiment from unlike specific embodiment one：By data in the step one Intensive formula carries out abstract treatment and is specially：

Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregation fortune Calculate and completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract；The simple computation is transported for four fundamental rules Calculation, power and evolution, Aggregation computation is statistical calculation, and MapReduce is programming model, (is more than for large-scale dataset Concurrent operation 1TB).

Other steps and parameter are identical with specific embodiment one.

Specific embodiment three：Present embodiment from unlike specific embodiment one or two：Will in the step 2 Data-intensive formula after the treatment of step one abstract generates formula semantic tree detailed process：

Extract the variable in data-intensive formula, and determine sub- formula, by sub- formula operator (plus, subtract, multiply, Except, Lian Jia, even multiply) as father node, to should operator calculating variable as child node, generate formula semantic tree, institute State semantic tree only one of which aggregate operation on each paths from leaf node to root node.In sub- formula such as formula∑ x²Deng.

Other steps and parameter are identical with specific embodiment one or two.

Specific embodiment four：Unlike one of present embodiment and specific embodiment one to three：The step 3 It is middle the semantic tree that step 2 is generated is carried out abbreviation and generates the detailed process of formula dependency graph be：

The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, identical calculations are carried out to identical variable Node merge into Same Vertices.

Other steps and parameter are identical with one of specific embodiment one to three.

Specific embodiment five：Unlike one of present embodiment and specific embodiment one to four：The step 4 It is middle the formula dependency graph that step 3 is generated is layered and is generated the detailed process of task sequence be：

Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator, To be defined by the path more than nodes, wherein each operator is a node；

The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node Row；The aggregate operation of different variables is parallel in each layer is put into execution in a wheel MapReduce；Identical variable is poly- in each layer Set operation is performed in being serially put into a wheel MapReduce.

Other steps and parameter are identical with one of specific embodiment one to four.

Embodiment one：As shown in figure 1, a kind of the step of parallel system optimization method towards big data is：

1st, abstract formula structure

By formulaIn sub- computing be divided into two kinds of simple computation and Aggregation computation, each aggregate operation Completed with a wheel MapReduce.Formula is carried out into functional expression abstract, such as will even put in marks is expressed as sum (), abstract laggard Row next step is operated.Above formula is abstract for functional expression expression formula is：

2nd, formula semantic tree is generated

Dependence according to formula generates semantic tree construction as shown in Figure 2.

3rd, abbreviation and formula dependency graph is generated

Carry out semantic tree abbreviation.We carry out semantic tree abbreviation using two principles.

All-to-1.The node of the identical variable of all correspondences merges into Same Vertices, the node for eliminating redundancy.

Same-to-1.Same Vertices are merged into the node that identical variable carries out identical calculations.So identical calculating can Carried out with same task, and then reduce fast resampling process.

Abbreviation is carried out to example using the two principles, as shown in Figure 3.Formula dependency graph is generated after abbreviation, foundation should Formula dependency graph generates the task sequence of formula.

4th, task order is generated

Calculating task plan is generated according to formula dependency graph, the calculating is distributed for task.First, we according to formula with The distance of variable is layered, as shown in Figure 4.

According to layering generation MapReduce tasks, each layer of identical operation generates same task, generates task sequence As shown in Figure 5.

5th, parallel system is performed

Task sequence is generated into the job dependence sequences in parallel system, is put into existing big data parallel system Perform.As shown in Figure 6.The formula generates 3 MapReduce tasks in parallel and performs.

After execution, the final result of calculation of formula is obtained.Experiment proves that the algorithm can greatly lift computational efficiency.

The algorithm is realized in Hadoop.System configuration is carried out by Job Configure files.We will configure Journey is divided into simple aggregation computing and complicated aggregate operation.Simple aggregation computing configuration is directly configured during Reduce, As shown in Figure 7.

Complicated aggregate operation configuration then needs to be contribute in Mapper to determine calculating process, so that result of calculation is generated, As shown in Figure 8.

The inventive method can such as be applied in existing parallel tables in Sparks and Hyracks platforms.Only need to configure Condition is modified slightly, and is suitable for said system.

The inventive method can apply to data analysis direction as the basic optimized algorithm of data analysis, such as business point The fields such as analysis, finance, industry, agricultural.

Multiple correlation analyzes Complex correlation analysis (CCA)：

Matrix correlation analyzes Matrix correlation analysis (MCA)：

Arbitrarily complicated formula Arbitrary complex formula (ACF)：

Run time comparison diagram, such as Fig. 9~Figure 13 understand, before and after optimization, significantly reducing occurs in run time.And can be with It was observed that data volume is bigger, effect of optimization is more obvious.

Claims

1. a kind of parallel system optimization method towards big data, it is characterised in that the parallel system towards big data is excellent Change method is comprised the following steps：

Step one：Data-intensive formula is carried out into abstract treatment；

Step 5：Task dependence is generated in parallel system according to the task sequence that step 4 is generated, is counted after execution According to the result of calculation of intensive formula.

2. a kind of parallel system optimization method towards big data according to claim 1, it is characterised in that the step Data-intensive formula is carried out into abstract treatment in one to be specially：

Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregate operation will Completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract；The simple computation is arithmetic, multiplies Side and evolution, Aggregation computation are statistical calculation, and MapReduce is programming model.

3. a kind of parallel system optimization method towards big data according to claim 1 and 2, it is characterised in that described It is by the data-intensive formula generation formula semantic tree detailed process after the treatment of step one abstract in step 2：

The variable in data-intensive formula is extracted, and determines sub- formula, using the operator in sub- formula as father node, correspondence The calculating variable of the operator generates formula semantic tree as child node, and the semantic tree is each from leaf node to root node Only one of which aggregate operation on paths.

4. a kind of parallel system optimization method towards big data according to claim 3, it is characterised in that the step The semantic tree that step 2 is generated carried out abbreviation and generates the detailed process of formula dependency graph in three be：

The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, the knot of identical calculations is carried out to identical variable Point merges into Same Vertices.

5. a kind of parallel system optimization method towards big data according to claim 1,2 or 4, it is characterised in that institute State to be layered and generated the detailed process of task sequence the formula dependency graph that step 3 is generated in step 4 and be：

Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with any The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator, To be defined by the path more than nodes, wherein each operator is a node；

The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node；Often The aggregate operation of different variables is parallel in one layer is put into execution in a wheel MapReduce；The aggregation fortune of identical variable in each layer Calculate and be serially put into execution in a wheel MapReduce.