CN106814994A - A kind of parallel system optimization method towards big data - Google Patents

A kind of parallel system optimization method towards big data Download PDF

Info

Publication number
CN106814994A
CN106814994A CN201710045825.9A CN201710045825A CN106814994A CN 106814994 A CN106814994 A CN 106814994A CN 201710045825 A CN201710045825 A CN 201710045825A CN 106814994 A CN106814994 A CN 106814994A
Authority
CN
China
Prior art keywords
formula
generated
data
variable
parallel system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710045825.9A
Other languages
Chinese (zh)
Other versions
CN106814994B (en
Inventor
王宏志
宋扬
文豪
李建中
高宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201710045825.9A priority Critical patent/CN106814994B/en
Publication of CN106814994A publication Critical patent/CN106814994A/en
Application granted granted Critical
Publication of CN106814994B publication Critical patent/CN106814994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/465Distributed object oriented systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A kind of parallel system optimization method towards big data, the present invention relates to the parallel system optimization method towards big data.The invention aims to solve prior art both for a certain specific algorithm, without for complicated formula, and the calculating problem that time-consuming.Detailed process is:Step one:Data-intensive formula is carried out into abstract treatment;Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, the result of calculation of data-intensive formula is obtained after execution.The present invention is used for data analysis field.

Description

A kind of parallel system optimization method towards big data
Technical field
The present invention relates to the parallel system optimization method towards big data.
Background technology
Data-intensive complicated formula refers to need to calculate mass data and possess complicated dependency structure formula, most of to relate to And Lian Jia and company multiply operation, such asThe calculating process of the formula needs to consume the substantial amounts of time.It is data-intensive Complicated formula is the basis of existing big data analysis, has very important application in data analysis field, and prior art is present Following problem:
1st, existing platform only provides the operation on basis, and the operation of Map and Reduce is only provided in such as Hadoop.This mould Formula is very difficult for unfamiliar programmer.
2nd, existing kit only provides the calculating formula computational methods in existing algorithm, can not provide universality Formula computational methods.
3rd, under the prior art, data-intensive complicated calculations can only be completed by the fast resampling of many wheels, operation Time greatly increases.
Data-intensive (complexity) formula refers to multiplying etc. containing multiple Lian Jialian that statistics is calculated and to calculate data volume big Formula.
The content of the invention
The present invention is for a certain specific algorithm in solving prior art both for parallel system, without for multiple Miscellaneous calculation expression, and the problem that time-consuming is calculated, and a kind of parallel system optimization method towards big data for proposing.
A kind of parallel system optimization method following steps towards big data are realized:
Step one:Data-intensive formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, after execution To the result of calculation of data-intensive formula.
Invention effect:
1. pass through it is demonstrated experimentally that the algorithm shows better effect of optimization in the case where data volume is bigger.At GB grades Under other data volume, the calculating time of average saving 57.3%.
2. by it is demonstrated experimentally that under the algorithm, the calculating time of formula does not rely on the complexity of formula, and depends on The task quantity of formula generation.
3. the algorithm has universality, can be applied to different parallel tables, such as Hadoop and Spark;It is not required for making The programming experience of user, provides the result of calculation that complex expression can be optimized.
Brief description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is semantic tree structure diagram;"/" represents the division sign in figure, and "-" represents minus sign, and " * " represents multiplication sign, and " sum " is represented and connected Put in marks, " pow " represents power symbol, avg " average numerical symbol is represented, " count " is represented and is counted symbol, " x ", " y " and " 2 " Represent performance variable and operand.
Fig. 3 is abbreviation procedure chart;
Fig. 4 is layering schematic diagram;
Fig. 5 is the task sequence figure of generation;In figure task sequence from the bottom to top, MapReduce:Avg () is represented and is performed one The mapping reduction operation that wheel average is calculated;MapReduce:Avg (), count () are represented and are performed a wheel average calculating and count The mapping reduction operation that number is calculated;MapReduce:Sum () represents the mapping reduction operation for performing a wheel even plus calculating;Totally 4 take turns MapReduce is mapping reduction operation.
Fig. 6 is task sequence scheduling graph;
Fig. 7 is that simple aggregation computing configures figure;In figure, input variable x, map mapping turns into Key:Value is key-value pair, By reduce reduction into result, valuesum even adds result, valuecount count results.Configuration operator:Sum and Count, carries out Lian Jia and counting operation as in reduction procedure.
Fig. 8 is complicated aggregate operation configuration figure;
Fig. 9 is run time comparison diagram before CCA, MCA, ACF optimization;
Figure 10 is run time comparison diagram after CCA, MCA, ACF optimization;
Figure 11 is comparison diagram before and after ACF optimizations;
Figure 12 is comparison diagram before and after MCA optimizations;
Figure 13 is comparison diagram before and after CCA optimizations.
Specific embodiment
Specific embodiment one:A kind of parallel system optimization method towards big data is comprised the following steps:
Step one:Data-intensive complicated formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, after execution To the result of calculation of data-intensive formula.
Specific embodiment two:Present embodiment from unlike specific embodiment one:By data in the step one Intensive formula carries out abstract treatment and is specially:
Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregation fortune Calculate and completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract;The simple computation is transported for four fundamental rules Calculation, power and evolution, Aggregation computation is statistical calculation, and MapReduce is programming model, (is more than for large-scale dataset Concurrent operation 1TB).
Other steps and parameter are identical with specific embodiment one.
Specific embodiment three:Present embodiment from unlike specific embodiment one or two:Will in the step 2 Data-intensive formula after the treatment of step one abstract generates formula semantic tree detailed process:
Extract the variable in data-intensive formula, and determine sub- formula, by sub- formula operator (plus, subtract, multiply, Except, Lian Jia, even multiply) as father node, to should operator calculating variable as child node, generate formula semantic tree, institute State semantic tree only one of which aggregate operation on each paths from leaf node to root node.In sub- formula such as formula∑ x2Deng.
Other steps and parameter are identical with specific embodiment one or two.
Specific embodiment four:Unlike one of present embodiment and specific embodiment one to three:The step 3 It is middle the semantic tree that step 2 is generated is carried out abbreviation and generates the detailed process of formula dependency graph be:
The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, identical calculations are carried out to identical variable Node merge into Same Vertices.
Other steps and parameter are identical with one of specific embodiment one to three.
Specific embodiment five:Unlike one of present embodiment and specific embodiment one to four:The step 4 It is middle the formula dependency graph that step 3 is generated is layered and is generated the detailed process of task sequence be:
Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator, To be defined by the path more than nodes, wherein each operator is a node;
The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node Row;The aggregate operation of different variables is parallel in each layer is put into execution in a wheel MapReduce;Identical variable is poly- in each layer Set operation is performed in being serially put into a wheel MapReduce.
Other steps and parameter are identical with one of specific embodiment one to four.
Embodiment one:As shown in figure 1, a kind of the step of parallel system optimization method towards big data is:
1st, abstract formula structure
By formulaIn sub- computing be divided into two kinds of simple computation and Aggregation computation, each aggregate operation Completed with a wheel MapReduce.Formula is carried out into functional expression abstract, such as will even put in marks is expressed as sum (), abstract laggard Row next step is operated.Above formula is abstract for functional expression expression formula is:
2nd, formula semantic tree is generated
Dependence according to formula generates semantic tree construction as shown in Figure 2.
3rd, abbreviation and formula dependency graph is generated
Carry out semantic tree abbreviation.We carry out semantic tree abbreviation using two principles.
All-to-1.The node of the identical variable of all correspondences merges into Same Vertices, the node for eliminating redundancy.
Same-to-1.Same Vertices are merged into the node that identical variable carries out identical calculations.So identical calculating can Carried out with same task, and then reduce fast resampling process.
Abbreviation is carried out to example using the two principles, as shown in Figure 3.Formula dependency graph is generated after abbreviation, foundation should Formula dependency graph generates the task sequence of formula.
4th, task order is generated
Calculating task plan is generated according to formula dependency graph, the calculating is distributed for task.First, we according to formula with The distance of variable is layered, as shown in Figure 4.
According to layering generation MapReduce tasks, each layer of identical operation generates same task, generates task sequence As shown in Figure 5.
5th, parallel system is performed
Task sequence is generated into the job dependence sequences in parallel system, is put into existing big data parallel system Perform.As shown in Figure 6.The formula generates 3 MapReduce tasks in parallel and performs.
After execution, the final result of calculation of formula is obtained.Experiment proves that the algorithm can greatly lift computational efficiency.
The algorithm is realized in Hadoop.System configuration is carried out by Job Configure files.We will configure Journey is divided into simple aggregation computing and complicated aggregate operation.Simple aggregation computing configuration is directly configured during Reduce, As shown in Figure 7.
Complicated aggregate operation configuration then needs to be contribute in Mapper to determine calculating process, so that result of calculation is generated, As shown in Figure 8.
The inventive method can such as be applied in existing parallel tables in Sparks and Hyracks platforms.Only need to configure Condition is modified slightly, and is suitable for said system.
The inventive method can apply to data analysis direction as the basic optimized algorithm of data analysis, such as business point The fields such as analysis, finance, industry, agricultural.
Multiple correlation analyzes Complex correlation analysis (CCA):
Matrix correlation analyzes Matrix correlation analysis (MCA):
Arbitrarily complicated formula Arbitrary complex formula (ACF):
Run time comparison diagram, such as Fig. 9~Figure 13 understand, before and after optimization, significantly reducing occurs in run time.And can be with It was observed that data volume is bigger, effect of optimization is more obvious.

Claims (5)

1. a kind of parallel system optimization method towards big data, it is characterised in that the parallel system towards big data is excellent Change method is comprised the following steps:
Step one:Data-intensive formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, is counted after execution According to the result of calculation of intensive formula.
2. a kind of parallel system optimization method towards big data according to claim 1, it is characterised in that the step Data-intensive formula is carried out into abstract treatment in one to be specially:
Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregate operation will Completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract;The simple computation is arithmetic, multiplies Side and evolution, Aggregation computation are statistical calculation, and MapReduce is programming model.
3. a kind of parallel system optimization method towards big data according to claim 1 and 2, it is characterised in that described It is by the data-intensive formula generation formula semantic tree detailed process after the treatment of step one abstract in step 2:
The variable in data-intensive formula is extracted, and determines sub- formula, using the operator in sub- formula as father node, correspondence The calculating variable of the operator generates formula semantic tree as child node, and the semantic tree is each from leaf node to root node Only one of which aggregate operation on paths.
4. a kind of parallel system optimization method towards big data according to claim 3, it is characterised in that the step The semantic tree that step 2 is generated carried out abbreviation and generates the detailed process of formula dependency graph in three be:
The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, the knot of identical calculations is carried out to identical variable Point merges into Same Vertices.
5. a kind of parallel system optimization method towards big data according to claim 1,2 or 4, it is characterised in that institute State to be layered and generated the detailed process of task sequence the formula dependency graph that step 3 is generated in step 4 and be:
Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with any The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator, To be defined by the path more than nodes, wherein each operator is a node;
The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node;Often The aggregate operation of different variables is parallel in one layer is put into execution in a wheel MapReduce;The aggregation fortune of identical variable in each layer Calculate and be serially put into execution in a wheel MapReduce.
CN201710045825.9A 2017-01-20 2017-01-20 A kind of parallel system optimization method towards big data Active CN106814994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710045825.9A CN106814994B (en) 2017-01-20 2017-01-20 A kind of parallel system optimization method towards big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710045825.9A CN106814994B (en) 2017-01-20 2017-01-20 A kind of parallel system optimization method towards big data

Publications (2)

Publication Number Publication Date
CN106814994A true CN106814994A (en) 2017-06-09
CN106814994B CN106814994B (en) 2019-02-19

Family

ID=59111200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710045825.9A Active CN106814994B (en) 2017-01-20 2017-01-20 A kind of parallel system optimization method towards big data

Country Status (1)

Country Link
CN (1) CN106814994B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885587A (en) * 2017-11-17 2018-04-06 清华大学 A kind of executive plan generation method of big data analysis process
CN108255689A (en) * 2018-01-11 2018-07-06 哈尔滨工业大学 A kind of Apache Spark application automation tuning methods based on historic task analysis
WO2024065525A1 (en) * 2022-09-29 2024-04-04 Intel Corporation Method and apparatus for optimizing deep learning computation graph

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137300A1 (en) * 2010-11-30 2012-05-31 Ryuji Sakai Information Processor and Information Processing Method
CN102591712A (en) * 2011-12-30 2012-07-18 大连理工大学 Decoupling parallel scheduling method for rely tasks in cloud computing
US8977898B1 (en) * 2012-09-24 2015-03-10 Emc Corporation Concurrent access to data during replay of a transaction log

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137300A1 (en) * 2010-11-30 2012-05-31 Ryuji Sakai Information Processor and Information Processing Method
CN102591712A (en) * 2011-12-30 2012-07-18 大连理工大学 Decoupling parallel scheduling method for rely tasks in cloud computing
US8977898B1 (en) * 2012-09-24 2015-03-10 Emc Corporation Concurrent access to data during replay of a transaction log

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885587A (en) * 2017-11-17 2018-04-06 清华大学 A kind of executive plan generation method of big data analysis process
CN107885587B (en) * 2017-11-17 2018-12-07 清华大学 A kind of executive plan generation method of big data analysis process
CN108255689A (en) * 2018-01-11 2018-07-06 哈尔滨工业大学 A kind of Apache Spark application automation tuning methods based on historic task analysis
CN108255689B (en) * 2018-01-11 2021-02-12 哈尔滨工业大学 Automatic Apache Spark application tuning method based on historical task analysis
WO2024065525A1 (en) * 2022-09-29 2024-04-04 Intel Corporation Method and apparatus for optimizing deep learning computation graph

Also Published As

Publication number Publication date
CN106814994B (en) 2019-02-19

Similar Documents

Publication Publication Date Title
CN106814994A (en) A kind of parallel system optimization method towards big data
CN110276450A (en) Deep neural network structural sparse system and method based on more granularities
CN102033748A (en) Method for generating data processing flow codes
Gölzer et al. Designing global manufacturing networks using Big Data
CN103399841A (en) Sparse matrix LU decomposition method based on GPU
Adams et al. Base-2 expansions for linearizing products of functions of discrete variables
US11977885B2 (en) Utilizing structured sparsity in systolic arrays
CN105183880A (en) Hash join method and device
CN108829501A (en) A kind of batch processing scientific workflow task scheduling algorithm based on improved adaptive GA-IAGA
CN107491571A (en) A kind of method and system of vehicle performance emulation
CN105302915B (en) The high-performance data processing system calculated based on memory
CN103699354A (en) Molecular adder establishment method based on strand displacement reaction
CN106445645A (en) Method and device for executing distributed computation tasks
CN108256638A (en) Microprocessor circuit and the method for performing neural network computing
Chen et al. Particle swarm optimization based on genetic operators for sensor-weapon-target assignment
García Martín Energy efficiency in machine learning: A position paper
CN104699449A (en) GMP (GNU multiple precision arithmetic library) based big integer addition and subtraction multinuclear parallelization implementation method
Hannachi et al. GMTE: A tool for graph transformation and exact/inexact graph matching
Durak et al. Towards an ontology for simulation systems engineering
Singh et al. Optimization of feature selection method for high dimensional data using fisher score and minimum spanning tree
Narkhede et al. Analyzing web application log files to find hit count through the utilization of Hadoop MapReduce in cloud computing environment
Redko et al. Concept-Monadic Model of Technological Environment of Programming
CN110119265A (en) Multiplication implementation method, device, computer storage medium and electronic equipment
CN107038244A (en) A kind of data digging method and device, a kind of computer-readable recording medium and storage control
CN107423028A (en) A kind of parallel scheduling method of extensive flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant