CN106814994A - A kind of parallel system optimization method towards big data - Google Patents
A kind of parallel system optimization method towards big data Download PDFInfo
- Publication number
- CN106814994A CN106814994A CN201710045825.9A CN201710045825A CN106814994A CN 106814994 A CN106814994 A CN 106814994A CN 201710045825 A CN201710045825 A CN 201710045825A CN 106814994 A CN106814994 A CN 106814994A
- Authority
- CN
- China
- Prior art keywords
- formula
- generated
- data
- variable
- parallel system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000005457 optimization Methods 0.000 title claims abstract description 22
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 230000002776 aggregation Effects 0.000 claims description 10
- 238000004220 aggregation Methods 0.000 claims description 10
- 238000007405 data analysis Methods 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 8
- 238000013507 mapping Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 238000000205 computational method Methods 0.000 description 2
- 238000010219 correlation analysis Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/465—Distributed object oriented systems
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A kind of parallel system optimization method towards big data, the present invention relates to the parallel system optimization method towards big data.The invention aims to solve prior art both for a certain specific algorithm, without for complicated formula, and the calculating problem that time-consuming.Detailed process is:Step one:Data-intensive formula is carried out into abstract treatment;Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, the result of calculation of data-intensive formula is obtained after execution.The present invention is used for data analysis field.
Description
Technical field
The present invention relates to the parallel system optimization method towards big data.
Background technology
Data-intensive complicated formula refers to need to calculate mass data and possess complicated dependency structure formula, most of to relate to
And Lian Jia and company multiply operation, such asThe calculating process of the formula needs to consume the substantial amounts of time.It is data-intensive
Complicated formula is the basis of existing big data analysis, has very important application in data analysis field, and prior art is present
Following problem:
1st, existing platform only provides the operation on basis, and the operation of Map and Reduce is only provided in such as Hadoop.This mould
Formula is very difficult for unfamiliar programmer.
2nd, existing kit only provides the calculating formula computational methods in existing algorithm, can not provide universality
Formula computational methods.
3rd, under the prior art, data-intensive complicated calculations can only be completed by the fast resampling of many wheels, operation
Time greatly increases.
Data-intensive (complexity) formula refers to multiplying etc. containing multiple Lian Jialian that statistics is calculated and to calculate data volume big
Formula.
The content of the invention
The present invention is for a certain specific algorithm in solving prior art both for parallel system, without for multiple
Miscellaneous calculation expression, and the problem that time-consuming is calculated, and a kind of parallel system optimization method towards big data for proposing.
A kind of parallel system optimization method following steps towards big data are realized:
Step one:Data-intensive formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, after execution
To the result of calculation of data-intensive formula.
Invention effect:
1. pass through it is demonstrated experimentally that the algorithm shows better effect of optimization in the case where data volume is bigger.At GB grades
Under other data volume, the calculating time of average saving 57.3%.
2. by it is demonstrated experimentally that under the algorithm, the calculating time of formula does not rely on the complexity of formula, and depends on
The task quantity of formula generation.
3. the algorithm has universality, can be applied to different parallel tables, such as Hadoop and Spark;It is not required for making
The programming experience of user, provides the result of calculation that complex expression can be optimized.
Brief description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is semantic tree structure diagram;"/" represents the division sign in figure, and "-" represents minus sign, and " * " represents multiplication sign, and " sum " is represented and connected
Put in marks, " pow " represents power symbol, avg " average numerical symbol is represented, " count " is represented and is counted symbol, " x ", " y " and " 2 "
Represent performance variable and operand.
Fig. 3 is abbreviation procedure chart;
Fig. 4 is layering schematic diagram;
Fig. 5 is the task sequence figure of generation;In figure task sequence from the bottom to top, MapReduce:Avg () is represented and is performed one
The mapping reduction operation that wheel average is calculated;MapReduce:Avg (), count () are represented and are performed a wheel average calculating and count
The mapping reduction operation that number is calculated;MapReduce:Sum () represents the mapping reduction operation for performing a wheel even plus calculating;Totally 4 take turns
MapReduce is mapping reduction operation.
Fig. 6 is task sequence scheduling graph;
Fig. 7 is that simple aggregation computing configures figure;In figure, input variable x, map mapping turns into Key:Value is key-value pair,
By reduce reduction into result, valuesum even adds result, valuecount count results.Configuration operator:Sum and
Count, carries out Lian Jia and counting operation as in reduction procedure.
Fig. 8 is complicated aggregate operation configuration figure;
Fig. 9 is run time comparison diagram before CCA, MCA, ACF optimization;
Figure 10 is run time comparison diagram after CCA, MCA, ACF optimization;
Figure 11 is comparison diagram before and after ACF optimizations;
Figure 12 is comparison diagram before and after MCA optimizations;
Figure 13 is comparison diagram before and after CCA optimizations.
Specific embodiment
Specific embodiment one:A kind of parallel system optimization method towards big data is comprised the following steps:
Step one:Data-intensive complicated formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, after execution
To the result of calculation of data-intensive formula.
Specific embodiment two:Present embodiment from unlike specific embodiment one:By data in the step one
Intensive formula carries out abstract treatment and is specially:
Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregation fortune
Calculate and completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract;The simple computation is transported for four fundamental rules
Calculation, power and evolution, Aggregation computation is statistical calculation, and MapReduce is programming model, (is more than for large-scale dataset
Concurrent operation 1TB).
Other steps and parameter are identical with specific embodiment one.
Specific embodiment three:Present embodiment from unlike specific embodiment one or two:Will in the step 2
Data-intensive formula after the treatment of step one abstract generates formula semantic tree detailed process:
Extract the variable in data-intensive formula, and determine sub- formula, by sub- formula operator (plus, subtract, multiply,
Except, Lian Jia, even multiply) as father node, to should operator calculating variable as child node, generate formula semantic tree, institute
State semantic tree only one of which aggregate operation on each paths from leaf node to root node.In sub- formula such as formula∑ x2Deng.
Other steps and parameter are identical with specific embodiment one or two.
Specific embodiment four:Unlike one of present embodiment and specific embodiment one to three:The step 3
It is middle the semantic tree that step 2 is generated is carried out abbreviation and generates the detailed process of formula dependency graph be:
The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, identical calculations are carried out to identical variable
Node merge into Same Vertices.
Other steps and parameter are identical with one of specific embodiment one to three.
Specific embodiment five:Unlike one of present embodiment and specific embodiment one to four:The step 4
It is middle the formula dependency graph that step 3 is generated is layered and is generated the detailed process of task sequence be:
Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with
The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator,
To be defined by the path more than nodes, wherein each operator is a node;
The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node
Row;The aggregate operation of different variables is parallel in each layer is put into execution in a wheel MapReduce;Identical variable is poly- in each layer
Set operation is performed in being serially put into a wheel MapReduce.
Other steps and parameter are identical with one of specific embodiment one to four.
Embodiment one:As shown in figure 1, a kind of the step of parallel system optimization method towards big data is:
1st, abstract formula structure
By formulaIn sub- computing be divided into two kinds of simple computation and Aggregation computation, each aggregate operation
Completed with a wheel MapReduce.Formula is carried out into functional expression abstract, such as will even put in marks is expressed as sum (), abstract laggard
Row next step is operated.Above formula is abstract for functional expression expression formula is:
2nd, formula semantic tree is generated
Dependence according to formula generates semantic tree construction as shown in Figure 2.
3rd, abbreviation and formula dependency graph is generated
Carry out semantic tree abbreviation.We carry out semantic tree abbreviation using two principles.
All-to-1.The node of the identical variable of all correspondences merges into Same Vertices, the node for eliminating redundancy.
Same-to-1.Same Vertices are merged into the node that identical variable carries out identical calculations.So identical calculating can
Carried out with same task, and then reduce fast resampling process.
Abbreviation is carried out to example using the two principles, as shown in Figure 3.Formula dependency graph is generated after abbreviation, foundation should
Formula dependency graph generates the task sequence of formula.
4th, task order is generated
Calculating task plan is generated according to formula dependency graph, the calculating is distributed for task.First, we according to formula with
The distance of variable is layered, as shown in Figure 4.
According to layering generation MapReduce tasks, each layer of identical operation generates same task, generates task sequence
As shown in Figure 5.
5th, parallel system is performed
Task sequence is generated into the job dependence sequences in parallel system, is put into existing big data parallel system
Perform.As shown in Figure 6.The formula generates 3 MapReduce tasks in parallel and performs.
After execution, the final result of calculation of formula is obtained.Experiment proves that the algorithm can greatly lift computational efficiency.
The algorithm is realized in Hadoop.System configuration is carried out by Job Configure files.We will configure
Journey is divided into simple aggregation computing and complicated aggregate operation.Simple aggregation computing configuration is directly configured during Reduce,
As shown in Figure 7.
Complicated aggregate operation configuration then needs to be contribute in Mapper to determine calculating process, so that result of calculation is generated,
As shown in Figure 8.
The inventive method can such as be applied in existing parallel tables in Sparks and Hyracks platforms.Only need to configure
Condition is modified slightly, and is suitable for said system.
The inventive method can apply to data analysis direction as the basic optimized algorithm of data analysis, such as business point
The fields such as analysis, finance, industry, agricultural.
Multiple correlation analyzes Complex correlation analysis (CCA):
Matrix correlation analyzes Matrix correlation analysis (MCA):
Arbitrarily complicated formula Arbitrary complex formula (ACF):
Run time comparison diagram, such as Fig. 9~Figure 13 understand, before and after optimization, significantly reducing occurs in run time.And can be with
It was observed that data volume is bigger, effect of optimization is more obvious.
Claims (5)
1. a kind of parallel system optimization method towards big data, it is characterised in that the parallel system towards big data is excellent
Change method is comprised the following steps:
Step one:Data-intensive formula is carried out into abstract treatment;
Step 2:Data-intensive formula after the treatment of step one abstract is generated into formula semantic tree;
Step 3:The semantic tree that step 2 is generated is carried out abbreviation and generates formula dependency graph;
Step 4:The formula dependency graph that step 3 is generated is layered and is generated task sequence;
Step 5:Task dependence is generated in parallel system according to the task sequence that step 4 is generated, is counted after execution
According to the result of calculation of intensive formula.
2. a kind of parallel system optimization method towards big data according to claim 1, it is characterised in that the step
Data-intensive formula is carried out into abstract treatment in one to be specially:
Sub- computing in data-intensive formula is divided to and is defined as two kinds of simple computation and Aggregation computation, each aggregate operation will
Completed with a wheel MapReduce, data-intensive formula is carried out into functional expression abstract;The simple computation is arithmetic, multiplies
Side and evolution, Aggregation computation are statistical calculation, and MapReduce is programming model.
3. a kind of parallel system optimization method towards big data according to claim 1 and 2, it is characterised in that described
It is by the data-intensive formula generation formula semantic tree detailed process after the treatment of step one abstract in step 2:
The variable in data-intensive formula is extracted, and determines sub- formula, using the operator in sub- formula as father node, correspondence
The calculating variable of the operator generates formula semantic tree as child node, and the semantic tree is each from leaf node to root node
Only one of which aggregate operation on paths.
4. a kind of parallel system optimization method towards big data according to claim 3, it is characterised in that the step
The semantic tree that step 2 is generated carried out abbreviation and generates the detailed process of formula dependency graph in three be:
The node of the identical variable of all correspondences in semantic tree is merged into Same Vertices, the knot of identical calculations is carried out to identical variable
Point merges into Same Vertices.
5. a kind of parallel system optimization method towards big data according to claim 1,2 or 4, it is characterised in that institute
State to be layered and generated the detailed process of task sequence the formula dependency graph that step 3 is generated in step 4 and be:
Distance according to variable AND operator in formula dependency graph is layered, using aleatory variable as start node, with any
The nodes that variable passes through to operator as the number of plies where operator, when having mulitpath between variable AND operator,
To be defined by the path more than nodes, wherein each operator is a node;
The aggregate operation of the identical variable of each layer is extracted, task sequence is sequentially generated according to start node to terminating node;Often
The aggregate operation of different variables is parallel in one layer is put into execution in a wheel MapReduce;The aggregation fortune of identical variable in each layer
Calculate and be serially put into execution in a wheel MapReduce.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710045825.9A CN106814994B (en) | 2017-01-20 | 2017-01-20 | A kind of parallel system optimization method towards big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710045825.9A CN106814994B (en) | 2017-01-20 | 2017-01-20 | A kind of parallel system optimization method towards big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106814994A true CN106814994A (en) | 2017-06-09 |
CN106814994B CN106814994B (en) | 2019-02-19 |
Family
ID=59111200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710045825.9A Active CN106814994B (en) | 2017-01-20 | 2017-01-20 | A kind of parallel system optimization method towards big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106814994B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885587A (en) * | 2017-11-17 | 2018-04-06 | 清华大学 | A kind of executive plan generation method of big data analysis process |
CN108255689A (en) * | 2018-01-11 | 2018-07-06 | 哈尔滨工业大学 | A kind of Apache Spark application automation tuning methods based on historic task analysis |
WO2024065525A1 (en) * | 2022-09-29 | 2024-04-04 | Intel Corporation | Method and apparatus for optimizing deep learning computation graph |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120137300A1 (en) * | 2010-11-30 | 2012-05-31 | Ryuji Sakai | Information Processor and Information Processing Method |
CN102591712A (en) * | 2011-12-30 | 2012-07-18 | 大连理工大学 | Decoupling parallel scheduling method for rely tasks in cloud computing |
US8977898B1 (en) * | 2012-09-24 | 2015-03-10 | Emc Corporation | Concurrent access to data during replay of a transaction log |
-
2017
- 2017-01-20 CN CN201710045825.9A patent/CN106814994B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120137300A1 (en) * | 2010-11-30 | 2012-05-31 | Ryuji Sakai | Information Processor and Information Processing Method |
CN102591712A (en) * | 2011-12-30 | 2012-07-18 | 大连理工大学 | Decoupling parallel scheduling method for rely tasks in cloud computing |
US8977898B1 (en) * | 2012-09-24 | 2015-03-10 | Emc Corporation | Concurrent access to data during replay of a transaction log |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885587A (en) * | 2017-11-17 | 2018-04-06 | 清华大学 | A kind of executive plan generation method of big data analysis process |
CN107885587B (en) * | 2017-11-17 | 2018-12-07 | 清华大学 | A kind of executive plan generation method of big data analysis process |
CN108255689A (en) * | 2018-01-11 | 2018-07-06 | 哈尔滨工业大学 | A kind of Apache Spark application automation tuning methods based on historic task analysis |
CN108255689B (en) * | 2018-01-11 | 2021-02-12 | 哈尔滨工业大学 | Automatic Apache Spark application tuning method based on historical task analysis |
WO2024065525A1 (en) * | 2022-09-29 | 2024-04-04 | Intel Corporation | Method and apparatus for optimizing deep learning computation graph |
Also Published As
Publication number | Publication date |
---|---|
CN106814994B (en) | 2019-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106814994A (en) | A kind of parallel system optimization method towards big data | |
CN110276450A (en) | Deep neural network structural sparse system and method based on more granularities | |
CN102033748A (en) | Method for generating data processing flow codes | |
Gölzer et al. | Designing global manufacturing networks using Big Data | |
CN103399841A (en) | Sparse matrix LU decomposition method based on GPU | |
Adams et al. | Base-2 expansions for linearizing products of functions of discrete variables | |
US11977885B2 (en) | Utilizing structured sparsity in systolic arrays | |
CN105183880A (en) | Hash join method and device | |
CN108829501A (en) | A kind of batch processing scientific workflow task scheduling algorithm based on improved adaptive GA-IAGA | |
CN107491571A (en) | A kind of method and system of vehicle performance emulation | |
CN105302915B (en) | The high-performance data processing system calculated based on memory | |
CN103699354A (en) | Molecular adder establishment method based on strand displacement reaction | |
CN106445645A (en) | Method and device for executing distributed computation tasks | |
CN108256638A (en) | Microprocessor circuit and the method for performing neural network computing | |
Chen et al. | Particle swarm optimization based on genetic operators for sensor-weapon-target assignment | |
García Martín | Energy efficiency in machine learning: A position paper | |
CN104699449A (en) | GMP (GNU multiple precision arithmetic library) based big integer addition and subtraction multinuclear parallelization implementation method | |
Hannachi et al. | GMTE: A tool for graph transformation and exact/inexact graph matching | |
Durak et al. | Towards an ontology for simulation systems engineering | |
Singh et al. | Optimization of feature selection method for high dimensional data using fisher score and minimum spanning tree | |
Narkhede et al. | Analyzing web application log files to find hit count through the utilization of Hadoop MapReduce in cloud computing environment | |
Redko et al. | Concept-Monadic Model of Technological Environment of Programming | |
CN110119265A (en) | Multiplication implementation method, device, computer storage medium and electronic equipment | |
CN107038244A (en) | A kind of data digging method and device, a kind of computer-readable recording medium and storage control | |
CN107423028A (en) | A kind of parallel scheduling method of extensive flow |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |