CN106648654A - Data sensing-based Spark configuration parameter automatic optimization method - Google Patents

Data sensing-based Spark configuration parameter automatic optimization method Download PDF

Info

Publication number
CN106648654A
CN106648654A CN201611182310.5A CN201611182310A CN106648654A CN 106648654 A CN106648654 A CN 106648654A CN 201611182310 A CN201611182310 A CN 201611182310A CN 106648654 A CN106648654 A CN 106648654A
Authority
CN
China
Prior art keywords
parameter
spark
data
configuration
configuration parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611182310.5A
Other languages
Chinese (zh)
Inventor
罗妮
喻之斌
贝振东
姜春涛
须成忠
熊文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201611182310.5A priority Critical patent/CN106648654A/en
Publication of CN106648654A publication Critical patent/CN106648654A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of electronic information, big data, cloud computing and the like, and particularly relates to a data sensing-based Spark configuration parameter automatic optimization method. The method comprises the steps of predetermining a Spark application and parameters influencing Spark performance; randomly configuring the parameters to obtain a training set; building a performance model by the training set through a random forest algorithm; and searching out optimal configuration parameters through a genetic algorithm. According to the method, under the condition that a user is not required to understand a Spark running mechanism, a parameter meaning effect, a value range, application characteristics and an input set, the optimal configuration parameters of a specific application running in a specific cluster environment can be found for the user; compared with a conventional parameter configuration method, the automatic optimization method is simpler and quicker; and the used random forest algorithm combines the advantages of machine learning and statistical reasoning, so that relatively high precision can be achieved by using relatively few training sets.

Description

A kind of Spark configuration parameter automatic optimization methods of data perception
Technical field
The invention belongs to the technical field such as electronic information, big data, cloud computing, more particularly to a kind of data perception Spark configuration parameter automatic optimization methods.
Background technology
Spark is the class that UC Berkeley AMP lab (the AMP laboratories of University of California Berkeley) are increased income Hadoop MapReduce universal parallel frameworks.It quickly grows, and has only used short five year, just becomes Apache funds Top project.Because Spark has the characteristics of intermediate result is stored in internal memory, Spark operation iteration and interactive program 10 times are improve than traditional disk Computational frame Hadoop.Because Spark has critical role, root in big data analysis field According to the investigation of Typesafe companies, 500 enterprises have been had more than within 2015 and has used Spark.
Configuration parameter optimization is always one of big data systematic research focus, and (100 are more than because configuration parameter is numerous It is individual), performance is affected very big by configuration parameter, and application program has different characteristics.Therefore reached far away most preferably using default configuration Performance.Spark is a kind of emerging big data internal memory Computational frame, because Spark has the characteristic of " internal memory calculating ", in cluster All resources:CPU, the network bandwidth, internal memory, can all become the bottleneck of restriction Spark programs.And different Spark application journeys Sequence has different characteristics again, such as Kmeans instructions locality is good but data locality is poor, the shuffle and iteration of PageRank Select all many than KMeans, WordCount is not comprising iteration etc..The problem to be solved in the present invention is to specific collection group rings Border, input data set and application program, are that automatic Spark finds optimum configuration parameter.
Hadoop parameter automatic optimization method RFHOC (A Random-Forest Approach to based on random forest Auto-Tuning Hadoop ' s Configuration, abbreviation RFHOC) it is a kind of for operating on a given cluster The configuration parameter optimization method of application program, is broadly divided into three steps:
1. performance test
2. performance model is built
3. iterative search allocation optimum
When user runs a Hadoop application program for the first time, RFHOC workload profiler collect operation When Hadoop configuration parameter and the execution time in MapReduce stages.Subsequently, the execution time of different phase and corresponding match somebody with somebody Putting parameter will be used to build performance prediction model as the input of random forests algorithm.RFHOC distinguishes for map the and reduce stages Building regression model is used to predict the performance in each stage.Each stage first will produce a training set S, each behavior of S Vector v j, vj contains each execution time and corresponding Hadoop configuration parameter values.After building up performance model, RFHOC is used Genetic algorithm searching Hadoop optimized parameters.Genetic algorithm uses the performance of Random Forest model prediction and corresponding configuration conduct Global search is done in input.The execution time in Map and reduce stages adds up to the total time of program operation, is also genetic algorithm Adaptive value.
Prior art is manual configuration parameter and automatic configuration parameter.Manual configuration parametric technique drawback be it is too time-consuming, And require that user has deeper understanding to the operating mechanism of Spark, the meaning of parameter, effect and span.User needs To increase manually or reduce Spark parameter values, then configure Spark, run application program, find the ginseng for making the execution time most short Numerical value.Because the allocation optimum parameter of different cluster environment, different application and different input data sets is different, match somebody with somebody manually It is a time-consuming, scissors and paste to put parametric technique.
Existing automatic configuration parameter method shortcoming is that performance model precision is low, modeling cost is high.Some method employments Artificial neural networks (Artificial Neural Network), SVMs (Support Vector Machine) modeling, But to reach degree of precision (within 10%), need to use very huge training set.
The content of the invention
Based on above-mentioned situation, it is necessary to there is provided a kind of Spark configuration parameter automatic optimization methods of data perception.
A kind of Spark configuration parameter automatic optimization methods of data perception, comprise the steps:
Collect data;The collection data are specifically included:Selected Spark application programs, further determine that above-mentioned application journey The parameter of Spark performances is affected in sequence, the span of above-mentioned parameter is determined;The random generation parameter in span, and it is raw Spark is configured into configuration file, application program is run and is collected data with postponing;The data are included but is not limited to:Spark is transported Row time, input data set, configuration parameter value;
Build performance model;The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal Amount, multiple vectorial composing training collection are modeled by random forests algorithm to above-mentioned training set;
Search allocation optimum parameter;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.
Further, a verification step, the verification step are also included after the search allocation optimum parameter step It is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most short.
Further, random generation parameter is described in collection data step:Assume parameter s span be [a, B], unified in the span, uniform, randomly value c, a≤c≤b, then (/t is one to produce a record " s/tc " Tab), in this manner, generate other configurations parameter.
Used as a kind of improvement, described being modeled to above-mentioned training set by random forests algorithm specifically includes following step Suddenly:
Random forests algorithm obtains multiple bootstrap from given training set by repeatedly random repeatable sampling Data set;
To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right by iteration Two sons concentrate what is realized, and this cutting procedure is the parameter space of a search segmentation function to seek maximum information increment meaning The process of the lower optimal parameter of justice;
Estimated by reaching the histogram experience of the tag along sort of this leaf node in statistics training set at each leaf node Count the class distribution on this leaf node;
Repetitive exercise process goes to always the maximal tree depth of user's setting or until obtaining by continuing segmentation Till taking bigger information gain;
In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to Determine ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are decision trees at each The number of sample predictions device at split vertexes.
It is described to be specially by Genetic algorithm searching allocation optimum parameter as further improvement:
One group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, model output execution time T1, then change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, corresponding to the time is shorter Configuration parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.
Specifically, the random forests algorithm is as follows:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
1.for i=1to ntree
2.S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
3.Ci=F (S ')
4.}
5.
Output:Polymerization C*.
The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with In high precision.
Description of the drawings
Fig. 1 is a kind of Spark configuration parameter automatic optimization method overall flow schematic diagrams of data perception of the invention;
Fig. 2 be a kind of data perception of the invention Spark configuration parameter automatic optimization methods in genetic algorithm schematic diagram.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become apparent from, below in conjunction with drawings and Examples, to this It is bright to be further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and without It is of the invention in limiting.
As shown in Figure 1-2, the Spark configuration parameter automatic optimization methods of a kind of data perception, including following three big step:
1) data are collected;The collection data include four little steps, as follows:
(1) parameter for affecting performance is found from all parameters of Spark;
(2) span of parameter is determined;
(3) input set of application program is selected;
(4) it is determined that span in change at random parameter, configure Spark, the application of the different input data sets of operation Program, the data collected are used as training set;
Data (Collecting) stage is being collected, above-mentioned four little step specifically can be expressed as:Selected experiment Spark application programs, the more commonly used is HiBench benchmarks, and HiBench contains figure calculating, machine learning, non- Iterative application program, therefrom chooses some representational programs, such as KMeans, Bayesian, PageRank, WordCount, TeraSort, further determining that affects the parameter of Spark performances in above-mentioned application program, determine the span of above-mentioned parameter; The random generation parameter in span, and configuration file configuration Spark is generated, each application program selects some input sets, Conf Generator are configuration parameter makers, and configuration file is produced using Conf Generator, and configuration file is included The random parameter for generating, runs application program and collects data with postponing;The data are included but is not limited to:When Spark runs Between, input data set, configuration parameter value.
Produce especially by following manner random generation parameter described in data step is collected:Assume parameter s value model It is [a, b] to enclose, unified in the span, uniform, randomly value c, a≤c≤b, then produce record " s/tc " (/t It is a tab), in this manner, generate other configurations parameter.
2) performance model is built;The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal Vector, multiple vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set.
Specifically, random forests algorithm is modeled to above-mentioned training set and specifically includes from given training set by multiple Random repeatable sampling obtains multiple bootstrap data sets;To one decision-making of each bootstrap dataset construction Tree, construction is to concentrate what is realized by of left and right two of assigning to data point of iteration, and this cutting procedure is a search point The process of optimal parameter under the parameter space of function is cut to seek maximum information increment meaning;By statistics at each leaf node The class distribution estimated on this leaf node of the histogram experience of the tag along sort of this leaf node is reached in training set;Repetitive exercise mistake Cheng Yizhi go to the maximal tree depth that user sets or until can not pass through to continue to split obtain bigger information gain as Only;In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each The number of sample predictions device at node.
Wherein, the random forests algorithm is specific as follows shown:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
1.for i=1to ntree
2.S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
3.Ci=F (S ')
4.}
5.
Output:Polymerization C*.
Integrated Algorithm in present invention machine learning --- random forest is modeled;Machine learning is compared to traditional statistics Learning method, with organizing and fitting parameter, can process the advantage of bigger data set;Random forest is compared to other Machine algorithm, can solve the problems, such as over-fitting, more (higher-dimension) situation of processing feature etc. and other effects.
3) allocation optimum parameter is searched for;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter. Specific way is that one group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, during model output execution Between t1, then change initial value, be input into performance model, model output execution time t2, t2 is compared with t1, corresponding to the time is shorter Configuration parameter as allocation optimum, above step is repeated, until finding the most short configuration of execution time.
Compared to other optimized algorithms, the such as method of exhaustion, greedy method, simulated annealing, ant group algorithm have genetic algorithm Good ability of searching optimum, rapidly can search out all solutions in solution space, without being absorbed in locally optimal solution Rapid decrease trap;Search, with potential concurrency, can carry out the comparison of multiple individualities from colony;Search procedure Simply, in-service evaluation function is inspired;It is iterated using probability mechanism, there is randomness.
Finally, as a kind of preferred embodiment, after the search allocation optimum parameter step verification step is also included, The verification step is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most It is short.
The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with In high precision.The present invention can find allocation optimum parameter for any input data set, because in practical situations both user is running During application program, input set is arbitrarily change, it is contemplated that practical situations.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, and these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims (6)

1. Spark configuration parameter automatic optimization methods of a kind of data perception, it is characterised in that comprise the steps:
Collect data;The collection data are specifically included:Selected Spark application programs, in further determining that above-mentioned application program The parameter of Spark performances is affected, the span of above-mentioned parameter is determined;The random generation parameter in span, and generate match somebody with somebody File configuration Spark is put, application program is run and is collected data with postponing;The data are included but is not limited to:When Spark runs Between, input data set, configuration parameter value;
Build performance model;The Spark run times collected, input data set, configuration parameter Value Data are constituted into transversal vector, it is many Individual vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set;
Search allocation optimum parameter;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.
2. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 1, it is characterised in that described Also include a verification step after search allocation optimum parameter step, the verification step is the allocation optimum parameter that will be searched Configuration Spark is carried out, and runtime verification performs whether the time is most short.
3. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 2, it is characterised in that collecting Random generation parameter step is described in data:Assume that parameter s span is [a, b], it is unified in the span, equal Even, randomly value c, a≤c≤b then produces a record " s/tc " (/t is a tab), in this manner, generates Other configurations parameter.
4. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 3, it is characterised in that described logical Cross random forests algorithm above-mentioned training set is modeled and specifically include following steps:
Random forests algorithm obtains multiple bootstrap data from given training set by repeatedly random repeatable sampling Collection;
To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right two by iteration Realize in subset, cutting procedure be a search segmentation function parameter space to seek maximum information increment meaning under it is optimal The process of parameter;
At each leaf node by reach in statistics training set the tag along sort of this leaf node histogram experience estimation this Class distribution on leaf node;Repetitive exercise process go to always user setting maximal tree depth or until can not by after Till continuous segmentation obtains bigger information gain;
In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each The number of sample predictions device at node.
5. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that described logical Cross Genetic algorithm searching allocation optimum parameter to be specially:
One group vectorial { c1 ..., cm } is set to initial configuration parameters value input performance model, model exports execution time t1, then Change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, time shorter corresponding configuration Parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.
6. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that it is described with Shown in machine forest algorithm is specific as follows:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
For i=1 to ntree
S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
Ci=F (S ')
}
C * ( x ) = arg y ∈ Y Σ i = 1 n t r e e C i ( x ) / n t r e e
Output:Polymerization C*.
CN201611182310.5A 2016-12-20 2016-12-20 Data sensing-based Spark configuration parameter automatic optimization method Pending CN106648654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611182310.5A CN106648654A (en) 2016-12-20 2016-12-20 Data sensing-based Spark configuration parameter automatic optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611182310.5A CN106648654A (en) 2016-12-20 2016-12-20 Data sensing-based Spark configuration parameter automatic optimization method

Publications (1)

Publication Number Publication Date
CN106648654A true CN106648654A (en) 2017-05-10

Family

ID=58833824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611182310.5A Pending CN106648654A (en) 2016-12-20 2016-12-20 Data sensing-based Spark configuration parameter automatic optimization method

Country Status (1)

Country Link
CN (1) CN106648654A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN107390753A (en) * 2017-08-29 2017-11-24 贵州省岚林阳环保能源科技有限责任公司 Intelligent plant growth environment regulating device and method based on Internet of Things cloud platform
CN107390754A (en) * 2017-08-29 2017-11-24 贵州省岚林阳环保能源科技有限责任公司 Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform
CN108255689A (en) * 2018-01-11 2018-07-06 哈尔滨工业大学 A kind of Apache Spark application automation tuning methods based on historic task analysis
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method
CN109035178A (en) * 2018-08-31 2018-12-18 杭州电子科技大学 A kind of multi-parameter value tuning method applied to image denoising
CN109325541A (en) * 2018-09-30 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for training pattern
WO2019061187A1 (en) * 2017-09-28 2019-04-04 深圳乐信软件技术有限公司 Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus
CN109634924A (en) * 2018-11-02 2019-04-16 华南师范大学 File system parameter automated tuning method and system based on machine learning
CN109947745A (en) * 2019-03-28 2019-06-28 浪潮商用机器有限公司 A kind of database optimizing method and device
CN110059842A (en) * 2018-01-19 2019-07-26 武汉十傅科技有限公司 A kind of foundry's production planning optimization method considering smelting furnace and sand mold size
CN110413313A (en) * 2019-07-19 2019-11-05 苏州浪潮智能科技有限公司 A kind of the parameter preferred method and device of Spark application
CN110427263A (en) * 2018-04-28 2019-11-08 深圳先进技术研究院 A kind of Spark big data application program capacity modeling method towards Docker container, equipment and storage equipment
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN110727506A (en) * 2019-10-18 2020-01-24 北京航空航天大学 SPARK parameter automatic tuning method based on cost model
CN110798314A (en) * 2019-11-01 2020-02-14 南京邮电大学 Quantum key distribution parameter optimization method based on random forest algorithm
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree
CN111461286A (en) * 2020-01-15 2020-07-28 华中科技大学 Spark parameter automatic optimization system and method based on evolutionary neural network
CN111629048A (en) * 2020-05-22 2020-09-04 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN112433853A (en) * 2020-11-30 2021-03-02 西安交通大学 Heterogeneous sensing data partitioning method for parallel application of supercomputer data
CN112445746A (en) * 2019-09-04 2021-03-05 中国科学院深圳先进技术研究院 Cluster configuration automatic optimization method and system based on machine learning
CN112488319A (en) * 2019-09-12 2021-03-12 中国科学院深圳先进技术研究院 Parameter adjusting method and system with self-adaptive configuration generator
CN113032033A (en) * 2019-12-05 2021-06-25 中国科学院深圳先进技术研究院 Automatic optimization method for big data processing platform configuration
CN113157538A (en) * 2021-02-02 2021-07-23 西安天和防务技术股份有限公司 Spark operation parameter determination method, device, equipment and storage medium
CN113204539A (en) * 2021-05-12 2021-08-03 南京大学 Big data system parameter automatic optimization method fusing system semantics
CN113574475A (en) * 2019-03-15 2021-10-29 3M创新有限公司 Determining causal models for a control environment
CN113743425A (en) * 2020-05-27 2021-12-03 北京沃东天骏信息技术有限公司 Method and device for generating classification model
CN114416193A (en) * 2021-12-15 2022-04-29 中国科学院深圳先进技术研究院 Method for accurately and quickly determining configuration parameter value field of big data analysis system
CN114489574A (en) * 2020-11-12 2022-05-13 深圳先进技术研究院 SVM-based automatic optimization method for stream processing framework
CN114565001A (en) * 2020-11-27 2022-05-31 深圳先进技术研究院 Automatic tuning method for graph data processing framework based on random forest
CN114880108A (en) * 2021-12-15 2022-08-09 中国科学院深圳先进技术研究院 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium
CN116089022A (en) * 2023-04-11 2023-05-09 广州嘉为科技有限公司 Parameter configuration adjustment method, system and storage medium of log search engine
CN116401451A (en) * 2023-03-31 2023-07-07 厦门海晟融创信息技术有限公司 Flow analysis method and system for building multi-dimensional strategy system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389585A (en) * 2015-10-20 2016-03-09 深圳大学 Random forest optimization method and system based on tensor decomposition
CN105550374A (en) * 2016-01-29 2016-05-04 湖南大学 Random forest parallelization machine studying method for big data in Spark cloud service environment
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389585A (en) * 2015-10-20 2016-03-09 深圳大学 Random forest optimization method and system based on tensor decomposition
CN105550374A (en) * 2016-01-29 2016-05-04 湖南大学 Random forest parallelization machine studying method for big data in Spark cloud service environment
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾霞霞 等: ""一种基于随机森林的头部位姿估计算法"", 《福建师范大学学报(自然科学版)》 *

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN107229693B (en) * 2017-05-22 2018-05-01 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN107390753A (en) * 2017-08-29 2017-11-24 贵州省岚林阳环保能源科技有限责任公司 Intelligent plant growth environment regulating device and method based on Internet of Things cloud platform
CN107390754A (en) * 2017-08-29 2017-11-24 贵州省岚林阳环保能源科技有限责任公司 Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform
CN107390754B (en) * 2017-08-29 2019-07-30 贵州省岚林阳环保能源科技有限责任公司 Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform
WO2019061187A1 (en) * 2017-09-28 2019-04-04 深圳乐信软件技术有限公司 Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus
CN108255689A (en) * 2018-01-11 2018-07-06 哈尔滨工业大学 A kind of Apache Spark application automation tuning methods based on historic task analysis
CN108255689B (en) * 2018-01-11 2021-02-12 哈尔滨工业大学 Automatic Apache Spark application tuning method based on historical task analysis
CN110059842A (en) * 2018-01-19 2019-07-26 武汉十傅科技有限公司 A kind of foundry's production planning optimization method considering smelting furnace and sand mold size
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method
CN108491226B (en) * 2018-02-05 2021-03-23 西安电子科技大学 Spark configuration parameter automatic tuning method based on cluster scaling
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN110427356B (en) * 2018-04-26 2021-08-13 中移(苏州)软件技术有限公司 Parameter configuration method and equipment
CN110427263A (en) * 2018-04-28 2019-11-08 深圳先进技术研究院 A kind of Spark big data application program capacity modeling method towards Docker container, equipment and storage equipment
CN110427263B (en) * 2018-04-28 2024-03-19 深圳先进技术研究院 Spark big data application program performance modeling method and device for Docker container and storage device
CN109035178A (en) * 2018-08-31 2018-12-18 杭州电子科技大学 A kind of multi-parameter value tuning method applied to image denoising
CN109325541A (en) * 2018-09-30 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for training pattern
CN109634924A (en) * 2018-11-02 2019-04-16 华南师范大学 File system parameter automated tuning method and system based on machine learning
CN109634924B (en) * 2018-11-02 2022-12-20 华南师范大学 File system parameter automatic tuning method and system based on machine learning
CN113574475A (en) * 2019-03-15 2021-10-29 3M创新有限公司 Determining causal models for a control environment
CN109947745A (en) * 2019-03-28 2019-06-28 浪潮商用机器有限公司 A kind of database optimizing method and device
CN110413313A (en) * 2019-07-19 2019-11-05 苏州浪潮智能科技有限公司 A kind of the parameter preferred method and device of Spark application
CN110413313B (en) * 2019-07-19 2023-05-23 苏州浪潮智能科技有限公司 Parameter optimization method and device for Spark application
CN112445746B (en) * 2019-09-04 2024-06-04 中国科学院深圳先进技术研究院 Automatic cluster configuration optimization method and system based on machine learning
CN112445746A (en) * 2019-09-04 2021-03-05 中国科学院深圳先进技术研究院 Cluster configuration automatic optimization method and system based on machine learning
CN112488319B (en) * 2019-09-12 2024-04-19 中国科学院深圳先进技术研究院 Parameter adjusting method and system with self-adaptive configuration generator
CN112488319A (en) * 2019-09-12 2021-03-12 中国科学院深圳先进技术研究院 Parameter adjusting method and system with self-adaptive configuration generator
CN110727506B (en) * 2019-10-18 2022-07-01 北京航空航天大学 SPARK parameter automatic tuning method based on cost model
CN110727506A (en) * 2019-10-18 2020-01-24 北京航空航天大学 SPARK parameter automatic tuning method based on cost model
CN110798314A (en) * 2019-11-01 2020-02-14 南京邮电大学 Quantum key distribution parameter optimization method based on random forest algorithm
CN110798314B (en) * 2019-11-01 2023-02-24 南京邮电大学 Quantum key distribution parameter optimization method based on random forest algorithm
CN113032033B (en) * 2019-12-05 2024-05-17 中国科学院深圳先进技术研究院 Automatic optimization method for big data processing platform configuration
CN113032033A (en) * 2019-12-05 2021-06-25 中国科学院深圳先进技术研究院 Automatic optimization method for big data processing platform configuration
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN111176832B (en) * 2019-12-06 2022-07-01 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN111259933B (en) * 2020-01-09 2023-06-13 中国科学院计算技术研究所 High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree
CN111461286B (en) * 2020-01-15 2022-03-29 华中科技大学 Spark parameter automatic optimization system and method based on evolutionary neural network
CN111461286A (en) * 2020-01-15 2020-07-28 华中科技大学 Spark parameter automatic optimization system and method based on evolutionary neural network
CN111629048A (en) * 2020-05-22 2020-09-04 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN111629048B (en) * 2020-05-22 2023-04-07 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN113743425A (en) * 2020-05-27 2021-12-03 北京沃东天骏信息技术有限公司 Method and device for generating classification model
CN114489574B (en) * 2020-11-12 2022-10-14 深圳先进技术研究院 SVM-based automatic optimization method for stream processing framework
WO2022100370A1 (en) * 2020-11-12 2022-05-19 深圳先进技术研究院 Automatic adjustment and optimization method for svm-based streaming
CN114489574A (en) * 2020-11-12 2022-05-13 深圳先进技术研究院 SVM-based automatic optimization method for stream processing framework
WO2022111125A1 (en) * 2020-11-27 2022-06-02 深圳先进技术研究院 Random-forest-based automatic optimization method for graphic data processing framework
CN114565001A (en) * 2020-11-27 2022-05-31 深圳先进技术研究院 Automatic tuning method for graph data processing framework based on random forest
CN112433853B (en) * 2020-11-30 2023-04-28 西安交通大学 Heterogeneous perception data partitioning method for supercomputer data parallel application
CN112433853A (en) * 2020-11-30 2021-03-02 西安交通大学 Heterogeneous sensing data partitioning method for parallel application of supercomputer data
CN113157538A (en) * 2021-02-02 2021-07-23 西安天和防务技术股份有限公司 Spark operation parameter determination method, device, equipment and storage medium
CN113204539B (en) * 2021-05-12 2023-08-22 南京大学 Automatic optimization method for big data system parameters fusing system semantics
CN113204539A (en) * 2021-05-12 2021-08-03 南京大学 Big data system parameter automatic optimization method fusing system semantics
CN114880108A (en) * 2021-12-15 2022-08-09 中国科学院深圳先进技术研究院 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium
CN114416193A (en) * 2021-12-15 2022-04-29 中国科学院深圳先进技术研究院 Method for accurately and quickly determining configuration parameter value field of big data analysis system
CN116401451A (en) * 2023-03-31 2023-07-07 厦门海晟融创信息技术有限公司 Flow analysis method and system for building multi-dimensional strategy system
CN116401451B (en) * 2023-03-31 2024-07-02 厦门海晟融创信息技术有限公司 Flow analysis method and system for building multi-dimensional strategy system
CN116089022A (en) * 2023-04-11 2023-05-09 广州嘉为科技有限公司 Parameter configuration adjustment method, system and storage medium of log search engine

Similar Documents

Publication Publication Date Title
CN106648654A (en) Data sensing-based Spark configuration parameter automatic optimization method
Triguero et al. Evolutionary undersampling for extremely imbalanced big data classification under apache spark
CN106096727B (en) A kind of network model building method and device based on machine learning
CN103679132B (en) A kind of nude picture detection method and system
CN108280236B (en) Method for analyzing random forest visual data based on LargeVis
CN105718490A (en) Method and device for updating classifying model
CN106326346A (en) Text classification method and terminal device
CN108491226B (en) Spark configuration parameter automatic tuning method based on cluster scaling
Nallathambi et al. Prediction of electricity consumption based on DT and RF: An application on USA country power consumption
CN115563610B (en) Training method, recognition method and device for intrusion detection model
CN103971136A (en) Large-scale data-oriented parallel structured support vector machine classification method
An et al. Classification method of teaching resources based on improved KNN algorithm
CN116245019A (en) Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm
CN111461324A (en) Hierarchical pruning method based on layer recovery sensitivity
CN109977977A (en) A kind of method and corresponding intrument identifying potential user
CN103020864B (en) Corn fine breed breeding method
CN104468276A (en) Network traffic identification method based on random sampling multiple classifiers
Ntoutsi et al. A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees
Bo Research on the classification of high dimensional imbalanced data based on the optimizational random forest algorithm
Rothe et al. Topics and trends in cognitive science (2000-2017)
Vuyyala et al. Crop Recommender System Based on Ensemble Classifiers
Nugroho et al. Decision tree using ant colony for classification of health data
She et al. Text Classification Research Based on Improved SoftMax Regression Algorithm
Fraideinberze et al. Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations
Zhao et al. One-Shot Pruning for Fast-adapting Pre-trained Models on Devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication