CN106648654A - Data sensing-based Spark configuration parameter automatic optimization method - Google Patents
Data sensing-based Spark configuration parameter automatic optimization method Download PDFInfo
- Publication number
- CN106648654A CN106648654A CN201611182310.5A CN201611182310A CN106648654A CN 106648654 A CN106648654 A CN 106648654A CN 201611182310 A CN201611182310 A CN 201611182310A CN 106648654 A CN106648654 A CN 106648654A
- Authority
- CN
- China
- Prior art keywords
- parameter
- spark
- data
- configuration
- configuration parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the technical field of electronic information, big data, cloud computing and the like, and particularly relates to a data sensing-based Spark configuration parameter automatic optimization method. The method comprises the steps of predetermining a Spark application and parameters influencing Spark performance; randomly configuring the parameters to obtain a training set; building a performance model by the training set through a random forest algorithm; and searching out optimal configuration parameters through a genetic algorithm. According to the method, under the condition that a user is not required to understand a Spark running mechanism, a parameter meaning effect, a value range, application characteristics and an input set, the optimal configuration parameters of a specific application running in a specific cluster environment can be found for the user; compared with a conventional parameter configuration method, the automatic optimization method is simpler and quicker; and the used random forest algorithm combines the advantages of machine learning and statistical reasoning, so that relatively high precision can be achieved by using relatively few training sets.
Description
Technical field
The invention belongs to the technical field such as electronic information, big data, cloud computing, more particularly to a kind of data perception
Spark configuration parameter automatic optimization methods.
Background technology
Spark is the class that UC Berkeley AMP lab (the AMP laboratories of University of California Berkeley) are increased income
Hadoop MapReduce universal parallel frameworks.It quickly grows, and has only used short five year, just becomes Apache funds
Top project.Because Spark has the characteristics of intermediate result is stored in internal memory, Spark operation iteration and interactive program
10 times are improve than traditional disk Computational frame Hadoop.Because Spark has critical role, root in big data analysis field
According to the investigation of Typesafe companies, 500 enterprises have been had more than within 2015 and has used Spark.
Configuration parameter optimization is always one of big data systematic research focus, and (100 are more than because configuration parameter is numerous
It is individual), performance is affected very big by configuration parameter, and application program has different characteristics.Therefore reached far away most preferably using default configuration
Performance.Spark is a kind of emerging big data internal memory Computational frame, because Spark has the characteristic of " internal memory calculating ", in cluster
All resources:CPU, the network bandwidth, internal memory, can all become the bottleneck of restriction Spark programs.And different Spark application journeys
Sequence has different characteristics again, such as Kmeans instructions locality is good but data locality is poor, the shuffle and iteration of PageRank
Select all many than KMeans, WordCount is not comprising iteration etc..The problem to be solved in the present invention is to specific collection group rings
Border, input data set and application program, are that automatic Spark finds optimum configuration parameter.
Hadoop parameter automatic optimization method RFHOC (A Random-Forest Approach to based on random forest
Auto-Tuning Hadoop ' s Configuration, abbreviation RFHOC) it is a kind of for operating on a given cluster
The configuration parameter optimization method of application program, is broadly divided into three steps:
1. performance test
2. performance model is built
3. iterative search allocation optimum
When user runs a Hadoop application program for the first time, RFHOC workload profiler collect operation
When Hadoop configuration parameter and the execution time in MapReduce stages.Subsequently, the execution time of different phase and corresponding match somebody with somebody
Putting parameter will be used to build performance prediction model as the input of random forests algorithm.RFHOC distinguishes for map the and reduce stages
Building regression model is used to predict the performance in each stage.Each stage first will produce a training set S, each behavior of S
Vector v j, vj contains each execution time and corresponding Hadoop configuration parameter values.After building up performance model, RFHOC is used
Genetic algorithm searching Hadoop optimized parameters.Genetic algorithm uses the performance of Random Forest model prediction and corresponding configuration conduct
Global search is done in input.The execution time in Map and reduce stages adds up to the total time of program operation, is also genetic algorithm
Adaptive value.
Prior art is manual configuration parameter and automatic configuration parameter.Manual configuration parametric technique drawback be it is too time-consuming,
And require that user has deeper understanding to the operating mechanism of Spark, the meaning of parameter, effect and span.User needs
To increase manually or reduce Spark parameter values, then configure Spark, run application program, find the ginseng for making the execution time most short
Numerical value.Because the allocation optimum parameter of different cluster environment, different application and different input data sets is different, match somebody with somebody manually
It is a time-consuming, scissors and paste to put parametric technique.
Existing automatic configuration parameter method shortcoming is that performance model precision is low, modeling cost is high.Some method employments
Artificial neural networks (Artificial Neural Network), SVMs (Support Vector Machine) modeling,
But to reach degree of precision (within 10%), need to use very huge training set.
The content of the invention
Based on above-mentioned situation, it is necessary to there is provided a kind of Spark configuration parameter automatic optimization methods of data perception.
A kind of Spark configuration parameter automatic optimization methods of data perception, comprise the steps:
Collect data;The collection data are specifically included:Selected Spark application programs, further determine that above-mentioned application journey
The parameter of Spark performances is affected in sequence, the span of above-mentioned parameter is determined;The random generation parameter in span, and it is raw
Spark is configured into configuration file, application program is run and is collected data with postponing;The data are included but is not limited to:Spark is transported
Row time, input data set, configuration parameter value;
Build performance model;The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal
Amount, multiple vectorial composing training collection are modeled by random forests algorithm to above-mentioned training set;
Search allocation optimum parameter;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.
Further, a verification step, the verification step are also included after the search allocation optimum parameter step
It is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most short.
Further, random generation parameter is described in collection data step:Assume parameter s span be [a,
B], unified in the span, uniform, randomly value c, a≤c≤b, then (/t is one to produce a record " s/tc "
Tab), in this manner, generate other configurations parameter.
Used as a kind of improvement, described being modeled to above-mentioned training set by random forests algorithm specifically includes following step
Suddenly:
Random forests algorithm obtains multiple bootstrap from given training set by repeatedly random repeatable sampling
Data set;
To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right by iteration
Two sons concentrate what is realized, and this cutting procedure is the parameter space of a search segmentation function to seek maximum information increment meaning
The process of the lower optimal parameter of justice;
Estimated by reaching the histogram experience of the tag along sort of this leaf node in statistics training set at each leaf node
Count the class distribution on this leaf node;
Repetitive exercise process goes to always the maximal tree depth of user's setting or until obtaining by continuing segmentation
Till taking bigger information gain;
In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to
Determine ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are decision trees at each
The number of sample predictions device at split vertexes.
It is described to be specially by Genetic algorithm searching allocation optimum parameter as further improvement:
One group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, model output execution time
T1, then change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, corresponding to the time is shorter
Configuration parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.
Specifically, the random forests algorithm is as follows:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
1.for i=1to ntree
2.S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
3.Ci=F (S ')
4.}
5.
Output:Polymerization C*.
The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance
The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest
Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine
In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in
The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this
The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with
In high precision.
Description of the drawings
Fig. 1 is a kind of Spark configuration parameter automatic optimization method overall flow schematic diagrams of data perception of the invention;
Fig. 2 be a kind of data perception of the invention Spark configuration parameter automatic optimization methods in genetic algorithm schematic diagram.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become apparent from, below in conjunction with drawings and Examples, to this
It is bright to be further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and without
It is of the invention in limiting.
As shown in Figure 1-2, the Spark configuration parameter automatic optimization methods of a kind of data perception, including following three big step:
1) data are collected;The collection data include four little steps, as follows:
(1) parameter for affecting performance is found from all parameters of Spark;
(2) span of parameter is determined;
(3) input set of application program is selected;
(4) it is determined that span in change at random parameter, configure Spark, the application of the different input data sets of operation
Program, the data collected are used as training set;
Data (Collecting) stage is being collected, above-mentioned four little step specifically can be expressed as:Selected experiment
Spark application programs, the more commonly used is HiBench benchmarks, and HiBench contains figure calculating, machine learning, non-
Iterative application program, therefrom chooses some representational programs, such as KMeans, Bayesian, PageRank, WordCount,
TeraSort, further determining that affects the parameter of Spark performances in above-mentioned application program, determine the span of above-mentioned parameter;
The random generation parameter in span, and configuration file configuration Spark is generated, each application program selects some input sets,
Conf Generator are configuration parameter makers, and configuration file is produced using Conf Generator, and configuration file is included
The random parameter for generating, runs application program and collects data with postponing;The data are included but is not limited to:When Spark runs
Between, input data set, configuration parameter value.
Produce especially by following manner random generation parameter described in data step is collected:Assume parameter s value model
It is [a, b] to enclose, unified in the span, uniform, randomly value c, a≤c≤b, then produce record " s/tc " (/t
It is a tab), in this manner, generate other configurations parameter.
2) performance model is built;The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal
Vector, multiple vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set.
Specifically, random forests algorithm is modeled to above-mentioned training set and specifically includes from given training set by multiple
Random repeatable sampling obtains multiple bootstrap data sets;To one decision-making of each bootstrap dataset construction
Tree, construction is to concentrate what is realized by of left and right two of assigning to data point of iteration, and this cutting procedure is a search point
The process of optimal parameter under the parameter space of function is cut to seek maximum information increment meaning;By statistics at each leaf node
The class distribution estimated on this leaf node of the histogram experience of the tag along sort of this leaf node is reached in training set;Repetitive exercise mistake
Cheng Yizhi go to the maximal tree depth that user sets or until can not pass through to continue to split obtain bigger information gain as
Only;In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that
Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each
The number of sample predictions device at node.
Wherein, the random forests algorithm is specific as follows shown:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
1.for i=1to ntree
2.S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
3.Ci=F (S ')
4.}
5.
Output:Polymerization C*.
Integrated Algorithm in present invention machine learning --- random forest is modeled;Machine learning is compared to traditional statistics
Learning method, with organizing and fitting parameter, can process the advantage of bigger data set;Random forest is compared to other
Machine algorithm, can solve the problems, such as over-fitting, more (higher-dimension) situation of processing feature etc. and other effects.
3) allocation optimum parameter is searched for;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.
Specific way is that one group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, during model output execution
Between t1, then change initial value, be input into performance model, model output execution time t2, t2 is compared with t1, corresponding to the time is shorter
Configuration parameter as allocation optimum, above step is repeated, until finding the most short configuration of execution time.
Compared to other optimized algorithms, the such as method of exhaustion, greedy method, simulated annealing, ant group algorithm have genetic algorithm
Good ability of searching optimum, rapidly can search out all solutions in solution space, without being absorbed in locally optimal solution
Rapid decrease trap;Search, with potential concurrency, can carry out the comparison of multiple individualities from colony;Search procedure
Simply, in-service evaluation function is inspired;It is iterated using probability mechanism, there is randomness.
Finally, as a kind of preferred embodiment, after the search allocation optimum parameter step verification step is also included,
The verification step is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most
It is short.
The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance
The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest
Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine
In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in
The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this
The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with
In high precision.The present invention can find allocation optimum parameter for any input data set, because in practical situations both user is running
During application program, input set is arbitrarily change, it is contemplated that practical situations.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, and these improvements and modifications also should be regarded as
Protection scope of the present invention.
Claims (6)
1. Spark configuration parameter automatic optimization methods of a kind of data perception, it is characterised in that comprise the steps:
Collect data;The collection data are specifically included:Selected Spark application programs, in further determining that above-mentioned application program
The parameter of Spark performances is affected, the span of above-mentioned parameter is determined;The random generation parameter in span, and generate match somebody with somebody
File configuration Spark is put, application program is run and is collected data with postponing;The data are included but is not limited to:When Spark runs
Between, input data set, configuration parameter value;
Build performance model;The Spark run times collected, input data set, configuration parameter Value Data are constituted into transversal vector, it is many
Individual vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set;
Search allocation optimum parameter;Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.
2. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 1, it is characterised in that described
Also include a verification step after search allocation optimum parameter step, the verification step is the allocation optimum parameter that will be searched
Configuration Spark is carried out, and runtime verification performs whether the time is most short.
3. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 2, it is characterised in that collecting
Random generation parameter step is described in data:Assume that parameter s span is [a, b], it is unified in the span, equal
Even, randomly value c, a≤c≤b then produces a record " s/tc " (/t is a tab), in this manner, generates
Other configurations parameter.
4. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 3, it is characterised in that described logical
Cross random forests algorithm above-mentioned training set is modeled and specifically include following steps:
Random forests algorithm obtains multiple bootstrap data from given training set by repeatedly random repeatable sampling
Collection;
To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right two by iteration
Realize in subset, cutting procedure be a search segmentation function parameter space to seek maximum information increment meaning under it is optimal
The process of parameter;
At each leaf node by reach in statistics training set the tag along sort of this leaf node histogram experience estimation this
Class distribution on leaf node;Repetitive exercise process go to always user setting maximal tree depth or until can not by after
Till continuous segmentation obtains bigger information gain;
In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that
Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each
The number of sample predictions device at node.
5. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that described logical
Cross Genetic algorithm searching allocation optimum parameter to be specially:
One group vectorial { c1 ..., cm } is set to initial configuration parameters value input performance model, model exports execution time t1, then
Change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, time shorter corresponding configuration
Parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.
6. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that it is described with
Shown in machine forest algorithm is specific as follows:
Input:Training set S, guidance function F, Integer n tree (bootstrap sample numbers)
For i=1 to ntree
S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)
Ci=F (S ')
}
Output:Polymerization C*.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611182310.5A CN106648654A (en) | 2016-12-20 | 2016-12-20 | Data sensing-based Spark configuration parameter automatic optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611182310.5A CN106648654A (en) | 2016-12-20 | 2016-12-20 | Data sensing-based Spark configuration parameter automatic optimization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106648654A true CN106648654A (en) | 2017-05-10 |
Family
ID=58833824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611182310.5A Pending CN106648654A (en) | 2016-12-20 | 2016-12-20 | Data sensing-based Spark configuration parameter automatic optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106648654A (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229693A (en) * | 2017-05-22 | 2017-10-03 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN107390753A (en) * | 2017-08-29 | 2017-11-24 | 贵州省岚林阳环保能源科技有限责任公司 | Intelligent plant growth environment regulating device and method based on Internet of Things cloud platform |
CN107390754A (en) * | 2017-08-29 | 2017-11-24 | 贵州省岚林阳环保能源科技有限责任公司 | Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform |
CN108255689A (en) * | 2018-01-11 | 2018-07-06 | 哈尔滨工业大学 | A kind of Apache Spark application automation tuning methods based on historic task analysis |
CN108491226A (en) * | 2018-02-05 | 2018-09-04 | 西安电子科技大学 | Spark based on cluster scaling configures parameter automated tuning method |
CN109035178A (en) * | 2018-08-31 | 2018-12-18 | 杭州电子科技大学 | A kind of multi-parameter value tuning method applied to image denoising |
CN109325541A (en) * | 2018-09-30 | 2019-02-12 | 北京字节跳动网络技术有限公司 | Method and apparatus for training pattern |
WO2019061187A1 (en) * | 2017-09-28 | 2019-04-04 | 深圳乐信软件技术有限公司 | Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus |
CN109634924A (en) * | 2018-11-02 | 2019-04-16 | 华南师范大学 | File system parameter automated tuning method and system based on machine learning |
CN109947745A (en) * | 2019-03-28 | 2019-06-28 | 浪潮商用机器有限公司 | A kind of database optimizing method and device |
CN110059842A (en) * | 2018-01-19 | 2019-07-26 | 武汉十傅科技有限公司 | A kind of foundry's production planning optimization method considering smelting furnace and sand mold size |
CN110413313A (en) * | 2019-07-19 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of the parameter preferred method and device of Spark application |
CN110427263A (en) * | 2018-04-28 | 2019-11-08 | 深圳先进技术研究院 | A kind of Spark big data application program capacity modeling method towards Docker container, equipment and storage equipment |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN110727506A (en) * | 2019-10-18 | 2020-01-24 | 北京航空航天大学 | SPARK parameter automatic tuning method based on cost model |
CN110798314A (en) * | 2019-11-01 | 2020-02-14 | 南京邮电大学 | Quantum key distribution parameter optimization method based on random forest algorithm |
CN111176832A (en) * | 2019-12-06 | 2020-05-19 | 重庆邮电大学 | Performance optimization and parameter configuration method based on memory computing framework Spark |
CN111259933A (en) * | 2020-01-09 | 2020-06-09 | 中国科学院计算技术研究所 | High-dimensional feature data classification method and system based on distributed parallel decision tree |
CN111461286A (en) * | 2020-01-15 | 2020-07-28 | 华中科技大学 | Spark parameter automatic optimization system and method based on evolutionary neural network |
CN111629048A (en) * | 2020-05-22 | 2020-09-04 | 浪潮电子信息产业股份有限公司 | spark cluster optimal configuration parameter determination method, device and equipment |
CN112433853A (en) * | 2020-11-30 | 2021-03-02 | 西安交通大学 | Heterogeneous sensing data partitioning method for parallel application of supercomputer data |
CN112445746A (en) * | 2019-09-04 | 2021-03-05 | 中国科学院深圳先进技术研究院 | Cluster configuration automatic optimization method and system based on machine learning |
CN112488319A (en) * | 2019-09-12 | 2021-03-12 | 中国科学院深圳先进技术研究院 | Parameter adjusting method and system with self-adaptive configuration generator |
CN113032033A (en) * | 2019-12-05 | 2021-06-25 | 中国科学院深圳先进技术研究院 | Automatic optimization method for big data processing platform configuration |
CN113157538A (en) * | 2021-02-02 | 2021-07-23 | 西安天和防务技术股份有限公司 | Spark operation parameter determination method, device, equipment and storage medium |
CN113204539A (en) * | 2021-05-12 | 2021-08-03 | 南京大学 | Big data system parameter automatic optimization method fusing system semantics |
CN113574475A (en) * | 2019-03-15 | 2021-10-29 | 3M创新有限公司 | Determining causal models for a control environment |
CN113743425A (en) * | 2020-05-27 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Method and device for generating classification model |
CN114416193A (en) * | 2021-12-15 | 2022-04-29 | 中国科学院深圳先进技术研究院 | Method for accurately and quickly determining configuration parameter value field of big data analysis system |
CN114489574A (en) * | 2020-11-12 | 2022-05-13 | 深圳先进技术研究院 | SVM-based automatic optimization method for stream processing framework |
CN114565001A (en) * | 2020-11-27 | 2022-05-31 | 深圳先进技术研究院 | Automatic tuning method for graph data processing framework based on random forest |
CN114880108A (en) * | 2021-12-15 | 2022-08-09 | 中国科学院深圳先进技术研究院 | Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium |
CN116089022A (en) * | 2023-04-11 | 2023-05-09 | 广州嘉为科技有限公司 | Parameter configuration adjustment method, system and storage medium of log search engine |
CN116401451A (en) * | 2023-03-31 | 2023-07-07 | 厦门海晟融创信息技术有限公司 | Flow analysis method and system for building multi-dimensional strategy system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389585A (en) * | 2015-10-20 | 2016-03-09 | 深圳大学 | Random forest optimization method and system based on tensor decomposition |
CN105550374A (en) * | 2016-01-29 | 2016-05-04 | 湖南大学 | Random forest parallelization machine studying method for big data in Spark cloud service environment |
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
-
2016
- 2016-12-20 CN CN201611182310.5A patent/CN106648654A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389585A (en) * | 2015-10-20 | 2016-03-09 | 深圳大学 | Random forest optimization method and system based on tensor decomposition |
CN105550374A (en) * | 2016-01-29 | 2016-05-04 | 湖南大学 | Random forest parallelization machine studying method for big data in Spark cloud service environment |
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
Non-Patent Citations (1)
Title |
---|
曾霞霞 等: ""一种基于随机森林的头部位姿估计算法"", 《福建师范大学学报(自然科学版)》 * |
Cited By (57)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229693A (en) * | 2017-05-22 | 2017-10-03 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN107229693B (en) * | 2017-05-22 | 2018-05-01 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN107390753A (en) * | 2017-08-29 | 2017-11-24 | 贵州省岚林阳环保能源科技有限责任公司 | Intelligent plant growth environment regulating device and method based on Internet of Things cloud platform |
CN107390754A (en) * | 2017-08-29 | 2017-11-24 | 贵州省岚林阳环保能源科技有限责任公司 | Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform |
CN107390754B (en) * | 2017-08-29 | 2019-07-30 | 贵州省岚林阳环保能源科技有限责任公司 | Intelligent plant growth environment adjustment system and method based on Internet of Things cloud platform |
WO2019061187A1 (en) * | 2017-09-28 | 2019-04-04 | 深圳乐信软件技术有限公司 | Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus |
CN108255689A (en) * | 2018-01-11 | 2018-07-06 | 哈尔滨工业大学 | A kind of Apache Spark application automation tuning methods based on historic task analysis |
CN108255689B (en) * | 2018-01-11 | 2021-02-12 | 哈尔滨工业大学 | Automatic Apache Spark application tuning method based on historical task analysis |
CN110059842A (en) * | 2018-01-19 | 2019-07-26 | 武汉十傅科技有限公司 | A kind of foundry's production planning optimization method considering smelting furnace and sand mold size |
CN108491226A (en) * | 2018-02-05 | 2018-09-04 | 西安电子科技大学 | Spark based on cluster scaling configures parameter automated tuning method |
CN108491226B (en) * | 2018-02-05 | 2021-03-23 | 西安电子科技大学 | Spark configuration parameter automatic tuning method based on cluster scaling |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN110427356B (en) * | 2018-04-26 | 2021-08-13 | 中移(苏州)软件技术有限公司 | Parameter configuration method and equipment |
CN110427263A (en) * | 2018-04-28 | 2019-11-08 | 深圳先进技术研究院 | A kind of Spark big data application program capacity modeling method towards Docker container, equipment and storage equipment |
CN110427263B (en) * | 2018-04-28 | 2024-03-19 | 深圳先进技术研究院 | Spark big data application program performance modeling method and device for Docker container and storage device |
CN109035178A (en) * | 2018-08-31 | 2018-12-18 | 杭州电子科技大学 | A kind of multi-parameter value tuning method applied to image denoising |
CN109325541A (en) * | 2018-09-30 | 2019-02-12 | 北京字节跳动网络技术有限公司 | Method and apparatus for training pattern |
CN109634924A (en) * | 2018-11-02 | 2019-04-16 | 华南师范大学 | File system parameter automated tuning method and system based on machine learning |
CN109634924B (en) * | 2018-11-02 | 2022-12-20 | 华南师范大学 | File system parameter automatic tuning method and system based on machine learning |
CN113574475A (en) * | 2019-03-15 | 2021-10-29 | 3M创新有限公司 | Determining causal models for a control environment |
CN109947745A (en) * | 2019-03-28 | 2019-06-28 | 浪潮商用机器有限公司 | A kind of database optimizing method and device |
CN110413313A (en) * | 2019-07-19 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of the parameter preferred method and device of Spark application |
CN110413313B (en) * | 2019-07-19 | 2023-05-23 | 苏州浪潮智能科技有限公司 | Parameter optimization method and device for Spark application |
CN112445746B (en) * | 2019-09-04 | 2024-06-04 | 中国科学院深圳先进技术研究院 | Automatic cluster configuration optimization method and system based on machine learning |
CN112445746A (en) * | 2019-09-04 | 2021-03-05 | 中国科学院深圳先进技术研究院 | Cluster configuration automatic optimization method and system based on machine learning |
CN112488319B (en) * | 2019-09-12 | 2024-04-19 | 中国科学院深圳先进技术研究院 | Parameter adjusting method and system with self-adaptive configuration generator |
CN112488319A (en) * | 2019-09-12 | 2021-03-12 | 中国科学院深圳先进技术研究院 | Parameter adjusting method and system with self-adaptive configuration generator |
CN110727506B (en) * | 2019-10-18 | 2022-07-01 | 北京航空航天大学 | SPARK parameter automatic tuning method based on cost model |
CN110727506A (en) * | 2019-10-18 | 2020-01-24 | 北京航空航天大学 | SPARK parameter automatic tuning method based on cost model |
CN110798314A (en) * | 2019-11-01 | 2020-02-14 | 南京邮电大学 | Quantum key distribution parameter optimization method based on random forest algorithm |
CN110798314B (en) * | 2019-11-01 | 2023-02-24 | 南京邮电大学 | Quantum key distribution parameter optimization method based on random forest algorithm |
CN113032033B (en) * | 2019-12-05 | 2024-05-17 | 中国科学院深圳先进技术研究院 | Automatic optimization method for big data processing platform configuration |
CN113032033A (en) * | 2019-12-05 | 2021-06-25 | 中国科学院深圳先进技术研究院 | Automatic optimization method for big data processing platform configuration |
CN111176832A (en) * | 2019-12-06 | 2020-05-19 | 重庆邮电大学 | Performance optimization and parameter configuration method based on memory computing framework Spark |
CN111176832B (en) * | 2019-12-06 | 2022-07-01 | 重庆邮电大学 | Performance optimization and parameter configuration method based on memory computing framework Spark |
CN111259933B (en) * | 2020-01-09 | 2023-06-13 | 中国科学院计算技术研究所 | High-dimensional characteristic data classification method and system based on distributed parallel decision tree |
CN111259933A (en) * | 2020-01-09 | 2020-06-09 | 中国科学院计算技术研究所 | High-dimensional feature data classification method and system based on distributed parallel decision tree |
CN111461286B (en) * | 2020-01-15 | 2022-03-29 | 华中科技大学 | Spark parameter automatic optimization system and method based on evolutionary neural network |
CN111461286A (en) * | 2020-01-15 | 2020-07-28 | 华中科技大学 | Spark parameter automatic optimization system and method based on evolutionary neural network |
CN111629048A (en) * | 2020-05-22 | 2020-09-04 | 浪潮电子信息产业股份有限公司 | spark cluster optimal configuration parameter determination method, device and equipment |
CN111629048B (en) * | 2020-05-22 | 2023-04-07 | 浪潮电子信息产业股份有限公司 | spark cluster optimal configuration parameter determination method, device and equipment |
CN113743425A (en) * | 2020-05-27 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Method and device for generating classification model |
CN114489574B (en) * | 2020-11-12 | 2022-10-14 | 深圳先进技术研究院 | SVM-based automatic optimization method for stream processing framework |
WO2022100370A1 (en) * | 2020-11-12 | 2022-05-19 | 深圳先进技术研究院 | Automatic adjustment and optimization method for svm-based streaming |
CN114489574A (en) * | 2020-11-12 | 2022-05-13 | 深圳先进技术研究院 | SVM-based automatic optimization method for stream processing framework |
WO2022111125A1 (en) * | 2020-11-27 | 2022-06-02 | 深圳先进技术研究院 | Random-forest-based automatic optimization method for graphic data processing framework |
CN114565001A (en) * | 2020-11-27 | 2022-05-31 | 深圳先进技术研究院 | Automatic tuning method for graph data processing framework based on random forest |
CN112433853B (en) * | 2020-11-30 | 2023-04-28 | 西安交通大学 | Heterogeneous perception data partitioning method for supercomputer data parallel application |
CN112433853A (en) * | 2020-11-30 | 2021-03-02 | 西安交通大学 | Heterogeneous sensing data partitioning method for parallel application of supercomputer data |
CN113157538A (en) * | 2021-02-02 | 2021-07-23 | 西安天和防务技术股份有限公司 | Spark operation parameter determination method, device, equipment and storage medium |
CN113204539B (en) * | 2021-05-12 | 2023-08-22 | 南京大学 | Automatic optimization method for big data system parameters fusing system semantics |
CN113204539A (en) * | 2021-05-12 | 2021-08-03 | 南京大学 | Big data system parameter automatic optimization method fusing system semantics |
CN114880108A (en) * | 2021-12-15 | 2022-08-09 | 中国科学院深圳先进技术研究院 | Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium |
CN114416193A (en) * | 2021-12-15 | 2022-04-29 | 中国科学院深圳先进技术研究院 | Method for accurately and quickly determining configuration parameter value field of big data analysis system |
CN116401451A (en) * | 2023-03-31 | 2023-07-07 | 厦门海晟融创信息技术有限公司 | Flow analysis method and system for building multi-dimensional strategy system |
CN116401451B (en) * | 2023-03-31 | 2024-07-02 | 厦门海晟融创信息技术有限公司 | Flow analysis method and system for building multi-dimensional strategy system |
CN116089022A (en) * | 2023-04-11 | 2023-05-09 | 广州嘉为科技有限公司 | Parameter configuration adjustment method, system and storage medium of log search engine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106648654A (en) | Data sensing-based Spark configuration parameter automatic optimization method | |
Triguero et al. | Evolutionary undersampling for extremely imbalanced big data classification under apache spark | |
CN106096727B (en) | A kind of network model building method and device based on machine learning | |
CN103679132B (en) | A kind of nude picture detection method and system | |
CN108280236B (en) | Method for analyzing random forest visual data based on LargeVis | |
CN105718490A (en) | Method and device for updating classifying model | |
CN106326346A (en) | Text classification method and terminal device | |
CN108491226B (en) | Spark configuration parameter automatic tuning method based on cluster scaling | |
Nallathambi et al. | Prediction of electricity consumption based on DT and RF: An application on USA country power consumption | |
CN115563610B (en) | Training method, recognition method and device for intrusion detection model | |
CN103971136A (en) | Large-scale data-oriented parallel structured support vector machine classification method | |
An et al. | Classification method of teaching resources based on improved KNN algorithm | |
CN116245019A (en) | Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm | |
CN111461324A (en) | Hierarchical pruning method based on layer recovery sensitivity | |
CN109977977A (en) | A kind of method and corresponding intrument identifying potential user | |
CN103020864B (en) | Corn fine breed breeding method | |
CN104468276A (en) | Network traffic identification method based on random sampling multiple classifiers | |
Ntoutsi et al. | A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees | |
Bo | Research on the classification of high dimensional imbalanced data based on the optimizational random forest algorithm | |
Rothe et al. | Topics and trends in cognitive science (2000-2017) | |
Vuyyala et al. | Crop Recommender System Based on Ensemble Classifiers | |
Nugroho et al. | Decision tree using ant colony for classification of health data | |
She et al. | Text Classification Research Based on Improved SoftMax Regression Algorithm | |
Fraideinberze et al. | Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations | |
Zhao et al. | One-Shot Pruning for Fast-adapting Pre-trained Models on Devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |
|
RJ01 | Rejection of invention patent application after publication |