WO2022100370A1 - 一种基于svm的流处理框架的自动调优方法 - Google Patents

一种基于svm的流处理框架的自动调优方法 Download PDF

Info

Publication number
WO2022100370A1
WO2022100370A1 PCT/CN2021/124402 CN2021124402W WO2022100370A1 WO 2022100370 A1 WO2022100370 A1 WO 2022100370A1 CN 2021124402 W CN2021124402 W CN 2021124402W WO 2022100370 A1 WO2022100370 A1 WO 2022100370A1
Authority
WO
WIPO (PCT)
Prior art keywords
configuration parameters
stream processing
performance
individual
fitness
Prior art date
Application number
PCT/CN2021/124402
Other languages
English (en)
French (fr)
Inventor
辛锦瀚
陈超
王峥
杨永魁
喻之斌
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2022100370A1 publication Critical patent/WO2022100370A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Definitions

  • the present invention relates to the technical field of big data processing, and more particularly, to an automatic tuning method of a stream processing framework based on SVM.
  • Structured Streaming provides technologies such as incremental query, more advanced programming interface, unified batch streaming combined with programming model, etc., which realizes lower processing delay, simple and more advanced business implementation logic, and realizes high efficiency by combining with Spark SQL Stream processing and other high-quality features make Structured Streaming used by more and more enterprises as the first choice for real-time computing.
  • Structured Streaming will be affected by configuration parameters during operation, and unreasonable configuration will seriously slow down task execution. Spark officially recommends a set of default configuration parameters. However, in actual stream processing tasks, the default configuration parameters cannot adapt to the real-time changes of the task scenario, and cannot be adapted according to different system resources, resulting in the performance of Structured Streaming being limited and a large number of waste of system resources. A large number of configuration parameters in Structured Streaming need to be set reasonably according to different application scenarios, and the existing manual parameter adjustment is difficult and expensive.
  • the existing Structured Streaming automatic configuration parameter tuning method is not deep enough. It only considers the characteristics of Structured Streaming's batch-stream mixing, and automatically optimizes the parameters related to batch processing and stream processing.
  • the underlying computing engine of Structured Streaming is Spark, which also needs to be optimized and adjusted. Therefore, it is far from enough to optimize parameters related to batch/stream processing.
  • the machine learning algorithms used in existing methods cannot effectively search for the most optimal configuration parameters.
  • the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide an automatic tuning method of a stream processing framework, which is based on SVM (support vector machine algorithm) and genetic algorithm to search for optimal configuration parameters suitable for specific application scenarios. new technology solutions.
  • SVM support vector machine algorithm
  • the invention provides an automatic tuning method of a stream processing framework based on SVM.
  • the method includes the following steps:
  • each sample data contains a set of configuration parameters and the corresponding relationship between the execution performance of the stream processing framework
  • a set of configuration parameters is regarded as an individual, each parameter in a set of configuration parameters is regarded as a gene in the individual, and the output performance of the performance prediction model corresponding to a set of configuration parameters is used to measure the individual Fitness, using a genetic algorithm to search for the optimal configuration parameters of the stream processing framework.
  • the present invention has the advantage that an automatic optimization method for the stream processing framework is designed based on SVM and a meta-heuristic algorithm (genetic algorithm), which realizes automatic optimization from the bottom layer to the upper layer, and combines with better machines. Learning algorithms can more efficiently search for better configuration parameters.
  • FIG. 1 is a flowchart of an automatic tuning method for an SVM-based stream processing framework according to an embodiment of the present invention
  • FIG. 2 is a schematic process diagram of an automatic tuning method of an SVM-based stream processing framework according to an embodiment of the present invention
  • Fig. 3 is the effect comparison diagram of three kinds of algorithms of the prior art and the method of the present invention.
  • FIG. 6 is an effect diagram of the ratio of data processing delay to data processing throughput according to an embodiment of the present invention.
  • the automatic tuning method of the SVM-based stream processing framework includes the following steps.
  • step S110 a training data set is constructed, and each training sample is used to represent the corresponding relationship between the execution performance of the stream processing framework and the used combination of configuration parameters.
  • a parameter generator (Conf Generator) is included, which first selects parameters that significantly affect the performance of Structured Streaming and the underlying Spark, and then automatically generates and assigns parameters for the operation of the program to be optimized according to the selected parameters.
  • the data processing latency and throughput of Structured Streaming at runtime combined with the parameters used are collected as a sample data in the training data set. In this way, a training dataset consisting of multiple sample data is obtained after multiple runs. Each training sample is used to characterize the correspondence between data processing latency and throughput and the combination of parameters used.
  • the training data set is represented as ⁇ Pv 1 , Pv 2 , ..., Pv n ⁇
  • the first training sample Pv 1 contains ⁇ t 1 , I i , conf i1 , ..., conf i23 ⁇
  • t 1 represents the data Processing delay
  • I 1 represents data throughput
  • conf i1 , ..., conf i23 is a combination of configuration parameters.
  • Each group of configuration parameters includes the upper-level parameters of the stream processing framework and the underlying parameters of Spark.
  • Step S120 using the training data set, taking the configuration parameter combination as the input, and taking the stream processing framework execution performance as the output, train the SVM model to obtain the performance prediction model of the stream processing framework.
  • This step is the modeling stage of the performance prediction model.
  • the obtained training data set is used for modeling based on machine learning algorithms.
  • the purpose is to build a performance prediction model that can reflect the impact of different configuration parameters on the delay and throughput of Structured Streaming.
  • each set of configuration parameters is used as input, and the execution performance of the corresponding stream processing framework is used as output to train an SVM model to obtain a performance prediction model, which is used for delay and throughput prediction for different configuration parameters.
  • the SVM model can quickly and accurately predict the execution performance of the stream processing framework.
  • Step S130 in the search space of configuration parameters, use the performance prediction model combined with the genetic algorithm to perform an optimal configuration search.
  • the genetic algorithm is used to perform an iterative search based on the performance prediction model, and the optimal configuration parameters are finally screened.
  • the present invention preferably uses genetic algorithm as the basic search algorithm, and optimizes the genetic algorithm and structured streaming.
  • Genetic algorithm As the basic search algorithm, and optimizes the genetic algorithm and structured streaming.
  • the search process for optimal configuration parameters includes:
  • Step S131 randomly input a group of structured streaming parameters and obtain the initialized individual fitness A standard through performance model calculation;
  • Step S132 randomly select n groups of configuration parameters (for example, n is greater than 1/5 of the number of training sets) from the training data set obtained in the data collection stage as the initialization population P, and perform random crossover operation and mutation rate for each individual in P is a mutation operation of 0.02;
  • Step S133 Use the performance prediction model to calculate the fitness of the population P and its descendants, and screen out individuals whose fitness is higher than A to form a new population P', and use the fitness A' of the individual with the highest fitness as the new fitness standard A';
  • Step S134 repeating S132 and S133 until no better individual can be generated, then the current optimal individual is the optimal configuration parameter searched.
  • the genetic algorithm is used to search for the optimal configuration based on the performance prediction model obtained in the modeling stage.
  • the purpose is to use the crossover mutation characteristics of the genetic algorithm to avoid the search from falling into local optimum, while ensuring excellent search performance.
  • the performance prediction model is used to predict the performance of different configuration parameters generated by the genetic algorithm in Structured Streaming, so as to achieve efficient search, and the search can finally obtain the optimal configuration parameters and directly use them in Structured Streaming.
  • a recursive random search algorithm, a pattern search algorithm, etc. are used to search for the optimal configuration, it is easy to fall into the local optimum, which leads to the problem that the global optimal configuration cannot be found.
  • Structured Streaming test program officially provided by Spark, including StructuredKafkaWordCount (referred to as KafkaWC), StructuredNetworkWordCount (referred to as NetworkWC), StructuredNetworkWordCountWindows (referred to as NetworkWCW) and StructuredSessionization (Sessionization for short), which automatically optimizes the configuration parameters of the Structured Streaming framework.
  • KafkaWC StructuredKafkaWordCount
  • NetworkWC StructuredNetworkWordCount
  • NetworkWCW StructuredNetworkWordCountWindows
  • StructuredSessionization StructuredSessionization for short
  • Fig. 3 is three kinds of algorithms (RS, ANN, RF) commonly used at present and the method used in the present invention (marked as support vector machine algorithm in the figure), for the modeling effect comparison of four different Structured Streaming programs selected. It can be clearly seen from FIG. 3 that the modeling accuracy of the method of the present invention (the far right side of each item indicates the present invention) is higher than that of the other three algorithms under different programs. Specifically, the modeling accuracy of the method of the present invention is on average 8% higher than that of the RS algorithm, 9.7% higher than that of the ANN algorithm, and 6.7% higher than that of the RF algorithm.
  • Fig. 4 is the optimization effect of the present invention on the operating throughput of Structured Streaming. Since the optimization method of the present invention automatically configures reasonable parameters for different programs, compared with the official default configuration (right side), the optimization method of the present invention is significantly improved.
  • the data processing throughput of Structured Streaming under different programs is increased by an average of 2.29 times and a maximum increase of 2.52 times.
  • Fig. 5 is the optimization effect of the present invention on reducing the runtime delay of Structured Streaming. Since the optimization method of the present invention automatically configures reasonable parameters for different programs, compared with the official default configuration (right side), the optimization method of the present invention is significantly Reduced the data processing delay of Structured Streaming under different programs, with an average reduction of 3.08 times and a maximum reduction of 3.96 times.
  • Figure 6 shows the optimization effect of the present invention on the ratio of data processing delay and data processing throughput.
  • Lower data processing delay and higher data processing throughput are the goals of stream processing system performance optimization. Low means that while achieving lower data processing latency, greater data throughput is achieved, which is a more comprehensive optimization evaluation criterion.
  • the optimization method of the present invention significantly reduces the ratio of delay to throughput by an average of 5.95 times and a maximum of 8.36 times
  • the experimental results show that the optimization method of the present invention realizes the automatic parameter adjustment and optimization of Structured Streaming, and the optimization performance is better than that of the prior art, and the data processing delay is significantly reduced compared with the official default configuration under different current program loads. 2.52 times, while improving the data processing throughput by up to 3.96 times.
  • the existing automatic tuning method of Structured Streaming configuration parameters does not consider the optimization of Spark, the underlying computing engine of Structured Streaming.
  • the invention realizes the overall optimization from the bottom layer Spark to the upper layer Structured Streaming, the optimization is more in-depth and the effect is better.
  • the existing machine learning algorithms have poor performance and do not fit the optimization characteristics of Structured Streaming.
  • the invention combines the SVM with the genetic algorithm, designs a technical scheme more in line with Structured Streaming optimization, and realizes high-performance automatic parameter tuning optimization.
  • the present invention may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • Source or object code written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs)
  • FPGAs field programmable gate arrays
  • PDAs programmable logic arrays
  • Computer readable program instructions are executed to implement various aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Complex Calculations (AREA)
  • Stored Programmes (AREA)

Abstract

本发明公开了一种基于SVM的流处理框架的自动调优方法。该方法包括:构建训练数据集,其中,每条样本数据包含一组配置参数与流处理框架执行性能之间的对应关系;基于所述训练数据集,以各组配置参数作为输入,以对应的流处理框架执行性能作为输出,训练SVM模型,获得性能预测模型;在配置参数的搜索空间中,将一组配置参数作为一个个体,将一组配置参数中的各个参数作为个体中的基因,并利用一组配置参数所对应的性能预测模型的输出性能来衡量个体适应度,利用遗传算法搜索流处理框架的最优配置参数。本发明将SVM与遗传算法结合,针对流处理框架,实现了高性能地自动调参优化。

Description

一种基于SVM的流处理框架的自动调优方法 技术领域
本发明涉及大数据处理技术领域,更具体地,涉及一种基于SVM的流处理框架的自动调优方法。
背景技术
在大数据处理领域,流处理技术已经被广泛应用于很多实时数据处理场景,而流处理框架也层出不穷。传统的流处理框架存在处理实时性弱、编程模型复杂、批/流处理统一性差等问题,为了解决这些问题,Apache Spark团队设计出了Structured Streaming流处理框架(结构化流处理框架),其基于Spark计算引擎,被作为一个模块随Spark开源。Structured Streaming提供了增量查询、更高级的编程接口、统一的批流结合编程模型等技术,实现了更低的处理延时、简单且更高级的业务实现逻辑并且通过与Spark SQL结合实现了高效的流处理等优质特性,使Structured Streaming被越来越多的企业所使用,作为实时计算的首选。
Structured Streaming在运行过程中会受到配置参数的影响,不合理地配置会严重拖慢任务执行。Spark官方推荐了一套默认配置参数,然而在实际流处理任务中默认配置参数无法适应任务场景的实时变化,并且无法根据系统资源的不同进行相应的适配,导致Structured Streaming性能受到限制且存在大量的系统资源浪费。Structured Streaming中的大量配置参数需要根据不同应用场景合理设置,而现有的人工调参难度大、成本高。
在现有技术中,着重于研究通过机器学习方法来自动优化批处理与流处理相关配置。该类方法首先收集不同应用程序运行在不同参数下的数据处理延迟和吞吐量,然后运行机器学习方法对不同应用程序进行分类建模。在对新的应用程序进行处理时,由模型进行自动预测并分类,从而根据结果合理设置批处理和流处理的相关参数,实现优化配置参数的目的。
经分析,现有的Structured Streaming自动配置参数调优方法深度不够,仅仅考虑了Structured Streaming的批流混合的特点,对批处理和流处理相关的参数进行了自动优化。而Structured Streaming底层计算引擎为Spark,其同样需要进行优化调参,因此,仅优化批/流处理相关参数是远远不够的;同时,现有方法所使用的机器学习算法还无法有效的搜索最优配置参数。
发明内容
本发明的目的是克服上述现有技术的缺陷,提供一种流处理框架的自动调优方法,其是基于SVM(支持向量机算法)和遗传算法搜索适用于特定应用场景的最优配置参数的新技术方案。
本发明提供一种基于SVM的流处理框架的自动调优方法。该方法包括以下步骤:
构建训练数据集,其中,每条样本数据包含一组配置参数与流处理框架执行性能之间的对应关系;
基于所述训练数据集,以各组配置参数作为输入,以对应的流处理框架执行性能作为输出,训练SVM模型,获得性能预测模型;
在配置参数的搜索空间中,将一组配置参数作为一个个体,将一组配置参数中的各个参数作为个体中的基因,并利用一组配置参数所对应的性能预测模型的输出性能来衡量个体适应度,利用遗传算法搜索流处理框架的最优配置参数。
与现有技术相比,本发明的优点在于,基于SVM和元启发式算法(遗传算法)设计了针对流处理框架的自动优化方法,实现了从底层到上层的自动优化,结合更优秀的机器学习算法,可以更高效地搜索出更优质的配置参数。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例,并且连同其说明一起用于解释本发明的原理。
图1是根据本发明一个实施例的基于SVM的流处理框架的自动调优方法的流程图;
图2是根据本发明一个实施例的基于SVM的流处理框架的自动调优方法的过程示意图;
图3是现有技术的三种算法与本发明方法的效果对比图;
图4是根据发明一个实施例在不同程序下数据处理吞吐量的效果图;
图5是根据本发明一个实施例在不同程序下的数据处理延迟的效果图;
图6是根据本发明一个实施例对数据处理延迟和数据处理吞吐量的比值的效果图。
具体实施方式
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
下文将以Structured Streaming框架为例介绍本发明,但应理解的是,本发明也可以应用于其他类型的流处理框架,例如Flink和Storm等。
结合图1和图2所示,该实施例提供的基于SVM的流处理框架的自动调优方法包括以下步骤。
步骤S110,构建训练数据集,每条训练样本用于表征流处理框架执行性能与所使用的配置参数组合之间的对应关系。
在该数据收集阶段,包括一个参数生成器(Conf Generator),其首先选择出显著影响Structured Streaming与底层Spark性能的参数,然后根据选择出的参数自动为待优化程序的运行自动生成并分配参数。在每次程序运行结束后,收集运行时Structured Streaming的数据处理延迟和吞吐量与所使用的参数组合作为训练数据集中的一条样本数据。通过这种方式,在多次运行后会得到由多条样本数据组成的训练数据集。每条训练样本用于表征数据处理延迟和吞吐量与所使用的参数组合之间的对应关系。
例如,训练数据集表示为{Pv 1,Pv 2,…,Pv n},第一条训练样本Pv 1包含{t 1,I i,conf i1,…,conf i23},其中,t 1表示数据处理延迟,I 1表示数据吞吐量,conf i1,…,conf i23是配置参数组合,各组配置参数中即包含流处理框架的上层参数,也包括Spark底层参数。
步骤S120,利用训练数据集,以配置参数组合作为输入,以流处理框架执行性能作为输出,训练SVM模型,获得流处理框架的性能预测模型。
该步骤是性能预测模型的建模阶段,利用获取的训练数据集基于机器学习算法进行建模,目的在于搭建一个性能预测模型能够反映不同配置参数对于Structured Streaming延迟与吞吐量的影响。
例如,基于训练数据集,以各组配置参数作为输入,以对应的流处理框架执行性能作为输出,训练SVM模型,获得性能预测模型,用于对不同配置参数进行延迟与吞吐量预测。经过实验验证,相对于其他的深度学习模型,SVM模型能够快速、准确地预测流处理框架的执行性能。
步骤S130,在配置参数的搜索空间中,利用性能预测模型结合遗传算法进行最优配置搜索。
在此步骤中,使用遗传算法基于性能预测模型进行迭代搜索,最终筛选出最优配置参数。
考虑到structured streaming的参数搜索空间复杂且维度高的特点,为 了能够高效找到最优配置参数并避免出现局部最优,本发明优选使用遗传算法作为基本搜索算法,并将遗传算法与structured streaming最优配置搜索的场景相结合。例如,从种群中的个体维度来看,一组配置参数抽象为遗传算法中的一个个体;一组配置参数中的各个参数抽象为个体中的基因;一组配置参数在structured streaming中的性能表现(如数据处理延迟与吞吐量的比值)作为个体的适应度,适应度越高代表个体越优秀。
具体地,最优配置参数的搜索过程包括:
步骤S131,随机输入一组structured streaming参数并通过性能模型计算得到初始化的个体适应度A标准;
步骤S132,从数据收集阶段得到的训练数据集中随机选择n组配置参数(例如n大于训练集数量的1/5)作为初始化种群P,对P中每个个体的进行随机的交叉运算及变异率为0.02的变异运算;
步骤S133,利用性能预测模型对种群P及其后代进行适应度计算,并筛选出适应度高于A的个体组成新种群P’,将适应度最高的个体的适应度A’作为新的适应度标准A’;
步骤S134,重复S132和S133,直至无法产生更优秀的个体,则当前最优个体即为搜索到的最优配置参数。
综上,利用遗传算法基于建模阶段得到的性能预测模型进行最优配置搜索,目的在于利用遗传算法交叉变异特性避免搜索陷入局部最优,同时保证优秀的搜索性能。在搜索阶段,性能预测模型被用于预测由遗传算法产生的不同配置参数在Structured Streaming中的性能高低,从而实现高效率搜索,搜索最终可以得到最优配置参数并直接用于Structured Streaming。而如果采用递归随机搜索算法、模式搜索算法等进行最优配置搜索都容易陷入局部最优,从而导致无法找到全局最优配置的问题。
进一步地,为了验证本发明的可行性和技术效果,进行了实验验证,通过使用Spark官方提供的Structured Streaming测试程序,包括StructuredKafkaWordCount(简称KafkaWC),StructuredNetworkWordCount(简称NetworkWC),StructuredNetworkWordCountWindows(简称NetworkWCW)和StructuredSessionization(简称Sessionization),对 Structured Streaming框架配置参数进行自动优化。首先选取了目前最常用的三种算法响应面算法(RS)、人工神经网络算法(ANN)和随机森林方法(RF),与本发明方法进行了性能对比;并且,利用本发明方法对优化Structured Streaming的延迟和吞吐量两方面进行了实验。
图3是目前常用的三种算法(RS、ANN、RF)与本发明所使用的方法(图中标记为支持向量机算法),针对所选取的四种不同Structured Streaming程序的建模效果对比。由图3可以明显看出,本发明方法(各项中最右侧指示本发明)在不同程序下的建模精度均高于其它三种算法。具体地,本发明方法的建模精度平均高于RS算法8%,高于ANN算法9.7%,高于RF算法6.7%。
图4是本发明对Structured Streaming运行吞吐量的优化效果,由于本发明的优化方法针对不同程序自动配置了合理的参数,因此相较于官方默认配置(右侧),本发明的优化方法显著提升了Structured Streaming在不同程序下的数据处理吞吐量,平均提升2.29倍,最高提升2.52倍。
图5是本发明对降低Structured Streaming运行时延迟的优化效果,由于本发明的优化方法针对不同程序自动配置了合理的参数,因此相较于官方默认配置(右侧),本发明的优化方法显著降低了Structured Streaming在不同程序下的数据处理延迟,平均降低3.08倍,最高降低3.96倍。
图6是本发明对数据处理延迟和数据处理吞吐量的比值的优化效果,更低的数据处理延迟、更高的数据处理吞吐量是流处理系统性能优化的目标,而同一时刻二者比值越低意味着在实现更低的数据处理延迟的同时,实现了更大的数据吞吐量,这是一个更为全面的优化评价标准。从图6中可以明显看出,相较于官方默认配置(右侧),本发明的优化方法明显降低了延迟与吞吐量的比值,平均降低5.95倍,最高降低8.36倍
实验结果表明,本发明的优化方法实现了对Structured Streaming的自动调参优化,且优化性能优于现有技术,在当前不同的程序负载下相较于官方默认配置显著降低了数据处理延迟最高达2.52倍,同时提升了数据处理吞吐量最高达3.96倍。
综上所述,现有的Structured Streaming配置参数自动调优方法未考虑 对Structured Streaming的底层计算引擎Spark进行优化。本发明实现了从底层Spark到上层Structured Streaming的整体优化,优化更为深入、效果更好。此外,现有所使用的机器学习算法性能差,且不贴合Structured Streaming优化特性。本发明将SVM与遗传算法结合,设计了更符合Structured Streaming优化的技术方案,实现了高性能地自动调参优化。
需要说明的是,在不违背本发明精神和范围的前提下,本领域技术人员可对上述实施例进行适当的改变或变型。例如,设置将变异率设置为其他值。又如,采用数据处理延迟或吞吐量作为个体适应度等。
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理 设备中的计算机可读存储介质中。
用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本发明的各个方面。
这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备 上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。

Claims (8)

  1. 一种基于SVM的流处理框架的自动调优方法,包括以下步骤:
    构建训练数据集,其中,每条样本数据包含一组配置参数与流处理框架执行性能之间的对应关系;
    基于所述训练数据集,以各组配置参数作为输入,以对应的流处理框架执行性能作为输出,训练SVM模型,获得性能预测模型;
    在配置参数的搜索空间中,将一组配置参数作为一个个体,将一组配置参数中的各个参数作为个体中的基因,并利用一组配置参数所对应的性能预测模型的输出性能来衡量个体适应度,利用遗传算法搜索流处理框架的最优配置参数。
  2. 根据权利要求1所述的方法,其中,所述利用遗传算法搜索流处理框架的最优配置参数包括:
    随机输入一组配置参数并通过所述性能预测模型计算得到初始化的个体适应度标准;
    从所述训练数据集中随机选择n组配置参数作为初始化种群P,对P中每个个体的进行随机的交叉运算以及设定变异率的变异运算;
    利用所述性能模型对种群P及其后代进行适应度计算,并筛选出适应度高于所述初始化的个体适应度标准的个体组成新种群P’,将适应度最高的个体的适应度作为新的适应度标准,通过迭代运算找出适应度最高的个体,该个体对应流处理框架的最优配置参数。
  3. 根据权利要求2所述的方法,其中,对于所选择的n组配置参数,n大于所述训练数据集中训练样本数量的1/5,所述变异率设置为0.02。
  4. 根据权利要求1所述的方法,其中,所述流处理框架的执行性能包括数据处理延迟和数据吞吐量,所述个体适应度是数据处理延迟与吞吐量的比值。
  5. 根据权利要求1所述的方法,其中,所述流处理框架包括结构化流处理框架、Flink框架或Storm框架。
  6. 根据权利要求1所述的方法,其中,所述流处理框架是结构化流处理框架,所述构建训练数据集包括:
    根据对执行性能的影响程度选择出显著影响上层结构化流处理框架与底层Spark性能的参数;
    根据所选择的参数为待优化程序的运行自动生成并分配参数;
    在每次程序运行结束后,收集运行时所述结构化流处理框架的数据处理延迟和吞吐量与所使用的参数组合作为所述训练数据集中的一条样本数据。
  7. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至6中任一项所述方法的步骤。
  8. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至6中任一项所述的方法的步骤。
PCT/CN2021/124402 2020-11-12 2021-10-18 一种基于svm的流处理框架的自动调优方法 WO2022100370A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011261446.1 2020-11-12
CN202011261446.1A CN114489574B (zh) 2020-11-12 2020-11-12 一种基于svm的流处理框架的自动调优方法

Publications (1)

Publication Number Publication Date
WO2022100370A1 true WO2022100370A1 (zh) 2022-05-19

Family

ID=81490256

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124402 WO2022100370A1 (zh) 2020-11-12 2021-10-18 一种基于svm的流处理框架的自动调优方法

Country Status (2)

Country Link
CN (1) CN114489574B (zh)
WO (1) WO2022100370A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117252114A (zh) * 2023-11-17 2023-12-19 湖南华菱线缆股份有限公司 一种基于遗传算法的电缆耐扭转实验方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198119A1 (en) * 2012-01-09 2013-08-01 DecisionQ Corporation Application of machine learned bayesian networks to detection of anomalies in complex systems
CN106648654A (zh) * 2016-12-20 2017-05-10 深圳先进技术研究院 一种数据感知的Spark配置参数自动优化方法
CN110086731A (zh) * 2019-04-25 2019-08-02 北京计算机技术及应用研究所 一种云架构下网络数据稳定采集方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121999A (zh) * 2017-12-10 2018-06-05 北京工业大学 基于混合蝙蝠算法的支持向量机参数选择方法
US11544621B2 (en) * 2019-03-26 2023-01-03 International Business Machines Corporation Cognitive model tuning with rich deep learning knowledge
CN111612528A (zh) * 2020-04-30 2020-09-01 中国移动通信集团江苏有限公司 用户分类模型的确定方法、装置、设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198119A1 (en) * 2012-01-09 2013-08-01 DecisionQ Corporation Application of machine learned bayesian networks to detection of anomalies in complex systems
CN106648654A (zh) * 2016-12-20 2017-05-10 深圳先进技术研究院 一种数据感知的Spark配置参数自动优化方法
CN110086731A (zh) * 2019-04-25 2019-08-02 北京计算机技术及应用研究所 一种云架构下网络数据稳定采集方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG DONG: "Parallel Analysis on Power Grid Equipment Monitoring Big Data Based on Storm Framework", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, no. 1, 15 January 2020 (2020-01-15), CN , XP055930023, ISSN: 1674-0246 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117252114A (zh) * 2023-11-17 2023-12-19 湖南华菱线缆股份有限公司 一种基于遗传算法的电缆耐扭转实验方法
CN117252114B (zh) * 2023-11-17 2024-02-13 湖南华菱线缆股份有限公司 一种基于遗传算法的电缆耐扭转实验方法

Also Published As

Publication number Publication date
CN114489574A (zh) 2022-05-13
CN114489574B (zh) 2022-10-14

Similar Documents

Publication Publication Date Title
JP7170779B2 (ja) 自動的な意図のマイニング、分類、及び配置のための方法及びシステム
Nguyen et al. Pay-as-you-go reconciliation in schema matching networks
WO2022111125A1 (zh) 一种基于随机森林的图数据处理框架自动调优方法
JP7392668B2 (ja) データ処理方法および電子機器
US11595415B2 (en) Root cause analysis in multivariate unsupervised anomaly detection
CN105868334B (zh) 一种基于特征递增型的电影个性化推荐方法及系统
US20160110657A1 (en) Configurable Machine Learning Method Selection and Parameter Optimization System and Method
WO2023124029A1 (zh) 深度学习模型的训练方法、内容推荐方法和装置
US10373071B2 (en) Automated intelligent data navigation and prediction tool
US20200089832A1 (en) Application- or algorithm-specific quantum circuit design
US9852390B2 (en) Methods and systems for intelligent evolutionary optimization of workflows using big data infrastructure
US11429623B2 (en) System for rapid interactive exploration of big data
US10762166B1 (en) Adaptive accelerated yield analysis
US9582586B2 (en) Massive rule-based classification engine
US20230342359A1 (en) System and method for machine learning for system deployments without performance regressions
CN116057518A (zh) 使用机器学习模型的自动查询谓词选择性预测
Nagesh et al. High performance computation of big data: performance optimization approach towards a parallel frequent item set mining algorithm for transaction data based on hadoop MapReduce framework
WO2022011553A1 (en) Feature interaction via edge search
WO2022100370A1 (zh) 一种基于svm的流处理框架的自动调优方法
WO2023174189A1 (zh) 图网络模型节点分类方法、装置、设备及存储介质
CN107679107A (zh) 一种基于图数据库的电网设备可达性查询方法及系统
US20230186074A1 (en) Fabricating data using constraints translated from trained machine learning models
US9928327B2 (en) Efficient deployment of table lookup (TLU) in an enterprise-level scalable circuit simulation architecture
Oo et al. Hyperparameters optimization in scalable random forest for big data analytics
JP2022189805A (ja) コンピュータ実装方法、情報処理システム、コンピュータプログラム(it環境向けのアノマリ検出ドメインにおけるパフォーマンスモニタリング)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890897

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21890897

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21890897

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 111223)