CN111858365A - A method and equipment for Flink K-Means performance test - Google Patents
A method and equipment for Flink K-Means performance test Download PDFInfo
- Publication number
- CN111858365A CN111858365A CN202010724528.9A CN202010724528A CN111858365A CN 111858365 A CN111858365 A CN 111858365A CN 202010724528 A CN202010724528 A CN 202010724528A CN 111858365 A CN111858365 A CN 111858365A
- Authority
- CN
- China
- Prior art keywords
- data
- flink
- parameters
- test
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3684—Test management for test design, e.g. generating new test cases
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域technical field
本领域涉及计算机领域,并且更具体地涉及一种Flink K-Means性能测试的方法和设备。The art relates to the computer field, and more particularly to a method and device for Flink K-Means performance testing.
背景技术Background technique
近几年大数据的飞速发展,出现了很多热门的开源社区,其中著名的有Hadoop、Storm,以及后来的Spark,Flink等,随着近些年的快速发展,Apache Flink已经在实时计算领域成为用户的主流选择。With the rapid development of big data in recent years, many popular open source communities have emerged, among which are Hadoop, Storm, and later Spark, Flink, etc. With the rapid development of The mainstream choice of users.
Apache Flink是一款分布式、高性能、高可用、高精确的为数据流应用而生的开源流式处理框架。Flink的核心是在数据流上提供数据分发、通信、具备容错的分布式计算。同时,Flink在流处理引擎上提供了批流融合计算能力,以及SQL表达能力。Flink技术发展越来越成熟,与Spark的PK之势渐占上风,是当前实时处理领域的一颗炙手可热的新星。Apache Flink是由Apache软件基金会开发的开源流处理框架,其核心是用Java和Scala编写的分布式流数据流引擎。Flink以数据并行和流水线方式执行任意流数据程序,Flink的流水线运行时系统可以执行批处理和流处理程序。此外,Flink的运行时本身也支持迭代算法的执行。K-means聚类算法也称k均值聚类算法,是集简单和经典于一身的基于距离的聚类算法。它采用距离作为相似性的评价指标,即认为两个对象的距离越近,其相似度就越大。该算法认为类簇是由距离靠近的对象组成的,因此把得到紧凑且独立的簇作为最终目标。Apache Flink is a distributed, high-performance, high-availability, and high-accuracy open source stream processing framework for data stream applications. The core of Flink is to provide data distribution, communication, and fault-tolerant distributed computing on data streams. At the same time, Flink provides batch stream fusion computing capabilities and SQL expression capabilities on the stream processing engine. The development of Flink technology is becoming more and more mature, and its PK trend with Spark is gradually gaining the upper hand. It is a hot new star in the current real-time processing field. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. Its core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data-parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. In addition, Flink's runtime itself supports the execution of iterative algorithms. K-means clustering algorithm, also known as k-means clustering algorithm, is a simple and classic distance-based clustering algorithm. It uses the distance as the evaluation index of similarity, that is, the closer the distance between two objects, the greater the similarity. The algorithm considers that clusters are composed of objects that are close in distance, so it takes compact and independent clusters as the final goal.
Hibench是Intel开源的大数据基准测试工具,可以评估不同大数据框架的速度,吞吐量和系统资源利用率。包括Sort,WordCount,TeraSort,Sleep,SQL,PageRank,Nutchindexing,Bayes,Kmeans,NWeight和enhanced DFSIO等,对于Flink框架的支持,目前只支持流式计算的测试,对于Flink分布式数据流引擎的batch计算模式,无法进行测试,本发明能实现Flink K-Means高效批量性能测试。Hibench is Intel's open source big data benchmarking tool that can evaluate the speed, throughput and system resource utilization of different big data frameworks. Including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutchindexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. For the support of the Flink framework, currently only the test of streaming computing is supported, and the batch computing of the Flink distributed data stream engine is supported. mode, and cannot be tested, the present invention can realize efficient batch performance test of Flink K-Means.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明实施例的目的在于提出一种Flink K-Means性能测试的方法和设备,通过使用本发明的方法,能够实现对Flink batch K-Means性能进行测试,能够对Flink集群各节点进行测试,能够对测试数据结果进行分析与统计以确认Flink分布式实时处理引擎是否能够满足当前生产需求,物理机及内存、CPU、硬盘是否够满足Flink线上生产需求。In view of this, the purpose of the embodiments of the present invention is to propose a method and device for Flink K-Means performance test. By using the method of the present invention, the performance of Flink batch K-Means can be tested, and each node of the Flink cluster can be tested. The test can analyze and count the test data results to confirm whether the Flink distributed real-time processing engine can meet the current production requirements, and whether the physical machine, memory, CPU, and hard disk can meet the Flink online production requirements.
基于上述目的,本发明的实施例的一个方面提供了一种Flink K-Means性能测试的方法,包括以下步骤:Based on the above purpose, an aspect of the embodiments of the present invention provides a method for Flink K-Means performance test, comprising the following steps:
基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据;Define the parameters required to generate data based on the k-means clustering algorithm, parse the parameters, and generate original data based on the parsed parameters;
将原始数据进行格式转换以形成测试需要的数据格式;Format the original data to form the data format required for testing;
基于格式转换后的数据对Flink K-Means进行测试,并将测试结果以图形化展示;Test Flink K-Means based on the converted data, and display the test results graphically;
对测试结果进行特定参数的分析以判断Flink分布式实时处理引擎是否能够满足当前生产需求。The test results are analyzed with specific parameters to judge whether the Flink distributed real-time processing engine can meet the current production requirements.
根据本发明的一个实施例,基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据包括:According to an embodiment of the present invention, parameters required to generate data are defined based on a k-means clustering algorithm, and parsing the parameters and generating original data based on the parsed parameters includes:
选择一个点并将点添加到中心集S中;select a point and add the point to the center set S;
获取中心集S中每个维度的所有点的均值,通过维度+方差的平均值来计算新点;Obtain the mean of all points in each dimension in the center set S, and calculate the new point by the mean of dimension + variance;
向中心集S中添加新点,循环上一步骤,直到获得足够的初始中心;Add new points to the center set S, and loop the previous step until enough initial centers are obtained;
从初始中心通过高斯分布在初始中心周围生成点,分割需要生成的点数,将每个分区生成的数据点写入结果以生成原始数据。Generate points from the initial center through a Gaussian distribution around the initial center, divide the number of points that need to be generated, and write the data points generated for each partition into the result to generate the original data.
根据本发明的一个实施例,参数包括文件输出位置、维数、数据点数、群集数、距所有中心均值的最小距离和数据点的标准偏差。According to one embodiment of the invention, the parameters include file output location, dimension, number of data points, number of clusters, minimum distance from the mean of all centers, and standard deviation of the data points.
根据本发明的一个实施例,特定参数包括总吞吐量、延迟的平均数、错误率以及平均间断相同时间的吞吐量、延迟、错误率。According to one embodiment of the present invention, the specific parameters include total throughput, average number of delays, error rate, and throughput, delay, and error rate for the same time interval on average.
根据本发明的一个实施例,数据格式包括数据量500G、线程数100、k值为3。According to an embodiment of the present invention, the data format includes a data volume of 500G, a number of threads of 100, and a k value of 3.
本发明的实施例的另一个方面,还提供了一种Flink K-Means性能测试的设备,设备包括:Another aspect of the embodiments of the present invention also provides a Flink K-Means performance test device, the device includes:
解析模块,解析模块配置为基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据;The parsing module, the parsing module is configured to define parameters required for generating data based on the k-means clustering algorithm, parse the parameters, and generate original data based on the parsed parameters;
转换模块,转换模块配置为将原始数据进行格式转换以形成测试需要的数据格式;A conversion module, the conversion module is configured to format the original data to form a data format required by the test;
测试模块,测试模块配置为基于格式转换后的数据对Flink K-Means进行测试,并将测试结果以图形化展示;Test module, the test module is configured to test Flink K-Means based on the format-converted data, and display the test results graphically;
分析模块,分析模块配置为对测试结果进行特定参数的分析以判断Flink分布式实时处理引擎是否能够满足当前生产需求。Analysis module, the analysis module is configured to analyze the test results with specific parameters to judge whether the Flink distributed real-time processing engine can meet the current production requirements.
根据本发明的一个实施例,解析模块还配置为:According to an embodiment of the present invention, the parsing module is further configured to:
选择一个点并将点添加到中心集S中;select a point and add the point to the center set S;
获取中心集S中每个维度的所有点的均值,通过维度+方差的平均值来计算新点;Obtain the mean of all points in each dimension in the center set S, and calculate the new point by the mean of dimension + variance;
向中心集S中添加新点,循环上一步骤,直到获得足够的初始中心;Add new points to the center set S, and loop the previous step until enough initial centers are obtained;
从初始中心通过高斯分布在初始中心周围生成点,分割需要生成的点数,将每个分区生成的数据点写入结果以生成原始数据。Generate points from the initial center through a Gaussian distribution around the initial center, divide the number of points that need to be generated, and write the data points generated for each partition into the result to generate the original data.
根据本发明的一个实施例,参数包括文件输出位置、维数、数据点数、群集数、距所有中心均值的最小距离和数据点的标准偏差。According to one embodiment of the invention, the parameters include file output location, dimension, number of data points, number of clusters, minimum distance from the mean of all centers, and standard deviation of the data points.
根据本发明的一个实施例,特定参数包括总吞吐量、延迟的平均数、错误率以及平均间断相同时间的吞吐量、延迟、错误率。According to one embodiment of the present invention, the specific parameters include total throughput, average number of delays, error rate, and throughput, delay, and error rate for the same time interval on average.
根据本发明的一个实施例,数据格式包括数据量500G、线程数100、k值为3。According to an embodiment of the present invention, the data format includes a data volume of 500G, a number of threads of 100, and a k value of 3.
本发明具有以下有益技术效果:本发明实施例提供的Flink K-Means性能测试的方法,通过基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据;将原始数据进行格式转换以形成测试需要的数据格式;基于格式转换后的数据对Flink K-Means进行测试,并将测试结果以图形化展示;对测试结果进行特定参数的分析以判断Flink分布式实时处理引擎是否能够满足当前生产需求的技术方案,能够实现对Flink batch K-Means性能进行测试,能够对Flink集群各节点进行测试,能够对测试数据结果进行分析与统计以确认Flink分布式实时处理引擎是否能够满足当前生产需求,物理机及内存、CPU、硬盘是否够满足Flink线上生产需求。The present invention has the following beneficial technical effects: the method for Flink K-Means performance test provided by the embodiment of the present invention defines parameters required for generating data based on the k-means clustering algorithm, parses the parameters, and generates original data based on the parsed parameters; The original data is formatted to form the data format required for the test; Flink K-Means is tested based on the format-converted data, and the test results are displayed graphically; the test results are analyzed with specific parameters to judge Flink distributed real-time The technical solution of whether the processing engine can meet the current production requirements, can test the performance of Flink batch K-Means, can test each node of the Flink cluster, and can analyze and count the test data results to confirm the Flink distributed real-time processing engine Whether it can meet the current production needs, and whether the physical machine, memory, CPU, and hard disk can meet the Flink online production needs.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other embodiments can also be obtained according to these drawings without creative efforts.
图1为根据本发明一个实施例的Flink K-Means性能测试的方法的示意性流程图;1 is a schematic flow chart of a method for Flink K-Means performance testing according to an embodiment of the present invention;
图2为根据本发明一个实施例的Flink K-Means性能测试的设备的示意图。FIG. 2 is a schematic diagram of a device for performing a Flink K-Means performance test according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明实施例进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the embodiments of the present invention will be further described in detail below with reference to the specific embodiments and the accompanying drawings.
基于上述目的,本发明的实施例的第一个方面,提出了一种Flink K-Means性能测试的方法的一个实施例。图1示出的是该方法的示意性流程图。Based on the above purpose, in the first aspect of the embodiments of the present invention, an embodiment of a method for testing the performance of Flink K-Means is proposed. Figure 1 shows a schematic flow chart of the method.
如图1中所示,该方法可以包括以下步骤:As shown in Figure 1, the method may include the following steps:
S1基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据;S1 defines the parameters required to generate data based on the k-means clustering algorithm, parses the parameters, and generates original data based on the parsed parameters;
S2将原始数据进行格式转换以形成测试需要的数据格式;S2 converts the original data into a format to form the data format required for the test;
S3基于格式转换后的数据对Flink K-Means进行测试,并将测试结果以图形化展示;S3 tests Flink K-Means based on the format-converted data, and displays the test results graphically;
S4对测试结果进行特定参数的分析以判断Flink分布式实时处理引擎是否能够满足当前生产需求。S4 analyzes the test results with specific parameters to judge whether the Flink distributed real-time processing engine can meet the current production requirements.
通过本发明的技术方案,能够实现对Flink batch K-Means性能进行测试,能够对Flink集群各节点进行测试,能够对测试数据结果进行分析与统计以确认Flink分布式实时处理引擎是否能够满足当前生产需求,物理机及内存、CPU、硬盘是否够满足Flink线上生产需求。Through the technical solution of the present invention, the performance of Flink batch K-Means can be tested, each node of the Flink cluster can be tested, and the test data results can be analyzed and counted to confirm whether the Flink distributed real-time processing engine can meet the current production requirements. Demand, whether the physical machine, memory, CPU, and hard disk can meet the Flink online production requirements.
在本发明的一个优选实施例中,基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据包括:In a preferred embodiment of the present invention, the parameters required to generate data are defined based on the k-means clustering algorithm, and parsing the parameters and generating the original data based on the parsed parameters includes:
选择一个点并将点添加到中心集S中;select a point and add the point to the center set S;
获取中心集S中每个维度的所有点的均值,通过维度+方差的平均值来计算新点;Obtain the mean of all points in each dimension in the center set S, and calculate the new point by the mean of dimension + variance;
向中心集S中添加新点,循环上一步骤,直到获得足够的初始中心;Add new points to the center set S, and loop the previous step until enough initial centers are obtained;
从初始中心通过高斯分布在初始中心周围生成点,分割需要生成的点数,将每个分区生成的数据点写入结果以生成原始数据。Generate points from the initial center through a Gaussian distribution around the initial center, divide the number of points that need to be generated, and write the data points generated for each partition into the result to generate the original data.
基于java语言开发的Flink K-Means数据生成模块flink-data-generator,该模块核心为分布式K均值数据生成器,采用java的IO流式编程以及原始K均值算法实现分布式生成模拟数据功能。需要对该模块源码进行编译mvn clean package以生成data-generator.jar,并将jar包拷贝到${FLINK_HOME}/example/batch下。The Flink K-Means data generation module flink-data-generator is developed based on the java language. The core of this module is a distributed K-means data generator. It uses java's IO streaming programming and the original K-means algorithm to realize the function of distributed generation of simulated data. You need to compile the module source code with mvn clean package to generate data-generator.jar, and copy the jar package to ${FLINK_HOME}/example/batch.
具体为定义模块接收参数output(文件输出位置)、维数、数据点数、群集数、距所有中心均值的最小距离、数据点的标准偏差等参数,解析这些参数并初始化。然后择一个点并添加到中心集S并创建下一个点,获取S中每个维度的所有点的均值,将通过维度+方差的平均值来计算新点,方差=minDistance+(minDistance*rnd.nextGaussain),向S中添加计算出的新点,从上一步骤开始循环,直到获得足够的初始中心,从初始中心通过高斯分布在该中心周围生成点,分割需要生成的点数,每个分区生成数据点,写入结果。生成数据主要使用java io模块的BufferedWriter FileWriter等类,及util的Random类。然后对数据格式转换,获取前面步骤生成的数据作为输入数据,并对数据进行格式化,使用Flink apiDataSet定义数据输出“-output”,文件接收器为惰性,由参数触发输出位置。Specifically, the definition module receives parameters such as output (file output location), dimension, number of data points, number of clusters, minimum distance from the mean of all centers, standard deviation of data points and other parameters, parses these parameters and initializes them. Then pick a point and add it to the center set S and create the next point, get the mean of all points in each dimension in S, the new point will be calculated by the mean of the dimension + variance, variance = minDistance + (minDistance*rnd.nextGaussain ), add the calculated new points to S, and loop from the previous step until enough initial centers are obtained, generate points around the center through Gaussian distribution from the initial centers, divide the number of points that need to be generated, and generate data for each partition Click to write the result. The generated data mainly uses classes such as BufferedWriter FileWriter of java io module, and Random class of util. Then convert the data format, obtain the data generated in the previous steps as input data, format the data, use Flink apiDataSet to define the data output "-output", the file receiver is lazy, and the output position is triggered by parameters.
在本发明的一个优选实施例中,参数包括文件输出位置、维数、数据点数、群集数、距所有中心均值的最小距离和数据点的标准偏差。In a preferred embodiment of the present invention, the parameters include file output location, dimension, number of data points, number of clusters, minimum distance from the mean of all centers, and standard deviation of data points.
在本发明的一个优选实施例中,特定参数包括总吞吐量、延迟的平均数、错误率以及平均间断相同时间的吞吐量、延迟、错误率。基于特定参数判断Flink分布式实时处理引擎是否能够满足当前生产需求包括:In a preferred embodiment of the present invention, the specific parameters include the total throughput, the average number of delays, the error rate, and the throughput, delay, and error rate at the same time on average. Judging whether the Flink distributed real-time processing engine can meet the current production needs based on specific parameters includes:
1、测试任务执行过程中,测试数据分析模块通过调用Flink Restfull web接口GET/v1/jobs/<jobid>,通过累加的方式计算测试任务内获取固定时间段的吞吐量(AvgQPS/TPS(条/分钟))、延迟、错误总数,测试结束时记录执行时间,并实时输出记录,数据量较大使用到队列服务;1. During the execution of the test task, the test data analysis module calls the Flink Restfull web interface GET/v1/jobs/<jobid>, and calculates the throughput (AvgQPS/TPS (bar/ minutes)), delay, the total number of errors, record the execution time at the end of the test, and output the record in real time, the queue service is used for a large amount of data;
2、根据数据总数和执行时间采用切尾均值的方法计算出总吞吐量、延迟的平均数,及错误率以及平均间断相同时间的吞吐量、延迟、错误率;2. According to the total number of data and execution time, the method of tail-cutting mean is used to calculate the average number of total throughput, delay, and error rate, as well as the throughput, delay, and error rate at the same time of average interruption;
3、测试结束时,通过接口按照一定格式输出到excel文件即单次测试结果,根据该测试结果判断Flink分布式实时处理引擎是否能够满足当前生产需求。3. At the end of the test, output to the excel file through the interface in a certain format, that is, the single test result. According to the test result, it is judged whether the Flink distributed real-time processing engine can meet the current production requirements.
在本发明的一个优选实施例中,数据格式包括数据量500G、线程数100、k值为3。使用data-generator.jar进行模拟数据的产生,可以自定义数据量、线程数,可设置产生的位置、条数、k值等,能够准确、高效的生成测试数据。In a preferred embodiment of the present invention, the data format includes a data volume of 500G, a number of threads of 100, and a k value of 3. Using data-generator.jar to generate simulated data, you can customize the amount of data, the number of threads, and set the generated location, number of bars, k value, etc., which can generate test data accurately and efficiently.
Flink run-c DistributedDataGenerator data-generator-1.0-SNAPSHOT.jar--output hdfs://xx.xx.xx.xx:9000/flink/kmeans--d 100--size500000000--k 3Flink run -c DistributedDataGenerator data-generator-1.0-SNAPSHOT.jar--output hdfs://xx.xx.xx.xx:9000/flink/kmeans--d 100--size500000000--k 3
其中:Flink命令,数据量500G,线程数100,k值为3。Among them: Flink command, the data volume is 500G, the number of threads is 100, and the k value is 3.
最后将数据输出到hadoop系统存储hdfs://xx.xx.xx.xx:9000/flink/kmeans,输出方式使用接口调用,接口开发符合RESTfull接口规范,并将此功能封装成模块,能够自动化的实现数据输出。Finally, the data is output to the hadoop system to store hdfs://xx.xx.xx.xx:9000/flink/kmeans. The output method uses the interface call, the interface development conforms to the RESTfull interface specification, and this function is encapsulated into a module, which can be automated Implement data output.
在本发明的一个优选实施例中,基于转换后的数据对Flink K-Means进行测试包括:运行Flink K-Means测试模块,使用${FLINK_HOME}/example/batch下自带的K-Means应用程序CLI:KMeans.jar,并将测试结果存储到分布式系统Hadoop。In a preferred embodiment of the present invention, testing Flink K-Means based on the converted data includes: running the Flink K-Means test module, using the K-Means application program that comes with ${FLINK_HOME}/example/batch CLI: KMeans.jar, and store the test results to the distributed system Hadoop.
flink run KMeans.jar--points hdfs://xx.xx.xx.xx:9000/flink/kmeans--output hdfs://xx.xx.xx.xx:9000/flink/kmeans/Output--k 3--iterations 20flink run KMeans.jar--points hdfs://xx.xx.xx.xx:9000/flink/kmeans--output hdfs://xx.xx.xx.xx:9000/flink/kmeans/Output--k 3--iterations 20
其中:Flink命令,数据输入源为步骤一生成存储到Hadoop的数据,结果数据输出存储到Hadoop,iterations为迭代次数。Among them: Flink command, the data input source is the data generated and stored in Hadoop in step 1, the result data output is stored in Hadoop, iterations is the number of iterations.
在本发明的一个优选实施例中,将测试结果以图形化展示包括:In a preferred embodiment of the present invention, displaying the test results graphically includes:
1、Flink配置metrics监控,检测Flink测试过程,1. Flink configures metrics monitoring to detect the Flink test process,
metrics.scope.jm,metrics.scope.jm.job,metrics.scope.jm,metrics.scope.jm.job,
metrics.scope.tm,metrics.scope.tm.job等metrics.scope.tm, metrics.scope.tm.job, etc.
以上为Flink关于metrics的配置项,可以监控收集flink的job、task等指标数据;The above is Flink's configuration items about metrics, which can monitor and collect flink's job, task and other indicator data;
2、Flink配置向Prometheus推送监控信息,2. Flink configuration pushes monitoring information to Prometheus,
metrics.reporter.promgateway.class、metrics.reporter.promgateway.host等以上为Flink关于metrics的配置项,配置Prometheus服务信息;metrics.reporter.promgateway.class, metrics.reporter.promgateway.host, etc. The above are the configuration items of Flink about metrics, and configure the Prometheus service information;
3、配置Prometheus,下载安装Prometheus组件,配置Prometheus;3. Configure Prometheus, download and install Prometheus components, and configure Prometheus;
4、Flink服务调用自带接口metrics采集监控信息同时调用Prometheus Restfullweb接口,将监控数据实时推送到Prometheus,通过Prometheus UI可以图形化实时看到吞吐、延时等监控结果。4. The Flink service calls the built-in interface metrics to collect monitoring information and calls the Prometheus Restfullweb interface to push the monitoring data to Prometheus in real time. Through the Prometheus UI, you can see the monitoring results such as throughput and delay in real time graphically.
通过本发明的技术方案,能够实现对Flink batch K-Means性能进行测试,能够对Flink集群各节点进行测试,能够对测试数据结果进行分析与统计以确认Flink分布式实时处理引擎是否能够满足当前生产需求,物理机及内存、CPU、硬盘是否够满足Flink线上生产需求。Through the technical solution of the present invention, the performance of Flink batch K-Means can be tested, each node of the Flink cluster can be tested, and the test data results can be analyzed and counted to confirm whether the Flink distributed real-time processing engine can meet the current production requirements. Demand, whether the physical machine, memory, CPU, and hard disk can meet the Flink online production requirements.
需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,上述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中存储介质可为磁碟、光盘、只读存储器(Read-Only Memory,ROM)或随机存取存储器(Random AccessMemory,RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。It should be noted that those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program. The above-mentioned program can be stored in a computer-readable storage medium. When the program is executed, it may include the flow of the embodiments of the above-mentioned methods. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like. The above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
此外,根据本发明实施例公开的方法还可以被实现为由CPU执行的计算机程序,该计算机程序可以存储在计算机可读存储介质中。在该计算机程序被CPU执行时,执行本发明实施例公开的方法中限定的上述功能。In addition, the methods disclosed according to the embodiments of the present invention may also be implemented as a computer program executed by the CPU, and the computer program may be stored in a computer-readable storage medium. When the computer program is executed by the CPU, the above-mentioned functions defined in the methods disclosed in the embodiments of the present invention are executed.
基于上述目的,本发明的实施例的第二个方面,提出了一种Flink K-Means性能测试的设备,如图2所示,设备200包括:Based on the above purpose, in the second aspect of the embodiments of the present invention, a device for Flink K-Means performance test is proposed. As shown in FIG. 2 , the
解析模块,解析模块配置为基于k均值聚类算法定义生成数据需要的参数,解析参数并基于解析后的参数生成原始数据;The parsing module, the parsing module is configured to define parameters required for generating data based on the k-means clustering algorithm, parse the parameters, and generate original data based on the parsed parameters;
转换模块,转换模块配置为将原始数据进行格式转换以形成测试需要的数据格式;A conversion module, the conversion module is configured to format the original data to form a data format required by the test;
测试模块,测试模块配置为基于格式转换后的数据对Flink K-Means进行测试,并将测试结果以图形化展示;Test module, the test module is configured to test Flink K-Means based on the format-converted data, and display the test results graphically;
分析模块,分析模块配置为对测试结果进行特定参数的分析以判断Flink分布式实时处理引擎是否能够满足当前生产需求。Analysis module, the analysis module is configured to analyze the test results with specific parameters to judge whether the Flink distributed real-time processing engine can meet the current production requirements.
在本发明的一个优选实施例中,解析模块还配置为:In a preferred embodiment of the present invention, the parsing module is further configured to:
选择一个点并将点添加到中心集S中;select a point and add the point to the center set S;
获取中心集S中每个维度的所有点的均值,通过维度+方差的平均值来计算新点;Obtain the mean of all points in each dimension in the center set S, and calculate the new point by the mean of dimension + variance;
向中心集S中添加新点,循环上一步骤,直到获得足够的初始中心;Add new points to the center set S, and loop the previous step until enough initial centers are obtained;
从初始中心通过高斯分布在初始中心周围生成点,分割需要生成的点数,将每个分区生成的数据点写入结果以生成原始数据。Generate points from the initial center through a Gaussian distribution around the initial center, divide the number of points that need to be generated, and write the data points generated from each partition to the result to generate the original data.
在本发明的一个优选实施例中,参数包括文件输出位置、维数、数据点数、群集数、距所有中心均值的最小距离和数据点的标准偏差。In a preferred embodiment of the present invention, the parameters include file output location, dimension, number of data points, number of clusters, minimum distance from the mean of all centers, and standard deviation of data points.
在本发明的一个优选实施例中,特定参数包括总吞吐量、延迟的平均数、错误率以及平均间断相同时间的吞吐量、延迟、错误率。In a preferred embodiment of the present invention, the specific parameters include the total throughput, the average number of delays, the error rate, and the throughput, delay, and error rate at the same time on average.
在本发明的一个优选实施例中,数据格式包括数据量500G、线程数100、k值为3。In a preferred embodiment of the present invention, the data format includes a data volume of 500G, a number of threads of 100, and a k value of 3.
需要特别指出的是,上述系统的实施例采用了上述方法的实施例来具体说明各模块的工作过程,本领域技术人员能够很容易想到,将这些模块应用到上述方法的其他实施例中。It should be particularly pointed out that the embodiments of the above system use the embodiments of the above method to specifically describe the working process of each module, and those skilled in the art can easily think of applying these modules to other embodiments of the above method.
此外,上述方法步骤以及系统单元或模块也可以利用控制器以及用于存储使得控制器实现上述步骤或单元或模块功能的计算机程序的计算机可读存储介质实现。In addition, the above-mentioned method steps and system units or modules can also be implemented by using a controller and a computer-readable storage medium for storing a computer program that enables the controller to implement the functions of the above-mentioned steps or units or modules.
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于具体应用以及施加给整个系统的设计约束。本领域技术人员可以针对每种具体应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本发明实施例公开的范围。Those skilled in the art will also appreciate that the various exemplary logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends on the specific application and design constraints imposed on the overall system. Those skilled in the art may implement the functions in various ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
上述实施例,特别是任何“优选”实施例是实现的可能示例,并且仅为了清楚地理解本发明的原理而提出。可以在不脱离本文所描述的技术的精神和原理的情况下对上述实施例进行许多变化和修改。所有修改旨在被包括在本公开的范围内并且由所附权利要求保护。The above-described embodiments, particularly any "preferred" embodiments, are possible examples of implementations, and are presented merely for a clear understanding of the principles of the invention. Numerous changes and modifications may be made to the above-described embodiments without departing from the spirit and principles of the technology described herein. All modifications are intended to be included within the scope of this disclosure and protected by the appended claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010724528.9A CN111858365A (en) | 2020-07-24 | 2020-07-24 | A method and equipment for Flink K-Means performance test |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010724528.9A CN111858365A (en) | 2020-07-24 | 2020-07-24 | A method and equipment for Flink K-Means performance test |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111858365A true CN111858365A (en) | 2020-10-30 |
Family
ID=72951203
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010724528.9A Withdrawn CN111858365A (en) | 2020-07-24 | 2020-07-24 | A method and equipment for Flink K-Means performance test |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111858365A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114490371A (en) * | 2022-01-20 | 2022-05-13 | 中国平安人寿保险股份有限公司 | Data testing method, device, testing equipment and medium based on artificial intelligence |
| CN114880103A (en) * | 2022-07-11 | 2022-08-09 | 中电云数智科技有限公司 | System and method for adapting flink task to hadoop ecology |
-
2020
- 2020-07-24 CN CN202010724528.9A patent/CN111858365A/en not_active Withdrawn
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114490371A (en) * | 2022-01-20 | 2022-05-13 | 中国平安人寿保险股份有限公司 | Data testing method, device, testing equipment and medium based on artificial intelligence |
| CN114880103A (en) * | 2022-07-11 | 2022-08-09 | 中电云数智科技有限公司 | System and method for adapting flink task to hadoop ecology |
| CN114880103B (en) * | 2022-07-11 | 2022-09-09 | 中电云数智科技有限公司 | A system and method for adapting flink tasks to the hadoop ecosystem |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10389592B2 (en) | Method, system and program product for allocation and/or prioritization of electronic resources | |
| US12039415B2 (en) | Debugging and profiling of machine learning model training | |
| US10684940B1 (en) | Microservice failure modeling and testing | |
| EP4113308A1 (en) | Enhanced application performance framework | |
| Kohyarnejadfard et al. | Anomaly detection in microservice environments using distributed tracing data analysis and NLP | |
| CN111459698A (en) | A kind of database cluster fault self-healing method and device | |
| CN113986746A (en) | Performance test method and device and computer readable storage medium | |
| CN110750458A (en) | Big data platform testing method and device, readable storage medium and electronic equipment | |
| US20210097432A1 (en) | Gpu code injection to summarize machine learning training data | |
| WO2024027384A1 (en) | Fault detection method, apparatus, electronic device, and storage medium | |
| CN114503132B (en) | Debugging and profiling machine learning model training | |
| WO2022134001A1 (en) | Machine learning model framework development method and system based on containerization technology | |
| CN113168362A (en) | Dedicated audit port for enforcing recoverability of output audit data | |
| JP2025007543A (en) | Anomaly detection device, anomaly detection system, and anomaly detection method | |
| Davis et al. | Failuresim: a system for predicting hardware failures in cloud data centers using neural networks | |
| Wang | Stream processing systems benchmark: Streambench | |
| Han et al. | Benchmarking big data systems: State-of-the-art and future directions | |
| CN111858365A (en) | A method and equipment for Flink K-Means performance test | |
| Popa et al. | A data-centric approach to distributed tracing | |
| Xu et al. | Fastpert: Towards fast microservice application latency prediction via structural inductive bias over pert networks | |
| Qian et al. | Learning unified system representations for microservice tail latency prediction | |
| CN112860531B (en) | Performance Evaluation Method for Broad Consensus of Blockchain Based on Deep Heterogeneous Graph Neural Network | |
| JP2023547153A (en) | Dynamic replacement of degrading processing elements in streaming applications | |
| Hasanpuri et al. | Comparative analysis of techniques for big-data performance testing | |
| Abbors et al. | Model-based performance testing of web services using probabilistic timed automata. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201030 |