CN111858365A

CN111858365A - Method and equipment for testing performance of Flink K-Means

Info

Publication number: CN111858365A
Application number: CN202010724528.9A
Authority: CN
Inventors: 蔡丽敏
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-30

Abstract

The invention provides a method and equipment for testing the performance of Flink K-Means, wherein the method comprises the following steps: defining parameters needed by data generation based on a k-means clustering algorithm, analyzing the parameters and generating original data based on the analyzed parameters; carrying out format conversion on the original data to form a data format required by the test; testing the Flink K-Means based on the data after format conversion, and graphically displaying the test result; and analyzing specific parameters of the test result to judge whether the Flink distributed real-time processing engine can meet the current production requirement. By using the scheme of the invention, the performance of the Flink batch K-Means can be tested, each node of the Flink cluster can be tested, the test data result can be analyzed and counted to determine whether the Flink distributed real-time processing engine can meet the current production requirement, and whether the physical machine, the memory, the CPU and the hard disk can meet the on-line production requirement of the Flink.

Description

Method and equipment for testing performance of Flink K-Means

Technical Field

The field relates to the field of computers, and more particularly to a method and apparatus for flank K-Means performance testing.

Background

In recent years, rapid development of big data has brought about a plurality of popular open source communities, which are known as Hadoop, Storm, and Spark, Flink, etc., and Apache Flink has become a mainstream choice for users in the field of real-time computing with rapid development in recent years.

Apache Flink is a distributed, high-performance, highly available, high-precision open-source streaming framework for data streaming applications. At the core of Flink is distributed computing that provides data distribution, communication, and fault tolerance on data streams. Meanwhile, Flink provides batch-flow fusion computing power on the flow processing engine, and SQL expression power. The Flink technology is developed more and more mature, and the PK with Spark gradually occupies the wind, so that the Flink technology is a new popular candidate in the current real-time processing field. Apache Flink is an open source stream processing framework developed by the Apache software foundation, at the heart of which is a distributed stream data stream engine written in Java and Scala. Flink executes arbitrary stream data programs in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. In addition, the runtime of Flink itself supports the execution of iterative algorithms. The K-means clustering algorithm is also called as a K-means clustering algorithm, and is a distance-based clustering algorithm integrating simplicity and classics. The distance is used as an evaluation index of similarity, namely the closer the distance between two objects is, the greater the similarity of the two objects is. The algorithm considers that class clusters are composed of closely spaced objects, and therefore the resulting compact and independent clusters are the final target.

The Hibench is an Intel-sourced big data benchmark test tool, and can evaluate the speed, the throughput and the system resource utilization rate of different big data frames. The method comprises the steps of Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutchindex, Bayes, Kmeans, NWeiight, enhanced DFSIO and the like, and for the support of a Flink framework, only the test of flow type calculation is supported at present, and for the batch calculation mode of a Flink distributed data flow engine, the test cannot be carried out.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and a device for testing flank K-Means performance, by using the method of the present invention, it is possible to test the performance of the flank batch K-Means, test each node of the flank cluster, and analyze and count the test data result to determine whether the flank distributed real-time processing engine can meet the current production requirement, and whether the physical machine and the memory, the CPU, and the hard disk can meet the production requirement on the flank line.

In view of the above object, an aspect of the embodiments of the present invention provides a method for testing performance of Flink K-Means, comprising the following steps:

defining parameters needed by data generation based on a k-means clustering algorithm, analyzing the parameters and generating original data based on the analyzed parameters;

carrying out format conversion on the original data to form a data format required by the test;

testing the Flink K-Means based on the data after format conversion, and graphically displaying the test result;

and analyzing specific parameters of the test result to judge whether the Flink distributed real-time processing engine can meet the current production requirement.

According to an embodiment of the present invention, defining parameters required for generating data based on a k-means clustering algorithm, analyzing the parameters and generating raw data based on the analyzed parameters includes:

selecting a point and adding the point to a central set S;

acquiring the mean value of all points of each dimension in the central set S, and calculating a new point through the mean value of the dimension plus the variance;

adding a new point to the center set S, and circulating the previous step until enough initial centers are obtained;

generating points around the initial center from the initial center through Gaussian distribution, dividing the number of the points to be generated, and writing the data points generated by each partition into the result to generate original data.

According to one embodiment of the invention, the parameters include file output location, dimensionality, number of data points, number of clusters, minimum distance to all central means, and standard deviation of data points.

According to one embodiment of the invention, the specific parameters include the total throughput, the average number of delays, the error rate and the average discontinuous throughput, delay, error rate for the same time.

According to one embodiment of the invention, the data format includes a data volume of 500G, a thread number of 100, and a k value of 3.

In another aspect of the embodiments of the present invention, there is also provided an apparatus for performing a Flink K-Means performance test, the apparatus including:

the analysis module is configured to define parameters required for generating data based on a k-means clustering algorithm, analyze the parameters and generate original data based on the analyzed parameters;

the conversion module is configured to convert the format of the original data to form a data format required by the test;

the testing module is configured to test the Flink K-Means based on the data after format conversion, and graphically display a testing result;

and the analysis module is configured to analyze specific parameters of the test result so as to judge whether the Flink distributed real-time processing engine can meet the current production requirement.

According to an embodiment of the invention, the parsing module is further configured to:

selecting a point and adding the point to a central set S;

The invention has the following beneficial technical effects: according to the method for testing the performance of the Flink K-Means, parameters needed by data generation are defined based on a K-Means clustering algorithm, the parameters are analyzed, and original data are generated based on the analyzed parameters; carrying out format conversion on the original data to form a data format required by the test; testing the Flink K-Means based on the data after format conversion, and graphically displaying the test result; the technical scheme includes that specific parameters are analyzed on test results to judge whether the Flink distributed real-time processing engine can meet the current production requirement or not, the performance of the Flink batch K-Means can be tested, each node of a Flink cluster can be tested, test data results can be analyzed and counted to confirm whether the Flink distributed real-time processing engine can meet the current production requirement or not, and whether physical machines, memories, CPUs and hard disks can meet the production requirement on a Flink line or not is judged.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a method of Flink K-Means performance testing in accordance with one embodiment of the present invention;

FIG. 2 is a schematic diagram of a device for Flink K-Means performance testing according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

In view of the above objects, a first aspect of embodiments of the present invention provides an embodiment of a method for performing a performance test of Flink K-Means. Fig. 1 shows a schematic flow diagram of the method.

As shown in fig. 1, the method may include the steps of:

s1 defining parameters needed by data generation based on a k-means clustering algorithm, analyzing the parameters and generating original data based on the analyzed parameters;

s2, converting the format of the original data to form the data format required by the test;

s3 testing the Flink K-Means based on the data after format conversion, and graphically displaying the test result;

s4, analyzing the specific parameters of the test result to judge whether the Flink distributed real-time processing engine can meet the current production requirement.

By the technical scheme, the invention can test the performance of the Flink batch K-Means, test each node of the Flink cluster, analyze and count the test data result to determine whether the Flink distributed real-time processing engine can meet the current production requirement and whether the physical machine, the memory, the CPU and the hard disk can meet the on-line production requirement of the Flink.

In a preferred embodiment of the present invention, defining parameters required for generating data based on a k-means clustering algorithm, and analyzing the parameters and generating raw data based on the analyzed parameters includes:

selecting a point and adding the point to a central set S;

The Java-based Flink K-Means data generation module is developed based on a java language, the core of the module is a distributed K-mean data generator, and the function of generating the simulation data in a distributed mode is realized by adopting java IO stream programming and an original K-mean algorithm. The module source code needs to be compiled into mvn clear packet to generate data generator jar, and the jar packet is copied to $ { FLINK _ HOME }/example/batch.

Specifically, the defining module receives parameters such as parameter output, dimension, data point number, cluster number, minimum distance from all central means, standard deviation of data points and the like, analyzes the parameters and initializes the parameters. Then, selecting a point and adding the point to a center set S and creating a next point, obtaining the average value of all points of each dimension in S, calculating a new point by the average value of dimension + variance, wherein the variance is minDistance + (minDistance rnd. nextgausain), adding the calculated new point to S, circulating from the previous step until enough initial center is obtained, generating points around the center from the initial center through Gaussian distribution, dividing the number of points to be generated, generating data points by each partition, and writing the result. The generated data mainly uses buffer writer FileWriter class of java io module and the like and Random class of util. And then converting the data format, acquiring the data generated in the previous step as input data, formatting the data, defining data output '-output' by using a Flank apiDataSet, enabling the file receiver to be inert, and triggering the output position by using the parameters.

In a preferred embodiment of the invention, the parameters include file output location, dimension, number of data points, number of clusters, minimum distance from all central means, and standard deviation of data points.

In a preferred embodiment of the invention, the specific parameters include the total throughput, the average number of delays, the error rate and the average discontinuous throughput, delay, error rate for the same time. Judging whether the Flink distributed real-time processing engine can meet the current production requirement or not based on the specific parameters comprises the following steps:

1. in the test task execution process, a test data analysis module calculates throughput (AvgQPS/TPS (strips/min)), delay and total number of errors) of a fixed time period obtained in a test task in an accumulation mode by calling a Flink Restfull web interface GET/v 1/jobs/jobid, records execution time when the test is finished, outputs the record in real time, and uses the queue service with larger data volume;

2. calculating the average number of total throughput and delay, the error rate and the average discontinuous throughput, delay and error rate in the same time by adopting a tail-cutting average method according to the total data and the execution time;

3. and when the test is finished, outputting the test result to an excel file, namely a single test result, according to a certain format through an interface, and judging whether the Flink distributed real-time processing engine can meet the current production requirement or not according to the test result.

In a preferred embodiment of the present invention, the data format includes a data volume of 500G, a thread number of 100, and a k value of 3. Jar is used for generating simulation data, the data volume and the thread number can be defined by users, the generated positions, the number of threads, the k value and the like can be set, and test data can be generated accurately and efficiently.

Flink run-c DistributedDataGenerator data-generator-1.0-SNAPSHOT.jar--output hdfs://xx.xx.xx.xx:9000/flink/kmeans--d 100--size500000000--k 3

Wherein: the Flink command has a data size of 500G, a thread number of 100, and a k value of 3.

And finally, outputting the data to a hadoop system to store hdfs:// xx.xx.xx.xx.xx.xx.9000/flink/kmeans, calling the output mode by using an interface, developing the interface to meet RESTful interface specification, packaging the function into a module, and realizing data output automatically.

In a preferred embodiment of the present invention, testing the Flink K-Means based on the transformed data comprises: running a Flink K-Means test module, using a self-contained K-Means application program CLI under $ { FLINK _ HOME }/example/batch KMeans. jar, and storing the test result in the Hadoop distributed system.

flink run KMeans.jar--points hdfs://xx.xx.xx.xx:9000/flink/kmeans--output hdfs://xx.xx.xx.xx:9000/flink/kmeans/Output--k 3--iterations 20

Wherein: and (3) in a Flink command, a data input source is the data stored in the Hadoop generated in the first step, the result data is output and stored in the Hadoop, and iterations are the iterations.

In a preferred embodiment of the present invention, graphically displaying the test results comprises:

1. flink configuration metrics monitors, detects the Flink test process,

metrics.scope.jm,metrics.scope.jm.job,

etc. of metrics

The above configuration items of the flight about metrics can monitor and collect the job, task and other index data of the flight;

2. the Flink configuration pushes monitoring information to Prometheus,

above, the parameter, the reader, the promegateway, the host and the like are configuration items of the flag about the parameters, and Prometous service information is configured;

3. configuring Prometous, downloading and installing Prometous components, and configuring Prometous;

4. the Flink service calls a metering with an interface to collect monitoring information and calls a Prometous restfulllweb interface at the same time, monitoring data is pushed to Prometous in real time, and monitoring results such as throughput, delay and the like can be seen in real time in a graphical mode through a Prometous UI.

It should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by instructing relevant hardware through a computer program, and the above programs may be stored in a computer-readable storage medium, and when executed, the programs may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.

In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided an apparatus for performing a flank K-Means performance test, as shown in fig. 2, the apparatus 200 includes:

In a preferred embodiment of the present invention, the parsing module is further configured to:

selecting a point and adding the point to a central set S;

In a preferred embodiment of the invention, the specific parameters include the total throughput, the average number of delays, the error rate and the average discontinuous throughput, delay, error rate for the same time.

In a preferred embodiment of the present invention, the data format includes a data volume of 500G, a thread number of 100, and a k value of 3.

It should be particularly noted that the embodiment of the system described above employs the embodiment of the method described above to specifically describe the working process of each module, and those skilled in the art can easily think that the modules are applied to other embodiments of the method described above.

Further, the above-described method steps and system elements or modules may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements or modules.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The embodiments described above, particularly any "preferred" embodiments, are possible examples of implementations and are presented merely to clearly understand the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure and protected by the following claims.

Claims

1. A method for testing the performance of Flink K-Means is characterized by comprising the following steps:

2. The method of claim 1, wherein parameters needed to generate data are defined based on a k-means clustering algorithm, and wherein parsing the parameters and generating raw data based on the parsed parameters comprises:

selecting a point and adding the point to a central set S;

acquiring the mean value of all points of each dimension in the central set S, and calculating a new point through the mean value of the dimension + the variance;

adding the new point to the center set S, and circulating the previous step until enough initial centers are obtained;

generating points around the initial center from the initial center through Gaussian distribution, dividing the number of points to be generated, and writing data points generated by each partition into a result to generate the original data.

3. The method of claim 1, wherein the parameters include file output location, dimensionality, number of data points, number of clusters, minimum distance from all central means, and standard deviation of data points.

4. The method of claim 1, wherein the specific parameters include total throughput, average number of delays, error rate, and average discontinuous throughput, delay, error rate for the same time.

5. The method of claim 1, wherein the data format comprises a data volume of 500G, a thread count of 100, and a k value of 3.

6. An apparatus for performing a Flink K-Means performance test, the apparatus comprising:

an analysis module configured to perform parameter-specific analysis on the test result to determine whether the Flink distributed real-time processing engine can meet the current production requirement.

7. The device of claim 6, wherein the parsing module is further configured to:

selecting a point and adding the point to a central set S;

8. The apparatus of claim 6, wherein the parameters include file output location, dimensionality, number of data points, number of clusters, minimum distance from all central means, and standard deviation of data points.

9. The apparatus of claim 6, wherein the specific parameters comprise total throughput, average number of delays, error rate, and average discontinuous throughput, delay, error rate for the same time.

10. The apparatus of claim 6, wherein the data format comprises a data volume of 500G, a thread count of 100, and a k value of 3.