WO2022108002A1

WO2022108002A1 - Cross model data integrated processing platform for automating task-specific platform selection

Info

Publication number: WO2022108002A1
Application number: PCT/KR2021/002674
Authority: WO
Inventors: 황다영; 강준성; 이석원
Original assignee: 주식회사 와이즈넛
Priority date: 2020-11-19
Filing date: 2021-03-04
Publication date: 2022-05-27
Also published as: KR102465932B1; KR20220068381A

Abstract

The present invention relates to a method for automating a task-specific platform selection by a big data integrated processing platform server when a user terminal accesses a big data integrated processing platform connected to multiple individual platforms and requests big data analysis. While a workflow according to a user request is performed, operators included in each task are executed in an individual platform, and a performance time of each of the operators is predicted by using data obtained by monitoring a consumed CPU and memory, etc. Data obtained by monitoring a resource situation of an individual platform for each task and a type/field of data to be analyzed are used as learning data, so that a cost learner module for predicting a performance time of each of operators according to the individual platform is made in advance, wherein the cost learner module is executed to assist in a selection of an individual platform for each task. A special effect of the present invention is to save a time required for a platform selection and an analysis time.

Description

Cross-model data integration processing platform that automates platform selection for each task

The present invention relates to big data collection and analysis technology through a platform.

In the era of the impending 4th industrial revolution, the importance of big data collection and analysis based on digital technology and network technology, and also big data processing is being highlighted. In this era, the value of data has risen remarkably. Then, in the market, attempts to build their own system by introducing a big data platform that can operate and manage big data appeared in various places.

Big data is collected through various sensors based on IoT technology. In addition, big data is collected from the web through crawling technology. By analyzing the big data collected in this way, it became possible to increase productivity on the one hand and provide new services as well as cost reduction and efficient management on the other hand.

However, there are various industries in the world, and there are various platforms created for collecting big data. Also, since there are many types and fields of data, the optimal analysis platform for analyzing such data is inevitably different. When analyzing big data, various tools and platforms such as Hadoop, Spark, Python, and R can be used for analysis, but it takes a lot of time to learn each technical field. In addition, it is difficult to select a platform suitable for the purpose of analysis. Ultimately, it is also a matter of time and cost.

The inventors of the present invention have completed the present invention after long research and efforts to solve the above problems.

In order to solve the difficult problem of selecting a platform suitable for analysis purposes, the inventors of the present invention first understand various platforms, select a platform based on such an understanding, and ensure data analysis speed accordingly as a solution for solving the problem. Thought.

Therefore, an object of the present invention is to develop a single integrated data analysis processing system to automate the selection of a platform to be analyzed. For this, a means to save resources and costs required in the analysis process is needed. This saves time for platform selection and time for data analysis. In other words, it is intended to provide a system that enables big data users to analyze data without acquiring technology for individual platforms for big data analysis. This is to save time and resources.

On the other hand, other objects not specified in the present invention will be additionally considered within the range that can be easily inferred from the following detailed description and effects thereof.

In the present invention for achieving the above object, when a user terminal accesses a big data integrated processing platform connected to a plurality of individual platforms and requests big data analysis, the big data integrated processing platform server selects a platform for each task As a method to automate:

executing operators included in tasks of a workflow of a user terminal on separate platforms, respectively;

monitoring resource status data including CPU usage and memory usage of individual platforms for each operator;

estimating the execution time of each operator using the data type to be analyzed by the cost runner module and the resource status data; and

and automatically selecting an individual platform for each task based on the predicted execution times of the operations.

In the method for automating platform selection for each task in a system connected to a plurality of individual platforms according to a preferred embodiment of the present invention, the method further comprising the step of determining, by the big data integrated processing platform server, an execution order of individual platform tasks it's good

In addition, in the method for automating platform selection for each task in a system connected to a plurality of individual platforms according to a preferred embodiment of the present invention, the cost runner module includes the operator type, analysis data type, CPU usage by operator, and memory by operator The method may further include calculating a cost for each operator using the usage amount, the operator progress status, and the operator completion time as parameters.

In the method for automating platform selection for each task in a system connected to a plurality of individual platforms according to a preferred embodiment of the present invention, the cost runner module may include a machine learning model.

In addition, in the method for automating platform selection for each task in a system connected to a plurality of individual platforms according to a preferred embodiment of the present invention, the step of executing the operators included in the tasks of the workflow on the individual platforms, respectively It runs in containers on individual platforms.

The present invention proposes a big data integrated processing platform that interworks and manages multiple data analysis platforms. If you try to analyze big data by linking with multiple platforms, you will inevitably consume a lot of time and resources. This is because the type of data to be analyzed is different for each, and the resource situation of each platform to perform the data analysis is different. In order to improve interoperability of heterogeneous systems by linking multiple platforms, it is necessary to reduce the resources consumed for analysis.

The present invention predicts the execution time for each operator included in each task by using the type of analysis data and resource conditions such as CPU and memory for each task of the individual platform. Since the platform selection for each task can be automatically performed through the operator's predicted execution time, it is possible to save resources and time for big data analysis.

On the other hand, even if it is an effect not explicitly mentioned herein, it is added that the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as described in the specification of the present invention.

1 is a diagram schematically showing a system configuration according to a preferred embodiment of the present invention.

Figure 2 schematically shows the concept of executing each operator on an individual platform to automatically select an individual platform according to a preferred embodiment of the present invention.

3 schematically shows the concept of platform selection for each task according to a preferred embodiment of the present invention.

4 conceptually illustrates a process of calculating the cost of each operator in the cost runner 120 module according to a preferred embodiment of the present invention.

5 schematically illustrates the overall process of the method of the present invention.

※ It is revealed that the accompanying drawings are exemplified as a reference for understanding the technical idea of the present invention, and the scope of the present invention is not limited thereby.

Hereinafter, the configuration of the present invention guided by various embodiments of the present invention and effects resulting from the configuration will be described with reference to the drawings. In the description of the present invention, if it is determined that related known functions are obvious to those skilled in the art and may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

1 schematically shows a system configuration according to a preferred embodiment of the present invention.

The user terminal 10 accesses the big data integrated processing platform. This big data integrated processing platform consists of one or more servers 100 . And the server consists of one or more software/hardware devices. The big data integrated processing platform server 100 is connected to N (N is an integer greater than 1) heterogeneous big data

individual platforms

200, 200... through a communication network. And the big data integrated processing platform server 100 accesses the

databases

210, 210... of these

individual platforms

200, 200... to collect big data.

Individual platforms

200 , 200 ... use a processor and memory when executing an operator included in tasks according to a user workflow. The big data integrated processing platform server 100 of the present invention monitors the resource status of these

individual platforms

200, 200... As such, heterogeneous individual platforms are interlocked with each other.

Users are diverse and have different needs for collecting and analyzing big data. The present invention provides a means for such diverse people to integrate and process big data scattered on several individual platforms to suit their needs.

When the user terminal 10 requests big data analysis after accessing the big data integrated processing platform server 100 , the big data integrated processing platform server 100 provides the user terminal 10 with an individual most suitable for big data analysis. select the platform. In particular, the cost learner module 120 selects an optimal analysis platform for each task according to the type/field of data to be analyzed. Although not shown, it is apparent to those skilled in the art that a plurality of modules for executing a series of processes such as collecting and processing big data are installed in the big data integrated processing platform server.

The big data integrated processing platform server 100 executes a workflow requested by the user from the user terminal 10 . Workflow refers to the process of transforming data to produce a desired result by a user. The system of the present invention predicts the execution time of temporary operators using the cost runner module 120 through the task for the user request workflow and information on the resource status of each platform, and for each task Helping you choose the right platform.

In the process of data analysis, several tasks are passed. Such tasks include data cleansing, data visualization, and predictive modeling creation. Data cleansing is the task of making data good for analysis. Processes such as processing of missing values, processing of outliers, and normalization to quantify categorical data may be included. The data visualization task is a process of visually expressing and delivering data analysis results so that they can be easily understood. The predictive modeling task is a process of selecting various modeling techniques through machine learning and deep learning algorithms, and selecting and optimizing detailed data used in the modeling process. In addition, each task includes various operators. An operator is an operator used for the purpose of processing and processing data.

2 schematically shows the essence of a method for automatically selecting an individual platform for each task according to a preferred embodiment of the present invention.

The big data integrated processing platform server 100 executes the operators 15 included in the tasks according to the user workflow on each individual platform as described above. As shown, operators each run on separate platform containers 250 . Such individual platform containers 250 include Spark, JavaStreams, Flink, Graph, and the like. Then, the big data integrated processing platform server 100 monitors the resources of the individual platform consumed when the individual platform executes operators. Individual platforms will run the CPU and use memory. Preferably, the big data integrated processing platform server 100 may monitor CPU usage and memory usage for each operator 15 .

The cost runner 120 uses the type information of the data to be analyzed and the resource status data (ie, includes CPU usage and memory usage data for each operator) on the individual platforms monitored above, the execution time of the operators 15 . By generating a learning model that predicts the execution time for each operator 15 is predicted. The types of data to be analyzed include file data (csv, excel, xml), relational data (data that tabulates simple relationships between keys and values), and web service data (data through REST service, SOAP service, etc.) it means.

Then, the platform selection automation module 170 automatically selects a platform for each task using the execution time for each operator 15 predicted by the cost runner 120 . See FIG. 3 . It then determines the order and method of performing individual platform tasks. The execution order is the execution order for each task. For example, when task 1 is performed in Flink, task 2 is JavaStreams, and task 3 is performed in Spark, the platform selection automation module 170 specifies the execution order as Flink → JavaStreams → Spark. Method refers to the platform selection method. For example, if the initial estimated execution time of a specific operator exceeds the actual execution time, the result is transmitted to the coast runner, and the platform operation sequence can be re-arranged again. It means deciding how to proceed with this process.

4 conceptually illustrates a process of calculating the cost of each operator in the cost runner 120 according to a preferred embodiment of the present invention.

When each operator is executed on an individual platform, as input information of the coast runner 120, the operator type 121, analysis data type 122, CPU usage by operator 123, memory usage by operator 124, each Operator progress status 125 and operator completion time 126 may be included. With these inputs, parameters for each operator are learned. Then, the cost of each operator is calculated and derived as an output result.

Preferably, the cost can be calculated using machine learning and deep learning algorithms. More preferably, it is good to use a boosting-type algorithm in the algorithm type. Boosting is known as a technique to improve errors and iterate by applying weights to misclassified entities during data prediction. The outline structure is as follows. Change all data to random values. Temporarily assign weights to all random data. Adjust the weights for each data according to the results (correct or incorrect) for random values. Random data is changed using the adjusted weight. Repeat this process 3 or 4 times. This number can be predefined by the user. After iteration, each model is weighted to make the final model. Various types of algorithms such as GBM (Grandient Boost), LightGBM, and XGBoost are known for boosting-type algorithms. However, the technical idea of the present invention is not limited by the type of this technique.

On the other hand, in order to predict data, training data is required. Training data is data with actual consumption of resources such as CPU and memory and actual execution time when the operators are executed on individual platforms. Training data is used to create a learning model using the machine learning and deep learning algorithms. Based on the trained model, the current CPU usage and memory usage of individual platforms are input to predict the execution time of the operations. Here, the cost is the same concept as the execution time. In other words, calculating the cost means predicting the execution time. For reference, the reason for the prediction of execution time (cost) is that the CPU and memory states of individual platforms are not always the same. The above process of the present invention may be understood as a task for predicting the execution time of an operation when an individual platform is in a specific state.

Let's go back to Figure 4. Since the execution time for operators can be calculated through cost calculation, it is possible to select an appropriate platform for each task.

In order to predict the execution time of each operator in the coast runner 120 module, machine learning and deep learning algorithms are used. Since the present invention automatically selects an appropriate platform according to the data to be analyzed and the operator, it has the advantage of saving the time it takes to select an individual platform for big data analysis and the analysis time.

Fig. 5 schematically shows once again the overall process of the preferred method of the present invention to summarize the configuration of the present invention.

First, by accessing the big data integrated processing platform of the present invention, a workflow requested by the user terminal in relation to big data analysis is analyzed (S100). At this time, the big data integrated processing platform server obtains information related to the type of data that the user wants to analyze.

Next, the integrated processing platform server executes the operators included in each task on individual platforms (S110).

The big data integrated processing platform server monitors how many resources each individual platform uses when executing operators for each task (S120). Preferably, information related to CPU usage and monitor usage is monitored. By monitoring in this way, the big data integrated processing platform server can obtain resource status data for each task from individual platforms.

Next, the cost runner module predicts the execution time of each operator using the type of analysis data obtained in step S100 and the resource situation data of individual platforms for each task obtained in step S120 as input values (S130).

Next, the platform selection automation module automatically selects an individual platform for each task (S140). This saves time for individual platform selection.

Finally, the order and method of performing tasks on individual platforms are determined (S150). This has the advantage of saving time and resources devoted to data analysis. This is directly related to the effect of cost reduction. The present invention is applicable to all industrial fields such as marketing and commerce where there are conditions or plans to utilize big data production and analysis.

For reference, the method for automating platform selection for each task in a system connected to a plurality of individual platforms according to an embodiment of the present invention is implemented in the form of a program command that can be executed through various computer means and recorded on a computer-readable medium. can The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

Examples of computer-readable media include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floppy disks, and ROM, RAM, A hardware device specifically configured to store and execute program instructions, such as flash memory, may be included. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter as well as machine codes such as those generated by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

The protection scope of the present invention is not limited to the description and expression of the embodiments explicitly described above. In addition, it is added once again that the protection scope of the present invention cannot be limited due to obvious changes or substitutions in the technical field to which the present invention pertains.

Claims

When a user terminal accesses a big data integrated processing platform connected to a plurality of individual platforms and requests big data analysis, the big data integrated processing platform server automates the selection of a platform for each task, the method comprising:

executing operators included in tasks of a workflow of a user terminal on separate platforms, respectively;

monitoring resource status data including CPU usage and memory usage of individual platforms for each operator;

estimating the execution time of each operator using the data type to be analyzed by the cost runner module and the resource status data; and

A method for automating platform selection for each task in a system connected to a plurality of individual platforms, comprising the step of automatically selecting an individual platform for each task through the predicted execution times of the operations.
According to claim 1,

The method for automating platform selection for each task in a system connected to a plurality of individual platforms, further comprising the step of the big data integrated processing platform server determining the execution order of individual platform tasks.
According to claim 1,

The cost runner module calculates the cost for each operator by using the operator type, analysis data type, CPU usage per operator, memory usage per operator, operator progress status, and operator completion time as parameters. A method of automating platform selection by task in the system connected to the platform.
According to claim 1,

The cost runner module includes a machine learning model, a method for automating platform selection for each task in a system connected to a plurality of individual platforms.
According to claim 1,

The step of executing each of the operators included in the tasks of the workflow on individual platforms is executing in containers of individual platforms, a method for automating platform selection for each task in a system connected to a plurality of individual platforms.