CN113553353A - Scheduling system for distributed data mining workflow - Google Patents

Scheduling system for distributed data mining workflow Download PDF

Info

Publication number
CN113553353A
CN113553353A CN202110650899.1A CN202110650899A CN113553353A CN 113553353 A CN113553353 A CN 113553353A CN 202110650899 A CN202110650899 A CN 202110650899A CN 113553353 A CN113553353 A CN 113553353A
Authority
CN
China
Prior art keywords
task
data mining
module
workflow
control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110650899.1A
Other languages
Chinese (zh)
Inventor
李晖
李一水
周彧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Youlian Borui Technology Co ltd
Original Assignee
Guizhou Youlian Borui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Youlian Borui Technology Co ltd filed Critical Guizhou Youlian Borui Technology Co ltd
Priority to CN202110650899.1A priority Critical patent/CN113553353A/en
Publication of CN113553353A publication Critical patent/CN113553353A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06312Adjustment or analysis of established resource schedule, e.g. resource or task levelling, or dynamic rescheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • Software Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a task scheduling system facing to distributed data mining workflows, which relates to the field of data mining and comprises a function module, a search module, a control module and a canvas module, wherein each module comprises a function area corresponding to the function module, a search area corresponding to the search module, a control area corresponding to the control module and a canvas area corresponding to the canvas module on a user interface of the system; aiming at the problems of the traditional data mining task scheduling technology, the invention designs a scheduling system for a distributed data mining task based on a workflow, designs a long task priority (LTF) scheduling algorithm aiming at the index of the total completion time of parallel subtasks in the data mining workflow task, and uses the distributed technology to perform the data mining task in a distributed mode, thereby greatly improving the execution efficiency of the data mining task.

Description

Scheduling system for distributed data mining workflow
Technical Field
The invention relates to the field of data mining, in particular to a dispatching system facing to a distributed data mining workflow.
Background
With the rapid development of information technologies such as internet, big data and cloud computing, the human society enters the information age, the scale and speed of data generation increase exponentially, and massive data are formed. How to extract valuable information from mass data, data mining technology and data analysis technology are common methods. The differences between data mining and data analysis are: the data analysis is targeted definitely, the hypothesis is made first, and then whether the hypothesis is correct is verified through the data analysis, so that a corresponding conclusion is obtained. Before information is mined, data mining does not have a definite target, and unknown modes and rules can be searched from the data. Compared with data analysis, the data mining can maximize the value of the data and mine potential and valuable knowledge from the data.
With the continuous accumulation of data volume of enterprises, data has become an intangible asset of enterprises. The data contains abundant information, and different knowledge can be obtained by mining and analyzing the data from different angles, so that the value of deeply mining the data becomes a way for improving the benefit of many enterprises in recent years. In a distributed data mining system, when data mining is performed on the same data set by using multiple mining algorithms, a proper scheduling strategy is needed to schedule parallel data mining tasks so as to achieve better QoS performance indexes.
The currently common distributed task scheduling platform includes XXL-JOB, Easy Scheduler and JobKeeper of Nanjing cloud creation data. The task scheduling platforms mainly schedule a timing task or a whole workflow task as a scheduling unit, and do not consider the problem of how to schedule a plurality of parallel subtasks in the workflow task.
Disclosure of Invention
The present invention is directed to a scheduling system for distributed data mining workflows, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a scheduling system facing to distributed data mining workflows comprises function modules, search modules, control modules and canvas modules, wherein each module comprises a function area corresponding to the function module, a search area corresponding to the search module, a control area corresponding to the control module and a canvas area corresponding to the canvas module on a user interface of the system;
the functional module is used for realizing the operation of a user on a task through the functional area, including creating, storing and opening a workflow, interactively executing the workflow and deleting control operation;
the searching module is used for realizing the searching of the control by the user through the searching area, inputting the control name in the searching bar by the user and quickly finding out the required control;
the control module is used for providing common data loading, data preprocessing and data mining controls, and a user selects a required control through the control area;
the canvas module corresponds to a canvas area for constructing the data mining workflow task, a control in the control area is selected by a mouse and dragged to the canvas, a curve is generated when the input and the output corresponding to the control are clicked to connect the two controls together, and the two controls are connected in pairs to construct the data mining workflow task.
As a further scheme of the invention: the dispatching system of the data mining Workflow adopts a Workflow framework to realize the automatic execution of the data mining task.
As a still further scheme of the invention: the Workflow framework contains the following table structure information:
control configuration table: the control configuration table is used for storing relevant configured information;
control input configuration table: the control input end configuration table is used for storing relevant information of control input end configuration;
and (3) configuring a control output end table: the control output end configuration table is used for storing relevant information of control output end configuration;
list table: the catalog table is used for storing relevant information of the control catalog;
a control table: the control table is used for storing relevant information of the controls;
a workflow table: the workflow table is used for storing relevant information of the workflow.
As a still further scheme of the invention: the search module supports fuzzy search, K is input in the search bar, and controls related to K can be found in the control area.
As a still further scheme of the invention: the control comprises a loading data set, a sampling, a splitting data, a selecting attribute, a linear regression, a logistic regression, K-Means, a support vector product, a decision tree, a random forest, a prediction and a data viewer;
loading a data set: selecting a data set through a loading data set controller, wherein the page of the loading data set can display the record number, the attribute column and the type of the attribute column of the data set;
sampling: parameters such as layering property, sampling proportion and the like can be set;
splitting data: the data set is divided into a training set and a testing set, the splitting proportion, the layered sampling and layering attributes and whether repeated sampling can be carried out can be set, and the output is divided into two parts: train is training set, test is testing set;
selecting the attributes: characteristic attributes, grouping attributes, marking attributes and the like can be set;
linear regression: algorithm names can be set;
and (3) logistic regression: an optimizer and iteration times can be set;
K-Means: parameters such as the number of clusters, the maximum iteration number, the minimum centroid, the aggregation function and the like can be set;
support vector product: parameters such as SVM type, kernel function, initial learning rate and the like can be set;
decision tree: parameters such as the maximum depth and the minimum branch node number of the tree can be set;
random forest: parameters such as the number of trees, the maximum depth of the trees, the minimum number of branch nodes and the like can be set;
and (3) prediction: connecting the trained model and the test set, predicting the test set, and displaying a prediction result in a form of a table;
a data viewer: the data is presented in tabular form.
The effective scope of the method of the present invention is not limited to the algorithms and data processing components mentioned above.
As a still further scheme of the invention: the scheduling system of the data mining workflow further comprises a secondary scheduling system, and the secondary scheduling system is deployed in a cluster environment built based on the KVM.
As a still further scheme of the invention: the secondary scheduling system includes: the system comprises a front-end module, a task scheduling module and a task execution module, wherein the front-end module is used for executing data mining workflow tasks in a mode of operating with default parameters; the task scheduling module is used for acquiring parallel subtasks from a database and scheduling the subtasks according to an LTF scheduling algorithm; and the task execution module is used for acquiring the subtasks from the task queue and delivering the subtasks to the Greenplus cluster for execution.
As a still further scheme of the invention: the greenplus cluster is executed as follows: the main node in the cluster receives tasks from the task scheduling module in sequence according to the information of the cluster resource queue, distributes system resources (such as memory) to the tasks, generates a task execution plan and distributes the task execution plan to each sub-node, and the sub-nodes are responsible for executing the tasks.
As a still further scheme of the invention: the task scheduling module establishes a linear regression model according to the size V of the data set to predict the execution time T of each mining algorithm, and the execution time T of the mining algorithm can be predicted through the following formula:
T=β1*V+β2 (1);
coefficient beta in said formula (1)1And beta2The samples may be obtained by repeated experiments using a least squares method.
As a still further scheme of the invention: the task scheduling module adopts a task priority scheduling strategy with the longest time consumption:
defining a data mining task set E ═ { E1, … Ei, … Em }, wherein Ei represents the ith task;
defining a data set V which is { V1, … Vi, … Vm }, wherein Vi represents the data set size processed by the ith task at this time;
defining a data mining task prediction execution time set T { T1, … Ti, … Tm }, wherein Ti represents the execution time of the ith data mining task prediction, and Ti ═ Ei. β 1 × Vi + Ei. β 2;
the scheduling strategy is as follows:
acquiring a task: acquiring a parallel subtask list E from the system;
predicting task execution time: selecting a data mining task Ei in the task list, selecting a corresponding linear equation, and predicting the execution time Ti of the data extraction task;
adding the task and the predicted execution time to the task dictionary Dict: storing the tasks Ei and the predicted execution time Ti in a dictionary in a key/value pair mode according to the predicted execution time, wherein the key/value pair is E [ i ]/T [ i ];
and (3) task sequencing: sorting the dictionaries according to the size of a value Ti in the dictionaries;
inputting a task: and sequentially outputting keys Ei in the dictionary to a task list E.
Compared with the prior art, the invention has the beneficial effects that: aiming at the problems of the traditional data mining tool, the invention designs a distributed data mining scheduling system based on workflow, and a data mining task can be quickly constructed for mining analysis by using a mouse through using the workflow and a dragging type technology. Meanwhile, the data mining task is carried out in a distributed mode by using a distributed technology, and the mining efficiency is greatly improved. The task scheduling submodule establishes a data mining task execution time prediction model to predict the execution time of the task and provides conditions for task scheduling, and the task scheduling submodule takes the shortest total completion time of the parallel subtasks as a scheduling target and designs a task priority scheduling strategy which consumes the longest time according to the characteristic that a data mining system has more long tasks. By using the strategy, the total completion time of the parallel subtasks can be reduced, and the execution time of the whole excavation task is further shortened.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a user interface of a workflow-based distributed data mining scheduling system according to the present invention;
FIG. 2 is a diagram of a task scheduling subsystem architecture according to the present invention;
FIG. 3 is a flow chart of an LTF scheduling algorithm;
FIG. 4 is a diagram illustrating a distribution of positions of controls in the control area;
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In this specification, references to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, step, or characteristic, but every embodiment may not necessarily include the particular feature, step, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
The invention is described in detail below with reference to the drawings and specific examples.
The first embodiment is as follows: referring to fig. 1, a scheduling system for a distributed data mining workflow includes a function module, a search module, a control module, and a canvas module, where each module includes a function area corresponding to the function module, a search area corresponding to the search module, a control area corresponding to the control module, and a canvas area corresponding to the canvas module in a user interface of the system;
the functional module is used for realizing the operation of a user on a task through the functional area, including creating, storing and opening a workflow, interactively executing the workflow and deleting control operation; the specific operation is as follows:
and clicking a 'new workflow' icon button by a user, popping up a new workflow page, inputting a workflow name and workflow information by the new workflow page, and clicking 'determination' to finish. And clicking the 'open workflow' icon by the user to open a workflow list, displaying the created workflow task, and clicking the workflow name to open a workflow design page. And clicking a 'save workflow' icon by the user for saving, clicking 'confirm' to finish saving after inputting the workflow name and the workflow information, and clicking 'close' to cancel saving. And clicking an interactive operation icon by a user to execute the workflow task, interrupting the execution of the workflow when a control needing to set parameters is met, popping up a control setting parameter interface, and clicking a confirm button to execute the task after the setting is finished. The user selects the control on the canvas area, clicks the 'delete' icon button to delete the control, and simultaneously can select the 'delete control' to delete by right-clicking the control. And clicking an interactive operation icon button by a user to execute a data mining workflow task, interrupting the operation of the workflow when a subtask control needing interactive parameter setting is met, and popping up a parameter setting interface to set parameters. Then, clicking the "confirm" button executes, and the following controls can be manually run one by one until the workflow is finished.
The searching module is used for realizing the searching of the control by the user through the searching area, inputting the control name in the searching bar by the user and quickly finding out the required control;
the canvas module corresponds to a canvas area for constructing the data mining workflow task, a control in a control area is selected by a mouse and dragged to the canvas, a curve is generated when the input and the output corresponding to the control are clicked to connect the two controls together, the controls are connected in pairs to construct the data mining workflow task, and the data mining task can be quickly constructed by using the mouse for mining analysis by using the workflow and dragging technology.
The dispatching system of the data mining Workflow adopts a Workflow framework to realize the automatic execution of the data mining task.
The Workflow framework contains the following table structure information:
control configuration table: the control configuration table is used for storing relevant configured information;
table 1 control configuration table
Figure BDA0003109508890000041
Figure BDA0003109508890000051
Control input configuration table: the control input end configuration table is used for storing relevant information of control input end configuration.
Table 2 control input terminal configuration table
Figure BDA0003109508890000052
And (3) configuring a control output end table: and the control output end configuration table is used for storing relevant information of control output end configuration.
Table 3 control output terminal configuration table
Figure BDA0003109508890000053
Figure BDA0003109508890000061
List table: the catalog table is used for storing relevant information of the control catalog.
TABLE 4 catalogue tables
Figure BDA0003109508890000062
A control table: the control table is used for storing relevant information of the controls.
TABLE 5 control List
Figure BDA0003109508890000063
Figure BDA0003109508890000071
A workflow table: the workflow table is used for storing relevant information of the workflow.
Table 6 workflow table
Figure BDA0003109508890000072
The search module supports fuzzy search, K is input in the search bar, and controls related to K can be found in the control area.
Referring to fig. 4, the controls include loading data sets, sampling, splitting data, selecting attributes, linear regression, logistic regression, K-Means, support vector product, decision tree, random forest, prediction, data viewer, etc. the functions of which are described below:
loading a data set: selecting a data set through a loading data set controller, wherein the page of the loading data set can display the record number, the attribute column and the type of the attribute column of the data set;
sampling: parameters such as layering property, sampling proportion and the like can be set;
splitting data: the data set is divided into a training set and a testing set, the splitting ratio, the hierarchical sampling and hierarchical properties and whether repeated sampling is available can be set,
the output is divided into two parts: train is training set, test is testing set;
selecting the attributes: characteristic attributes, grouping attributes, marking attributes and the like can be set;
linear regression: algorithm names can be set;
and (3) logistic regression: an optimizer and iteration times can be set;
K-Means: parameters such as the number of clusters, the maximum iteration number, the minimum centroid, the aggregation function and the like can be set;
support vector product: parameters such as SVM type, kernel function, initial learning rate and the like can be set;
decision tree: parameters such as the maximum depth and the minimum branch node number of the tree can be set;
random forest: parameters such as the number of trees, the maximum depth of the trees, the minimum number of branch nodes and the like can be set;
and (3) prediction: connecting the trained model and the test set, predicting the test set, and displaying a prediction result in a form of a table;
a data viewer: the data is presented in tabular form.
The effective scope of the method of the present invention is not limited to the algorithms and data processing components mentioned above.
The scheduling system of the data mining workflow further comprises an auxiliary scheduling system, the auxiliary scheduling system is deployed in a cluster environment built based on a KVM virtual machine, and the auxiliary scheduling system has the following functions:
(1) the default parameters of the workflow are operated: setting default parameters for subtask controls needing parameter setting;
(2) and (3) parallel mining of a subtask scheduling strategy: and scheduling a plurality of parallel subtasks.
Referring to fig. 2, the secondary scheduling system includes: the system comprises a front-end module, a task scheduling module and a task execution module, wherein the front-end module is used for executing data mining workflow tasks in a mode of operating with default parameters; the task scheduling module is used for acquiring parallel subtasks from a database and scheduling the subtasks according to an LTF scheduling algorithm; and the task execution module is used for acquiring the subtasks from the task queue and delivering the subtasks to the Greenplus cluster for execution.
The greenplus cluster is executed as follows: the main node in the cluster receives tasks from the task scheduling module in sequence according to the information of the cluster resource queue, the tasks allocate system resources (such as memory), a task execution plan is generated and distributed to each sub-node, the sub-nodes are responsible for executing the tasks, and the sub-scheduling system is deployed in a cluster environment built based on the KVM virtual machine, so that the cost of system development is saved.
Therefore, the implementation of the secondary scheduling system in the embodiment of the invention focuses on the implementation of the back end. In the distributed data mining scheduling system based on the workflow, a mode of operating the data mining workflow by default parameters and a parallel subtask scheduling mechanism are added.
(1) Operation of default parameters
And setting default parameters for the control needing interaction, and not popping up an interaction interface in the workflow running process. The operation mode is convenient for non-professionals to use, reduces the use threshold of the data mining system, and provides a system environment for the next research of the parallel subtask scheduling strategy.
(2) Parallel subtask scheduling mechanism
When a plurality of excavation subtasks are executed in parallel, the LTF scheduling algorithm in the fourth chapter is used for scheduling the subtasks, so that the total completion time of the parallel subtasks is reduced, the execution time of the whole excavation task is shortened, and the overall excavation efficiency is improved.
Referring to FIG. 3, in one embodiment of the present invention, a scheduling algorithm used by the task scheduling module is provided. In the distributed data mining scheduling system based on the workflow, different mining algorithms can be used for constructing a plurality of parallel mining subtasks for the same data set. And executing a plurality of mining subtasks in parallel, which is equivalent to executing a plurality of SQL sentences by a Greenplus database at the same time. Due to the limitation of SQL concurrency number in a Greenplus resource queue, when the number of the parallel mining subtasks is larger than the SQL concurrency number in the resource queue in a data mining workflow task, the task scheduling module needs to schedule the parallel mining subtasks, and the total completion time of the parallel subtasks is reduced.
The execution duration of a task is the basis of the design of a multi-task scheduling algorithm, the execution time of a certain task needs to be obtained under the condition that the task is not executed, a task scheduling module firstly establishes a data mining task execution time prediction model to estimate the execution time of the task, and the size of a data set in a database is the most direct and important factor influencing the speed of SQL query, so the size of the data set in a workflow-based distributed data mining scheduling system is the key factor influencing the execution duration of the mining algorithm; the parallel subtask scheduling mechanism is applied to the data mining workflow task under the condition that the data mining workflow task is not interrupted, namely the data mining workflow task is executed in a 'default parameter operation' mode. In the operation mode, the mining algorithms are all set with default parameters, so that the influence of different algorithm parameter settings on the algorithm execution time can be ignored, in the factors influencing the mining algorithm execution time, the influence of the size of the data set on the algorithm execution time is only considered, and all the mining algorithm execution times basically show a linear growth trend along with the increase of the data volume; the task scheduling module establishes a linear regression model according to the size V of the data set to predict the execution time T of each mining algorithm, and the execution time T of the mining algorithm can be predicted through the following formula:
T=β1*V+β2 (1);
coefficient beta in said formula (1)1And beta2The sample can be obtained by repeated experiments using a least squares method; by the method, the time length required by the mining subtasks can be accurately predicted, and a decision basis is provided for subsequent task scheduling.
By combining the above method and the characteristic of more long tasks in an actual data mining system, the task scheduling module adopts a task priority scheduling strategy with the longest time consumption, and the problem of parallel subtask scheduling is defined as follows:
defining a data mining task set E ═ { E1, … Ei, … Em }, wherein Ei represents the ith task;
defining a data set V which is { V1, … Vi, … Vm }, wherein Vi represents the data set size processed by the ith task at this time;
defining a data mining task prediction execution time set T { T1, … Ti, … Tm }, wherein Ti represents the execution time of the ith data mining task prediction, and Ti ═ Ei. β 1 × Vi + Ei. β 2;
the specific scheduling strategy is as follows:
1. acquiring a task: acquiring a parallel subtask list E from the system;
2. predicting task execution time: selecting a data mining task Ei in the task list, selecting a corresponding linear equation, and predicting the execution time Ti of the data extraction task;
3. adding the task and the predicted execution time to the task dictionary Dict: storing the tasks Ei and the predicted execution time Ti in a dictionary in a key/value pair mode according to the predicted execution time, wherein the key/value pair is E [ i ]/T [ i ];
4. and (3) task sequencing: sorting the dictionaries according to the size of a value Ti in the dictionaries;
5. inputting a task: and sequentially outputting keys Ei in the dictionary to a task list E.
The following illustrates that the scheduling algorithm of the embodiment of the invention is superior to the short task priority scheduling algorithm and the first come first served algorithm;
assuming that the number of SQL concurrences in the Greenplus resource queue is 2, 2 SQL statements can be executed at the same time, which are respectively represented by SQL1 and SQL 2. A. B, C the execution time of the three tasks is 10 seconds, 5 seconds, 15 seconds. When A, B, C three tasks are submitted simultaneously, the total completion time of the three tasks will also be different using different scheduling algorithms.
Using the short task priority scheduling algorithm, the task allocation process is as follows: the B task was first assigned to SQL1 and the A task was assigned to SQL 2. After 5 seconds, the execution of the task B is finished, and the task A is still executed. At this point SQL1 is idle and the C task is assigned to SQL 1. After 5 seconds, the execution of the task A is finished, and the task C is still executed. After 10 seconds, the execution of the task C is finished, and all the tasks are executed. A. B, C the total completion time for the task is 20 seconds.
Using a first-come-first-serve scheduling algorithm, the process of task allocation is as follows: assuming the precedence order of the three tasks is alphabetical, the A task is first assigned to SQL1 and the B task is assigned to SQL 2. After 5 seconds, the execution of the task B is finished, and the task A is still executed. At this point SQL2 is idle and the C task is assigned to SQL 2. After 5 seconds, the execution of the task A is finished, and the task C is still executed. After 10 seconds, the execution of the task C is finished, and all the tasks are executed. A. B, C the total completion time for the task is 20 seconds.
The process of task allocation using the scheduling algorithm employed by the present invention is as follows: the C task was first assigned to SQL1 and the A task was assigned to SQL 2. After 10 seconds, the execution of the task A is finished, and the task C is still executed. At this point SQL2 is idle and the B task is assigned to SQL 2. After 5 seconds, the tasks A and C are finished simultaneously, and all the tasks are executed. A. B, C the total completion time for the task is 15 seconds.
Through the analysis of the common algorithm, A, B, C tasks are scheduled by using the scheduling algorithm, and the total completion time of the obtained A, B, C tasks is shortest. Therefore, the method and the device take the shortest total completion time of the parallel subtasks as a scheduling target, optimize a scheduling algorithm according to the characteristic that a data mining system has more long tasks, and can reduce the total completion time of the parallel subtasks so as to shorten the execution time of the whole mining task.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (10)

1. A scheduling system facing to distributed data mining workflows is characterized by comprising a function module, a search module, a control module and a canvas module, wherein the function module is used for realizing the operation of a user on tasks through a function area, and comprises the steps of creating, storing and opening workflows, interactively executing workflows and deleting control operations;
the searching module is used for realizing the searching of the control by the user through the searching area, inputting the control name in the searching bar by the user and quickly finding out the required control;
the control module is used for providing common data loading, data preprocessing and data mining controls, and a user selects a required control through the control area;
the canvas module corresponds to a canvas area for constructing the data mining workflow task, a control in the control area is selected by a mouse and dragged to the canvas, a curve is generated when the input and the output corresponding to the control are clicked to connect the two controls together, and the two controls are connected in pairs to construct the data mining workflow task.
2. The distributed data mining Workflow-oriented scheduling system of claim 1 wherein the data mining Workflow scheduling system employs a Workflow framework to implement automated execution of data mining tasks.
3. The distributed data mining Workflow-oriented scheduling system of claim 2, wherein the Workflow framework contains the following table structure information:
the control configuration table is used for storing relevant configured information;
the control input end configuration table is used for storing relevant information of control input end configuration;
the control output end configuration table is used for storing relevant information of control output end configuration;
the directory table is used for storing relevant information of the control directory;
the control table is used for storing relevant information of the control;
and the workflow table is used for storing relevant information of the workflow.
4. The distributed data mining workflow-oriented scheduling system of claim 1 wherein the search module supports fuzzy search, wherein K is entered in a search field and wherein controls related to K are located in a control field.
5. The distributed data mining workflow-oriented scheduling system of claim 4, wherein the control comprises:
the loading data set is used for selecting the data set, and the page of the loading data set displays the record number, the attribute column and the type of the attribute column of the data set;
sampling, wherein the hierarchical attribute and the sampling proportion parameter can be set;
splitting data, which divides a data set into a training set and a testing set, can set a splitting ratio, a layered sampling layered attribute and can perform repeated sampling, and the output is divided into two parts: train is training set, test is testing set;
selecting attributes which can set characteristic attributes, grouping attributes and marking attributes;
linear regression, which can set the algorithm name;
logistic regression, which can set optimizers, iterations;
K-Means, which can set the number of clusters, the maximum iteration number, the minimum centroid and the aggregation function parameter;
a support vector product which can set SVM type, kernel function and initial learning rate number;
decision tree, which can set the parameters of maximum depth and minimum branch node number of the tree;
a random forest which can set parameters of tree number, maximum depth of the tree and minimum branch node number;
forecasting, wherein the trained model and the trained test set are connected, and the test set is forecasted to display a forecasting result in a form;
a data viewer: the data is presented in tabular form.
6. The distributed data mining workflow-oriented scheduling system of claim 1, further comprising a secondary scheduling system deployed in a cluster environment built based on KVM virtual machines.
7. The distributed data mining workflow-oriented scheduling system of claim 6, wherein the secondary scheduling system comprises: the system comprises a front-end module, a task scheduling module and a task execution module, wherein the front-end module is used for executing data mining workflow tasks in a mode of operating with default parameters; the task scheduling module is used for acquiring parallel subtasks from a database and scheduling the subtasks according to an LTF scheduling algorithm; and the task execution module is used for acquiring the subtasks from the task queue and delivering the subtasks to the Greenplus cluster for execution.
8. The distributed data mining workflow-oriented scheduling system of claim 7, wherein the greenplus cluster performs the steps of: and the main node in the cluster receives the tasks from the task scheduling module in sequence according to the information of the cluster resource queue, distributes system resources for the tasks, generates a task execution plan and distributes the task execution plan to each child node, and the child nodes are responsible for executing the tasks.
9. The distributed data mining workflow-oriented scheduling system of claim 8 wherein the task scheduling module predicts the execution time T of each mining algorithm by building a linear regression model of the size V of the dataset, the execution time T of the mining algorithm being predictable by the following formula:
T=β1*V+β2
wherein the coefficient beta1And beta2The samples may be obtained by repeated experiments using a least squares method.
10. The distributed data mining workflow-oriented scheduling system of claim 9, wherein the task scheduling module adopts a policy of task-first scheduling that takes the longest time:
defining a data mining task set E ═ { E1, … Ei, … Em }, wherein Ei represents the ith task;
defining a data set V which is { V1, … Vi, … Vm }, wherein Vi represents the data set size processed by the ith task at this time;
defining a data mining task prediction execution time set T { T1, … Ti, … Tm }, wherein Ti represents the execution time of the ith data mining task prediction, and Ti ═ Ei. β 1 × Vi + Ei. β 2;
the scheduling strategy is as follows:
acquiring a task: acquiring a parallel subtask list E from the system;
predicting task execution time: selecting a data mining task Ei in the task list, selecting a corresponding linear equation, and predicting the execution time Ti of the data extraction task;
adding the task and the predicted execution time to the task dictionary Dict: storing the tasks Ei and the predicted execution time Ti in a dictionary in a key/value pair mode according to the predicted execution time, wherein the key/value pair is E [ i ]/T [ i ];
and (3) task sequencing: sorting the dictionaries according to the size of a value Ti in the dictionaries;
inputting a task: and sequentially outputting keys Ei in the dictionary to a task list E.
CN202110650899.1A 2021-06-10 2021-06-10 Scheduling system for distributed data mining workflow Withdrawn CN113553353A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110650899.1A CN113553353A (en) 2021-06-10 2021-06-10 Scheduling system for distributed data mining workflow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110650899.1A CN113553353A (en) 2021-06-10 2021-06-10 Scheduling system for distributed data mining workflow

Publications (1)

Publication Number Publication Date
CN113553353A true CN113553353A (en) 2021-10-26

Family

ID=78130457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110650899.1A Withdrawn CN113553353A (en) 2021-06-10 2021-06-10 Scheduling system for distributed data mining workflow

Country Status (1)

Country Link
CN (1) CN113553353A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521753A (en) * 2023-03-29 2023-08-01 国网上海市电力公司 Mining method and system for electric energy service data, electronic equipment and storage medium
CN116521753B (en) * 2023-03-29 2024-07-12 国网上海市电力公司 Mining method and system for electric energy service data, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521753A (en) * 2023-03-29 2023-08-01 国网上海市电力公司 Mining method and system for electric energy service data, electronic equipment and storage medium
CN116521753B (en) * 2023-03-29 2024-07-12 国网上海市电力公司 Mining method and system for electric energy service data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Hu et al. Spear: Optimized dependency-aware task scheduling with deep reinforcement learning
CN104317658B (en) A kind of loaded self-adaptive method for scheduling task based on MapReduce
CN111176832B (en) Performance optimization and parameter configuration method based on memory computing framework Spark
US7747641B2 (en) Modeling sequence and time series data in predictive analytics
CN109240901B (en) Performance analysis method, performance analysis device, storage medium, and electronic apparatus
Grover et al. Extending map-reduce for efficient predicate-based sampling
CN109243532A (en) Eukaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method
CN104750780B (en) A kind of Hadoop configuration parameter optimization methods based on statistical analysis
CN107038069A (en) Dynamic labels match DLMS dispatching methods under Hadoop platform
He et al. Parallel implementation of classification algorithms based on MapReduce
CN103136337A (en) Distributed knowledge data mining device and mining method used for complex network
Dias et al. Supporting dynamic parameter sweep in adaptive and user-steered workflow
JP2009223833A (en) Workflow management system
CN109299180B (en) ETL operating system of data warehouse
CN105653647B (en) The information collecting method and system of SQL statement
CN109857532A (en) DAG method for scheduling task based on the search of Monte Carlo tree
CN110825526B (en) Distributed scheduling method and device based on ER relationship, equipment and storage medium
Yu et al. Design and implementation of curriculum system based on knowledge graph
Abdul et al. Database workload management through CBR and fuzzy based characterization
CN113762514A (en) Data processing method, device, equipment and computer readable storage medium
CN113553353A (en) Scheduling system for distributed data mining workflow
CN116756373A (en) Project review expert screening method, system and medium based on knowledge graph update
CN114090583A (en) Cross-business system order data analysis method and device
Liu et al. Multivariate modeling and two-level scheduling of analytic queries
Shin et al. Hippo: Taming hyper-parameter optimization of deep learning with stage trees

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211026

WW01 Invention patent application withdrawn after publication