CN108073582B - Computing framework selection method and device - Google Patents

Computing framework selection method and device Download PDF

Info

Publication number
CN108073582B
CN108073582B CN201610981871.5A CN201610981871A CN108073582B CN 108073582 B CN108073582 B CN 108073582B CN 201610981871 A CN201610981871 A CN 201610981871A CN 108073582 B CN108073582 B CN 108073582B
Authority
CN
China
Prior art keywords
data
data mining
selecting
model
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610981871.5A
Other languages
Chinese (zh)
Other versions
CN108073582A (en
Inventor
李杰亮
崔洪涛
李光瑞
钱岭
齐骥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201610981871.5A priority Critical patent/CN108073582B/en
Publication of CN108073582A publication Critical patent/CN108073582A/en
Application granted granted Critical
Publication of CN108073582B publication Critical patent/CN108073582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a computing framework selection method, which comprises the following steps: selecting a preset component according to a preset data mining process; converting the preset component into a directed acyclic graph; selecting a computing framework for data mining for the nodes; submitting the nodes to a data mining calculation frame, and calculating the nodes by using the data mining calculation frame; wherein, the nodes of the directed acyclic graph have corresponding relations with the components. The embodiment of the invention also provides a device for selecting the computing frame.

Description

Computing framework selection method and device
Technical Field
The invention relates to the field of data mining, in particular to a computing framework selection method and device.
Background
With the advent of the information age, data accumulation has increased geometrically, and various parallel computing framework data processing platforms have appeared in order to be able to process existing massive data. The existing parallel computing framework has two main types: mapreduce and Spark. The Mapreduce is an off-line computing framework, and compared with park, the Mapreduce is a disk computing framework, and when the Mapreduce is specifically used for computing, an algorithm is abstracted into two stages of Map and Reduce to be carried out, so that the Mapreduce is very suitable for data intensive computing; spark is a memory computing framework that puts data in memory as much as possible to improve the computational efficiency of iterative and interactive applications.
At present, a large data mining platform is realized by selecting one of calculation frames, a Mapreduce frame is selected, and when the size of the processed data volume is smaller than the resource volume of a cluster, the processing speed is very slow compared with Spark, so that the cluster resources are seriously wasted, and the utilization rate of the resources is reduced; when the size of the processed data is far larger than the resource amount of the cluster, the cluster resources are seriously insufficient, the performance of processing the data is sharply reduced, and even the data cannot be processed.
Disclosure of Invention
In view of this, embodiments of the present invention are expected to provide a method and an apparatus for selecting a computing framework, which can select a suitable computing framework according to an actual data mining process.
The technical scheme of the embodiment of the invention is realized as follows:
a computing framework selection method, comprising:
selecting a preset component according to a preset data mining process;
converting the preset component into a directed acyclic graph; wherein the nodes of the directed acyclic graph have corresponding relations with the components;
selecting a computing framework for data mining for the node;
submitting the nodes to the data-mining computation framework and computing the nodes using the data-mining computation framework.
The method as described above, the predetermined data mining process comprising: the method comprises a data extraction process, a data processing process, an algorithm application process and a model establishment process, wherein the preset components are selected according to a preset data mining process, and the method comprises the following steps:
determining a data source to be extracted according to the data extraction process, and selecting a data extraction type component according to the data source;
determining a processing method for processing data according to the data processing process, and selecting a data processing assembly according to the processing method;
determining an algorithm established by a model according to the algorithm application process, and selecting an algorithm component according to the algorithm;
and determining the purpose of data mining according to the model building process, and selecting a modeling tool class component according to the purpose.
The method as described above, the selecting a computing framework for data mining for the nodes, comprising:
selecting a calculation framework for data mining for the nodes according to the input data volume of the nodes and the resource use condition of the cluster; wherein the cluster is a storage space of the extracted data.
The method as described above, the selecting a computing framework for data mining for the node according to the input data volume of the node and the resource usage of the cluster, comprising:
screening nodes with the degree of income of 0 in the directed acyclic graph;
acquiring the input data volume of the node with the income degree of 0 and the resource use condition of the cluster;
selecting a calculation frame of data mining for the node with the degree of income 0 according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using an intelligent discrimination model;
accordingly, submitting the nodes to the selected computing framework for data mining, the computing framework performing computations, comprising:
submitting the nodes with the degree of income of 0 to a selected data mining calculation framework, and calculating the nodes by using the calculation framework.
As described above, after submitting the node with the degree of income 0 to the selected computing framework for data mining and computing the node using the computing framework, the method further includes:
and deleting the node with the income degree of 0.
The method as described above, further comprising:
generating a data mining model according to the calculation result of the data mining calculation framework on the nodes;
the data mining model is converted into a file in another markup language, yaml format, and stored on a predetermined path of the cluster.
The method as described above, further comprising:
and monitoring the operation condition of the component and positioning the abnormal component.
The method as described above, further comprising:
maintaining data generated during the data mining process.
A computing framework selection apparatus, comprising:
the selection module is used for selecting the preset components according to a preset data mining process; selecting a computing framework for data mining for the node;
the conversion module is used for converting the preset component into a directed acyclic graph; wherein the nodes of the directed acyclic graph have corresponding relations with the components;
and the processing module is used for submitting the nodes to the data mining calculation framework and calculating the nodes by using the data mining calculation framework.
The apparatus as described above, the predetermined data mining process comprising: the system comprises a data extraction process, a data processing process, an algorithm application process and a model establishment process, wherein the selection module is specifically used for:
determining a data source to be extracted according to the data extraction process, and selecting a data extraction type component according to the data source;
determining a processing method for processing data according to the data processing process, and selecting a data processing assembly according to the processing method;
determining an algorithm established by a model according to the algorithm application process, and selecting an algorithm component according to the algorithm;
and determining the purpose of data mining according to the model building process, and selecting a modeling tool class component according to the purpose.
The apparatus as described above, the selection module comprising:
the screening unit is used for screening nodes with the degree of income of 0 in the directed acyclic graph;
the acquisition unit is used for acquiring the input data volume of the node with the income degree of 0 and the resource use condition of the cluster;
the selection unit is used for selecting a calculation frame of data mining for the node with the degree of income 0 according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using an intelligent discrimination model;
the processing module is specifically configured to submit the node with the degree of income 0 to the selected calculation framework for data mining, and calculate the node by using the calculation framework.
The apparatus as described above, the processing module further to:
obtaining a data mining model according to the calculation result of the data mining calculation framework on the nodes;
and converting the data mining model into a file in another markup language of yaml format and storing the file on a predetermined path of the storage space.
The method and the device for the computing framework provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing framework for the data mining for the node, submit the node to the computing framework for the data mining, and calculate the node by using the computing framework for the data mining; therefore, different computing frames can be selected for each step (node) in the data mining process, the problem that only one computing frame can be selected in the data mining process in the prior art is solved, the flexible selection of the computing frames is realized in a simple and feasible mode, and the efficiency of the data mining process is improved.
Drawings
Fig. 1 is a schematic flowchart of a computing framework selecting method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a predetermined data mining process provided by an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for selecting default components according to a predetermined data mining process according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another computing framework selection method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for selecting a computing framework according to another embodiment of the present invention;
FIG. 6 is a flowchart illustrating a method for selecting a computing framework according to another embodiment of the present invention;
FIG. 7 is a schematic flow chart of storage and application of a model provided by an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a computing framework selecting apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention;
fig. 12 is a schematic flowchart of a working process of the execution engine module according to this embodiment.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Fig. 1 is a schematic flowchart of a computing framework selecting method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101, selecting a preset component according to a preset data mining process.
Specifically, the step 101 of selecting the preset component according to the predetermined data mining process may be implemented by the computing framework selecting device. Before the data mining process is carried out, the data mining process is determined, and then corresponding components are selected in the data mining system according to the determined data mining process.
And 102, converting the preset component into a directed acyclic graph. Wherein, the nodes of the directed acyclic graph have corresponding relations with the components.
Specifically, the step 102 of converting the default component into the directed acyclic graph may be implemented by the computing framework selecting device. It should be noted that the directed acyclic graph is a directed graph which starts from a certain node and cannot return to the point through a plurality of edges, and the selected preset component is converted into the directed acyclic graph, so that each node of the directed acyclic graph corresponds to one component, and the sub-process corresponding to the selected preset component forms the whole data mining process, and therefore, each node represents one sub-process of the whole data mining process.
And 103, selecting a calculation frame for data mining for the nodes.
Specifically, the step 103 of selecting a computing framework for data mining for the node may be implemented by the computing framework selecting device. The calculation framework for data mining is actually selected for each sub-flow of the whole data mining process, and the calculation framework selected for each node may be the same or different.
And 104, submitting the nodes to a data mining calculation framework, and calculating the nodes by using the data mining calculation framework.
Specifically, the step 104 of submitting the nodes to the data mining computation framework and performing computation on the nodes by using the data mining computation framework can be realized by the computation framework selecting device. And submitting each node to a corresponding frame according to the computing frame selected for each node, and computing each node by using the corresponding frame, so that each node can find a computing frame suitable for itself and perform corresponding computation by using the frame.
The computing frame selection method provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing frame of data mining for the node, submit the node to the computing frame of data mining, and use the computing frame to compute the node; therefore, different computing frames can be selected for each step (node) in the data mining process, the problem that only one computing frame can be selected in the data mining process in the prior art is solved, the flexible selection of the computing frames is realized in a simple and feasible mode, and the efficiency of the data mining process is improved.
Fig. 2 is a schematic flow chart of a predetermined data mining process provided in an embodiment of the present invention, and as shown in fig. 2, the process includes a data extraction process, a data processing process, an algorithm application process, and a model building process.
It should be noted that the data extraction process refers to a process of extracting data of a certain data source into a cluster, where the data source includes a Mysql database, an Oracle database, a DB2 database, a File Transfer Protocol (FTP) server, and the like; the data processing process refers to the process of processing extracted data in the cluster to enable the final data to meet the algorithm input, wherein the data processing comprises missing value processing, sampling, duplicate removal, extreme value removal, condition filtering, condition replacement, sorting, column generation, interval, normalization, association table, statistical summary and the like; the algorithm application process refers to a process of applying an algorithm to perform data mining, the algorithm has respective implementation modes in different calculation frames, but the configuration of each algorithm is uniform and is irrelevant to a specific calculation frame, wherein the algorithm comprises a naive Bayes algorithm, a random forest algorithm, a linear regression algorithm, a frequent pattern Growth FP-Growth algorithm, a K-means clustering algorithm, an article-based recommendation algorithm and the like; the model building process is a process of building a data mining model, and after the model is built, the model building process further comprises a test process, a prediction process, an evaluation process and the like of the model.
Fig. 3 is a flowchart illustrating a method for selecting preset components according to a predetermined data mining process according to an embodiment of the present invention, as shown in fig. 3, the method includes the following steps:
step 201, the computing framework selecting device determines the data source to be extracted according to the data extraction process, and selects the data extraction component according to the data source.
It should be noted that the data extraction class component includes Mysql component, Oracle component, DB2 component, FTP component, and when the data extraction class component is selected, the source of the data is selected, for example, when the Oracle component is selected, the data in the Oracle data source is extracted into the cluster as the input original data; for example, if the FTP component is selected, the Internet Protocol (IP) address, port, username, and password data of the machine on which the FTP server resides will be provided to the cluster as raw data for input.
Step 202, the calculation frame selection device determines a processing method for processing the data according to the data processing process, and selects the data processing type component according to the processing method.
It should be noted that the data processing component includes a missing value processing component, a sampling component, a duplicate removal component, an extremum removal component, a condition filtering component, a condition replacement component, a sorting component, a generating column component, an interval component, a normalization component, an association table component, and a statistics summarizing component, and when the component of the data processing component is selected, a processing method for data is selected, for example, the sampling component is selected, and data extracted by data extraction is sampled; for example, a normalization component is selected, and the data extracted by the data extraction component is subjected to normalization processing; for example, both the sampling component and the normalization component are selected, and the data extracted by the data extraction component is subjected to sampling processing and normalization processing, respectively.
Step 203, the calculation frame selection device determines the algorithm of the model establishment according to the algorithm application process, and selects the algorithm type component according to the algorithm.
It should be noted that the algorithm components include a naive bayes algorithm component, a random forest algorithm component, a linear regression algorithm component, an FP-Growth algorithm component, a K-means clustering algorithm component, and an article-based recommendation algorithm component, and if the algorithm components are selected, a data mining algorithm is selected, for example, if the K-means clustering algorithm component is selected, data mining is performed by using the K-means clustering algorithm.
And 204, determining the purpose of data mining by the calculation framework selection device according to the model building process, and selecting the modeling tool type component according to the purpose.
It should be noted that the modeling tool class component includes a data set segmentation component, a training component, a testing component, a prediction component, an evaluation component, a model visualization component, and the like, for example, when the purpose of data mining is to establish only one model, the training component for training the model is selected; for example, if the purpose of data mining is to establish a model and test the model, a data set segmentation component, a training component and a testing component are selected, wherein the data set segmentation component is used for segmenting data in a cluster, one part of the data set segmentation component is used for training the component to train the model, and the other part of the data set segmentation component is used for testing the model trained by the testing component; for another example, if the purpose of data mining is to create a model and display the model, a training component and a visualization component are selected, the visualization component is used to display the trained model, and if the trained model is a tree, the model is displayed in a tree structure.
According to the method for selecting the preset components according to the preset data mining process, the user does not need to know the implementation of the bottom layer and deeply know the data mining algorithm, the data mining process is constructed only by selecting the components, and then the internal model in the big data is mined, so that the time efficiency is greatly improved, and the user experience is improved.
Fig. 4 is a schematic flowchart of another calculation framework selection method according to an embodiment of the present invention, and as shown in fig. 4, the method includes the following steps:
step 301, the computing framework selecting means selects a preset component according to a predetermined data mining process.
Step 302, the computing framework selecting device converts the preset component into a directed acyclic graph. Wherein, the nodes of the directed acyclic graph have corresponding relations with the components.
And step 303, selecting a calculation frame for data mining for the node by the calculation frame selection device according to the input data volume of the node and the resource use condition of the cluster. Wherein the cluster is a storage space of the extracted data.
It should be noted that the input data amount of a node refers to the output data amount of the previous node passing through the computation framework. Selecting a calculation framework for data mining for a node according to the input data volume of the node and the resource usage of the cluster means selecting a suitable calculation framework for data mining for each node according to the input data volume of each node and the current resource usage of the cluster.
And step 304, submitting the nodes to a data mining calculation framework by the calculation framework selection device, and calculating the nodes by using the data mining calculation framework.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
The computing frame selection method provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing frame of data mining for the node according to the input data volume of the node and the resource use condition of the cluster, submit the node to the computing frame of data mining, and calculate the node by using the computing frame of data mining; therefore, different computing frames can be selected for each step (node) in the data mining process, the problem that only one computing frame can be selected in the data mining process in the prior art is solved, the flexible selection of the computing frames is realized in a simple and feasible mode, and the efficiency of the data mining process is improved.
Fig. 5 is a schematic flowchart of another calculation framework selection method according to an embodiment of the present invention, and as shown in fig. 5, the method includes the following steps:
step 401, the computing framework selecting device selects a preset component according to a predetermined data mining process.
Step 402, the computing framework selecting device converts the preset component into a directed acyclic graph. Wherein, the nodes of the directed acyclic graph have corresponding relations with the components.
Step 403, the calculation framework selection device screens the nodes with the degree of income of 0 in the directed acyclic graph.
Note that the reason why the node with an entry degree of 0 is screened in the directed acyclic graph is that only the node with an entry degree of 0 has a fixed input data amount and can be used for calculation.
Step 404, the computing framework selecting device obtains the input data size of the node with the degree of income 0 and the resource use condition of the cluster.
Step 405, the calculation framework selection device selects a calculation framework for data mining for the node with the degree of income 0 according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using the intelligent discrimination model.
It should be noted that the intelligent discriminant model is generated by an algorithm, and preferably, may be generated by a stand-alone C4.5 decision tree algorithm. And selecting a proper calculation frame for data mining for the node with the degree of 0 by using an intelligent discriminant model according to the input data volume of the node with the degree of 0 and the resource use condition of the cluster at that time.
Step 406, the computing framework selecting device submits the node to a computing framework of data mining, and the computing framework of data mining is used for computing the node.
Step 407, the computing framework selecting means deletes the node whose degree of entry is 0.
It should be noted that, if a node with an entry degree of 0 is deleted in the directed acyclic graph, some original nodes with an entry degree of not 0 will become nodes with an entry degree of 0, and therefore, steps 403 to 407 are executed in a loop until a suitable data mining calculation frame is selected for all nodes in the directed acyclic graph. After step 407, the method further includes recording the running time length of the computing frame selected for each node, the input data amount of each node, the resource usage amount of the cluster when each node submits, and the computing frame submitted by each node, and updating the recorded data into the intelligent discriminant model, so as to better serve the computing frame to which the node is selected to be submitted for applying the intelligent discriminant model again.
It should be further noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, which is not repeated herein.
The computing frame selection method provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing frame of data mining for the node according to the input data volume of the node and the resource use condition of the cluster by using the intelligent discrimination model, submit the node to the computing frame of data mining, and calculate the node by using the computing frame of data mining; therefore, different computing frames can be selected for each step (node) in the data mining process, the problem that only one computing frame can be selected in the data mining process in the prior art is solved, the flexible selection of the computing frames is realized in a simple and feasible mode, and the efficiency of the data mining process is improved.
Fig. 6 is a schematic flowchart of another calculation framework selection method according to an embodiment of the present invention, as shown in fig. 6, the method includes the following steps:
step 501, the computing framework selecting device selects a preset component according to a preset data mining process.
Step 502, the computing framework selecting device converts the preset component into a directed acyclic graph. Wherein, the nodes of the directed acyclic graph have corresponding relations with the components.
Step 503, the calculation framework selection device screens the nodes with the degree of income of 0 in the directed acyclic graph.
Step 504, the computing framework selecting device obtains the input data size of the node with the degree of income 0 and the resource use condition of the cluster.
And 505, selecting a calculation frame for data mining for the node with the degree of income 0 by the calculation frame selection device according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using an intelligent discrimination model.
Step 506, the computing framework selecting device submits the nodes to the computing framework of the data mining, and the computing framework of the data mining is used for computing the nodes.
Step 507, the calculation framework selection device deletes the node with the degree of income of 0.
And step 508, generating a data mining model by the calculation frame selection device according to the calculation result of the data mining calculation frame to the nodes.
Step 509, the computing framework selecting means converts the data mining model into a file in another markup language, yaml format, and stores the file on the predetermined path of the cluster.
In particular, since the yaml file is a portable file, converting the obtained data model into yaml file can be loaded on any data mining platform.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
The calculation frame selection method provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the calculation frame of data mining for the node according to the input data volume of the node and the resource use condition of the cluster by using the intelligent discrimination model, submit the node to the calculation frame of data mining, calculate the node by using the calculation frame of data mining, finally generate the data mining model according to the calculation result, and store the data mining model in a yaml format on the preset path of the cluster; therefore, different calculation frames can be selected for each step (node) of the data mining process, the problem that only one calculation frame can be selected in the data mining process in the prior art is solved, the flexible selection of the calculation frames is realized in a simple and feasible mode, the efficiency of the data mining process is improved, and meanwhile, a portable data mining model can be generated to be used by any data mining platform.
Fig. 7 is a schematic flow chart of storage and application of a model according to an embodiment of the present invention, as shown in fig. 7, a data mining model is generated through a series of calculations, the generated model is converted into a yaml file and stored in a predetermined path of a cluster, when a production prediction is to be performed by applying the model or a test is to be performed on the model, the yaml model file is loaded from the path, then the model is converted into a model on a specific data mining platform according to a calculation framework in which components run, and finally the data is analyzed by using the converted model.
It should be noted that, the existing data mining model is stored in a form of serialization in a binary manner, if the model is transplanted, it is necessary to know how the data mining model is serialized, and knowing how a model is serialized cannot be realized in practical application, so the model stored in this form has no portability.
Further, the method for selecting a computing framework provided by this embodiment further includes monitoring the operating condition of the component and locating the component in which the abnormality occurs.
It should be noted that when an abnormality occurs in the component during operation, the component is located.
Further, the method for selecting a computing framework provided by this embodiment further includes maintaining data generated in the data mining process.
The data maintenance comprises data maintenance in a data table and data maintenance of a data mining model, wherein the data table is used for storing intermediate data and result data of a data mining process in a data warehouse tool hive based on Hadoop in a table form, and the data maintenance in the data table comprises table data viewing, table data visualization and table data deletion; the data maintenance of the data mining model comprises the downloading of the model data, the checking of the model data and the deletion of the model data.
Fig. 8 is a schematic structural diagram of a computing framework selecting apparatus according to an embodiment of the present invention, and as shown in fig. 8, the apparatus 6 includes:
a selection module 61 for selecting preset components according to a predetermined data mining process; a computing framework for data mining is selected for the node.
A conversion module 62, configured to convert the preset component into a directed acyclic graph; wherein, the nodes of the directed acyclic graph have corresponding relations with the components.
And the processing module 63 is used for submitting the nodes to a data mining calculation framework and calculating the nodes by using the data mining calculation framework.
The computing frame selection device provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing frame for data mining for the node, submit the node to the computing frame for data mining, and calculate the node by using the computing frame for data mining; therefore, different computing frames can be selected for each step (node) in the data mining process, the problem that only one computing frame can be selected in the data mining process in the prior art is solved, the flexible selection of the computing frames is realized in a simple and feasible mode, and the efficiency of the data mining process is improved.
Further, the predetermined data mining process includes: the selection module 61 is specifically configured to:
determining a data source to be extracted according to the data extraction process, and selecting a data extraction type component according to the data source; determining a processing method for processing data according to the data processing process, and selecting a data processing type component according to the processing method; determining an algorithm established by the model according to the algorithm application process, and selecting an algorithm component according to the algorithm; the purpose of data mining is determined according to the model building process, and modeling tool class components are selected according to the purpose.
Further, the selecting module 61 is specifically further configured to: selecting a calculation frame for data mining for the nodes according to the input data volume of the nodes and the resource use condition of the cluster; wherein the cluster is a storage space of the extracted data.
Fig. 9 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention, and as shown in fig. 9, a selecting module 61 in the apparatus 6 includes:
the screening unit 611 is configured to screen a node with an incoming degree of 0 in the directed acyclic graph.
An obtaining unit 612, configured to obtain an input data amount of a node with an entry degree of 0 and a resource usage of a cluster.
A selecting unit 613, configured to select a calculation framework for data mining for a node with an income degree of 0 according to the input data amount of the node with an income degree of 0 and the resource usage of the cluster using the intelligent discriminant model.
The processing module 63 is specifically configured to submit the node with the degree of income 0 to the selected calculation framework for data mining, and calculate the node by using the calculation framework.
Further, the processing module 63 is further configured to: deleting the nodes with the degree of income of 0; obtaining a data mining model according to the calculation result of the data mining calculation framework on the nodes; the data mining model is converted to a file in a yaml format and stored on a predetermined path of the storage space.
Fig. 10 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention, and as shown in fig. 10, the apparatus 6 further includes:
and a monitoring module 64 for monitoring the operation of the components and locating the abnormal components.
A maintenance module 65 for maintaining data generated during the data mining process.
It should be noted that, in the present embodiment, reference may be made to the method embodiments corresponding to fig. 1 to 6 for the interaction process between the modules, and details are not described here.
The computing frame selection device provided by the embodiment of the invention can select the preset component according to the preset data mining process, convert the preset component into the directed acyclic graph, select the computing frame of data mining for the node according to the input data volume of the node and the resource use condition of the cluster by using the intelligent discrimination model, submit the node to the computing frame of data mining, calculate the node by using the computing frame of data mining, finally generate the data mining model according to the calculation result, and store the data mining model in a yaml format on the preset path of the cluster; therefore, different calculation frames can be selected for each step (node) of the data mining process, the problem that only one calculation frame can be selected in the data mining process in the prior art is solved, the flexible selection of the calculation frames is realized in a simple and feasible mode, the efficiency of the data mining process is improved, and meanwhile, a portable data mining model can be generated to be used by any data mining platform.
In practical applications, the selection module 61, the screening Unit 611, the obtaining Unit 612, the selection Unit 613, the conversion module 62, the Processing module 63, the monitoring module 64, and the maintenance module 65 may be implemented by a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like in the computing frame selection device.
Fig. 11 is a schematic structural diagram of another computing framework selecting apparatus according to an embodiment of the present invention, and as shown in fig. 11, the computing framework selecting apparatus 7 includes a data source management module 71, a design development module 72, an execution engine module 73, a process monitoring module 74, and a data management module 75.
The data source management module 71 is configured to extract data from a data source, and different processes may reuse the same data source to extract different data from the same data source; a design development module 72 for implementing data mining processes of data extraction, data processing, model building, model testing, model evaluation, model visualization and model prediction in a component selection manner; an execution engine module 73 for sequentially submitting the selected components onto the selected computing framework; a process monitoring module 74 for monitoring the operational status of the components of the data mining process; the data management module 75 is configured to manage the tabular data and the model data, perform visualization of the tabular data, and perform visualization of the model data.
Specifically, the data management module 75 mainly includes a table management module 751, a model management module 752, and a component management module 753.
The table management module 751 is configured to manage data tables in the cluster, and includes functions of previewing table data, viewing table data, visualizing table data, deleting table, and the like. The method greatly facilitates the management of the hive data in the cluster by the user.
The model management module 752 is used for managing the model files generated by the data mining algorithm, including previewing the model, i.e. visualizing the model, evaluating the model, downloading the model, deleting the model, and so on. Wherein the model preview enables the user to more easily understand the model and use the good model. The evaluation information of the model can enable a user to know the quality of the model, and the quality of the model generated by different algorithms of the same data and different parameters of the same algorithm of the same data can be conveniently compared, so that the user can conveniently select a most appropriate model. Downloading the model, namely downloading the yaml file of the model to the local of a user, and transplanting the model to any other data mining platform according to given documents and rules; the user-constructed data mining model can be multiplexed onto other systems or platforms. The model deletion mainly refers to deleting the model which is not used any more, so that the use amount of disk storage can be reduced.
The component management module 753 is used for managing built-in components and managing custom components, and the management of the built-in components is mainly to set whether the components are used or not. If the built-in component is set to be in a use state, the component can be seen on a design and development page and can be used; if the component is set to be in the unavailable state, the component is hidden in a design and development page, and the component cannot be used for constructing a data mining process. The management of the custom component mainly comprises the definition and setting state of the custom component and the deleting function. The self-defining component is also defined by configuring the input and output and parameters of the front end of the component in a pulling mode, and after the self-defining component is defined, the page configuration of the component can be previewed; the background code of the execution of the component needs to be customized according to a given interface and specification, and is packaged and uploaded to a server when the component is customized. The custom component sets the state of the component as it was when the component was built in. And deleting the custom component by realizing the configuration of the front-end component and the corresponding package of the component background.
Fig. 12 is a schematic flow chart of a working process of the execution engine module provided in this embodiment, and as shown in fig. 12, the process includes a flow parsing process, an intelligent selection process, a submission execution process, and a model storage process.
The process analysis process analyzes the developed data mining process into executable logic units; in the intelligent selection process, for each logic unit, selecting a calculation frame to be submitted by using an intelligent discrimination model according to the data volume and the resource use condition of the current cluster; submitting the execution process, submitting the executed logic unit to the selected calculation frame for calculation; and a model storage process, namely converting the model generated by calculation into a file in a yaml format for storage.
The computing frame selection device provided by the embodiment of the invention realizes the selection of different computing frames for each step (node) of the data mining process through the data source management module, the design development module, the execution engine module, the process monitoring module and the data management module, solves the problem that only one computing frame can be selected in the data mining process in the prior art, and realizes the flexible selection of the computing frame in a simple and feasible manner, thereby improving the efficiency of the data mining process, simultaneously generating a portable data mining model for any data mining platform, monitoring the progress of the whole data mining process, and positioning and timely repairing the problems.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (9)

1. A computing framework selection method, the method comprising:
selecting a preset component according to a preset data mining process;
converting the preset component into a directed acyclic graph; wherein the nodes of the directed acyclic graph have corresponding relations with the components;
screening nodes with the degree of income of 0 in the directed acyclic graph;
acquiring the input data volume of the node with the income degree of 0 and the resource use condition of the cluster;
selecting a calculation frame of data mining for the node with the degree of income 0 according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using an intelligent discrimination model; wherein the cluster is a storage space of the extracted data;
submitting the nodes with the degree of income of 0 to a selected data mining calculation framework, and calculating the nodes by using the calculation framework.
2. The method of claim 1, wherein the predetermined data mining process comprises: the method comprises a data extraction process, a data processing process, an algorithm application process and a model establishment process, wherein the preset components are selected according to a preset data mining process, and the method comprises the following steps:
determining a data source to be extracted according to the data extraction process, and selecting a data extraction type component according to the data source;
determining a processing method for processing data according to the data processing process, and selecting a data processing assembly according to the processing method;
determining an algorithm established by a model according to the algorithm application process, and selecting an algorithm component according to the algorithm;
and determining the purpose of data mining according to the model building process, and selecting a modeling tool class component according to the purpose.
3. The method of claim 1, wherein after submitting the node with an in-degree of 0 to the selected computing framework for data mining and computing the node using the computing framework, the method further comprises:
and deleting the node with the income degree of 0.
4. The method of claim 1, further comprising:
generating a data mining model according to the calculation result of the data mining calculation framework on the nodes;
the data mining model is converted into a file in another markup language, yaml format, and stored on a predetermined path of the cluster.
5. The method of claim 1, further comprising:
and monitoring the operation condition of the component and positioning the abnormal component.
6. The method of claim 4, further comprising:
maintaining data generated during the data mining process.
7. A computing framework selection apparatus, the apparatus comprising:
the selection module is used for selecting the preset components according to a preset data mining process; selecting a computing framework for data mining for the nodes;
the conversion module is used for converting the preset component into a directed acyclic graph; wherein the nodes of the directed acyclic graph have corresponding relations with the components;
the selection module comprises: the screening unit is used for screening the nodes with the degree of income of 0 in the directed acyclic graph; the acquisition unit is used for acquiring the input data volume of the node with the income degree of 0 and the resource use condition of the cluster; the selection unit is used for selecting a calculation frame of data mining for the node with the degree of income 0 according to the input data volume of the node with the degree of income 0 and the resource use condition of the cluster by using an intelligent discrimination model; wherein the cluster is a storage space of the extracted data;
and the processing module is used for submitting the nodes with the in degree of 0 to the selected data mining calculation frame and calculating the nodes by using the calculation frame.
8. The apparatus of claim 7, wherein the predetermined data mining process comprises: the system comprises a data extraction process, a data processing process, an algorithm application process and a model establishment process, wherein the selection module is specifically used for:
determining a data source to be extracted according to the data extraction process, and selecting a data extraction type component according to the data source;
determining a processing method for processing data according to the data processing process, and selecting a data processing assembly according to the processing method;
determining an algorithm established by a model according to the algorithm application process, and selecting an algorithm component according to the algorithm;
and determining the purpose of data mining according to the model building process, and selecting a modeling tool class component according to the purpose.
9. The apparatus of claim 7, wherein the processing module is further configured to:
obtaining a data mining model according to the calculation result of the data mining calculation framework on the nodes;
and converting the data mining model into a file in another markup language of yaml format and storing the file on a predetermined path of the storage space.
CN201610981871.5A 2016-11-08 2016-11-08 Computing framework selection method and device Active CN108073582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610981871.5A CN108073582B (en) 2016-11-08 2016-11-08 Computing framework selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610981871.5A CN108073582B (en) 2016-11-08 2016-11-08 Computing framework selection method and device

Publications (2)

Publication Number Publication Date
CN108073582A CN108073582A (en) 2018-05-25
CN108073582B true CN108073582B (en) 2021-08-06

Family

ID=62154125

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610981871.5A Active CN108073582B (en) 2016-11-08 2016-11-08 Computing framework selection method and device

Country Status (1)

Country Link
CN (1) CN108073582B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763573A (en) * 2018-06-06 2018-11-06 众安信息技术服务有限公司 A kind of OLAP engines method for routing and system based on machine learning
CN111382193A (en) * 2018-12-28 2020-07-07 顺丰科技有限公司 Method and device for constructing data warehouse topic model
CN110363280A (en) * 2019-09-02 2019-10-22 国家气象信息中心 Algorithm model training analysis system
CN113342489A (en) * 2021-05-25 2021-09-03 上海商汤智能科技有限公司 Task processing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573063A (en) * 2015-01-23 2015-04-29 四川中科腾信科技有限公司 Data analysis method based on big data
CN104834561A (en) * 2015-04-29 2015-08-12 华为技术有限公司 Data processing method and device
CN106020811A (en) * 2016-05-13 2016-10-12 乐视控股(北京)有限公司 Development method and device of algorithm model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573063A (en) * 2015-01-23 2015-04-29 四川中科腾信科技有限公司 Data analysis method based on big data
CN104834561A (en) * 2015-04-29 2015-08-12 华为技术有限公司 Data processing method and device
CN106020811A (en) * 2016-05-13 2016-10-12 乐视控股(北京)有限公司 Development method and device of algorithm model

Also Published As

Publication number Publication date
CN108073582A (en) 2018-05-25

Similar Documents

Publication Publication Date Title
CN107844837B (en) Method and system for adjusting and optimizing algorithm parameters aiming at machine learning algorithm
Acun et al. Understanding training efficiency of deep learning recommendation models at scale
CN108073582B (en) Computing framework selection method and device
CN107612886B (en) Spark platform Shuffle process compression algorithm decision method
Begoli et al. Design principles for effective knowledge discovery from big data
US11314808B2 (en) Hybrid flows containing a continous flow
CN110929489A (en) Form generation and form data processing method and system
CN113822440A (en) Method and system for determining feature importance of machine learning samples
US11461333B2 (en) Vertical union of feature-based datasets
CN108171617A (en) A kind of power grid big data analysis method and device
US10664477B2 (en) Cardinality estimation in databases
US9706005B2 (en) Providing automatable units for infrastructure support
CN114443639A (en) Method and system for processing data table and automatically training machine learning model
CN109684319A (en) Data clean system, method, apparatus and storage medium
CN112286957A (en) API application method and system of BI system based on structured query language
CN111078094A (en) Distributed machine learning visualization device
CN110704371A (en) Large-scale data management and data distribution system and method
US20210326761A1 (en) Method and System for Uniform Execution of Feature Extraction
CN112395333B (en) Method, device, electronic equipment and storage medium for checking data abnormality
CN104636397B (en) Resource allocation methods, calculating accelerated method and device for Distributed Calculation
CN110019152A (en) A kind of big data cleaning method
US20190340540A1 (en) Adaptive continuous log model learning
Cai et al. A recommendation-based parameter tuning approach for Hadoop
WO2019153546A1 (en) Ten-thousand-level dimension data generation method, apparatus and device, and storage medium
CN115686995A (en) Data monitoring processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant