CN102693317B - Method and device for data mining process generating - Google Patents

Method and device for data mining process generating Download PDF

Info

Publication number
CN102693317B
CN102693317B CN201210171554.9A CN201210171554A CN102693317B CN 102693317 B CN102693317 B CN 102693317B CN 201210171554 A CN201210171554 A CN 201210171554A CN 102693317 B CN102693317 B CN 102693317B
Authority
CN
China
Prior art keywords
iteration
node
data
time
assessment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210171554.9A
Other languages
Chinese (zh)
Other versions
CN102693317A (en
Inventor
刘诗凯
杨志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201210171554.9A priority Critical patent/CN102693317B/en
Publication of CN102693317A publication Critical patent/CN102693317A/en
Application granted granted Critical
Publication of CN102693317B publication Critical patent/CN102693317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and device for data mining process generating. The method includes the following steps that, the metadata and training data submitted by an input device are received, and the metadata includes output data types; the preset output data types of multiple algorithm nodes are matched with the output data types in the metadata, so as to determine the successfully matched algorithm nodes; all the forward and the backward nodes of each successfully matched algorithm node are recursively searched, so as to confirm the multiple data mining paths corresponding to each successfully matched algorithm node; multiple data mining processes are established according to the preset node parameter templates; and the multiple data mining processes are respectively verified according to the training data, ad the optimal data mining process in the multiple data mining processes is determined according to the verification result and the preset evaluation rule. As long as an ordinary user specifies the evaluation rule, the optimal data mining process can be automatically chosen, so as to reduce the difficulty of modeling.

Description

Data digging flow generation method and device
Technical field
The embodiment of the present invention relates to the communication technology, relates in particular to a kind of data digging flow generation method and device.
Background technology
At present Data Mining mainly carries out modeling by the expert who is not only proficient in professional work but also be proficient in algorithm, makes the data mining cannot popularization and application.
For instance, Data Mining Tools platform Ke Laimentai (Clementine) provides graphical operation interface, make each step that analyst can visualized data mining process, mutual by with data stream, analyst and business personnel can cooperate, and professional knowledge is dissolved in data mining process.Fig. 1 is the schematic diagram that Clementine generated data excavates flow process, as shown in Figure 1, after business understanding and data understanding, the data that need analyst to rely on higher professional knowledge and to complete repeatedly the understandability of data are prepared and modeling, assess each data preparation and the result of modeling, finally determine optimal data excavation process concurrency cloth.
This shows, in existing generated data excavation flow and method, modeling difficulty is high, and definite optimal data excavation flow process difficulty is higher, process complexity.
Summary of the invention
The embodiment of the present invention provides a kind of data digging flow generation method and device, and in order to solve existing generated data, to excavate in flow and method modeling difficulty high, and definite optimal data to excavate flow process difficulty higher, the problem of process complexity.
First aspect of the present invention is to provide a kind of data digging flow generation method, comprising:
Data digging flow generating apparatus receives metadata and the training data that input media is submitted to, and described metadata comprises output data type;
The output data type of default polyalgorithm node is mated with the output data type in described metadata respectively, determine the algorithm node that the match is successful;
According to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;
The multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;
According to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Further, described metadata also comprises input data type;
The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis also comprise before:
According to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, and the Source Type of described type conversion node comprises the input data type in described metadata;
The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise:
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
Further, described training data comprises many records; Describedly according to described training data, described multiple data digging flows are verified respectively, are determined that according to the result and default Rules of Assessment optimal data in described multiple data digging flow excavates flow process and specifically comprise:
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Further, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
The default sampling prescription of described basis is sampled to described training data, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines that according to the result and default Rules of Assessment the optimal data excavation flow process in described multiple data digging flow specifically comprises:
According to default iterations N, described training data is divided into N part training subdata;
According to the sampling prescription that iteration is for the first time corresponding, first part of training subdata sampled for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
The base condition that the assessed value of iteration for the first time of each assessment key element corresponding described each data digging flow is corresponding with described each assessment key element is mated, and the data digging flow that the assessed value of iteration for the first time of determining described each assessment key element all meets corresponding base condition is the first iterative process;
The sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
Alternatively, describedly the record that obtains of sampling be for the first time input to each data digging flow specifically comprise:
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
Alternatively, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
The described record that secondary sample is obtained also comprises before being input to each the first iterative process:
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
The described record that secondary sample is obtained is input to each the first iterative process and specifically comprises:
The record that secondary sample is obtained is input to described N the first iterative process.
Alternatively, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:
Obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each two iterative process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
Further, described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively:
Based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful.
Further, the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise:
According to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path.
Another aspect of the present invention is to provide a kind of data digging flow generating apparatus, comprising:
Receiver module, metadata and the training data submitted to for receiving input media, described metadata comprises output data type;
Matching module, for the output data type of default polyalgorithm node is mated with the output data type of described metadata respectively, determines the algorithm node that the match is successful;
Path determination module, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;
Flow process creation module, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;
Authentication module, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Further, described metadata also comprises input data type, and described device also comprises:
Conversion matching module, for before described flow process creation module is according to the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path, according to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, the Source Type of described type conversion node comprises the input data type in described metadata,
Described flow process creation module specifically for,
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
Further, described training data comprises many records, described authentication module specifically for,
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Further, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
Described authentication module specifically comprises:
Data division unit, for being divided into N part training subdata according to default iterations N by described training data;
Authentication unit, for first part of training subdata being sampled for the first time according to sampling prescription corresponding to iteration for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
Assessment unit, for the assessed value of iteration for the first time of the each assessment key element corresponding each data digging flow base condition corresponding with the each assessment key element of described Rules of Assessment is mated, the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;
Iteration unit is used for, the sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
Further, described authentication unit specifically for,
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
Alternatively, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
Described assessment unit also for,
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
Described iteration unit is input to described N the first iterative process specifically for, the record that secondary sample is obtained.
Alternatively, described iteration unit also for,
After the record that secondary sample is obtained is input to each the first iterative process, obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each two iterative process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
A technical scheme in above-mentioned multiple technical scheme at least has following beneficial effect or advantage:
The embodiment of the present invention is by exporting data type to mating according to the default polyalgorithm node of expertise, based on according to all forward direction nodes and the backward node of the default node relationships table recursive lookup algorithm node that the match is successful of expertise, determine multiple data minings path, multiple data digging flows corresponding to node parameter template establishment of the each node based on presetting according to expertise, then according to described training data, described multiple data digging flows are verified respectively, determine the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment, having solved existing generated data, to excavate in flow and method modeling difficulty high, and it is higher that definite optimal data is excavated flow process difficulty, the problem of process complexity, domestic consumer only needs to specify Rules of Assessment can automatically select optimal data to excavate flow process, reduce the difficulty of modeling.
Brief description of the drawings
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the schematic diagram that Clementine generated data excavates flow process;
The schematic flow sheet of a kind of data digging flow generation method that Fig. 2 provides for the embodiment of the present invention;
The structural representation of a kind of data digging flow generating apparatus that Fig. 3 provides for the embodiment of the present invention.
Embodiment
For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
The schematic flow sheet of a kind of data digging flow generation method that Fig. 2 provides for the embodiment of the present invention.As shown in Figure 2, comprising:
201, data digging flow generating apparatus receives metadata and the training data that input media is submitted to, and described metadata comprises output data type.
The data digging flow generating apparatus that described data digging flow generating apparatus provides for the embodiment of the present invention.Described input media can be the user interface that described data digging flow generating apparatus provides, and can be also independently to install with described data digging flow generating apparatus.When input media is when independently installing with described data digging flow generating apparatus, described input media by and described data digging flow generating apparatus between self-defining interface submit metadata and training data to.
Particularly, training data is identical with general data storehouse derived data form, as " A|B|C|D "; Metadata is the definition to training data, also referred to as data description.The metadata example of table 1 for user is analyzed from net state.
Table 1
Attribute-name Data type Attribute direction Data span Default value
UserID String ID NULL NULL
Sex String Input Man, female NULL
Age Integer Input 0-150 NULL
Status Integer Output 0,1 NULL
As shown in table 1, metadata has defined four attributes and corresponding data type, attribute direction, data span and the default value of each attribute of training data, wherein, data type can be character (String), integer (Integer) etc., data span and default value can be empty (NULL), attribute direction comprises input, output and ID, and the attribute that attribute direction is ID is the major key of a record, in data mining process without analyze this attribute.Four attributes are respectively user ID (UserID), sex (Sex), age (Age) and User Status (Status), wherein, Status be 0 expression from net, Status 1 is illustrated in net.Conventionally, training data comprises many records, and every record comprises the data of corresponding each attribute, and for instance, record " Xiao Ming | man | 32|0 " represents that UserID be " Xiao Ming ", sex for " man ", the age user that is " 32 " is in from net state.
Particularly, data type corresponding to attribute that attribute direction is " input " namely inputted data type, and data type corresponding to attribute that attribute direction is " output " namely exported data type.Training data in 201 and metadata are inputted with document form conventionally.
202, the output data type of default polyalgorithm node is mated with the output data type in described metadata respectively, determine the algorithm node that the match is successful.
Particularly, according to the default described polyalgorithm node of expertise; Technology that each algorithm node is corresponding algorithm realizes, and for according to the training data of input, the incidence relation between mining data, according to a model of incidence relation output.Conventionally, each algorithm node all has corresponding constraint information, comprises node-classification, input data type, output data type and whether requires to record integrality etc.Table 2 is the constraint information example of two algorithm nodes.
Table 2
Wherein, input and output data type ID is enumerated value, and corresponding data type is as shown in table 3.
Table 3
ID Data type
1 Type continuously
2 Many-valued discrete type
3 Two-value discrete type
As shown in table 2, table 3, the node-classification of bayesian algorithm node (Bayesian) is prediction class (Predictive), input data type is many-valued discrete type or two-value discrete type, and output data type is many-valued discrete type or two-value discrete type, does not require and records integrality; The node-classification of logical algorithm node (Logistic) is prediction class (Predictive), and input data type is continuous type, many-valued discrete type or two-value discrete type, and output data type is two-value discrete type, requires to record integrality.
It should be noted that, determine that the algorithm node that the match is successful can have one or more in 202.For instance, if the output data type in metadata is two-value discrete type, the algorithm node that the match is successful is Bayesian and Logistic; If the output data type in metadata is many-valued discrete type, the algorithm node that the match is successful is Bayesian.
203, according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful.
Particularly, described node relationships table is default according to expertise.All forward direction nodes of algorithm node comprise the preposition node of this algorithm node, the preposition node of this preposition node, etc.; The all backward node of algorithm node comprises the descendant node taking this algorithm node as preposition node, the descendant node taking this descendant node as preposition node, etc.Table 4 is node relationships representation case.
Table 4
As shown in table 4, node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding.Based on table 4, the preposition node that can determine algorithm Node B ayesian is feature selecting (Feature-Select), the preposition node of Feature-Select is branch mailbox (Binning), the preposition node of Binning is type (Type), the preposition node of Type is file input (FileImport), the preposition node of FileImport is empty, and FileImport is the start node in data mining path; In addition, algorithm Node B ayesian is the preposition node of model application (Model-Apply), Model-Apply is the preposition node of model evaluation (Model-Evaluation), that is to say, Model-Apply is the descendant node of Bayesian, the descendant node that Model-Evaluation is Model-Apply; Due to the descendant node that Model-Evaluation is arbitrary other nodes, therefore Model-Evaluation is the end node in data mining path.In sum, by the node relationships table of table 4, can obtain data mining path: a FileImport → Type → Binning → Feature-Select → Bayesian → Model-Apply → Model-Evaluation, this data mining path is corresponding with algorithm Node B ayesian.
It should be noted that, a node can be by one or more node parameter templates; A node can have one or more preposition nodes, can be also the preposition node of one or more other nodes, therefore, from an algorithm node, can obtain one or more data minings path.
Particularly, described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful.
204, the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment.
Particularly, the node parameter template of each node is default according to expertise.Node parameter template can be free mark (Freemarker) template, preserves with ftl form.The most parameters of node parameter template solidifies based on expertise, and only the reserved partial parameters relevant to input data generates in the process of generated data excavation flow process, such as inputting data type etc.
For instance, in default node parameter template FileImport-1, curing parameter can comprise: input file title (inputFileName), encoding (encoding), the first row title (first_row_as_names), row delimiter (columnDelimiter).Wherein, the value of parameter inputFileName is the title of corresponding document while inputting metadata and training data with document form, as " TrainData "; The value representation coded format of parameter encoding, can be unified character standard transmission formats-8(Unicode Transformation Format-8, be called for short UTF-8) ASCII(American Standard Code for information interchange) (American Standard Code for Information Interchange is called for short ASCII) etc.; Whether the first row of the value representation input file of parameter f irst_row_as_names can be used as the data description of this file, be worth for " true " represent passable, be worth for " false " represent cannot, be mainly used in simplify configuration; Separator in the value representation input file of parameter c olumnDelimiter between field and field, can be ", ".
In default node parameter template Binning-1, curing parameter can comprise: branch mailbox type (binningtype), branch mailbox are described (binningdetail), branch mailbox file (binningColumns).Wherein, branch mailbox type such as comprises at dark branch mailbox, wide branch mailbox, the standard deviation branch mailbox etc.; The design parameter of the corresponding above-mentioned branch mailbox type of the value representation of parameter b inningdetail, as need the number etc. of branch mailbox; The value representation branch mailbox field of parameter b inningColumns, represents by Age field branch mailbox while being " Age " as value.
In default node parameter template FeatureSelection-1, curing parameter can comprise: miss ratio threshold value (missingRatioThreshold), repetitive rate threshold value (repeatRatioThreshold), different value ratio threshold values (diffValueRatioThreshold), degree of confidence (confidence), maximum dispersion number (maxDiscretizedNum).Wherein, the incomplete ratio threshold value being recorded in the total number of records of the value representation of parameter m issingRatioThreshold; What the value representation of parameter repeatRatioThreshold repeated is recorded in the ratio threshold value in the total number of records; The ratio threshold value of the different values of value representation of parameter d iffValueRatioThreshold in the total number of records; The element size that the value representation of parameter m axDiscretizedNum is supported for aggregate type acquiescence, exceedes this size and does not think aggregate type.
In default node parameter template Bayesian-1, curing parameter can comprise: model file (ModelFile), ignore disappearance (IgnoreMissings), use partition data (Usepartition).Wherein, whether the value representation of parameter Usepartition uses partition data.
In default node parameter template Model-Apply-1, curing parameter can comprise: model file (ModelFile), output data file (OutputDataFile).Wherein, the path of the model file of the value representation algorithm node output of parameter ModelFile, the path of the value representation output data file of parameter OutputDataFile.
In default node parameter template Model-Evaluation-1, curing parameter can comprise: Study document (AnalysisFile), OutputDataFile(output data file), ROI, F-score.Wherein, whether the value representation of ROI uses ROI as assessment key element, and whether the value representation of F-score uses F-score as assessment key element.
204 specifically comprise: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path.
Based on the each data mining path obtaining in 203, the node parameter template that can find according to the node parameter template store path in node relationships table all nodes on this data mining path, these node parameter templates have formed the data digging flow corresponding with this data mining path.
205, according to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Describedly according to described training data, described multiple data digging flows are verified and are specially respectively, described training data is input to respectively to each data digging flow.Correspondingly, the output data of each data digging flow are exactly the result.
Rules of Assessment can be that user is default according to the target of data mining, generally includes at least one assessment key element and base condition corresponding to each assessment key element.For instance, assessment key element can be gain on investments (Return On Investment is called for short ROI), F value (F-Score) etc., and the base condition that gain on investments is corresponding can be to be greater than 0.9, and the base condition that F-Score is corresponding can be to be greater than 0.95.
In an optional embodiment of the present invention, effective in order to ensure the each data digging flow creating in 204, can also first carry out verification to each data mining path.Particularly, described metadata also comprises input data type, before 204, can also comprise:
According to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, and the Source Type of described type conversion node comprises the input data type in described metadata;
Accordingly, 204 specifically can comprise:
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
Particularly, the type conversion node that the algorithm node that the match is successful relies on can be that the data of the input data type in metadata are converted to the data that data type is the input data type of the algorithm node of described coupling by data type.Table 5 is data type mapping and node relationships definition example.
Table 5
Object type Source Type Data transformation node title
Integer,Float,Double Many-valued discrete Binning,Typing
String Many-valued discrete Typing
Integer,Float,Double Type continuously Binning,Typing
String Type continuously Typing
As shown in table 5, data transformation node Binning and Typing can be that many-valued discrete data are converted to the data that data type is integer (Integer), 32 single-precision floating point type decimals (Float) or 64 double-precision floating point type decimals (Double) by data type.
For instance, if the type conversion node that algorithm Node B ayesian relies on is Binning, but in the data mining path corresponding with Bayesian of determining in 203, do not comprise Binning, ignore this data mining path, otherwise create the data digging flow corresponding with this data mining path.
In another alternative embodiment of the present invention, because the data volume of training data is conventionally very large, in order to alleviate checking burden, can first sample to training data, then according to the training data after sampling, data digging flow is verified.Particularly, described training data comprises many records, and 205 can comprise:
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Particularly, according to default sampling prescription, described training data being sampled can be random sampling by a certain percentage, also can sample by the span of each attribute in training data, etc.Wherein, sampling prescription is default according to expertise.
Further, in order to take into account the comprehensive of checking, can also adopt the mode of cross-iteration checking, particularly, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
The default sampling prescription of described basis is sampled to described training data, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines that according to the result and default Rules of Assessment the optimal data excavation flow process in described multiple data digging flow specifically comprises:
According to default iterations N, described training data is divided into N part training subdata;
According to the sampling prescription that iteration is for the first time corresponding, first part of training subdata sampled for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
The base condition that the assessed value of iteration for the first time of each assessment key element corresponding each data digging flow is corresponding with each assessment key element in described Rules of Assessment is mated, and the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;
The sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
Particularly, each sampling can also adopt the mode of stratified sampling to carry out.Sampling prescription corresponding to each iteration can be different, so that how dissimilar the record that each iteration is used is different and cover more record in training data as much as possible.Conventionally, the iterative process that each iteration obtains has multiple, in order further to alleviate the burden of successive iterations checking, can also limit the number of the iterative process that enters next iteration checking.Particularly, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
The described record that secondary sample is obtained also comprises before being input to each the first iterative process:
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
The described record that secondary sample is obtained is input to each the first iterative process and specifically comprises:
The record that secondary sample is obtained is input to described N the first iterative process.
For instance, if the weight order of assessment key element ROI is 1, the weight order of assessment key element F-Score is 2, be that ROI is more important in sequence,, in multiple the first iterative process that iteration obtains for the first time, for the first time iteration assessed value corresponding according to ROI sorts, in the identical situation of the assessed value of iteration for the first time that ROI is corresponding, the for the first time iteration assessed value corresponding according to F-Score sorts, thereby determines N the first iterative process that comes top N.Conventionally, it is 1 that the N time MAXIMUM SELECTION flow process quantity corresponding to iteration is set, and, finally determines an optimal data excavation flow process that is.
Further, in order to ensure carrying out smoothly of checking, can also first filter the record that will input.Particularly, describedly the record that obtains of sampling be for the first time input to each data digging flow specifically comprise:
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
Particularly, for each data digging flow, can filter record according to above-mentioned steps.It should be noted that, the algorithm node in different data digging flows may be different, and different algorithm nodes may be different to the requirement of recording integrality, and therefore, the record after filtering for different data digging flows may be different.
For instance, four attribute: UserID, Sex, Age and Status in metadata, are defined, if one be recorded as " Xiao Ming || 32|0 ", do not comprise the parameter that attribute Sex is corresponding, this record imperfect, further, if the algorithm node that data digging flow is corresponding is Logistic, and the constraint information of Logistic shows that Logistic requires to record integrality, filter this record, this record is not inputted to this data digging flow.On the contrary, if the algorithm node that data digging flow is corresponding is Bayesian, and showing that Bayesian does not require, the constraint information of Bayesian records integrality, this record can be input to this data digging flow, conventionally, before this record is input to this data digging flow, default value corresponding attribute Sex can also be added in this record, to form a complete record.If the default value that attribute Sex is corresponding is empty,, without adding default value, directly this record is input to this data digging flow.
It should be noted that, before the record that sampling obtains is each time input to data digging flow, all can carry out above-mentioned filtration step.
Further, in follow-up iterative process, the assessed value that the assessed value that can also obtain according to front iteration several times and this iteration obtain is carried out comprehensive assessment, to select to carry out the data digging flow of next iteration.Particularly, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:
Obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each two iterative process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
For instance, the assessment weight of iteration is 1 for the first time, the assessment weight of iteration is 4 for the second time, the assessed value of iteration for the first time of the ROI that certain data digging flow is corresponding and for the second time iteration assessed value are respectively 0.9 and 0.95, the comprehensive assessment value of the ROI that this data digging flow is corresponding is 4.7, after iteration for the second time, sort according to the comprehensive assessment value of ROI corresponding to each data digging flow (secondary iteration flow process), determine and come M secondary iteration flow process of front M position, thereby this M secondary iteration flow process being verified in iteration for the third time.
The embodiment of the present invention is by exporting data type to mating according to the default polyalgorithm node of expertise, based on according to all forward direction nodes and the backward node of the default node relationships table recursive lookup algorithm node that the match is successful of expertise, determine multiple data minings path, the multiple data digging flows corresponding according to default node parameter template establishment, then according to described training data, described multiple data digging flows are verified respectively, determine the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment, having solved existing generated data, to excavate in flow and method modeling difficulty high, and it is higher that definite optimal data is excavated flow process difficulty, the problem of process complexity, domestic consumer only needs to specify Rules of Assessment can automatically select optimal data to excavate flow process, reduce the difficulty of modeling.And, owing to having simplified flow process and parameter that domestic consumer need to configure, automatically create and find optimum flow process, also improve to a certain extent the utilization rate of data mining.
One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each embodiment of the method can complete by the relevant hardware of programmed instruction.Aforesaid program can be stored in a computer read/write memory medium.This program, in the time carrying out, is carried out the step that comprises above-mentioned each embodiment of the method; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.
The structural representation of a kind of data digging flow generating apparatus that Fig. 3 provides for the embodiment of the present invention.As shown in Figure 3, this device comprises:
Receiver module 31, metadata and the training data submitted to for receiving input media, described metadata comprises output data type;
Matching module 32, for the output data type of default polyalgorithm node is mated with the output data type of described metadata respectively, determines the algorithm node that the match is successful;
Path determination module 33, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;
Flow process creation module 34, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;
Authentication module 35, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Described input media can be the user interface that described data digging flow generating apparatus provides, and can be also independently to install with described data digging flow generating apparatus.When input media is when independently installing with described data digging flow generating apparatus, described input media by and described data digging flow generating apparatus between self-defining interface submit metadata and training data to.
In an optional embodiment of the present invention, described metadata also comprises input data type, and described device also comprises:
Conversion matching module, for before flow process creation module 34 is according to the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path, according to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, the Source Type of described type conversion node comprises the input data type in described metadata,
Flow process creation module 34 specifically for,
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
In another alternative embodiment of the present invention, described training data comprises many records, authentication module 35 specifically for,
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
Further, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
Authentication module 35 specifically comprises:
Data division unit, for being divided into N part training subdata according to default iterations N by described training data;
Authentication unit, for first part of training subdata being sampled for the first time according to sampling prescription corresponding to iteration for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
Assessment unit, for the assessed value of iteration for the first time of the each assessment key element corresponding each data digging flow base condition corresponding with the each assessment key element of described Rules of Assessment is mated, the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;
Iteration unit is used for, the sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
Further, described authentication unit specifically for,
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
Alternatively, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
Described assessment unit also for,
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
Described iteration unit is input to described N the first iterative process specifically for, the record that secondary sample is obtained.
Alternatively, described iteration unit also for,
After the record that secondary sample is obtained is input to each the first iterative process, obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each two iterative process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
It should be noted that, the data digging flow generating apparatus of the present embodiment can be concentrated and dispose with data digging system, also can independently dispose.
The embodiment of the present invention is by exporting data type to mating according to the default polyalgorithm node of expertise, based on according to all forward direction nodes and the backward node of the default node relationships table recursive lookup algorithm node that the match is successful of expertise, determine multiple data minings path, the multiple data digging flows corresponding according to default node parameter template establishment, then according to described training data, described multiple data digging flows are verified respectively, determine the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment, having solved existing generated data, to excavate in flow and method modeling difficulty high, and it is higher that definite optimal data is excavated flow process difficulty, the problem of process complexity, domestic consumer only needs to specify Rules of Assessment can automatically select optimal data to excavate flow process, reduce the difficulty of modeling.And, owing to having simplified flow process and parameter that domestic consumer need to configure, automatically create and find optimum flow process, also improve to a certain extent the utilization rate of data mining.
Finally it should be noted that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although the present invention is had been described in detail with reference to aforementioned each embodiment, those of ordinary skill in the art is to be understood that: its technical scheme that still can record aforementioned each embodiment is modified, or some or all of technical characterictic is wherein equal to replacement; And these amendments or replacement do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims (12)

1. a data digging flow generation method, is characterized in that, comprising:
Data digging flow generating apparatus receives metadata and the training data that input media is submitted to, and described metadata comprises output data type;
The output data type of default polyalgorithm node is mated with the output data type in described metadata respectively, determine the algorithm node that the match is successful;
According to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful; Described node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding; Described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful;
The multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment; The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path;
According to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment;
Described metadata also comprises input data type;
The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis also comprise before:
According to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, and the Source Type of described type conversion node comprises the input data type in described metadata;
The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise:
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
2. method according to claim 1, is characterized in that, described training data comprises many records; Describedly according to described training data, described multiple data digging flows are verified respectively, are determined that according to the result and default Rules of Assessment optimal data in described multiple data digging flow excavates flow process and specifically comprise:
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
3. method according to claim 2, is characterized in that, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
The default sampling prescription of described basis is sampled to described training data, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines that according to the result and default Rules of Assessment the optimal data excavation flow process in described multiple data digging flow specifically comprises:
According to default iterations N, described training data is divided into N part training subdata;
According to the sampling prescription that iteration is for the first time corresponding, first part of training subdata sampled for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
The base condition that the assessed value of iteration for the first time of each assessment key element corresponding described each data digging flow is corresponding with described each assessment key element is mated, and the data digging flow that the assessed value of iteration for the first time of determining described each assessment key element all meets corresponding base condition is the first iterative process;
The sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
4. method according to claim 3, is characterized in that, the described record obtaining sampling is for the first time input to each data digging flow and specifically comprises:
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
5. method according to claim 3, is characterized in that, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
The described record that secondary sample is obtained also comprises before being input to each the first iterative process:
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
The described record that secondary sample is obtained is input to each the first iterative process and specifically comprises:
The record that secondary sample is obtained is input to described N the first iterative process.
6. method according to claim 3, is characterized in that, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:
Obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each secondary iteration flow process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
7. a data digging flow generating apparatus, is characterized in that, comprising:
Receiver module, metadata and the training data submitted to for receiving input media, described metadata comprises output data type;
Matching module, for the output data type of default polyalgorithm node is mated with the output data type of described metadata respectively, determines the algorithm node that the match is successful;
Path determination module, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful; Described node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding; Described path determination module determine respectively with corresponding multiple data minings path of described each algorithm node that the match is successful specifically for: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful;
Flow process creation module, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment; Described flow process creation module according to default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path specifically for: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path;
Authentication module, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment;
Described metadata also comprises input data type, and described device also comprises:
Conversion matching module, for before described flow process creation module is according to the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path, according to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, the Source Type of described type conversion node comprises the input data type in described metadata,
Described flow process creation module specifically for,
Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.
8. device according to claim 7, is characterized in that, described training data comprises many records, described authentication module specifically for,
According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.
9. device according to claim 8, is characterized in that, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;
Described authentication module specifically comprises:
Data division unit, for being divided into N part training subdata according to default iterations N by described training data;
Authentication unit, for first part of training subdata being sampled for the first time according to sampling prescription corresponding to iteration for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;
Assessment unit, for the assessed value of iteration for the first time of the each assessment key element corresponding each data digging flow base condition corresponding with the each assessment key element of described Rules of Assessment is mated, the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;
Iteration unit is used for, the sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.
10. device according to claim 9, is characterized in that, described authentication unit specifically for,
If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;
If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;
Record after filtering is input to described data digging flow.
11. devices according to claim 9, is characterized in that, described Rules of Assessment also comprises weight order corresponding to each assessment key element;
Described assessment unit also for,
If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;
Described iteration unit is input to described N the first iterative process specifically for, the record that secondary sample is obtained.
12. devices according to claim 9, is characterized in that, described iteration unit also for,
After the record that secondary sample is obtained is input to each the first iterative process, obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;
Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;
If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each secondary iteration flow process;
According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;
According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.
CN201210171554.9A 2012-05-29 2012-05-29 Method and device for data mining process generating Active CN102693317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210171554.9A CN102693317B (en) 2012-05-29 2012-05-29 Method and device for data mining process generating

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210171554.9A CN102693317B (en) 2012-05-29 2012-05-29 Method and device for data mining process generating

Publications (2)

Publication Number Publication Date
CN102693317A CN102693317A (en) 2012-09-26
CN102693317B true CN102693317B (en) 2014-11-05

Family

ID=46858750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210171554.9A Active CN102693317B (en) 2012-05-29 2012-05-29 Method and device for data mining process generating

Country Status (1)

Country Link
CN (1) CN102693317B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995873B (en) * 2014-05-22 2017-03-15 长春工业大学 A kind of data digging method and data digging system
CN105205052B (en) * 2014-05-30 2019-01-25 华为技术有限公司 A kind of data digging method and device
CN107038167A (en) * 2016-02-03 2017-08-11 普华诚信信息技术有限公司 Big data excavating analysis system and its analysis method based on model evaluation
WO2017185285A1 (en) * 2016-04-28 2017-11-02 华为技术有限公司 Method and device for assigning graphics processing unit task
CN108932300B (en) * 2018-06-06 2022-05-27 成都深思科技有限公司 Filter analysis method and device for infinite iteration and storage medium
CN110008306A (en) * 2019-04-04 2019-07-12 北京易华录信息技术股份有限公司 A kind of data relationship analysis method, device and data service system
CN110188159B (en) * 2019-05-27 2023-05-12 深圳前海微众银行股份有限公司 Credit data access method, device, equipment and computer readable storage medium
CN111143577B (en) * 2019-12-27 2023-06-16 北京百度网讯科技有限公司 Data labeling method, device and system
CN111523798B (en) * 2020-04-21 2023-09-01 武汉市奥拓智能科技有限公司 Automatic modeling method, device, system and electronic equipment thereof
CN112948469B (en) * 2021-04-16 2023-10-13 平安科技(深圳)有限公司 Data mining method, device, computer equipment and storage medium
CN113190582B (en) * 2021-05-06 2021-11-16 北京三维天地科技股份有限公司 Data real-time interactive mining flow modeling analysis system
CN114996331B (en) * 2022-06-10 2023-01-20 北京柏睿数据技术股份有限公司 Data mining control method and system
CN115686867B (en) * 2022-11-30 2024-10-18 北京市大数据中心 Data mining method, device, system, equipment and medium based on cloud computing
CN117406972B (en) * 2023-12-14 2024-02-13 安徽思高智能科技有限公司 RPA high-value flow instance discovery method and system based on fitness analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6820089B2 (en) * 2001-04-05 2004-11-16 International Business Machines Corporation Method and system for simplifying the use of data mining in domain-specific analytic applications by packaging predefined data mining models
US6823334B2 (en) * 2002-03-07 2004-11-23 International Business Machines Corporation Metadata system for managing data mining environments
CN101110089A (en) * 2007-09-04 2008-01-23 华为技术有限公司 Method and system for data digging and model building
CN102346747A (en) * 2010-08-04 2012-02-08 鸿富锦精密工业(深圳)有限公司 Method for searching parameters in data model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6820089B2 (en) * 2001-04-05 2004-11-16 International Business Machines Corporation Method and system for simplifying the use of data mining in domain-specific analytic applications by packaging predefined data mining models
US6823334B2 (en) * 2002-03-07 2004-11-23 International Business Machines Corporation Metadata system for managing data mining environments
CN101110089A (en) * 2007-09-04 2008-01-23 华为技术有限公司 Method and system for data digging and model building
CN102346747A (en) * 2010-08-04 2012-02-08 鸿富锦精密工业(深圳)有限公司 Method for searching parameters in data model

Also Published As

Publication number Publication date
CN102693317A (en) 2012-09-26

Similar Documents

Publication Publication Date Title
CN102693317B (en) Method and device for data mining process generating
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
US11048762B2 (en) User-defined automated document feature modeling, extraction and optimization
CN110737466A (en) Source code coding sequence representation method based on static program analysis
Sahraoui et al. Applying concept formation methods to object identification in procedural code
CN103810152A (en) Visualized formula editor
CN104731588B (en) Page layout document generating method and page layout file creating apparatus
CN112329874A (en) Data service decision method and device, electronic equipment and storage medium
CN110750254A (en) Flowchart editor implementation method based on Angular
CN113190694A (en) Knowledge management platform of knowledge graph
CN109445794B (en) Page construction method and device
CN101944080B (en) Method for reading and XML conversion based on DXF file format
CN114443854A (en) Processing method and device of multi-source heterogeneous data, computer equipment and storage medium
CN117724683B (en) Business logic coding frame generation method and system based on large language model
CN117555986A (en) Intelligent data analysis method and device based on large language model
CN110765276A (en) Entity alignment method and device in knowledge graph
CN109325217B (en) File conversion method, system, device and computer readable storage medium
CN102707938B (en) Table-form software specification manufacturing and supporting method and device
CN104537012A (en) Data processing method and device
CN116126873B (en) Data summarization method and device based on nonstandard data table and storage medium
CN108846059B (en) OpenFOAM finite volume analysis result data format for result post-processing and conversion method thereof
CN115510834A (en) Laboratory data digital management system and method
CN115827885A (en) Operation and maintenance knowledge graph construction method and device and electronic equipment
CN117348863B (en) Low-code development method and device for industrial software, electronic equipment and storage medium
CN110515913B (en) Log processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200225

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee after: HUAWEI TECHNOLOGIES Co.,Ltd.

Address before: Kokusai Hotel No. 11 Nanjing Avenue in the flora of 210012 cities in Jiangsu Province

Patentee before: Huawei Technologies Co.,Ltd.