CN102693317B

CN102693317B - Method and device for data mining process generating

Info

Publication number: CN102693317B
Application number: CN201210171554.9A
Authority: CN
Inventors: 刘诗凯; 杨志
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2012-05-29
Filing date: 2012-05-29
Publication date: 2014-11-05
Anticipated expiration: 2032-05-29
Also published as: CN102693317A

Abstract

The invention provides a method and device for data mining process generating. The method includes the following steps that, the metadata and training data submitted by an input device are received, and the metadata includes output data types; the preset output data types of multiple algorithm nodes are matched with the output data types in the metadata, so as to determine the successfully matched algorithm nodes; all the forward and the backward nodes of each successfully matched algorithm node are recursively searched, so as to confirm the multiple data mining paths corresponding to each successfully matched algorithm node; multiple data mining processes are established according to the preset node parameter templates; and the multiple data mining processes are respectively verified according to the training data, ad the optimal data mining process in the multiple data mining processes is determined according to the verification result and the preset evaluation rule. As long as an ordinary user specifies the evaluation rule, the optimal data mining process can be automatically chosen, so as to reduce the difficulty of modeling.

Description

Data digging flow generation method and device

Technical field

The embodiment of the present invention relates to the communication technology, relates in particular to a kind of data digging flow generation method and device.

Background technology

At present Data Mining mainly carries out modeling by the expert who is not only proficient in professional work but also be proficient in algorithm, makes the data mining cannot popularization and application.

For instance, Data Mining Tools platform Ke Laimentai (Clementine) provides graphical operation interface, make each step that analyst can visualized data mining process, mutual by with data stream, analyst and business personnel can cooperate, and professional knowledge is dissolved in data mining process.Fig. 1 is the schematic diagram that Clementine generated data excavates flow process, as shown in Figure 1, after business understanding and data understanding, the data that need analyst to rely on higher professional knowledge and to complete repeatedly the understandability of data are prepared and modeling, assess each data preparation and the result of modeling, finally determine optimal data excavation process concurrency cloth.

This shows, in existing generated data excavation flow and method, modeling difficulty is high, and definite optimal data excavation flow process difficulty is higher, process complexity.

Summary of the invention

The embodiment of the present invention provides a kind of data digging flow generation method and device, and in order to solve existing generated data, to excavate in flow and method modeling difficulty high, and definite optimal data to excavate flow process difficulty higher, the problem of process complexity.

First aspect of the present invention is to provide a kind of data digging flow generation method, comprising:

Data digging flow generating apparatus receives metadata and the training data that input media is submitted to, and described metadata comprises output data type;

The output data type of default polyalgorithm node is mated with the output data type in described metadata respectively, determine the algorithm node that the match is successful;

According to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;

The multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;

According to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.

Further, described metadata also comprises input data type;

The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis also comprise before:

According to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, and the Source Type of described type conversion node comprises the input data type in described metadata;

The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise:

Judge respectively the type conversion node that whether comprises in each data mining path that the corresponding algorithm node that the match is successful relies on, if so, the data digging flow corresponding with described data mining path according to default node parameter template establishment.

Further, described training data comprises many records; Describedly according to described training data, described multiple data digging flows are verified respectively, are determined that according to the result and default Rules of Assessment optimal data in described multiple data digging flow excavates flow process and specifically comprise:

According to default sampling prescription, described training data is sampled, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.

Further, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;

The default sampling prescription of described basis is sampled to described training data, the record obtaining according to sampling is verified respectively described multiple data digging flows, determines that according to the result and default Rules of Assessment the optimal data excavation flow process in described multiple data digging flow specifically comprises:

According to default iterations N, described training data is divided into N part training subdata;

According to the sampling prescription that iteration is for the first time corresponding, first part of training subdata sampled for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;

The base condition that the assessed value of iteration for the first time of each assessment key element corresponding described each data digging flow is corresponding with described each assessment key element is mated, and the data digging flow that the assessed value of iteration for the first time of determining described each assessment key element all meets corresponding base condition is the first iterative process;

The sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.

Alternatively, describedly the record that obtains of sampling be for the first time input to each data digging flow specifically comprise:

If the algorithm node that described data digging flow is corresponding requires to record integrality, incomplete record in the record that described in filtering according to described metadata, sampling obtains for the first time;

If comprise span requirement in described metadata, in the record that described in filtering, sampling obtains for the first time, do not meet the record that described span requires;

Record after filtering is input to described data digging flow.

Alternatively, described Rules of Assessment also comprises weight order corresponding to each assessment key element;

The described record that secondary sample is obtained also comprises before being input to each the first iterative process:

If the number of described the first iterative process exceedes MAXIMUM SELECTION flow process quantity N corresponding to default iteration for the first time, according to the assessed value of iteration for the first time of each assessment key element of weight order corresponding to each assessment key element and each the first iterative process, described the first iterative process is sorted, determine N the first iterative process that comes top N;

The described record that secondary sample is obtained is input to each the first iterative process and specifically comprises:

The record that secondary sample is obtained is input to described N the first iterative process.

Alternatively, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:

Obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;

Base condition corresponding with each assessment key element the assessed value of iteration for the second time of each assessment key element corresponding each the first iterative process is mated, and the first iterative process that the assessed value of iteration for the second time of determining each assessment key element all meets corresponding base condition is secondary iteration flow process;

If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each two iterative process;

According to the comprehensive assessment value of each assessment key element of weight order corresponding to each assessment key element and each secondary iteration flow process, described secondary iteration flow process is sorted, determine M the secondary iteration flow process that comes front M position;

According to the sampling prescription that iteration is for the third time corresponding, the 3rd part of training subdata sampled for the third time, the record that sampling obtains is for the third time input to described M secondary iteration flow process.

Further, described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively:

Based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful.

Further, the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise:

According to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path.

Another aspect of the present invention is to provide a kind of data digging flow generating apparatus, comprising:

Receiver module, metadata and the training data submitted to for receiving input media, described metadata comprises output data type;

Matching module, for the output data type of default polyalgorithm node is mated with the output data type of described metadata respectively, determines the algorithm node that the match is successful;

Path determination module, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;

Flow process creation module, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;

Authentication module, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.

Further, described metadata also comprises input data type, and described device also comprises:

Conversion matching module, for before described flow process creation module is according to the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path, according to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, the Source Type of described type conversion node comprises the input data type in described metadata,

Described flow process creation module specifically for,

Further, described training data comprises many records, described authentication module specifically for,

Described authentication module specifically comprises:

Data division unit, for being divided into N part training subdata according to default iterations N by described training data;

Authentication unit, for first part of training subdata being sampled for the first time according to sampling prescription corresponding to iteration for the first time, the record that sampling obtains is for the first time input to each data digging flow, obtain the result of iteration for the first time of each data digging flow, the described the result of iteration for the first time comprises the assessed value of iteration for the first time of each assessment key element;

Assessment unit, for the assessed value of iteration for the first time of the each assessment key element corresponding each data digging flow base condition corresponding with the each assessment key element of described Rules of Assessment is mated, the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;

Iteration unit is used for, the sampling prescription corresponding according to iteration for the second time carries out secondary sample to second part of training subdata, the record that secondary sample is obtained is input to each the first iterative process, iteration checking is until N iterative process is defined as optimal data excavation flow process by definite N iterative process.

Further, described authentication unit specifically for,

Record after filtering is input to described data digging flow.

Described assessment unit also for,

Described iteration unit is input to described N the first iterative process specifically for, the record that secondary sample is obtained.

Alternatively, described iteration unit also for,

After the record that secondary sample is obtained is input to each the first iterative process, obtain the result of iteration for the second time of each the first iterative process, the described the result of iteration for the second time comprises the assessed value of iteration for the second time of each assessment key element;

A technical scheme in above-mentioned multiple technical scheme at least has following beneficial effect or advantage:

The embodiment of the present invention is by exporting data type to mating according to the default polyalgorithm node of expertise, based on according to all forward direction nodes and the backward node of the default node relationships table recursive lookup algorithm node that the match is successful of expertise, determine multiple data minings path, multiple data digging flows corresponding to node parameter template establishment of the each node based on presetting according to expertise, then according to described training data, described multiple data digging flows are verified respectively, determine the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment, having solved existing generated data, to excavate in flow and method modeling difficulty high, and it is higher that definite optimal data is excavated flow process difficulty, the problem of process complexity, domestic consumer only needs to specify Rules of Assessment can automatically select optimal data to excavate flow process, reduce the difficulty of modeling.

Brief description of the drawings

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the schematic diagram that Clementine generated data excavates flow process;

The schematic flow sheet of a kind of data digging flow generation method that Fig. 2 provides for the embodiment of the present invention;

The structural representation of a kind of data digging flow generating apparatus that Fig. 3 provides for the embodiment of the present invention.

Embodiment

For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

The schematic flow sheet of a kind of data digging flow generation method that Fig. 2 provides for the embodiment of the present invention.As shown in Figure 2, comprising:

201, data digging flow generating apparatus receives metadata and the training data that input media is submitted to, and described metadata comprises output data type.

The data digging flow generating apparatus that described data digging flow generating apparatus provides for the embodiment of the present invention.Described input media can be the user interface that described data digging flow generating apparatus provides, and can be also independently to install with described data digging flow generating apparatus.When input media is when independently installing with described data digging flow generating apparatus, described input media by and described data digging flow generating apparatus between self-defining interface submit metadata and training data to.

Particularly, training data is identical with general data storehouse derived data form, as " A|B|C|D "; Metadata is the definition to training data, also referred to as data description.The metadata example of table 1 for user is analyzed from net state.

Table 1

Attribute-name	Data type	Attribute direction	Data span	Default value
					UserID	String	ID	NULL	NULL
Sex	String	Input	Man, female	NULL
					Age	Integer	Input	0-150	NULL
Status	Integer	Output	0,1	NULL

As shown in table 1, metadata has defined four attributes and corresponding data type, attribute direction, data span and the default value of each attribute of training data, wherein, data type can be character (String), integer (Integer) etc., data span and default value can be empty (NULL), attribute direction comprises input, output and ID, and the attribute that attribute direction is ID is the major key of a record, in data mining process without analyze this attribute.Four attributes are respectively user ID (UserID), sex (Sex), age (Age) and User Status (Status), wherein, Status be 0 expression from net, Status 1 is illustrated in net.Conventionally, training data comprises many records, and every record comprises the data of corresponding each attribute, and for instance, record " Xiao Ming | man | 32|0 " represents that UserID be " Xiao Ming ", sex for " man ", the age user that is " 32 " is in from net state.

Particularly, data type corresponding to attribute that attribute direction is " input " namely inputted data type, and data type corresponding to attribute that attribute direction is " output " namely exported data type.Training data in 201 and metadata are inputted with document form conventionally.

202, the output data type of default polyalgorithm node is mated with the output data type in described metadata respectively, determine the algorithm node that the match is successful.

Particularly, according to the default described polyalgorithm node of expertise; Technology that each algorithm node is corresponding algorithm realizes, and for according to the training data of input, the incidence relation between mining data, according to a model of incidence relation output.Conventionally, each algorithm node all has corresponding constraint information, comprises node-classification, input data type, output data type and whether requires to record integrality etc.Table 2 is the constraint information example of two algorithm nodes.

Table 2

Wherein, input and output data type ID is enumerated value, and corresponding data type is as shown in table 3.

Table 3

ID	Data type
		1	Type continuously
2	Many-valued discrete type
		3	Two-value discrete type

As shown in table 2, table 3, the node-classification of bayesian algorithm node (Bayesian) is prediction class (Predictive), input data type is many-valued discrete type or two-value discrete type, and output data type is many-valued discrete type or two-value discrete type, does not require and records integrality; The node-classification of logical algorithm node (Logistic) is prediction class (Predictive), and input data type is continuous type, many-valued discrete type or two-value discrete type, and output data type is two-value discrete type, requires to record integrality.

It should be noted that, determine that the algorithm node that the match is successful can have one or more in 202.For instance, if the output data type in metadata is two-value discrete type, the algorithm node that the match is successful is Bayesian and Logistic; If the output data type in metadata is many-valued discrete type, the algorithm node that the match is successful is Bayesian.

203, according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful.

Particularly, described node relationships table is default according to expertise.All forward direction nodes of algorithm node comprise the preposition node of this algorithm node, the preposition node of this preposition node, etc.; The all backward node of algorithm node comprises the descendant node taking this algorithm node as preposition node, the descendant node taking this descendant node as preposition node, etc.Table 4 is node relationships representation case.

Table 4

As shown in table 4, node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding.Based on table 4, the preposition node that can determine algorithm Node B ayesian is feature selecting (Feature-Select), the preposition node of Feature-Select is branch mailbox (Binning), the preposition node of Binning is type (Type), the preposition node of Type is file input (FileImport), the preposition node of FileImport is empty, and FileImport is the start node in data mining path; In addition, algorithm Node B ayesian is the preposition node of model application (Model-Apply), Model-Apply is the preposition node of model evaluation (Model-Evaluation), that is to say, Model-Apply is the descendant node of Bayesian, the descendant node that Model-Evaluation is Model-Apply; Due to the descendant node that Model-Evaluation is arbitrary other nodes, therefore Model-Evaluation is the end node in data mining path.In sum, by the node relationships table of table 4, can obtain data mining path: a FileImport → Type → Binning → Feature-Select → Bayesian → Model-Apply → Model-Evaluation, this data mining path is corresponding with algorithm Node B ayesian.

It should be noted that, a node can be by one or more node parameter templates; A node can have one or more preposition nodes, can be also the preposition node of one or more other nodes, therefore, from an algorithm node, can obtain one or more data minings path.

Particularly, described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful.

204, the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment.

Particularly, the node parameter template of each node is default according to expertise.Node parameter template can be free mark (Freemarker) template, preserves with ftl form.The most parameters of node parameter template solidifies based on expertise, and only the reserved partial parameters relevant to input data generates in the process of generated data excavation flow process, such as inputting data type etc.

For instance, in default node parameter template FileImport-1, curing parameter can comprise: input file title (inputFileName), encoding (encoding), the first row title (first_row_as_names), row delimiter (columnDelimiter).Wherein, the value of parameter inputFileName is the title of corresponding document while inputting metadata and training data with document form, as " TrainData "; The value representation coded format of parameter encoding, can be unified character standard transmission formats-8(Unicode Transformation Format-8, be called for short UTF-8) ASCII(American Standard Code for information interchange) (American Standard Code for Information Interchange is called for short ASCII) etc.; Whether the first row of the value representation input file of parameter f irst_row_as_names can be used as the data description of this file, be worth for " true " represent passable, be worth for " false " represent cannot, be mainly used in simplify configuration; Separator in the value representation input file of parameter c olumnDelimiter between field and field, can be ", ".

In default node parameter template Binning-1, curing parameter can comprise: branch mailbox type (binningtype), branch mailbox are described (binningdetail), branch mailbox file (binningColumns).Wherein, branch mailbox type such as comprises at dark branch mailbox, wide branch mailbox, the standard deviation branch mailbox etc.; The design parameter of the corresponding above-mentioned branch mailbox type of the value representation of parameter b inningdetail, as need the number etc. of branch mailbox; The value representation branch mailbox field of parameter b inningColumns, represents by Age field branch mailbox while being " Age " as value.

In default node parameter template FeatureSelection-1, curing parameter can comprise: miss ratio threshold value (missingRatioThreshold), repetitive rate threshold value (repeatRatioThreshold), different value ratio threshold values (diffValueRatioThreshold), degree of confidence (confidence), maximum dispersion number (maxDiscretizedNum).Wherein, the incomplete ratio threshold value being recorded in the total number of records of the value representation of parameter m issingRatioThreshold; What the value representation of parameter repeatRatioThreshold repeated is recorded in the ratio threshold value in the total number of records; The ratio threshold value of the different values of value representation of parameter d iffValueRatioThreshold in the total number of records; The element size that the value representation of parameter m axDiscretizedNum is supported for aggregate type acquiescence, exceedes this size and does not think aggregate type.

In default node parameter template Bayesian-1, curing parameter can comprise: model file (ModelFile), ignore disappearance (IgnoreMissings), use partition data (Usepartition).Wherein, whether the value representation of parameter Usepartition uses partition data.

In default node parameter template Model-Apply-1, curing parameter can comprise: model file (ModelFile), output data file (OutputDataFile).Wherein, the path of the model file of the value representation algorithm node output of parameter ModelFile, the path of the value representation output data file of parameter OutputDataFile.

In default node parameter template Model-Evaluation-1, curing parameter can comprise: Study document (AnalysisFile), OutputDataFile(output data file), ROI, F-score.Wherein, whether the value representation of ROI uses ROI as assessment key element, and whether the value representation of F-score uses F-score as assessment key element.

204 specifically comprise: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path.

Based on the each data mining path obtaining in 203, the node parameter template that can find according to the node parameter template store path in node relationships table all nodes on this data mining path, these node parameter templates have formed the data digging flow corresponding with this data mining path.

205, according to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.

Describedly according to described training data, described multiple data digging flows are verified and are specially respectively, described training data is input to respectively to each data digging flow.Correspondingly, the output data of each data digging flow are exactly the result.

Rules of Assessment can be that user is default according to the target of data mining, generally includes at least one assessment key element and base condition corresponding to each assessment key element.For instance, assessment key element can be gain on investments (Return On Investment is called for short ROI), F value (F-Score) etc., and the base condition that gain on investments is corresponding can be to be greater than 0.9, and the base condition that F-Score is corresponding can be to be greater than 0.95.

In an optional embodiment of the present invention, effective in order to ensure the each data digging flow creating in 204, can also first carry out verification to each data mining path.Particularly, described metadata also comprises input data type, before 204, can also comprise:

Accordingly, 204 specifically can comprise:

Particularly, the type conversion node that the algorithm node that the match is successful relies on can be that the data of the input data type in metadata are converted to the data that data type is the input data type of the algorithm node of described coupling by data type.Table 5 is data type mapping and node relationships definition example.

Table 5

Object type	Source Type	Data transformation node title
			Integer,Float,Double	Many-valued discrete	Binning,Typing
String	Many-valued discrete	Typing
			Integer,Float,Double	Type continuously	Binning,Typing
String	Type continuously	Typing

As shown in table 5, data transformation node Binning and Typing can be that many-valued discrete data are converted to the data that data type is integer (Integer), 32 single-precision floating point type decimals (Float) or 64 double-precision floating point type decimals (Double) by data type.

For instance, if the type conversion node that algorithm Node B ayesian relies on is Binning, but in the data mining path corresponding with Bayesian of determining in 203, do not comprise Binning, ignore this data mining path, otherwise create the data digging flow corresponding with this data mining path.

In another alternative embodiment of the present invention, because the data volume of training data is conventionally very large, in order to alleviate checking burden, can first sample to training data, then according to the training data after sampling, data digging flow is verified.Particularly, described training data comprises many records, and 205 can comprise:

Particularly, according to default sampling prescription, described training data being sampled can be random sampling by a certain percentage, also can sample by the span of each attribute in training data, etc.Wherein, sampling prescription is default according to expertise.

Further, in order to take into account the comprehensive of checking, can also adopt the mode of cross-iteration checking, particularly, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;

The base condition that the assessed value of iteration for the first time of each assessment key element corresponding each data digging flow is corresponding with each assessment key element in described Rules of Assessment is mated, and the data digging flow that the assessed value of iteration for the first time of determining each assessment key element all meets corresponding base condition is the first iterative process;

Particularly, each sampling can also adopt the mode of stratified sampling to carry out.Sampling prescription corresponding to each iteration can be different, so that how dissimilar the record that each iteration is used is different and cover more record in training data as much as possible.Conventionally, the iterative process that each iteration obtains has multiple, in order further to alleviate the burden of successive iterations checking, can also limit the number of the iterative process that enters next iteration checking.Particularly, described Rules of Assessment also comprises weight order corresponding to each assessment key element;

For instance, if the weight order of assessment key element ROI is 1, the weight order of assessment key element F-Score is 2, be that ROI is more important in sequence,, in multiple the first iterative process that iteration obtains for the first time, for the first time iteration assessed value corresponding according to ROI sorts, in the identical situation of the assessed value of iteration for the first time that ROI is corresponding, the for the first time iteration assessed value corresponding according to F-Score sorts, thereby determines N the first iterative process that comes top N.Conventionally, it is 1 that the N time MAXIMUM SELECTION flow process quantity corresponding to iteration is set, and, finally determines an optimal data excavation flow process that is.

Further, in order to ensure carrying out smoothly of checking, can also first filter the record that will input.Particularly, describedly the record that obtains of sampling be for the first time input to each data digging flow specifically comprise:

Record after filtering is input to described data digging flow.

Particularly, for each data digging flow, can filter record according to above-mentioned steps.It should be noted that, the algorithm node in different data digging flows may be different, and different algorithm nodes may be different to the requirement of recording integrality, and therefore, the record after filtering for different data digging flows may be different.

For instance, four attribute: UserID, Sex, Age and Status in metadata, are defined, if one be recorded as " Xiao Ming || 32|0 ", do not comprise the parameter that attribute Sex is corresponding, this record imperfect, further, if the algorithm node that data digging flow is corresponding is Logistic, and the constraint information of Logistic shows that Logistic requires to record integrality, filter this record, this record is not inputted to this data digging flow.On the contrary, if the algorithm node that data digging flow is corresponding is Bayesian, and showing that Bayesian does not require, the constraint information of Bayesian records integrality, this record can be input to this data digging flow, conventionally, before this record is input to this data digging flow, default value corresponding attribute Sex can also be added in this record, to form a complete record.If the default value that attribute Sex is corresponding is empty,, without adding default value, directly this record is input to this data digging flow.

It should be noted that, before the record that sampling obtains is each time input to data digging flow, all can carry out above-mentioned filtration step.

Further, in follow-up iterative process, the assessed value that the assessed value that can also obtain according to front iteration several times and this iteration obtain is carried out comprehensive assessment, to select to carry out the data digging flow of next iteration.Particularly, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:

For instance, the assessment weight of iteration is 1 for the first time, the assessment weight of iteration is 4 for the second time, the assessed value of iteration for the first time of the ROI that certain data digging flow is corresponding and for the second time iteration assessed value are respectively 0.9 and 0.95, the comprehensive assessment value of the ROI that this data digging flow is corresponding is 4.7, after iteration for the second time, sort according to the comprehensive assessment value of ROI corresponding to each data digging flow (secondary iteration flow process), determine and come M secondary iteration flow process of front M position, thereby this M secondary iteration flow process being verified in iteration for the third time.

The embodiment of the present invention is by exporting data type to mating according to the default polyalgorithm node of expertise, based on according to all forward direction nodes and the backward node of the default node relationships table recursive lookup algorithm node that the match is successful of expertise, determine multiple data minings path, the multiple data digging flows corresponding according to default node parameter template establishment, then according to described training data, described multiple data digging flows are verified respectively, determine the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment, having solved existing generated data, to excavate in flow and method modeling difficulty high, and it is higher that definite optimal data is excavated flow process difficulty, the problem of process complexity, domestic consumer only needs to specify Rules of Assessment can automatically select optimal data to excavate flow process, reduce the difficulty of modeling.And, owing to having simplified flow process and parameter that domestic consumer need to configure, automatically create and find optimum flow process, also improve to a certain extent the utilization rate of data mining.

One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each embodiment of the method can complete by the relevant hardware of programmed instruction.Aforesaid program can be stored in a computer read/write memory medium.This program, in the time carrying out, is carried out the step that comprises above-mentioned each embodiment of the method; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.

The structural representation of a kind of data digging flow generating apparatus that Fig. 3 provides for the embodiment of the present invention.As shown in Figure 3, this device comprises:

Receiver module 31, metadata and the training data submitted to for receiving input media, described metadata comprises output data type;

Matching module 32, for the output data type of default polyalgorithm node is mated with the output data type of described metadata respectively, determines the algorithm node that the match is successful;

Path determination module 33, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful;

Flow process creation module 34, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment;

Authentication module 35, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment.

Described input media can be the user interface that described data digging flow generating apparatus provides, and can be also independently to install with described data digging flow generating apparatus.When input media is when independently installing with described data digging flow generating apparatus, described input media by and described data digging flow generating apparatus between self-defining interface submit metadata and training data to.

In an optional embodiment of the present invention, described metadata also comprises input data type, and described device also comprises:

Conversion matching module, for before flow process creation module 34 is according to the default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path, according to the input data type in the input data type of described each algorithm node that the match is successful and described metadata, determine the type conversion node of the dependence of each algorithm node that the match is successful, the target type of the type conversion node of described dependence comprises the input data type of the described algorithm node that the match is successful, the Source Type of described type conversion node comprises the input data type in described metadata,

Flow process creation module 34 specifically for,

In another alternative embodiment of the present invention, described training data comprises many records, authentication module 35 specifically for,

Authentication module 35 specifically comprises:

Further, described authentication unit specifically for,

Record after filtering is input to described data digging flow.

Described assessment unit also for,

Alternatively, described iteration unit also for,

It should be noted that, the data digging flow generating apparatus of the present embodiment can be concentrated and dispose with data digging system, also can independently dispose.

Finally it should be noted that: above each embodiment, only in order to technical scheme of the present invention to be described, is not intended to limit; Although the present invention is had been described in detail with reference to aforementioned each embodiment, those of ordinary skill in the art is to be understood that: its technical scheme that still can record aforementioned each embodiment is modified, or some or all of technical characterictic is wherein equal to replacement; And these amendments or replacement do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims

1. a data digging flow generation method, is characterized in that, comprising:

According to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful; Described node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding; Described determine specifically comprise with corresponding multiple data minings path of described each algorithm node that the match is successful respectively: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful;

The multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment; The default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path of described basis specifically comprise: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path;

According to described training data, described multiple data digging flows are verified respectively, determined the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment;

Described metadata also comprises input data type;

2. method according to claim 1, is characterized in that, described training data comprises many records; Describedly according to described training data, described multiple data digging flows are verified respectively, are determined that according to the result and default Rules of Assessment optimal data in described multiple data digging flow excavates flow process and specifically comprise:

3. method according to claim 2, is characterized in that, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;

4. method according to claim 3, is characterized in that, the described record obtaining sampling is for the first time input to each data digging flow and specifically comprises:

Record after filtering is input to described data digging flow.

5. method according to claim 3, is characterized in that, described Rules of Assessment also comprises weight order corresponding to each assessment key element;

6. method according to claim 3, is characterized in that, the described record that secondary sample is obtained also comprises after being input to each the first iterative process:

If the number of described secondary iteration flow process exceedes MAXIMUM SELECTION flow process quantity M corresponding to default iteration for the second time,, according to default iteration for the first time and the assessed value of iteration for the second time and the iteration assessed value for the first time of the assessment weight of iteration and each assessment key element of each secondary iteration flow process for the second time, determine the comprehensive assessment value of each assessment key element of each secondary iteration flow process;

7. a data digging flow generating apparatus, is characterized in that, comprising:

Path determination module, for according to default node relationships table, recursive lookup is to all forward direction nodes and the backward node of each algorithm node that the match is successful, determines respectively the corresponding multiple data minings path with described each algorithm node that the match is successful; Described node relationships table has been preserved node parameter template name, node parameter template store path and the preposition node parameter template name that each node is corresponding; Described path determination module determine respectively with corresponding multiple data minings path of described each algorithm node that the match is successful specifically for: based on each algorithm node that the match is successful, according to all forward direction nodes and the backward node of the described algorithm node that the match is successful and the described algorithm node that the match is successful, determine at least one the corresponding data mining path with the described algorithm node that the match is successful;

Flow process creation module, for the multiple data digging flows corresponding with described multiple data minings path according to default node parameter template establishment; Described flow process creation module according to default node parameter template establishment multiple data digging flows corresponding with described multiple data minings path specifically for: according to the node parameter template of each node in each data mining path, create the data digging flow corresponding with each data mining path;

Authentication module, for according to described training data, described multiple data digging flows being verified respectively, determines the optimal data excavation flow process in described multiple data digging flow according to the result and default Rules of Assessment;

Described metadata also comprises input data type, and described device also comprises:

Described flow process creation module specifically for,

8. device according to claim 7, is characterized in that, described training data comprises many records, described authentication module specifically for,

9. device according to claim 8, is characterized in that, described Rules of Assessment comprises at least one assessment key element and base condition corresponding to each assessment key element;

Described authentication module specifically comprises:

10. device according to claim 9, is characterized in that, described authentication unit specifically for,

Record after filtering is input to described data digging flow.

11. devices according to claim 9, is characterized in that, described Rules of Assessment also comprises weight order corresponding to each assessment key element;

Described assessment unit also for,

12. devices according to claim 9, is characterized in that, described iteration unit also for,