Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of buildings of the automated characterization of structural data
Method and device is needed to carry out model training based on external data or specific domain knowledge in the prior art, is applicable in overcome
The problems such as range is smaller, and more complicated.
To solve said one or multiple technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of automated characterization construction method of structural data is provided, this method comprises the following steps:
S1: data processing is carried out to initial data, the data processing includes at least pretreatment, and the pretreatment is at least wrapped
Include missing values processing;
S2: data after combination processing initialize the building tree constructed in advance;
S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generation side
Formula;
S4: feature extraction is carried out to data after pretreatment using the feature generating mode, feature is obtained and generates result.
Further, if the initial data is categories class data, the then pretreatment further include:
One-hot coding is carried out to the categories class data, obtains coded data column.
Further, the pretreatment further include:
After carrying out one-hot coding to the categories class data, the coded data is arranged respectively and in addition to classification
Other initial data other than categorical data are counted, and statistical information is obtained;
According to the numerical value in the statistical information, coded data column are at least divided into solely heat column and/or numerical value and are arranged,
It and is only heat column and/or numerical value column label.
Further, the step S2 is specifically included:
Data after the processing are input in the building tree constructed in advance;
Define the feature construction operation of the building tree;
Configure the external parameter of the building tree, the external parameter includes at least: node explores number, beta pruning limitation ginseng
Several and filtering threshold parameter.
Further, if building tree is randomized policy, then step S2 further include:
Configure the random weight of the feature construction operation.
Further, the step S3 is specifically included:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out
Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again
Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction
Original column-generation generates column, and the original column include data after the processing;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold
Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described
It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node
Return the above process.
Further, the data processing further includes data resampling, which comprises
Initial data is pre-processed, data after pretreatment are obtained;
Carry out data resampling to data after the pretreatment, obtains several sampled datas;
In conjunction with the sampled data, the building tree constructed in advance is initialized;
In conjunction with the sampled data, generation operation is carried out to the building tree after initialization, obtains feature generating mode:
Feature extraction is carried out to data after the pretreatment using the feature generating mode, feature is obtained and generates result.
On the other hand, a kind of automated characterization construction device of structural data is provided, which includes:
Data processing module, for carrying out data processing to initial data, the data processing includes at least pretreatment, institute
It states pretreatment and includes at least missing values processing;
Initialization module, for being initialized to the building tree constructed in advance in conjunction with data after the processing;
Computing module is generated, for carrying out generation fortune to the building tree after initialization in conjunction with data after the processing
It calculates, obtains feature generating mode;
Characteristic extracting module, for carrying out feature extraction to data after the pretreatment using the feature generating mode,
It obtains feature and generates result.
Further, if the initial data is categories class data, then the data processing module includes:
Coding unit obtains coded data column for carrying out one-hot coding to the categories class data.
Further, the data processing module further include:
Statistic unit is used for after carrying out one-hot coding to the categories class data, respectively to the coded data
Column and other initial data other than categories class data are counted, and statistical information is obtained;
Division unit, for according to the numerical value in the statistical information, coded data column to be at least divided into only heat
Column and/or numerical value column, and be only heat column and/or numerical value column label.
Further, the initialization module includes:
Data input cell, for data after the pretreatment to be input in the building tree constructed in advance;
Operation Definition unit, for defining the feature construction operation of the building tree;
Parameter configuration unit, for configuring the external parameter of the building tree, the external parameter includes at least: node is visited
Rope number, beta pruning limitation parameter and filtering threshold parameter.
Further, if building tree is randomized policy, the then initialization module further include:
Weight configuration unit, for configuring the random weight of the feature construction operation.
Further, the generation computing module is specifically used for:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out
Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again
Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction
Original column operations generates column, and the original column include that the coded data is listed in data after interior pretreatment;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold
Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described
It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node
Return the above process.
Further, the data processing module further include:
Sampling unit obtains several sampled datas for carrying out resampling to data after the pretreatment;
The initialization module is also used in conjunction with the sampled data, carries out initialization behaviour to the building tree constructed in advance
Make;
The generation computing module is also used to generate the building tree after initialization in conjunction with the sampled data
Operation obtains feature generating mode.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, is not needed upon outside
Data training or domain-specific knowledge, it is applied widely;
2, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, complexity is low, can be with
Extensive computation is carried out, constructs huge search space to cope with automated characterization;
3, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, uses resampling plan
Slightly, resampling is carried out to pretreated data, obtains several sampled datas, carried out building tree respectively to sampled data and generate fortune
It calculates, obtains corresponding feature generating mode, be able to ascend the stability of building result, reduce the influence of noise.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention
Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this
Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of the automated characterization construction method of structural data shown according to an exemplary embodiment, ginseng
According to shown in Fig. 1, this method comprises the following steps:
S1: data processing is carried out to initial data, the data processing includes at least pretreatment, and the pretreatment is at least wrapped
Include missing values processing.
Specifically, initial data refers to structural data in the embodiment of the present invention.Wherein, structural data is also referred to as gone
Data are by two-dimentional table structure come the data of logical expression and realization, strictly follow data format and length specification, common
Structural data carries out storage and management by relevant database.The column of structural data are divided into two kinds, value type (such as
The age of people) and categories class (such as gender of people).
Initial data is carried out the data processing operation such as to pre-process, is handled wherein pretreatment includes at least missing values.Missing
Value refer in asperity data as lack information and caused by data cluster, grouping deletes mistake or truncation.It refers to existing number
It is incomplete according to the value for concentrating some or certain attributes.In the embodiment of the present invention, missing values processing can be and delete from data
It removes, can also be filled according to specified mode.
S2: data after combination processing initialize the building tree constructed in advance.
Specifically, the automated characterization construction method of structural data is set based on building in the embodiment of the present invention.Building
Tree is a decision tree, unlike building tree fission process and decision tree it is different, building tree input in addition to pretreatment after
Data outside further include other features generate relevant information, final output further comprise automated characterization generate result.
S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generation side
Formula.
Specifically, also needing to carry out generation operation to the building tree after the completion of initialization, in whole nodes point of building tree
After splitting and being disposed, one whole building tree and all feature generating modes being recorded can be obtained.
S4: feature extraction is carried out to data after pretreatment using the feature generating mode, feature is obtained and generates result.
As a kind of preferably embodiment, in embodiments of the present invention, if the initial data is categories class number
According to the then pretreatment further include:
One-hot coding is carried out to the categories class data, obtains coded data column.
Specifically, usual categories class data, we are usually switched to numerical value and bring model into, such as by gender [male, female]
It is converted to [0,1] etc., but model is often defaulted as handling continuous type numerical value, [0,1] is directly used to will affect building tree
Effect.The method of one-hot coding (One Hot Encoding) is to be compiled using N bit status register to N number of state
Code, each state is by his independent register-bit, and when any, wherein only one effectively.Such as natural coding are as follows:
0,1, one-hot coding are as follows: 10,01.
Categories class data carry out one-hot coding, and the building tree in the present invention on the one hand can be made to handle discrete numerical value
On the other hand feature has also expanded the feature of data to a certain extent.For example gender itself is a feature, by one
After hot coding, male or two features of female have been reformed into.
As a kind of preferably embodiment, in embodiments of the present invention, the pretreatment further include:
After carrying out one-hot coding to the categories class data, the coded data is arranged respectively and in addition to classification
Other initial data other than categorical data are counted, and statistical information is obtained.
Specifically, in the embodiment of the present invention, after carrying out one-hot coding to categories class data, it is also necessary to data (packet
Include coded data column and other initial data other than categories class data) it is counted.Statistical information is not only wrapped
The numerical value for including data column, further includes maximum value, minimum value, mean value, variance etc., these information can make in building tree generates
With.
According to the numerical value in the statistical information, coded data column are at least divided into solely heat column and/or numerical value and are arranged,
It and is only heat column and/or numerical value column label.
Specifically, arranging coded data and carrying out according to the numerical value (numerical value i.e. in statistical information) that coded data arranges
It distinguishes, there was only the column of 0-1 for characteristic value, be recorded as solely heat column, remaining is recorded as numerical value column, and is solely heat column and numerical value column
Carry out label.What needs to be explained here is that data column were divided and numbered is to use different feature to generate behaviour
Make.
As a kind of preferably embodiment, in embodiments of the present invention, the step S2 is specifically included:
Data after the processing are input in the building tree constructed in advance.
Specifically, what needs to be explained here is that, data include coded data column (i.e. solely heat column and numerical value column) after processing.
In embodiments of the present invention, in addition to after handling data be input in the building tree constructed in advance, also need only heat column and numerical value
The various statistical informations that the label of column, all coded datas arrange are input in building tree.
Define the feature construction operation of the building tree.
Specifically, building tree includes several feature construction operating method, these are directed to the spy of data characteristics type design
Sign building operating method has good interpretation.Such as: the building operation arranged using numerical value can be defined and include summation, ask
Difference, quadrature, discretization etc., definition include logical AND, logic or logic NOT, adduction etc., definition using the building operation of solely heat column
It the use of numerical value column and solely the building operation of heat column include door operation etc..
Configure the external parameter of the building tree, the external parameter includes at least: node explores number, beta pruning limitation ginseng
Several and filtering threshold parameter.
As a kind of preferably embodiment, in embodiments of the present invention, if building tree is randomized policy, then walk
Rapid S2 further include:
Configure the random weight of the feature construction operation.
Specifically, if building tree is randomized policy, it is also necessary to the random weight for initializing various building operations, with
Machine weight can be arranged according to the statistical information of data set.If it is traversal strategies, initialization weight is not needed.
Fig. 2 is data after combination processing shown according to an exemplary embodiment, is given birth to the building tree after initialization
At operation, the schematic diagram of feature generating mode is obtained, referring to shown in Fig. 2, as a kind of preferably embodiment, of the invention real
It applies in example, the step S3 is specifically included:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out
Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again
Division, then stop the node subsequent arithmetic, and using the node as leaf node.
Specifically, limiting parameter according to beta pruning, check that the splitting condition of each node is set in building.Beta pruning limitation parameter can be
The minimum accounting of node sample size is also possible to divide the sample minimum quantity or other conditions of child node, if this
Node cannot divide again, then the node stops subsequent arithmetic, and using the node as leaf node.
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction
Original column-generation generates column, and the original column include data after the processing.
Specifically, traversing to original column, the best disruptive features for meeting splitting condition and position are found, even if son section
The information gain of point is maximum, and note information gain is I.Filtering threshold is obtained by calculating filtering threshold parameter and information gain I, than
If filtering threshold can be 1.2*I, it is also contemplated that the other parameters such as sample number of present node.
Carrying out sum is the feature generation that node explores number, that is, selects a building operation, select for building operation
Mutually in requisition for characteristic series, operation is carried out to the characteristic series selected using building operation, obtains new column (generating column).Here
Selection either carried out according to the weight of building operation random, be also possible to traverse, if traversal, this step can
It is generated with property primary on root node.It obtains filtering threshold and generates to arrange to be not necessarily intended to carry out in the order described above, as long as
It can guarantee to obtain generating column and filtering threshold before traversal generates column.Volume can be used in the calculating process that feature generates
The various statistical informations that data arrange after code.That is, generating in operation in certain features, the statistics of column itself may be used
Information, for example, logarithm column binaryzation, discretization generate in, to use mean value, maximum value, minimum value etc..
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold
Generating mode.
Specifically, being traversed to each generation column, finds the best disruptive features for meeting splitting condition and position is (i.e. raw
Best splitting point in column), keep child node information gain maximum, obtains each generation and arrange corresponding information gain, if the letter
It ceases gain and is greater than filtering threshold, then the building operation and characteristic series for recording generation column (record the spy for meeting filtering threshold
Levy generating mode), only record it can also meet the information gain highest of splitting condition and meet the result of filtering threshold.
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described
It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node
Return the above process.
If then dividing according to the information gain for meeting splitting condition is best specifically, node, which meets, continues splitting condition
It splits result (i.e. best splitting point) and sample (herein referring to data column) division is carried out to the node, generate the original column of child node.Such as
Fruit can not find the information gain for meeting splitting condition then without division.Here division can be original column, be also possible to give birth to
In column.The feature generating mode of the node is recorded, and to the child node recurrence process, until all nodes all no longer meet
Splitting condition.In whole node splits and after being disposed, one whole building tree and all features being recorded will be obtained
Generating mode.The feature generating mode that finally pretreated data application is recorded, so that it may obtain automatic specially offered generation
As a result.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing further includes data resampling,
The described method includes:
Initial data is pre-processed, data after pretreatment are obtained;
Carry out data resampling to data after the pretreatment, obtains several sampled datas;
In conjunction with the sampled data, the building tree constructed in advance is initialized;
In conjunction with the sampled data, generation operation is carried out to the building tree after initialization, obtains feature generating mode;
Feature extraction is carried out to data after the pretreatment using the feature generating mode, feature is obtained and generates result.
Specifically, reduce the influence of noise to promote the stability of building result, it can be to go back in the embodiment of the present invention
Integrated approach can be added.
Integrated approach specifically: use resampling strategy, such as bagging or bootstr to kowtow, to pretreated
Data carry out resampling, obtain several sampled datas.In conjunction with sampled data, initialization behaviour is carried out to the building tree constructed in advance
Make, and building tree is carried out to sampled data and generates operation, obtains corresponding feature generating mode.
In addition to this it is possible to all feature generating modes got are counted, final according to screening conditions
To voting results.Screening conditions can be but not be limited to threshold filtering mode, i.e. statistics frequency of occurrence, and threshold value is arranged, surpasses
The feature generating mode for crossing the threshold value frequency of occurrence is i.e. selected.
All feature generating modes finally finally obtained to pretreated data application obtain automated characterization and generate knot
Fruit.
Fig. 3 is the structural representation of the automated characterization construction device of structural data shown according to an exemplary embodiment
Figure, referring to shown in Fig. 3, which includes:
Data processing module, for carrying out data processing to initial data, the data processing includes at least pretreatment, institute
It states pretreatment and includes at least missing values processing;
Initialization module, for being initialized to the building tree constructed in advance in conjunction with data after the processing;
Computing module is generated, for carrying out generation fortune to the building tree after initialization in conjunction with data after the processing
It calculates, obtains feature generating mode;
Characteristic extracting module, for carrying out feature extraction to data after the pretreatment using the feature generating mode,
It obtains feature and generates result.
As a kind of preferably embodiment, in embodiments of the present invention, if the initial data is categories class number
According to then the data processing module includes:
Coding unit obtains coded data column for carrying out one-hot coding to the categories class data.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing module further include:
Statistic unit is used for after carrying out one-hot coding to the categories class data, respectively to the coded data
Column and other initial data other than categories class data are counted, and statistical information is obtained;
Division unit, for according to the numerical value in the statistical information, coded data column to be at least divided into only heat
Column and/or numerical value column, and be only heat column and/or numerical value column label.
As a kind of preferably embodiment, in embodiments of the present invention, the initialization module includes:
Data input cell, for data after the pretreatment to be input in the building tree constructed in advance;
Operation Definition unit, for defining the feature construction operation of the building tree;
Parameter configuration unit, for configuring the external parameter of the building tree, the external parameter includes at least: node is visited
Rope number, beta pruning limitation parameter and filtering threshold parameter.
As a kind of preferably embodiment, in embodiments of the present invention, if building tree is randomized policy, then institute
State initialization module further include:
Weight configuration unit, for configuring the random weight of the feature construction operation.
As a kind of preferably embodiment, in embodiments of the present invention, the generation computing module is specifically used for:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out
Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again
Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction
Original column operations generates column, and the original column include that the coded data is listed in data after interior pretreatment;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold
Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described
It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node
Return the above process.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing module further include:
Sampling unit obtains several sampled datas for carrying out resampling to data after the pretreatment;
The initialization module is also used in conjunction with the sampled data, carries out initialization behaviour to the building tree constructed in advance
Make;
The generation computing module is also used to generate the building tree after initialization in conjunction with the sampled data
Operation obtains feature generating mode.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, is not needed upon outside
Data training or domain-specific knowledge, it is applied widely;
2, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, complexity is low, can be with
Extensive computation is carried out, constructs huge search space to cope with automated characterization;
3, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, uses resampling plan
Slightly, resampling is carried out to pretreated data, obtains several sampled datas, carried out building tree respectively to sampled data and generate fortune
It calculates, obtains corresponding feature generating mode, be able to ascend the stability of building result, reduce the influence of noise.
It should be understood that the automated characterization construction device of structural data provided by the above embodiment is automatic special in triggering
Levy building business when, only the example of the division of the above functional modules, in practical application, can according to need and incite somebody to action
Above-mentioned function distribution is completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, with complete
At all or part of function described above.In addition, the automated characterization of structural data provided by the above embodiment constructs dress
Set and belong to same design with the automated characterization construction method embodiment of structural data, i.e., this method be based on the system,
Specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.