CN109993217A - A kind of the automated characterization construction method and device of structural data - Google Patents

A kind of the automated characterization construction method and device of structural data Download PDF

Info

Publication number
CN109993217A
CN109993217A CN201910206424.6A CN201910206424A CN109993217A CN 109993217 A CN109993217 A CN 109993217A CN 201910206424 A CN201910206424 A CN 201910206424A CN 109993217 A CN109993217 A CN 109993217A
Authority
CN
China
Prior art keywords
data
column
feature
node
pretreatment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910206424.6A
Other languages
Chinese (zh)
Other versions
CN109993217B (en
Inventor
安睿
施兴天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhongan Information Technology Service Co ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201910206424.6A priority Critical patent/CN109993217B/en
Publication of CN109993217A publication Critical patent/CN109993217A/en
Application granted granted Critical
Publication of CN109993217B publication Critical patent/CN109993217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses the automated characterization construction methods and device of a kind of structural data, this method comprises: S1: carrying out data processing to initial data, data processing includes at least pretreatment, and pretreatment includes at least missing values and handles;S2: data after combination processing initialize the building tree constructed in advance;S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generating mode;S4: feature extraction is carried out to data after pretreatment using feature generating mode, feature is obtained and generates result.The automated characterization construction method and device of structural data provided in an embodiment of the present invention, it is not needed upon external data training or domain-specific knowledge, applied widely, complexity is low, extensive computation can be carried out, constructs huge search space to cope with automated characterization;Using resampling strategy, it is able to ascend the stability of building result, reduces the influence of noise.

Description

A kind of the automated characterization construction method and device of structural data
Technical field
The present invention relates to technical field of data processing, in particular to the automated characterization construction method of a kind of structural data and Device.
Background technique
Feature Engineering, which refers to, processes raw data set using domain knowledge, the mistake for enabling machine learning to achieve the goal Journey (process for initial data being changed into the training data of model), its purpose are exactly to obtain better training data Feature.Data are the carriers of information, and raw data set may includes a large amount of noise, cause information representation not terse, are hidden Information is unobvious, is difficult to be captured by machine learning algorithm.Feature Engineering is used these information by a series of engineering activity More efficient character representation not only retains and enhances the rule of raw data set, can also reduce initial data and concentrate not It determines that factor and noise, missing data etc. influence, machine learning algorithm is enable to obtain better result.Common Feature Engineering There is the drop such as liter dimension method and principal component analysis, autocoder, feature selecting of the feature constructions such as calculating between normalization, feature Dimension method.
But existing feature construction method, it usually needs model instruction is carried out based on external data or specific domain knowledge Practice, smaller scope of application, and since feature construction needs are calculated on a large scale, so existing feature construction method all compares It is more complex etc..
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of buildings of the automated characterization of structural data Method and device is needed to carry out model training based on external data or specific domain knowledge in the prior art, is applicable in overcome The problems such as range is smaller, and more complicated.
To solve said one or multiple technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of automated characterization construction method of structural data is provided, this method comprises the following steps:
S1: data processing is carried out to initial data, the data processing includes at least pretreatment, and the pretreatment is at least wrapped Include missing values processing;
S2: data after combination processing initialize the building tree constructed in advance;
S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generation side Formula;
S4: feature extraction is carried out to data after pretreatment using the feature generating mode, feature is obtained and generates result.
Further, if the initial data is categories class data, the then pretreatment further include:
One-hot coding is carried out to the categories class data, obtains coded data column.
Further, the pretreatment further include:
After carrying out one-hot coding to the categories class data, the coded data is arranged respectively and in addition to classification Other initial data other than categorical data are counted, and statistical information is obtained;
According to the numerical value in the statistical information, coded data column are at least divided into solely heat column and/or numerical value and are arranged, It and is only heat column and/or numerical value column label.
Further, the step S2 is specifically included:
Data after the processing are input in the building tree constructed in advance;
Define the feature construction operation of the building tree;
Configure the external parameter of the building tree, the external parameter includes at least: node explores number, beta pruning limitation ginseng Several and filtering threshold parameter.
Further, if building tree is randomized policy, then step S2 further include:
Configure the random weight of the feature construction operation.
Further, the step S3 is specifically included:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction Original column-generation generates column, and the original column include data after the processing;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node Return the above process.
Further, the data processing further includes data resampling, which comprises
Initial data is pre-processed, data after pretreatment are obtained;
Carry out data resampling to data after the pretreatment, obtains several sampled datas;
In conjunction with the sampled data, the building tree constructed in advance is initialized;
In conjunction with the sampled data, generation operation is carried out to the building tree after initialization, obtains feature generating mode:
Feature extraction is carried out to data after the pretreatment using the feature generating mode, feature is obtained and generates result.
On the other hand, a kind of automated characterization construction device of structural data is provided, which includes:
Data processing module, for carrying out data processing to initial data, the data processing includes at least pretreatment, institute It states pretreatment and includes at least missing values processing;
Initialization module, for being initialized to the building tree constructed in advance in conjunction with data after the processing;
Computing module is generated, for carrying out generation fortune to the building tree after initialization in conjunction with data after the processing It calculates, obtains feature generating mode;
Characteristic extracting module, for carrying out feature extraction to data after the pretreatment using the feature generating mode, It obtains feature and generates result.
Further, if the initial data is categories class data, then the data processing module includes:
Coding unit obtains coded data column for carrying out one-hot coding to the categories class data.
Further, the data processing module further include:
Statistic unit is used for after carrying out one-hot coding to the categories class data, respectively to the coded data Column and other initial data other than categories class data are counted, and statistical information is obtained;
Division unit, for according to the numerical value in the statistical information, coded data column to be at least divided into only heat Column and/or numerical value column, and be only heat column and/or numerical value column label.
Further, the initialization module includes:
Data input cell, for data after the pretreatment to be input in the building tree constructed in advance;
Operation Definition unit, for defining the feature construction operation of the building tree;
Parameter configuration unit, for configuring the external parameter of the building tree, the external parameter includes at least: node is visited Rope number, beta pruning limitation parameter and filtering threshold parameter.
Further, if building tree is randomized policy, the then initialization module further include:
Weight configuration unit, for configuring the random weight of the feature construction operation.
Further, the generation computing module is specifically used for:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction Original column operations generates column, and the original column include that the coded data is listed in data after interior pretreatment;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node Return the above process.
Further, the data processing module further include:
Sampling unit obtains several sampled datas for carrying out resampling to data after the pretreatment;
The initialization module is also used in conjunction with the sampled data, carries out initialization behaviour to the building tree constructed in advance Make;
The generation computing module is also used to generate the building tree after initialization in conjunction with the sampled data Operation obtains feature generating mode.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, is not needed upon outside Data training or domain-specific knowledge, it is applied widely;
2, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, complexity is low, can be with Extensive computation is carried out, constructs huge search space to cope with automated characterization;
3, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, uses resampling plan Slightly, resampling is carried out to pretreated data, obtains several sampled datas, carried out building tree respectively to sampled data and generate fortune It calculates, obtains corresponding feature generating mode, be able to ascend the stability of building result, reduce the influence of noise.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow chart of the automated characterization construction method of structural data shown according to an exemplary embodiment:
Fig. 2 is data after combination processing shown according to an exemplary embodiment, is given birth to the building tree after initialization At operation, the schematic diagram of feature generating mode is obtained;
Fig. 3 is the structural representation of the automated characterization construction device of structural data shown according to an exemplary embodiment Figure.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Fig. 1 is the flow chart of the automated characterization construction method of structural data shown according to an exemplary embodiment, ginseng According to shown in Fig. 1, this method comprises the following steps:
S1: data processing is carried out to initial data, the data processing includes at least pretreatment, and the pretreatment is at least wrapped Include missing values processing.
Specifically, initial data refers to structural data in the embodiment of the present invention.Wherein, structural data is also referred to as gone Data are by two-dimentional table structure come the data of logical expression and realization, strictly follow data format and length specification, common Structural data carries out storage and management by relevant database.The column of structural data are divided into two kinds, value type (such as The age of people) and categories class (such as gender of people).
Initial data is carried out the data processing operation such as to pre-process, is handled wherein pretreatment includes at least missing values.Missing Value refer in asperity data as lack information and caused by data cluster, grouping deletes mistake or truncation.It refers to existing number It is incomplete according to the value for concentrating some or certain attributes.In the embodiment of the present invention, missing values processing can be and delete from data It removes, can also be filled according to specified mode.
S2: data after combination processing initialize the building tree constructed in advance.
Specifically, the automated characterization construction method of structural data is set based on building in the embodiment of the present invention.Building Tree is a decision tree, unlike building tree fission process and decision tree it is different, building tree input in addition to pretreatment after Data outside further include other features generate relevant information, final output further comprise automated characterization generate result.
S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generation side Formula.
Specifically, also needing to carry out generation operation to the building tree after the completion of initialization, in whole nodes point of building tree After splitting and being disposed, one whole building tree and all feature generating modes being recorded can be obtained.
S4: feature extraction is carried out to data after pretreatment using the feature generating mode, feature is obtained and generates result.
As a kind of preferably embodiment, in embodiments of the present invention, if the initial data is categories class number According to the then pretreatment further include:
One-hot coding is carried out to the categories class data, obtains coded data column.
Specifically, usual categories class data, we are usually switched to numerical value and bring model into, such as by gender [male, female] It is converted to [0,1] etc., but model is often defaulted as handling continuous type numerical value, [0,1] is directly used to will affect building tree Effect.The method of one-hot coding (One Hot Encoding) is to be compiled using N bit status register to N number of state Code, each state is by his independent register-bit, and when any, wherein only one effectively.Such as natural coding are as follows: 0,1, one-hot coding are as follows: 10,01.
Categories class data carry out one-hot coding, and the building tree in the present invention on the one hand can be made to handle discrete numerical value On the other hand feature has also expanded the feature of data to a certain extent.For example gender itself is a feature, by one After hot coding, male or two features of female have been reformed into.
As a kind of preferably embodiment, in embodiments of the present invention, the pretreatment further include:
After carrying out one-hot coding to the categories class data, the coded data is arranged respectively and in addition to classification Other initial data other than categorical data are counted, and statistical information is obtained.
Specifically, in the embodiment of the present invention, after carrying out one-hot coding to categories class data, it is also necessary to data (packet Include coded data column and other initial data other than categories class data) it is counted.Statistical information is not only wrapped The numerical value for including data column, further includes maximum value, minimum value, mean value, variance etc., these information can make in building tree generates With.
According to the numerical value in the statistical information, coded data column are at least divided into solely heat column and/or numerical value and are arranged, It and is only heat column and/or numerical value column label.
Specifically, arranging coded data and carrying out according to the numerical value (numerical value i.e. in statistical information) that coded data arranges It distinguishes, there was only the column of 0-1 for characteristic value, be recorded as solely heat column, remaining is recorded as numerical value column, and is solely heat column and numerical value column Carry out label.What needs to be explained here is that data column were divided and numbered is to use different feature to generate behaviour Make.
As a kind of preferably embodiment, in embodiments of the present invention, the step S2 is specifically included:
Data after the processing are input in the building tree constructed in advance.
Specifically, what needs to be explained here is that, data include coded data column (i.e. solely heat column and numerical value column) after processing. In embodiments of the present invention, in addition to after handling data be input in the building tree constructed in advance, also need only heat column and numerical value The various statistical informations that the label of column, all coded datas arrange are input in building tree.
Define the feature construction operation of the building tree.
Specifically, building tree includes several feature construction operating method, these are directed to the spy of data characteristics type design Sign building operating method has good interpretation.Such as: the building operation arranged using numerical value can be defined and include summation, ask Difference, quadrature, discretization etc., definition include logical AND, logic or logic NOT, adduction etc., definition using the building operation of solely heat column It the use of numerical value column and solely the building operation of heat column include door operation etc..
Configure the external parameter of the building tree, the external parameter includes at least: node explores number, beta pruning limitation ginseng Several and filtering threshold parameter.
As a kind of preferably embodiment, in embodiments of the present invention, if building tree is randomized policy, then walk Rapid S2 further include:
Configure the random weight of the feature construction operation.
Specifically, if building tree is randomized policy, it is also necessary to the random weight for initializing various building operations, with Machine weight can be arranged according to the statistical information of data set.If it is traversal strategies, initialization weight is not needed.
Fig. 2 is data after combination processing shown according to an exemplary embodiment, is given birth to the building tree after initialization At operation, the schematic diagram of feature generating mode is obtained, referring to shown in Fig. 2, as a kind of preferably embodiment, of the invention real It applies in example, the step S3 is specifically included:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again Division, then stop the node subsequent arithmetic, and using the node as leaf node.
Specifically, limiting parameter according to beta pruning, check that the splitting condition of each node is set in building.Beta pruning limitation parameter can be The minimum accounting of node sample size is also possible to divide the sample minimum quantity or other conditions of child node, if this Node cannot divide again, then the node stops subsequent arithmetic, and using the node as leaf node.
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction Original column-generation generates column, and the original column include data after the processing.
Specifically, traversing to original column, the best disruptive features for meeting splitting condition and position are found, even if son section The information gain of point is maximum, and note information gain is I.Filtering threshold is obtained by calculating filtering threshold parameter and information gain I, than If filtering threshold can be 1.2*I, it is also contemplated that the other parameters such as sample number of present node.
Carrying out sum is the feature generation that node explores number, that is, selects a building operation, select for building operation Mutually in requisition for characteristic series, operation is carried out to the characteristic series selected using building operation, obtains new column (generating column).Here Selection either carried out according to the weight of building operation random, be also possible to traverse, if traversal, this step can It is generated with property primary on root node.It obtains filtering threshold and generates to arrange to be not necessarily intended to carry out in the order described above, as long as It can guarantee to obtain generating column and filtering threshold before traversal generates column.Volume can be used in the calculating process that feature generates The various statistical informations that data arrange after code.That is, generating in operation in certain features, the statistics of column itself may be used Information, for example, logarithm column binaryzation, discretization generate in, to use mean value, maximum value, minimum value etc..
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold Generating mode.
Specifically, being traversed to each generation column, finds the best disruptive features for meeting splitting condition and position is (i.e. raw Best splitting point in column), keep child node information gain maximum, obtains each generation and arrange corresponding information gain, if the letter It ceases gain and is greater than filtering threshold, then the building operation and characteristic series for recording generation column (record the spy for meeting filtering threshold Levy generating mode), only record it can also meet the information gain highest of splitting condition and meet the result of filtering threshold.
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node Return the above process.
If then dividing according to the information gain for meeting splitting condition is best specifically, node, which meets, continues splitting condition It splits result (i.e. best splitting point) and sample (herein referring to data column) division is carried out to the node, generate the original column of child node.Such as Fruit can not find the information gain for meeting splitting condition then without division.Here division can be original column, be also possible to give birth to In column.The feature generating mode of the node is recorded, and to the child node recurrence process, until all nodes all no longer meet Splitting condition.In whole node splits and after being disposed, one whole building tree and all features being recorded will be obtained Generating mode.The feature generating mode that finally pretreated data application is recorded, so that it may obtain automatic specially offered generation As a result.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing further includes data resampling, The described method includes:
Initial data is pre-processed, data after pretreatment are obtained;
Carry out data resampling to data after the pretreatment, obtains several sampled datas;
In conjunction with the sampled data, the building tree constructed in advance is initialized;
In conjunction with the sampled data, generation operation is carried out to the building tree after initialization, obtains feature generating mode;
Feature extraction is carried out to data after the pretreatment using the feature generating mode, feature is obtained and generates result.
Specifically, reduce the influence of noise to promote the stability of building result, it can be to go back in the embodiment of the present invention Integrated approach can be added.
Integrated approach specifically: use resampling strategy, such as bagging or bootstr to kowtow, to pretreated Data carry out resampling, obtain several sampled datas.In conjunction with sampled data, initialization behaviour is carried out to the building tree constructed in advance Make, and building tree is carried out to sampled data and generates operation, obtains corresponding feature generating mode.
In addition to this it is possible to all feature generating modes got are counted, final according to screening conditions To voting results.Screening conditions can be but not be limited to threshold filtering mode, i.e. statistics frequency of occurrence, and threshold value is arranged, surpasses The feature generating mode for crossing the threshold value frequency of occurrence is i.e. selected.
All feature generating modes finally finally obtained to pretreated data application obtain automated characterization and generate knot Fruit.
Fig. 3 is the structural representation of the automated characterization construction device of structural data shown according to an exemplary embodiment Figure, referring to shown in Fig. 3, which includes:
Data processing module, for carrying out data processing to initial data, the data processing includes at least pretreatment, institute It states pretreatment and includes at least missing values processing;
Initialization module, for being initialized to the building tree constructed in advance in conjunction with data after the processing;
Computing module is generated, for carrying out generation fortune to the building tree after initialization in conjunction with data after the processing It calculates, obtains feature generating mode;
Characteristic extracting module, for carrying out feature extraction to data after the pretreatment using the feature generating mode, It obtains feature and generates result.
As a kind of preferably embodiment, in embodiments of the present invention, if the initial data is categories class number According to then the data processing module includes:
Coding unit obtains coded data column for carrying out one-hot coding to the categories class data.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing module further include:
Statistic unit is used for after carrying out one-hot coding to the categories class data, respectively to the coded data Column and other initial data other than categories class data are counted, and statistical information is obtained;
Division unit, for according to the numerical value in the statistical information, coded data column to be at least divided into only heat Column and/or numerical value column, and be only heat column and/or numerical value column label.
As a kind of preferably embodiment, in embodiments of the present invention, the initialization module includes:
Data input cell, for data after the pretreatment to be input in the building tree constructed in advance;
Operation Definition unit, for defining the feature construction operation of the building tree;
Parameter configuration unit, for configuring the external parameter of the building tree, the external parameter includes at least: node is visited Rope number, beta pruning limitation parameter and filtering threshold parameter.
As a kind of preferably embodiment, in embodiments of the present invention, if building tree is randomized policy, then institute State initialization module further include:
Weight configuration unit, for configuring the random weight of the feature construction operation.
As a kind of preferably embodiment, in embodiments of the present invention, the generation computing module is specifically used for:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out Following operation:
Parameter is limited according to the beta pruning, the splitting condition of the node of the building tree is checked, if the node cannot be again Division, then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described using feature construction Original column operations generates column, and the original column include that the coded data is listed in data after interior pretreatment;
The generation column are traversed, the best splitting point for generating column is obtained, record meets the feature of the filtering threshold Generating mode;
If the node, which meets, continues splitting condition, using the best splitting point to original column and/or described It generates column to be divided, generates the original column of child node, record the feature generating mode of the node, and pass the child node Return the above process.
As a kind of preferably embodiment, in embodiments of the present invention, the data processing module further include:
Sampling unit obtains several sampled datas for carrying out resampling to data after the pretreatment;
The initialization module is also used in conjunction with the sampled data, carries out initialization behaviour to the building tree constructed in advance Make;
The generation computing module is also used to generate the building tree after initialization in conjunction with the sampled data Operation obtains feature generating mode.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, is not needed upon outside Data training or domain-specific knowledge, it is applied widely;
2, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, complexity is low, can be with Extensive computation is carried out, constructs huge search space to cope with automated characterization;
3, the automated characterization construction method and device of structural data provided in an embodiment of the present invention, uses resampling plan Slightly, resampling is carried out to pretreated data, obtains several sampled datas, carried out building tree respectively to sampled data and generate fortune It calculates, obtains corresponding feature generating mode, be able to ascend the stability of building result, reduce the influence of noise.
It should be understood that the automated characterization construction device of structural data provided by the above embodiment is automatic special in triggering Levy building business when, only the example of the division of the above functional modules, in practical application, can according to need and incite somebody to action Above-mentioned function distribution is completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, with complete At all or part of function described above.In addition, the automated characterization of structural data provided by the above embodiment constructs dress Set and belong to same design with the automated characterization construction method embodiment of structural data, i.e., this method be based on the system, Specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (14)

1. a kind of automated characterization construction method of structural data, which is characterized in that described method includes following steps:
S1: data processing is carried out to initial data, the data processing includes at least pretreatment, and the pretreatment, which includes at least, to be lacked The processing of mistake value;
S2: data after combination processing initialize the building tree constructed in advance;
S3: data after combination processing carry out generation operation to the building tree after initialization, obtain feature generating mode;
S4: feature extraction is carried out to data after pretreatment using the feature generating mode, feature is obtained and generates result.
2. the feature construction method of structural data according to claim 1, which is characterized in that if the initial data For categories class data, the then pretreatment further include:
One-hot coding is carried out to the categories class data, obtains coded data column.
3. the feature construction method of structural data according to claim 2, which is characterized in that the pretreatment is also wrapped It includes: after carrying out one-hot coding to the categories class data, the coded data being arranged respectively and in addition to categories class Other initial data other than data are counted, and statistical information is obtained;
According to the numerical value in the statistical information, coded data column are at least divided into solely heat column and/or numerical value and are arranged, and is Only heat column and/or numerical value column label.
4. according to claim 1 to the feature construction method of structural data described in 3 any one, which is characterized in that described Step S2 is specifically included:
Data after the processing are input in the building tree constructed in advance;
Define the feature construction operation of the building tree;
Configure the external parameter of the building tree, the external parameter includes at least: node explore number, beta pruning limitation parameter and Filtering threshold parameter.
5. the feature construction method of structural data according to claim 4, which is characterized in that if building tree is Randomized policy, then step S2 further include:
Configure the random weight of the feature construction operation.
6. the feature construction method of structural data according to claim 4, which is characterized in that the step S3 is specifically wrapped It includes:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out as follows Operation:
Parameter is limited according to the beta pruning, checks the splitting condition of the node of the building tree, if the node cannot divide again, Then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described original using feature construction Column-generation generates column, and the original column include data after the processing;
The generation column are traversed, the best splitting point for generating column is obtained, the feature that record meets the filtering threshold generates Mode;
If the node, which meets, continues splitting condition, using the best splitting point to the original column and/or the generation Column are divided, and are generated the original column of child node, are recorded the feature generating mode of the node, and in the child node recurrence State process.
7. according to claim 1 to the feature construction method of structural data described in 3 any one, which is characterized in that described Data processing further includes data resampling, which comprises
Initial data is pre-processed, data after pretreatment are obtained;
Carry out data resampling to data after the pretreatment, obtains several sampled datas;
In conjunction with the sampled data, the building tree constructed in advance is initialized;
In conjunction with the sampled data, generation operation is carried out to the building tree after initialization, obtains feature generating mode;
Feature extraction is carried out to data after the pretreatment using the feature generating mode, feature is obtained and generates result.
8. a kind of automated characterization construction device of structural data, which is characterized in that described device includes:
Data processing module, for carrying out data processing to initial data, the data processing includes at least pretreatment, described pre- Processing includes at least missing values and handles;
Initialization module, for being initialized to the building tree constructed in advance in conjunction with data after the processing;
Computing module is generated, for generation operation being carried out to the building tree after initialization, being obtained in conjunction with data after the processing Take feature generating mode;
Characteristic extracting module is obtained for carrying out feature extraction to data after the pretreatment using the feature generating mode Feature generates result.
9. the feature construction device of structural data according to claim 8, which is characterized in that if the initial data For categories class data, then the data processing module includes:
Coding unit obtains coded data column for carrying out one-hot coding to the categories class data.
10. the feature construction device of structural data according to claim 9, which is characterized in that the data processing mould Block further include:
Statistic unit, for the categories class data carry out one-hot coding after, respectively to the coded data arrange with And other initial data other than categories class data are counted, and statistical information is obtained;
Division unit, for according to the numerical value in the statistical information, by coded data column be at least divided into solely heat column and/ Or numerical value column, and be only heat column and/or numerical value column label.
11. according to the feature construction device of structural data described in claim 8 to 10 any one, which is characterized in that institute Stating initialization module includes:
Data input cell, for data after the pretreatment to be input in the building tree constructed in advance;
Operation Definition unit, for defining the feature construction operation of the building tree;
Parameter configuration unit, for configuring the external parameter of the building tree, the external parameter includes at least: node is explored secondary Number, beta pruning limitation parameter and filtering threshold parameter.
12. the feature construction device of structural data according to claim 11, which is characterized in that if the building is set For randomized policy, then the initialization module further include:
Weight configuration unit, for configuring the random weight of the feature construction operation.
13. the feature construction device of structural data according to claim 11, which is characterized in that the generation operation mould Block is specifically used for:
Since the root node of the building tree, using the mode of traversal, the node of the building tree is recursively carried out as follows Operation:
Parameter is limited according to the beta pruning, checks the splitting condition of the node of the building tree, if the node cannot divide again, Then stop the node subsequent arithmetic, and using the node as leaf node;
Original column are traversed, obtain filtering threshold in conjunction with the filtering threshold parameter, and operate to described original using feature construction Column operations generates column, and the original column include that the coded data is listed in data after interior pretreatment;
The generation column are traversed, the best splitting point for generating column is obtained, the feature that record meets the filtering threshold generates Mode;
If the node, which meets, continues splitting condition, using the best splitting point to the original column and/or the generation Column are divided, and are generated the original column of child node, are recorded the feature generating mode of the node, and in the child node recurrence State process.
14. according to the feature construction device of structural data described in claim 8 to 10 any one, which is characterized in that institute State data processing module further include:
Sampling unit obtains several sampled datas for carrying out resampling to data after the pretreatment;
The initialization module is also used to initialize the building tree constructed in advance in conjunction with the sampled data;
The generation computing module is also used in conjunction with the sampled data, carries out generation fortune to the building tree after initialization It calculates, obtains feature generating mode.
CN201910206424.6A 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data Active CN109993217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910206424.6A CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910206424.6A CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Publications (2)

Publication Number Publication Date
CN109993217A true CN109993217A (en) 2019-07-09
CN109993217B CN109993217B (en) 2021-06-15

Family

ID=67130699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910206424.6A Active CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Country Status (1)

Country Link
CN (1) CN109993217B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011091470A1 (en) * 2010-01-27 2011-08-04 National Ict Australia Limited Query processing of tree-structured data
CN103049587A (en) * 2011-10-13 2013-04-17 同济大学 Feature recognition method based on hierarchy and construction method of product feature semantic network
US20180053107A1 (en) * 2016-08-19 2018-02-22 Sap Se Aspect-based sentiment analysis
CN107729349A (en) * 2017-08-25 2018-02-23 昆仑智汇数据科技(北京)有限公司 A kind of characteristic data set automatic generation method and device based on metadata
CN108363759A (en) * 2018-02-01 2018-08-03 厦门快商通信息技术有限公司 Subject tree generation method and system based on structural data and Intelligent dialogue method
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011091470A1 (en) * 2010-01-27 2011-08-04 National Ict Australia Limited Query processing of tree-structured data
CN103049587A (en) * 2011-10-13 2013-04-17 同济大学 Feature recognition method based on hierarchy and construction method of product feature semantic network
US20180053107A1 (en) * 2016-08-19 2018-02-22 Sap Se Aspect-based sentiment analysis
CN107729349A (en) * 2017-08-25 2018-02-23 昆仑智汇数据科技(北京)有限公司 A kind of characteristic data set automatic generation method and device based on metadata
CN108363759A (en) * 2018-02-01 2018-08-03 厦门快商通信息技术有限公司 Subject tree generation method and system based on structural data and Intelligent dialogue method
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANDREW CAIN等: "An automatic test data generation system based on the integrated classification-tree methodology", 《SOFTWARE ENGINEERING RESEARCH AND APPLICATIONS》 *
刘龙霞等: "基于分类树和贪心算法的测试数据自动生成方法", 《计算机工程与设计》 *

Also Published As

Publication number Publication date
CN109993217B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
Tickle et al. The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN109657947A (en) A kind of method for detecting abnormality towards enterprises ' industry classification
CN111159407A (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN108427720A (en) System log sorting technique
CN109684476B (en) Text classification method, text classification device and terminal equipment
AU2003221986A1 (en) Processing mixed numeric and/or non-numeric data
CN106302522A (en) A kind of network safety situations based on neutral net and big data analyze method and system
CN104866578A (en) Hybrid filling method for incomplete data
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
EP3846034A1 (en) Systems and methods for automated testing using artificial intelligence techniques
Kravets et al. Gaming Method of Ontology Clusterization.
CN108549685A (en) Behavior analysis method, device, system and readable storage medium storing program for executing
CN109583659A (en) User's operation behavior prediction method and system based on deep learning
CN116595406A (en) Event argument character classification method and system based on character consistency
Yan et al. A clustering algorithm for multi-modal heterogeneous big data with abnormal data
Kakade et al. A neural network approach for text document classification and semantic text analytics
CN113742396A (en) Mining method and device for object learning behavior pattern
Baglioni et al. DrC4. 5: Improving C4. 5 by means of prior knowledge
CN109993217A (en) A kind of the automated characterization construction method and device of structural data
CN114880635A (en) User security level identification method, system, electronic device and medium of model integrated with lifting tree construction
Zbiciak et al. Feature recognition methods review
CN111046934B (en) SWIFT message soft clause recognition method and device
Drescher et al. Modelling grammar constraints with answer set programming
Lv et al. CEP rule extraction framework based on evolutionary algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240306

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240415

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right