CN109993217B

CN109993217B - Automatic feature construction method and device for structured data

Info

Publication number: CN109993217B
Application number: CN201910206424.6A
Authority: CN
Inventors: 安睿; 施兴天
Original assignee: Zhongan Information Technology Service Co Ltd
Current assignee: Shanghai Zhongan Information Technology Service Co ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2021-06-15
Anticipated expiration: 2039-03-18
Also published as: CN109993217A

Abstract

The invention discloses an automatic feature construction method and device of structured data, wherein the method comprises the following steps: s1: performing data processing on original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing; s2: performing initialization operation on a pre-constructed construction tree by combining the processed data; s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode; s4: and performing feature extraction on the preprocessed data by using a feature generation mode to obtain a feature generation result. The method and the device for automatically constructing the features of the structured data do not need to be based on external data training or knowledge in specific fields, have wide application range and low complexity, and can perform large-scale operation so as to construct a huge search space for the automatic features; and a resampling strategy is used, so that the stability of a construction result can be improved, and the influence of noise is reduced.

Description

Automatic feature construction method and device for structured data

Technical Field

The invention relates to the technical field of data processing, in particular to an automatic feature construction method and device for structured data.

Background

Feature engineering refers to a process of processing an original data set by using domain knowledge to enable machine learning to achieve the purpose (i.e., a process of converting original data into training data of a model), and the purpose is to obtain better training data features. Data is a carrier of information, and an original data set may contain a large amount of noise, so that the information expression is not concise, hidden information is not obvious, and the hidden information is difficult to capture by a machine learning algorithm. The characteristic engineering uses a series of engineering activities to express the information by using more efficient characteristics, so that the rule of the original data set is reserved and strengthened, uncertain factors, noise, missing data and other influences in the original data set can be reduced, and a machine learning algorithm can obtain a better result. Common feature engineering includes a dimension-increasing method constructed by features such as normalization and inter-feature calculation, and a dimension-reducing method such as principal component analysis, an automatic encoder and feature selection.

However, the existing feature construction methods generally need model training based on external data or specific domain knowledge, have a small application range, and are complicated because large-scale calculation is needed for feature construction.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide an automatic feature construction method and apparatus for structured data, so as to overcome the problems in the prior art that model training needs to be performed based on external data or specific domain knowledge, the application range is small, and the model training is complex.

In order to solve one or more technical problems, the invention adopts the technical scheme that:

in one aspect, a method for automatic feature construction of structured data is provided, which includes the following steps:

s1: performing data processing on original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;

s2: performing initialization operation on a pre-constructed construction tree by combining the processed data;

s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode;

s4: and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.

Further, if the original data is the category type data, the preprocessing further includes:

and carrying out one-hot coding on the class type data to obtain a coded data column.

Further, the preprocessing further comprises:

after the category type data is subjected to one-hot coding, counting the coded data column and other original data except the category type data respectively to obtain statistical information;

and dividing the coded data column into at least one independent hot column and/or numerical value column according to numerical values in the statistical information, and labeling the independent hot column and/or numerical value column.

Further, the step S2 specifically includes:

inputting the processed data into the pre-constructed construction tree;

defining a feature building operation of the building tree;

configuring external parameters of the building tree, wherein the external parameters at least comprise: node exploration times, pruning limit parameters and filtering threshold parameters.

Further, if the building tree is a random policy, step S2 further includes:

configuring random weights for the feature construction operations.

Further, the step S3 specifically includes:

starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:

checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;

traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data;

traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;

and if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording the characteristic generation mode of the node, and performing recursion on the child node.

Further, the data processing further comprises data resampling, and the method comprises:

preprocessing the original data to obtain preprocessed data;

performing data resampling on the preprocessed data to obtain a plurality of sampling data;

carrying out initialization operation on a pre-constructed construction tree by combining the sampling data;

combining the sampling data, performing generation operation on the initialized construction tree to obtain a feature generation mode:

and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.

In another aspect, an apparatus for automatic feature construction of structured data is provided, the apparatus comprising:

the data processing module is used for carrying out data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;

the initialization module is used for carrying out initialization operation on a pre-constructed construction tree in combination with the processed data;

the generating operation module is used for combining the processed data to perform generating operation on the initialized construction tree to acquire a feature generating mode;

and the feature extraction module is used for extracting features of the preprocessed data by using the feature generation mode to obtain a feature generation result.

Further, if the original data is the category type data, the data processing module includes:

and the coding unit is used for carrying out one-hot coding on the class type data to obtain a coded data column.

Further, the data processing module further includes:

the statistical unit is used for respectively carrying out statistics on the coded data column and other original data except the category type data after carrying out the one-hot coding on the category type data to obtain statistical information;

and the dividing unit is used for dividing the coded data column into at least one independent hot column and/or one numerical value column according to the numerical values in the statistical information, and labeling the independent hot column and/or the numerical value column.

Further, the initialization module includes:

the data input unit is used for inputting the preprocessed data into the pre-constructed construction tree;

an operation definition unit for defining the characteristic construction operation of the construction tree;

a parameter configuration unit, configured to configure external parameters of the building tree, where the external parameters at least include: node exploration times, pruning limit parameters and filtering threshold parameters.

Further, if the constructed tree is a random policy, the initialization module further includes:

and the weight configuration unit is used for configuring the random weight of the feature construction operation.

Further, the generating operation module is specifically configured to:

traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column by operating the original column by using a characteristic construction operation, wherein the original column comprises preprocessed data including the coded data column;

Further, the data processing module further includes:

the sampling unit is used for resampling the preprocessed data to obtain a plurality of sampling data;

the initialization module is also used for carrying out initialization operation on a pre-constructed construction tree in combination with the sampling data;

and the generating operation module is also used for combining the sampling data to perform generating operation on the initialized construction tree to acquire a feature generating mode.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

1. the method and the device for automatically constructing the characteristics of the structured data do not need to be based on external data training or specific field knowledge, and have wide application range;

2. the method and the device for automatically constructing the features of the structured data, provided by the embodiment of the invention, have low complexity, and can perform large-scale operation so as to construct a huge search space for the automatic features;

3. according to the method and the device for automatically constructing the characteristics of the structured data, the resampling strategy is used for resampling the preprocessed data to obtain a plurality of sampling data, the construction tree generation operation is respectively carried out on the sampling data to obtain the corresponding characteristic generation mode, the stability of the construction result can be improved, and the influence of noise is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method for automatic feature building of structured data, according to an exemplary embodiment:

FIG. 2 is a schematic diagram illustrating a feature generation manner obtained by performing a generation operation on an initialized construction tree in combination with processed data according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of an apparatus for automated feature construction of structured data according to an exemplary embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a flow diagram illustrating a method for automatic feature building of structured data, according to an exemplary embodiment, and as shown with reference to FIG. 1, the method includes the steps of:

s1: and performing data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing.

Specifically, in the embodiment of the present invention, the original data refers to structured data. The structured data is also called row data, is data logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is stored and managed by a relational database. The columns of structured data are divided into two categories, a numerical type (such as age of a person) and a category type (such as gender of a person).

And carrying out data processing operations such as preprocessing and the like on the original data, wherein the preprocessing at least comprises missing value processing. Missing values refer to clustering, grouping, pruning or truncation of data in the coarse data due to lack of information. It means that the value of some attribute or attributes in the existing dataset is incomplete. In the embodiment of the present invention, the missing value processing may be deleting from the data or filling in according to a specified manner.

S2: and performing initialization operation on the pre-constructed construction tree by combining the processed data.

Specifically, in the embodiment of the present invention, the automatic feature construction method for structured data is based on a construction tree. The building tree is a decision tree, the difference is that the splitting process of the building tree is different from the decision tree, the input of the building tree comprises other characteristic generation related information besides the preprocessed data, and the final output also comprises an automatic characteristic generation result.

S3: and combining the processed data to perform generation operation on the initialized construction tree to acquire a feature generation mode.

Specifically, generation operation is required to be performed on the constructed tree after initialization is completed, and after all nodes of the constructed tree are split and processed, a whole constructed tree and all recorded feature generation modes can be obtained.

As a preferred implementation manner, in an embodiment of the present invention, if the original data is class type data, the preprocessing further includes:

Specifically, the data of the category type is usually converted into numerical values to be introduced into the model, for example, the gender [ male, female ] is converted into [0, 1], but the model usually defaults to processing continuous numerical values, and the effect of building the tree is affected by directly using [0, 1 ]. One Hot Encoding (One Hot Encoding) uses an N-bit status register to encode N states, each state being represented by its own independent register bit and only One of which is active at any One time. If the natural code is: 0, 1, one-hot coded as: 10, 01.

The class type data is subjected to one-hot coding, so that the constructed tree in the invention can process discontinuous numerical characteristics on one hand, and on the other hand, the characteristics of the data are expanded to a certain extent. For example, sex is a feature itself, and becomes a male or a female feature after one hot encoding.

As a preferred embodiment, in the embodiment of the present invention, the pretreatment further includes:

after the category type data is subjected to the one-hot coding, the coded data column and other original data except the category type data are respectively counted to obtain statistical information.

Specifically, in the embodiment of the present invention, after the class-type data is subjected to the one-hot encoding, statistics on the data (including the encoded data column and other raw data besides the class-type data) is also required. The statistical information includes not only the values of the data columns but also the maximum value, minimum value, mean, variance, etc., which can be used in the construction tree generation.

Specifically, the encoded data columns are distinguished according to the numerical values (i.e., the numerical values in the statistical information) of the encoded data columns, the columns with characteristic values of only 0 to 1 are recorded as one-hot columns, the rest are recorded as numerical value columns, and the labels are given to the one-hot columns and the numerical value columns. It should be noted here that the data columns are divided and numbered in order to use different feature generation operations.

As a preferred implementation manner, in an embodiment of the present invention, the step S2 specifically includes:

inputting the processed data into the pre-constructed building tree.

Specifically, it should be noted that the processed data includes encoded data columns (i.e., one-hot columns and numerical columns). In the embodiment of the present invention, in addition to inputting the processed data into the pre-constructed building tree, the labels of the unique column and the numerical column, and various statistical information of all the encoded data columns need to be input into the building tree.

Defining a feature building operation for the building tree.

Specifically, the construction tree comprises a plurality of feature construction operation methods, and the feature construction operation methods designed aiming at the data feature types have good interpretability. For example: the building operations using the numerical value columns may be defined to include summing, differencing, integrating, discretizing, etc., the building operations using the one-hot columns may be defined to include logical and, logical or, logical not, summing, etc., and the building operations using the numerical value columns and the one-hot columns may be defined to include gate operations, etc.

As a preferred implementation manner, in an embodiment of the present invention, if the building tree is a random policy, step S2 further includes:

configuring random weights for the feature construction operations.

Specifically, if the building tree is a random strategy, random weights of various building operations need to be initialized, and the random weights can be set according to statistical information of the data set. If a traversal strategy is used, no initialization weights are required.

Fig. 2 is a schematic diagram illustrating a feature generation manner obtained by performing a generation operation on an initialized construction tree in combination with processed data according to an exemplary embodiment, and referring to fig. 2, as a preferred implementation manner, in an embodiment of the present invention, the step S3 specifically includes:

and checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, and stopping the subsequent operation of the nodes if the nodes can not be split any more, and taking the nodes as leaf nodes.

Specifically, the splitting condition of each node of the building tree is checked according to the pruning limit parameter. The pruning limit parameter may be the lowest ratio of the number of samples of the node, or the lowest number of samples of the splitting sub-node, or other conditions, if the node can not be split any more, the node stops the subsequent operation, and the node is taken as the leaf node.

Traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data.

Specifically, the original column is traversed, and the optimal splitting characteristic and position meeting the splitting condition are found, even if the information gain of the child node is the maximum, the information gain is recorded as I. The filtering threshold is obtained by calculating a filtering threshold parameter and an information gain I, for example, the filtering threshold may be 1.2 × I, and other parameters such as the number of samples of the current node may also be considered.

And performing feature generation with the total number of node exploration times, namely selecting a construction operation, selecting a correspondingly required feature column aiming at the construction operation, and performing operation on the selected feature column by using the construction operation to obtain a new column (namely a generation column). The selection can be either random or traversal according to the weight of the construction operation, and if the selection is traversal, the step can be generated on the root node at one time. The filtering threshold and the generation column do not need to be obtained in the above order, as long as it is ensured that the generation column and the filtering threshold are obtained before the generation column is traversed. The operation of feature generation may use various statistics of the encoded data column. That is, in some feature generation operations, statistical information of the columns themselves may be used, for example, in binarization and discretization generation of a log column, a mean value, a maximum value, a minimum value, and the like are used.

And traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold.

Specifically, each generated column is traversed, the optimal splitting characteristic and position (i.e., the optimal splitting point of the generated column) satisfying the splitting condition are found, the information gain of the child node is maximized, the information gain corresponding to each generated column is obtained, if the information gain is greater than the filtering threshold, the construction operation and the characteristic column of the generated column are recorded (i.e., the characteristic generating manner satisfying the filtering threshold is recorded), or only the result that the information gain satisfying the splitting condition is the highest and satisfies the filtering threshold may be recorded.

Specifically, if the node satisfies the continue splitting condition, the node is subjected to sample (referred to as a data column herein) splitting according to the splitting result (i.e., the best splitting point) satisfying the splitting condition and having the best information gain, so as to generate an original column of the child node. If no information gain satisfying the splitting condition can be found, no splitting is performed. The splitting here may be either the original column or the generated column. The feature generation of the node is recorded and the process is recursed on the child nodes until all nodes no longer satisfy the split condition. After all the nodes are split and processed, a whole building tree and all the recorded feature generation modes can be obtained. And finally, applying the recorded feature generation mode to the preprocessed data to obtain an automatic special generation result.

As a preferred implementation, in an embodiment of the present invention, the data processing further includes data resampling, and the method includes:

preprocessing the original data to obtain preprocessed data;

combining the sampling data, performing generation operation on the initialized construction tree to obtain a feature generation mode;

Specifically, in order to improve the stability of the construction result and reduce the influence of noise, an integration method may be further added in the embodiment of the present invention.

The integration method specifically comprises the following steps: and resampling the preprocessed data by using a resampling strategy, such as bagging or bootstr tapping, and the like to obtain a plurality of sampled data. And (4) combining the sampled data, initializing the pre-constructed construction tree, and performing construction tree generation operation on the sampled data to obtain the respective corresponding feature generation modes.

In addition, statistics can be carried out on all the obtained feature generation modes, and a voting result can be finally obtained according to the screening conditions. The filtering condition may be, but is not limited to, a threshold filtering manner, that is, counting the number of repeated occurrences, setting a threshold, and selecting a feature generation manner exceeding the threshold number of occurrences.

And finally, applying all finally obtained feature generation modes to the preprocessed data to obtain an automatic feature generation result.

Fig. 3 is a schematic structural diagram illustrating an apparatus for automatic feature construction of structured data according to an exemplary embodiment, and referring to fig. 3, the apparatus includes:

As a preferred implementation manner, in an embodiment of the present invention, if the original data is category-type data, the data processing module includes:

As a preferred implementation manner, in an embodiment of the present invention, the data processing module further includes:

As a preferred implementation manner, in an embodiment of the present invention, the initialization module includes:

As a preferred implementation manner, in an embodiment of the present invention, if the building tree is a random policy, the initialization module further includes:

As a preferred implementation manner, in an embodiment of the present invention, the generating operation module is specifically configured to:

In summary, the technical solution provided by the embodiment of the present invention has the following beneficial effects:

It should be noted that: the automatic feature building apparatus for structured data provided in the foregoing embodiment is only illustrated by dividing the functional modules when triggering an automatic feature building service, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the automatic feature construction device for structured data and the automatic feature construction method for structured data provided in the above embodiments belong to the same concept, that is, the method is based on the system, and the specific implementation process thereof is described in the method embodiments, and will not be described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for automatic feature construction of structured data, the method comprising the steps of:

s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode, comprising:

starting from the root node of the building tree, recursively performing the following operations on the nodes of the building tree in a traversal mode:

traversing an original column, acquiring a filtering threshold value by combining with a filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data;

if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording a characteristic generation mode of the node, and performing recursion on the child node; s4: and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.

2. The method of claim 1, wherein if the raw data is class type data, the preprocessing further comprises:

3. The method of automatic feature construction of structured data according to claim 2, wherein said preprocessing further comprises: after the category type data is subjected to one-hot coding, counting the coded data column and other original data except the category type data respectively to obtain statistical information;

4. The method for automatically constructing features of structured data according to any one of claims 1 to 3, wherein the step S2 specifically comprises:

inputting the processed data into the pre-constructed construction tree;

defining a feature building operation of the building tree;

5. The method according to claim 4, wherein if the building tree is a random strategy, the step S2 further comprises:

configuring random weights for the feature construction operations.

6. A method for automatic feature construction of structured data according to any of claims 1 to 3, wherein the data processing further comprises data resampling, the method comprising:

preprocessing the original data to obtain preprocessed data;

7. An apparatus for automatic feature construction of structured data, the apparatus comprising:

the generation operation module is specifically configured to:

traversing an original column, acquiring a filtering threshold value by combining with a filtering threshold value parameter, and generating a generated column by operating the original column by using a characteristic construction operation, wherein the original column comprises preprocessed data including a coded data column;

if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording a characteristic generation mode of the node, and performing recursion on the child node;

8. The apparatus according to claim 7, wherein if the raw data is class type data, the data processing module comprises:

9. The apparatus for automated feature building of structured data according to claim 8, wherein said data processing module further comprises:

10. The apparatus for automated feature construction of structured data according to any of claims 7 to 9, wherein the initialization module comprises:

11. The apparatus for automated feature building of structured data according to claim 10, wherein if the building tree is a random policy, the initialization module further comprises:

12. The apparatus for automated feature construction of structured data according to any of claims 7 to 9, wherein the data processing module further comprises: