CN109993217B - Automatic feature construction method and device for structured data - Google Patents

Automatic feature construction method and device for structured data Download PDF

Info

Publication number
CN109993217B
CN109993217B CN201910206424.6A CN201910206424A CN109993217B CN 109993217 B CN109993217 B CN 109993217B CN 201910206424 A CN201910206424 A CN 201910206424A CN 109993217 B CN109993217 B CN 109993217B
Authority
CN
China
Prior art keywords
data
column
feature
tree
construction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910206424.6A
Other languages
Chinese (zh)
Other versions
CN109993217A (en
Inventor
安睿
施兴天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhongan Information Technology Service Co ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201910206424.6A priority Critical patent/CN109993217B/en
Publication of CN109993217A publication Critical patent/CN109993217A/en
Application granted granted Critical
Publication of CN109993217B publication Critical patent/CN109993217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic feature construction method and device of structured data, wherein the method comprises the following steps: s1: performing data processing on original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing; s2: performing initialization operation on a pre-constructed construction tree by combining the processed data; s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode; s4: and performing feature extraction on the preprocessed data by using a feature generation mode to obtain a feature generation result. The method and the device for automatically constructing the features of the structured data do not need to be based on external data training or knowledge in specific fields, have wide application range and low complexity, and can perform large-scale operation so as to construct a huge search space for the automatic features; and a resampling strategy is used, so that the stability of a construction result can be improved, and the influence of noise is reduced.

Description

Automatic feature construction method and device for structured data
Technical Field
The invention relates to the technical field of data processing, in particular to an automatic feature construction method and device for structured data.
Background
Feature engineering refers to a process of processing an original data set by using domain knowledge to enable machine learning to achieve the purpose (i.e., a process of converting original data into training data of a model), and the purpose is to obtain better training data features. Data is a carrier of information, and an original data set may contain a large amount of noise, so that the information expression is not concise, hidden information is not obvious, and the hidden information is difficult to capture by a machine learning algorithm. The characteristic engineering uses a series of engineering activities to express the information by using more efficient characteristics, so that the rule of the original data set is reserved and strengthened, uncertain factors, noise, missing data and other influences in the original data set can be reduced, and a machine learning algorithm can obtain a better result. Common feature engineering includes a dimension-increasing method constructed by features such as normalization and inter-feature calculation, and a dimension-reducing method such as principal component analysis, an automatic encoder and feature selection.
However, the existing feature construction methods generally need model training based on external data or specific domain knowledge, have a small application range, and are complicated because large-scale calculation is needed for feature construction.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide an automatic feature construction method and apparatus for structured data, so as to overcome the problems in the prior art that model training needs to be performed based on external data or specific domain knowledge, the application range is small, and the model training is complex.
In order to solve one or more technical problems, the invention adopts the technical scheme that:
in one aspect, a method for automatic feature construction of structured data is provided, which includes the following steps:
s1: performing data processing on original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;
s2: performing initialization operation on a pre-constructed construction tree by combining the processed data;
s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode;
s4: and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
Further, if the original data is the category type data, the preprocessing further includes:
and carrying out one-hot coding on the class type data to obtain a coded data column.
Further, the preprocessing further comprises:
after the category type data is subjected to one-hot coding, counting the coded data column and other original data except the category type data respectively to obtain statistical information;
and dividing the coded data column into at least one independent hot column and/or numerical value column according to numerical values in the statistical information, and labeling the independent hot column and/or numerical value column.
Further, the step S2 specifically includes:
inputting the processed data into the pre-constructed construction tree;
defining a feature building operation of the building tree;
configuring external parameters of the building tree, wherein the external parameters at least comprise: node exploration times, pruning limit parameters and filtering threshold parameters.
Further, if the building tree is a random policy, step S2 further includes:
configuring random weights for the feature construction operations.
Further, the step S3 specifically includes:
starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:
checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;
traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data;
traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;
and if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording the characteristic generation mode of the node, and performing recursion on the child node.
Further, the data processing further comprises data resampling, and the method comprises:
preprocessing the original data to obtain preprocessed data;
performing data resampling on the preprocessed data to obtain a plurality of sampling data;
carrying out initialization operation on a pre-constructed construction tree by combining the sampling data;
combining the sampling data, performing generation operation on the initialized construction tree to obtain a feature generation mode:
and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
In another aspect, an apparatus for automatic feature construction of structured data is provided, the apparatus comprising:
the data processing module is used for carrying out data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;
the initialization module is used for carrying out initialization operation on a pre-constructed construction tree in combination with the processed data;
the generating operation module is used for combining the processed data to perform generating operation on the initialized construction tree to acquire a feature generating mode;
and the feature extraction module is used for extracting features of the preprocessed data by using the feature generation mode to obtain a feature generation result.
Further, if the original data is the category type data, the data processing module includes:
and the coding unit is used for carrying out one-hot coding on the class type data to obtain a coded data column.
Further, the data processing module further includes:
the statistical unit is used for respectively carrying out statistics on the coded data column and other original data except the category type data after carrying out the one-hot coding on the category type data to obtain statistical information;
and the dividing unit is used for dividing the coded data column into at least one independent hot column and/or one numerical value column according to the numerical values in the statistical information, and labeling the independent hot column and/or the numerical value column.
Further, the initialization module includes:
the data input unit is used for inputting the preprocessed data into the pre-constructed construction tree;
an operation definition unit for defining the characteristic construction operation of the construction tree;
a parameter configuration unit, configured to configure external parameters of the building tree, where the external parameters at least include: node exploration times, pruning limit parameters and filtering threshold parameters.
Further, if the constructed tree is a random policy, the initialization module further includes:
and the weight configuration unit is used for configuring the random weight of the feature construction operation.
Further, the generating operation module is specifically configured to:
starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:
checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;
traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column by operating the original column by using a characteristic construction operation, wherein the original column comprises preprocessed data including the coded data column;
traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;
and if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording the characteristic generation mode of the node, and performing recursion on the child node.
Further, the data processing module further includes:
the sampling unit is used for resampling the preprocessed data to obtain a plurality of sampling data;
the initialization module is also used for carrying out initialization operation on a pre-constructed construction tree in combination with the sampling data;
and the generating operation module is also used for combining the sampling data to perform generating operation on the initialized construction tree to acquire a feature generating mode.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. the method and the device for automatically constructing the characteristics of the structured data do not need to be based on external data training or specific field knowledge, and have wide application range;
2. the method and the device for automatically constructing the features of the structured data, provided by the embodiment of the invention, have low complexity, and can perform large-scale operation so as to construct a huge search space for the automatic features;
3. according to the method and the device for automatically constructing the characteristics of the structured data, the resampling strategy is used for resampling the preprocessed data to obtain a plurality of sampling data, the construction tree generation operation is respectively carried out on the sampling data to obtain the corresponding characteristic generation mode, the stability of the construction result can be improved, and the influence of noise is reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram illustrating a method for automatic feature building of structured data, according to an exemplary embodiment:
FIG. 2 is a schematic diagram illustrating a feature generation manner obtained by performing a generation operation on an initialized construction tree in combination with processed data according to an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating the structure of an apparatus for automated feature construction of structured data according to an exemplary embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a flow diagram illustrating a method for automatic feature building of structured data, according to an exemplary embodiment, and as shown with reference to FIG. 1, the method includes the steps of:
s1: and performing data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing.
Specifically, in the embodiment of the present invention, the original data refers to structured data. The structured data is also called row data, is data logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is stored and managed by a relational database. The columns of structured data are divided into two categories, a numerical type (such as age of a person) and a category type (such as gender of a person).
And carrying out data processing operations such as preprocessing and the like on the original data, wherein the preprocessing at least comprises missing value processing. Missing values refer to clustering, grouping, pruning or truncation of data in the coarse data due to lack of information. It means that the value of some attribute or attributes in the existing dataset is incomplete. In the embodiment of the present invention, the missing value processing may be deleting from the data or filling in according to a specified manner.
S2: and performing initialization operation on the pre-constructed construction tree by combining the processed data.
Specifically, in the embodiment of the present invention, the automatic feature construction method for structured data is based on a construction tree. The building tree is a decision tree, the difference is that the splitting process of the building tree is different from the decision tree, the input of the building tree comprises other characteristic generation related information besides the preprocessed data, and the final output also comprises an automatic characteristic generation result.
S3: and combining the processed data to perform generation operation on the initialized construction tree to acquire a feature generation mode.
Specifically, generation operation is required to be performed on the constructed tree after initialization is completed, and after all nodes of the constructed tree are split and processed, a whole constructed tree and all recorded feature generation modes can be obtained.
S4: and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
As a preferred implementation manner, in an embodiment of the present invention, if the original data is class type data, the preprocessing further includes:
and carrying out one-hot coding on the class type data to obtain a coded data column.
Specifically, the data of the category type is usually converted into numerical values to be introduced into the model, for example, the gender [ male, female ] is converted into [0, 1], but the model usually defaults to processing continuous numerical values, and the effect of building the tree is affected by directly using [0, 1 ]. One Hot Encoding (One Hot Encoding) uses an N-bit status register to encode N states, each state being represented by its own independent register bit and only One of which is active at any One time. If the natural code is: 0, 1, one-hot coded as: 10, 01.
The class type data is subjected to one-hot coding, so that the constructed tree in the invention can process discontinuous numerical characteristics on one hand, and on the other hand, the characteristics of the data are expanded to a certain extent. For example, sex is a feature itself, and becomes a male or a female feature after one hot encoding.
As a preferred embodiment, in the embodiment of the present invention, the pretreatment further includes:
after the category type data is subjected to the one-hot coding, the coded data column and other original data except the category type data are respectively counted to obtain statistical information.
Specifically, in the embodiment of the present invention, after the class-type data is subjected to the one-hot encoding, statistics on the data (including the encoded data column and other raw data besides the class-type data) is also required. The statistical information includes not only the values of the data columns but also the maximum value, minimum value, mean, variance, etc., which can be used in the construction tree generation.
And dividing the coded data column into at least one independent hot column and/or numerical value column according to numerical values in the statistical information, and labeling the independent hot column and/or numerical value column.
Specifically, the encoded data columns are distinguished according to the numerical values (i.e., the numerical values in the statistical information) of the encoded data columns, the columns with characteristic values of only 0 to 1 are recorded as one-hot columns, the rest are recorded as numerical value columns, and the labels are given to the one-hot columns and the numerical value columns. It should be noted here that the data columns are divided and numbered in order to use different feature generation operations.
As a preferred implementation manner, in an embodiment of the present invention, the step S2 specifically includes:
inputting the processed data into the pre-constructed building tree.
Specifically, it should be noted that the processed data includes encoded data columns (i.e., one-hot columns and numerical columns). In the embodiment of the present invention, in addition to inputting the processed data into the pre-constructed building tree, the labels of the unique column and the numerical column, and various statistical information of all the encoded data columns need to be input into the building tree.
Defining a feature building operation for the building tree.
Specifically, the construction tree comprises a plurality of feature construction operation methods, and the feature construction operation methods designed aiming at the data feature types have good interpretability. For example: the building operations using the numerical value columns may be defined to include summing, differencing, integrating, discretizing, etc., the building operations using the one-hot columns may be defined to include logical and, logical or, logical not, summing, etc., and the building operations using the numerical value columns and the one-hot columns may be defined to include gate operations, etc.
Configuring external parameters of the building tree, wherein the external parameters at least comprise: node exploration times, pruning limit parameters and filtering threshold parameters.
As a preferred implementation manner, in an embodiment of the present invention, if the building tree is a random policy, step S2 further includes:
configuring random weights for the feature construction operations.
Specifically, if the building tree is a random strategy, random weights of various building operations need to be initialized, and the random weights can be set according to statistical information of the data set. If a traversal strategy is used, no initialization weights are required.
Fig. 2 is a schematic diagram illustrating a feature generation manner obtained by performing a generation operation on an initialized construction tree in combination with processed data according to an exemplary embodiment, and referring to fig. 2, as a preferred implementation manner, in an embodiment of the present invention, the step S3 specifically includes:
starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:
and checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, and stopping the subsequent operation of the nodes if the nodes can not be split any more, and taking the nodes as leaf nodes.
Specifically, the splitting condition of each node of the building tree is checked according to the pruning limit parameter. The pruning limit parameter may be the lowest ratio of the number of samples of the node, or the lowest number of samples of the splitting sub-node, or other conditions, if the node can not be split any more, the node stops the subsequent operation, and the node is taken as the leaf node.
Traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data.
Specifically, the original column is traversed, and the optimal splitting characteristic and position meeting the splitting condition are found, even if the information gain of the child node is the maximum, the information gain is recorded as I. The filtering threshold is obtained by calculating a filtering threshold parameter and an information gain I, for example, the filtering threshold may be 1.2 × I, and other parameters such as the number of samples of the current node may also be considered.
And performing feature generation with the total number of node exploration times, namely selecting a construction operation, selecting a correspondingly required feature column aiming at the construction operation, and performing operation on the selected feature column by using the construction operation to obtain a new column (namely a generation column). The selection can be either random or traversal according to the weight of the construction operation, and if the selection is traversal, the step can be generated on the root node at one time. The filtering threshold and the generation column do not need to be obtained in the above order, as long as it is ensured that the generation column and the filtering threshold are obtained before the generation column is traversed. The operation of feature generation may use various statistics of the encoded data column. That is, in some feature generation operations, statistical information of the columns themselves may be used, for example, in binarization and discretization generation of a log column, a mean value, a maximum value, a minimum value, and the like are used.
And traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold.
Specifically, each generated column is traversed, the optimal splitting characteristic and position (i.e., the optimal splitting point of the generated column) satisfying the splitting condition are found, the information gain of the child node is maximized, the information gain corresponding to each generated column is obtained, if the information gain is greater than the filtering threshold, the construction operation and the characteristic column of the generated column are recorded (i.e., the characteristic generating manner satisfying the filtering threshold is recorded), or only the result that the information gain satisfying the splitting condition is the highest and satisfies the filtering threshold may be recorded.
And if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording the characteristic generation mode of the node, and performing recursion on the child node.
Specifically, if the node satisfies the continue splitting condition, the node is subjected to sample (referred to as a data column herein) splitting according to the splitting result (i.e., the best splitting point) satisfying the splitting condition and having the best information gain, so as to generate an original column of the child node. If no information gain satisfying the splitting condition can be found, no splitting is performed. The splitting here may be either the original column or the generated column. The feature generation of the node is recorded and the process is recursed on the child nodes until all nodes no longer satisfy the split condition. After all the nodes are split and processed, a whole building tree and all the recorded feature generation modes can be obtained. And finally, applying the recorded feature generation mode to the preprocessed data to obtain an automatic special generation result.
As a preferred implementation, in an embodiment of the present invention, the data processing further includes data resampling, and the method includes:
preprocessing the original data to obtain preprocessed data;
performing data resampling on the preprocessed data to obtain a plurality of sampling data;
carrying out initialization operation on a pre-constructed construction tree by combining the sampling data;
combining the sampling data, performing generation operation on the initialized construction tree to obtain a feature generation mode;
and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
Specifically, in order to improve the stability of the construction result and reduce the influence of noise, an integration method may be further added in the embodiment of the present invention.
The integration method specifically comprises the following steps: and resampling the preprocessed data by using a resampling strategy, such as bagging or bootstr tapping, and the like to obtain a plurality of sampled data. And (4) combining the sampled data, initializing the pre-constructed construction tree, and performing construction tree generation operation on the sampled data to obtain the respective corresponding feature generation modes.
In addition, statistics can be carried out on all the obtained feature generation modes, and a voting result can be finally obtained according to the screening conditions. The filtering condition may be, but is not limited to, a threshold filtering manner, that is, counting the number of repeated occurrences, setting a threshold, and selecting a feature generation manner exceeding the threshold number of occurrences.
And finally, applying all finally obtained feature generation modes to the preprocessed data to obtain an automatic feature generation result.
Fig. 3 is a schematic structural diagram illustrating an apparatus for automatic feature construction of structured data according to an exemplary embodiment, and referring to fig. 3, the apparatus includes:
the data processing module is used for carrying out data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;
the initialization module is used for carrying out initialization operation on a pre-constructed construction tree in combination with the processed data;
the generating operation module is used for combining the processed data to perform generating operation on the initialized construction tree to acquire a feature generating mode;
and the feature extraction module is used for extracting features of the preprocessed data by using the feature generation mode to obtain a feature generation result.
As a preferred implementation manner, in an embodiment of the present invention, if the original data is category-type data, the data processing module includes:
and the coding unit is used for carrying out one-hot coding on the class type data to obtain a coded data column.
As a preferred implementation manner, in an embodiment of the present invention, the data processing module further includes:
the statistical unit is used for respectively carrying out statistics on the coded data column and other original data except the category type data after carrying out the one-hot coding on the category type data to obtain statistical information;
and the dividing unit is used for dividing the coded data column into at least one independent hot column and/or one numerical value column according to the numerical values in the statistical information, and labeling the independent hot column and/or the numerical value column.
As a preferred implementation manner, in an embodiment of the present invention, the initialization module includes:
the data input unit is used for inputting the preprocessed data into the pre-constructed construction tree;
an operation definition unit for defining the characteristic construction operation of the construction tree;
a parameter configuration unit, configured to configure external parameters of the building tree, where the external parameters at least include: node exploration times, pruning limit parameters and filtering threshold parameters.
As a preferred implementation manner, in an embodiment of the present invention, if the building tree is a random policy, the initialization module further includes:
and the weight configuration unit is used for configuring the random weight of the feature construction operation.
As a preferred implementation manner, in an embodiment of the present invention, the generating operation module is specifically configured to:
starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:
checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;
traversing an original column, acquiring a filtering threshold value by combining the filtering threshold value parameter, and generating a generated column by operating the original column by using a characteristic construction operation, wherein the original column comprises preprocessed data including the coded data column;
traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;
and if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording the characteristic generation mode of the node, and performing recursion on the child node.
As a preferred implementation manner, in an embodiment of the present invention, the data processing module further includes:
the sampling unit is used for resampling the preprocessed data to obtain a plurality of sampling data;
the initialization module is also used for carrying out initialization operation on a pre-constructed construction tree in combination with the sampling data;
and the generating operation module is also used for combining the sampling data to perform generating operation on the initialized construction tree to acquire a feature generating mode.
In summary, the technical solution provided by the embodiment of the present invention has the following beneficial effects:
1. the method and the device for automatically constructing the characteristics of the structured data do not need to be based on external data training or specific field knowledge, and have wide application range;
2. the method and the device for automatically constructing the features of the structured data, provided by the embodiment of the invention, have low complexity, and can perform large-scale operation so as to construct a huge search space for the automatic features;
3. according to the method and the device for automatically constructing the characteristics of the structured data, the resampling strategy is used for resampling the preprocessed data to obtain a plurality of sampling data, the construction tree generation operation is respectively carried out on the sampling data to obtain the corresponding characteristic generation mode, the stability of the construction result can be improved, and the influence of noise is reduced.
It should be noted that: the automatic feature building apparatus for structured data provided in the foregoing embodiment is only illustrated by dividing the functional modules when triggering an automatic feature building service, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the automatic feature construction device for structured data and the automatic feature construction method for structured data provided in the above embodiments belong to the same concept, that is, the method is based on the system, and the specific implementation process thereof is described in the method embodiments, and will not be described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (12)

1. A method for automatic feature construction of structured data, the method comprising the steps of:
s1: performing data processing on original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;
s2: performing initialization operation on a pre-constructed construction tree by combining the processed data;
s3: combining the processed data, performing generation operation on the initialized construction tree to obtain a feature generation mode, comprising:
starting from the root node of the building tree, recursively performing the following operations on the nodes of the building tree in a traversal mode:
checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;
traversing an original column, acquiring a filtering threshold value by combining with a filtering threshold value parameter, and generating a generated column for the original column by using a feature construction operation, wherein the original column comprises the processed data;
traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;
if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording a characteristic generation mode of the node, and performing recursion on the child node; s4: and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
2. The method of claim 1, wherein if the raw data is class type data, the preprocessing further comprises:
and carrying out one-hot coding on the class type data to obtain a coded data column.
3. The method of automatic feature construction of structured data according to claim 2, wherein said preprocessing further comprises: after the category type data is subjected to one-hot coding, counting the coded data column and other original data except the category type data respectively to obtain statistical information;
and dividing the coded data column into at least one independent hot column and/or numerical value column according to numerical values in the statistical information, and labeling the independent hot column and/or numerical value column.
4. The method for automatically constructing features of structured data according to any one of claims 1 to 3, wherein the step S2 specifically comprises:
inputting the processed data into the pre-constructed construction tree;
defining a feature building operation of the building tree;
configuring external parameters of the building tree, wherein the external parameters at least comprise: node exploration times, pruning limit parameters and filtering threshold parameters.
5. The method according to claim 4, wherein if the building tree is a random strategy, the step S2 further comprises:
configuring random weights for the feature construction operations.
6. A method for automatic feature construction of structured data according to any of claims 1 to 3, wherein the data processing further comprises data resampling, the method comprising:
preprocessing the original data to obtain preprocessed data;
performing data resampling on the preprocessed data to obtain a plurality of sampling data;
carrying out initialization operation on a pre-constructed construction tree by combining the sampling data;
combining the sampling data, performing generation operation on the initialized construction tree to obtain a feature generation mode;
and performing feature extraction on the preprocessed data by using the feature generation mode to obtain a feature generation result.
7. An apparatus for automatic feature construction of structured data, the apparatus comprising:
the data processing module is used for carrying out data processing on the original data, wherein the data processing at least comprises preprocessing, and the preprocessing at least comprises missing value processing;
the initialization module is used for carrying out initialization operation on a pre-constructed construction tree in combination with the processed data;
the generating operation module is used for combining the processed data to perform generating operation on the initialized construction tree to acquire a feature generating mode;
the generation operation module is specifically configured to:
starting from the root node of the building tree, performing the following operations on the nodes of the building tree in a traversal mode:
checking the splitting condition of the nodes of the constructed tree according to the pruning limiting parameters, if the nodes can not be split again, stopping the subsequent operation of the nodes, and taking the nodes as leaf nodes;
traversing an original column, acquiring a filtering threshold value by combining with a filtering threshold value parameter, and generating a generated column by operating the original column by using a characteristic construction operation, wherein the original column comprises preprocessed data including a coded data column;
traversing the generated column, acquiring the optimal splitting point of the generated column, and recording the feature generation mode meeting the filtering threshold;
if the node meets the condition of continuing splitting, splitting the original column and/or the generated column by using the optimal splitting point to generate an original column of a child node, recording a characteristic generation mode of the node, and performing recursion on the child node;
and the feature extraction module is used for extracting features of the preprocessed data by using the feature generation mode to obtain a feature generation result.
8. The apparatus according to claim 7, wherein if the raw data is class type data, the data processing module comprises:
and the coding unit is used for carrying out one-hot coding on the class type data to obtain a coded data column.
9. The apparatus for automated feature building of structured data according to claim 8, wherein said data processing module further comprises:
the statistical unit is used for respectively carrying out statistics on the coded data column and other original data except the category type data after carrying out the one-hot coding on the category type data to obtain statistical information;
and the dividing unit is used for dividing the coded data column into at least one independent hot column and/or one numerical value column according to the numerical values in the statistical information, and labeling the independent hot column and/or the numerical value column.
10. The apparatus for automated feature construction of structured data according to any of claims 7 to 9, wherein the initialization module comprises:
the data input unit is used for inputting the preprocessed data into the pre-constructed construction tree;
an operation definition unit for defining the characteristic construction operation of the construction tree;
a parameter configuration unit, configured to configure external parameters of the building tree, where the external parameters at least include: node exploration times, pruning limit parameters and filtering threshold parameters.
11. The apparatus for automated feature building of structured data according to claim 10, wherein if the building tree is a random policy, the initialization module further comprises:
and the weight configuration unit is used for configuring the random weight of the feature construction operation.
12. The apparatus for automated feature construction of structured data according to any of claims 7 to 9, wherein the data processing module further comprises:
the sampling unit is used for resampling the preprocessed data to obtain a plurality of sampling data;
the initialization module is also used for carrying out initialization operation on a pre-constructed construction tree in combination with the sampling data;
and the generating operation module is also used for combining the sampling data to perform generating operation on the initialized construction tree to acquire a feature generating mode.
CN201910206424.6A 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data Active CN109993217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910206424.6A CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910206424.6A CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Publications (2)

Publication Number Publication Date
CN109993217A CN109993217A (en) 2019-07-09
CN109993217B true CN109993217B (en) 2021-06-15

Family

ID=67130699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910206424.6A Active CN109993217B (en) 2019-03-18 2019-03-18 Automatic feature construction method and device for structured data

Country Status (1)

Country Link
CN (1) CN109993217B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049587A (en) * 2011-10-13 2013-04-17 同济大学 Feature recognition method based on hierarchy and construction method of product feature semantic network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011091470A1 (en) * 2010-01-27 2011-08-04 National Ict Australia Limited Query processing of tree-structured data
US20180053107A1 (en) * 2016-08-19 2018-02-22 Sap Se Aspect-based sentiment analysis
CN107729349B (en) * 2017-08-25 2022-06-07 昆仑智汇数据科技(北京)有限公司 Method and device for automatically generating feature data set based on metadata
CN108363759A (en) * 2018-02-01 2018-08-03 厦门快商通信息技术有限公司 Subject tree generation method and system based on structural data and Intelligent dialogue method
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049587A (en) * 2011-10-13 2013-04-17 同济大学 Feature recognition method based on hierarchy and construction method of product feature semantic network

Also Published As

Publication number Publication date
CN109993217A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
US8280915B2 (en) Binning predictors using per-predictor trees and MDL pruning
CN111078780A (en) AI optimization data management method
CN111046630A (en) Syntax tree extraction method of JSON data
US7571159B2 (en) System and method for building decision tree classifiers using bitmap techniques
CN112667860A (en) Sub-graph matching method, device, equipment and storage medium
CN113779272A (en) Data processing method, device and equipment based on knowledge graph and storage medium
CN112667735A (en) Visualization model establishing and analyzing system and method based on big data
CN112905380A (en) System anomaly detection method based on automatic monitoring log
Jain et al. A review of unstructured data analysis and parsing methods
CN114911820A (en) SQL statement judging model construction method and SQL statement judging method
CN114818643A (en) Log template extraction method for reserving specific service information
CN115828180A (en) Log anomaly detection method based on analytic optimization and time sequence convolution network
CN114880635A (en) User security level identification method, system, electronic device and medium of model integrated with lifting tree construction
CN115827797A (en) Environmental data analysis and integration method and system based on big data
CN109933589B (en) Data structure conversion method for data summarization based on ElasticSearch aggregation operation result
CN109993217B (en) Automatic feature construction method and device for structured data
Dutta et al. Big data architecture for environmental analytics
CN117095230A (en) Air quality low-consumption assessment method and system based on image big data intelligent analysis
CN110825792A (en) High-concurrency distributed data retrieval method based on golang middleware coroutine mode
CN115982177A (en) Data collection method, device, equipment and medium based on tree dimensionality
CN116432099A (en) Log classification method, device, electronic equipment and storage medium
CN112287663B (en) Text parsing method, equipment, terminal and storage medium
CN100573574C (en) A kind of device and method of constructing cases tree
Lv et al. CEP rule extraction framework based on evolutionary algorithm
WO2024004083A1 (en) Data generation device, data generation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240306

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240415

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China