CN116414880A - Feature engineering method and device and electronic equipment - Google Patents

Feature engineering method and device and electronic equipment Download PDF

Info

Publication number
CN116414880A
CN116414880A CN202111652827.7A CN202111652827A CN116414880A CN 116414880 A CN116414880 A CN 116414880A CN 202111652827 A CN202111652827 A CN 202111652827A CN 116414880 A CN116414880 A CN 116414880A
Authority
CN
China
Prior art keywords
data
data table
node
analyzed
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111652827.7A
Other languages
Chinese (zh)
Inventor
侯俊雄
姜伟浩
浦世亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202111652827.7A priority Critical patent/CN116414880A/en
Publication of CN116414880A publication Critical patent/CN116414880A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a feature engineering method, a feature engineering device and electronic equipment. Wherein the method comprises the following steps: acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of multiple dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table; for each slave data table, determining an association mode between data in the slave data table and the target data according to the dimension of each data table; according to the association mode, determining data associated with the target data in each data table as associated data; and constructing a characteristic according to the associated data as the characteristic of the object to be analyzed. The requirements of feature engineering on the expertise required by algorithm engineers can be reduced, thereby improving the applicability.

Description

Feature engineering method and device and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a feature engineering method, a device, and an electronic device.
Background
The mapping between data and results can be achieved through a model obtained through machine learning training. However, there is often no explicit mapping relationship between the data and the result, which makes it difficult for the model to directly learn the mapping relationship between the data and the result through machine learning. Therefore, based on data extraction and construction, features with more obvious mapping relation with the results (the process is referred to as feature engineering in the text) are obtained, so that the model can learn the mapping relation between the features and the results through machine learning, and the mapping between the data and the results is realized according to the mapping relation between the features and the results.
In the related art, feature engineering is often built by an algorithm development engineer according to the professional knowledge of the field to which the model is applied to obtain features with more obvious mapping relation with the result, but the method requires the algorithm engineer to have the professional knowledge of the related field, so that the applicability is poor.
Disclosure of Invention
The embodiment of the invention aims to provide a feature engineering method, a device and electronic equipment, so as to improve the applicability of feature engineering. The specific technical scheme is as follows:
in a first aspect of the embodiments of the present invention, there is provided a feature engineering method, the method comprising:
acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of multiple dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table;
for each slave data table, determining an association mode between data in the slave data table and the target data according to the dimension of each data table;
according to the association mode, determining data associated with the target data in each data table as associated data;
and constructing a characteristic according to the associated data as the characteristic of the object to be analyzed.
In a possible embodiment, the determining, for each of the slave data tables, a correlation manner between the data in the slave data table and the target data according to the dimensions of the respective data tables includes:
generating a data relation tree according to the dimension of each data table, wherein each node in the data relation tree is used for representing one data table, the root node is used for representing the main data table, and an intersection exists between the dimension of the data table represented by any node and the dimension of the data table represented by the father node of the node;
for each node except the root node, determining the association mode between the data table represented by the node and the main data table as the association mode between the data in the data table represented by the node and the target data according to the common dimension between the data table represented by the node and the data table represented by the father node and the association mode between the data table represented by the father node and the main data table.
In a possible embodiment, the constructing a feature according to the association data as the feature of the object to be analyzed includes:
and respectively constructing characteristics aiming at the data of the associated data in each data table as single table characteristics of the object to be analyzed.
In one possible embodiment, the method further comprises:
features are commonly constructed for associated data in at least two different data tables as intersecting features of the target data.
In one possible embodiment, the method further comprises:
calculating information entropy of each sub-feature in the single-table feature;
and deleting the sub-features of the single table features, the information entropy of which is lower than a preset information entropy threshold, so as to obtain the screened single table features.
In a possible embodiment, the constructing a feature for the data of the associated data in each data table, as a single table feature of the target data, includes:
sequentially extracting the characteristics of the associated data in the data table corresponding to each node from the root node to the child nodes according to the hierarchical traversal sequence until the number of the extracted characteristics is greater than a preset number threshold;
and determining the extracted data characteristics as single-table characteristics of the object to be analyzed.
In one possible embodiment, the method further comprises:
extracting characteristics of data to be analyzed according to the characteristics of the target data, wherein the characteristics of the data to be analyzed are consistent with characteristic dimensions included in the characteristics of the target data;
and analyzing the data to be analyzed according to the characteristics of the data to be analyzed to obtain an analysis result.
In a second aspect of the embodiments of the present invention, there is provided a feature engineering apparatus, the apparatus comprising:
the data table acquisition module is used for acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of a plurality of dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table;
the relation construction module is used for determining the association mode between the data in the slave data table and the target data according to the dimension of each data table for each slave data table; according to the association mode, determining the data associated with the target data in each data table as associated data;
and the characteristic engineering module is used for constructing characteristics according to the associated data and taking the characteristics as the characteristics of the object to be analyzed.
In one possible embodiment, the relationship construction module determines, for each of the slave data tables, a relationship between data in the slave data table and the target data according to dimensions of the respective data table, including:
generating a data relation tree according to the dimension of each data table, wherein each node in the data relation tree is used for representing one data table, the root node is used for representing the main data table, and an intersection exists between the dimension of the data table represented by any node and the dimension of the data table represented by the father node of the node;
for each node except the root node, determining the association mode between the data table represented by the node and the main data table as the association mode between the data in the data table represented by the node and the target data according to the common dimension between the data table represented by the node and the data table represented by the father node and the association mode between the data table represented by the father node and the main data table.
In a possible embodiment, the feature engineering module includes a recursive feature construction sub-module for constructing features for data of associated data in each data table, respectively, as single table features of the object to be analyzed.
In a possible embodiment, the feature engineering module further comprises a feature cross-deriving sub-module for jointly constructing features for associated data in at least two different data tables as cross-features of the target data.
In a possible embodiment, the feature engineering module further includes a feature fusion and selection sub-module, configured to calculate an information entropy of each sub-feature in the single table feature;
and deleting the sub-features of the single table features, the information entropy of which is lower than a preset information entropy threshold, so as to obtain the screened single table features.
In a possible embodiment, the recursive feature construction submodule is specifically configured to sequentially extract features of associated data in a data table corresponding to each node from a root node to a child node according to a hierarchical traversal order until the number of extracted features is greater than a preset number threshold;
and determining the extracted data characteristics as single-table characteristics of the object to be analyzed.
In a possible embodiment, the device further comprises an analysis module, configured to extract a feature of data to be analyzed according to a feature of the target data, where the feature of the data to be analyzed is consistent with a feature dimension included in the feature of the target data;
and analyzing the data to be analyzed according to the characteristics of the data to be analyzed to obtain an analysis result.
In a third aspect of the embodiment of the present invention, there is provided an electronic device, including:
a memory for storing a computer program;
a processor for implementing the method steps of any of the above first aspects when executing a program stored on a memory.
In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored therein a computer program which when executed by a processor implements the method steps of any of the first aspects described above.
The embodiment of the invention has the beneficial effects that:
according to the feature engineering method, the device and the electronic equipment provided by the embodiment of the invention, the data in each data table can be automatically associated based on the dimensions among the data tables, so that the data describing different dimensions in each data table is converted into the data describing the object to be analyzed, and the features of the object to be analyzed can be obtained by constructing the data, so that an algorithm engineer does not need to excessively care about screening and feature extraction of the data table in the process of feature engineering, but only needs to focus on the definition of business problems and the expression of the association mode among the data tables, the professional knowledge requirement on the algorithm engineer is reduced, and the applicability is stronger.
Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.
FIG. 1 is a schematic flow chart of a feature engineering method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data relationship tree according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a feature engineering device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art will be able to devise all other embodiments that are obtained based on this application and are within the scope of the present invention.
In order to more clearly describe the feature engineering method provided by the embodiment of the present invention, an application scenario of the feature engineering method provided by the embodiment of the present invention will be described in the following by way of example, and the following example is only one possible application scenario of the feature engineering method provided by the embodiment of the present invention, and in other possible embodiments, the feature engineering method provided by the embodiment of the present invention may also be applied to other possible application scenarios, which are not limited in any way.
In the context of financial data mining, data about objects to be analyzed is often distributed in a large number of data tables, e.g., risk status of a credit card account needs to be analyzed, and data related to the credit card account may be distributed in a credit card billing day status table, a credit card transaction statement table, an ID mapping table, a customer base information table, a credit card repayment status table.
The credit card bill date state table is used for recording the credit card numbers under each credit card account number and the bill date state of each credit card number, the credit card transaction detail table is used for recording the consumption records of each credit card number, the ID mapping table is used for recording the credit card account numbers and the credit card numbers under each user name, the customer basic information table is used for recording the information related to each user, and the credit card repayment state table is used for recording the current repayment state of each credit card number.
In the related art, the algorithm engineer may look up each data table one by one, and extract the feature that there is an obvious mapping relation between the risk status of the credit card account number from each data table in combination with his own expertise.
However, the algorithm engineer needs to have professional knowledge in the financial field to know what data has a relation with the risk condition of the credit card account number, and what feature has a more obvious mapping relation with the risk condition of the credit card account number, so that feature engineering is realized. And secondly, the data tables need to be manually consulted, so that the efficiency is low.
Based on this, the embodiment of the invention provides a feature engineering method, and the feature engineering method provided by the embodiment of the invention can be applied to any electronic equipment with feature engineering capability, including but not limited to servers, personal computers, mobile terminals and the like. The feature engineering method provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:
s101, acquiring a master data table and at least one slave data table.
Wherein each data table is used for representing the corresponding relation between the data of a plurality of dimensions, the main data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table.
S102, for each slave data table, determining an association mode between data in the slave data table and target data according to the dimension of each data table.
S103, according to the association mode, determining the data corresponding to the target data in each data table as association data.
S104, constructing characteristics according to the associated data, and taking the characteristics as characteristics of the object to be analyzed.
By adopting the embodiment, the data in each data table can be automatically associated based on the dimensions among a plurality of data tables, so that the data describing different dimensions in each data table is converted into the data describing the object to be analyzed, and the characteristics of the object to be analyzed can be obtained by constructing the data, so that an algorithm engineer does not need to excessively care about screening and characteristic extraction of the data tables in the process of characteristic engineering, only needs to focus on the definition of business problems and the expression of the association mode among the data tables, reduces the professional knowledge requirement on the algorithm engineer, and has stronger applicability.
In addition, the data in each data table is fully utilized in the scheme, so that richer features can be established for the object to be analyzed, feature engineering is automatically realized, and the efficiency of the feature engineering is improved. Taking the aforementioned scenario of analyzing the risk condition of the credit card account number as an example, the present solution does not require an algorithm engineer to manually sort data from each data table, so that the operation is convenient and the efficiency is high.
The steps of the foregoing S101 to S104 will be described below, respectively:
in S101, the number of master data tables is one, the number of slave data tables is one or more, and hereinafter, for convenience of description, only the case of including a plurality of slave data tables is taken as an example, and the principle is the same for the case of including only one slave data table, so that the description is omitted here.
The main data table includes target data for uniquely identifying an object to be analyzed, the object to be analyzed may be different according to application scenarios, such as a scenario for analyzing a risk condition of a credit card account, where the object to be analyzed is the credit card account, and a scenario for analyzing a credit condition of a user, where the object to be analyzed is the user.
Assuming that the data table includes the aforementioned credit card billing day state table, credit card transaction detail table, ID mapping table, customer base information table, and credit card payment state table, since the object to be analyzed is a credit card account number and the credit card numbers under each credit card account number are recorded in the credit card billing day state table, the data of one dimension in the credit card billing day state table is used to represent the credit card account number, that is, the credit card billing day state table includes target data capable of uniquely identifying the credit card account number, so that the credit card billing day state table may be used as a master data table, the credit card transaction detail table, the ID mapping table, the customer base information table, and the credit card payment state table may be used as slave data tables in this example.
The intersection of the dimension of any data table with the dimension of at least one other data table means that: for any one data table there is always another data table and the other data table has the same dimensions as the any one data table. Illustratively, the dimension of the customer exists in both the ID mapping table and the customer base information table, the dimension of the credit card number exists in both the ID mapping table and the credit card repayment status table, and the dimension of the credit card account number exists in both the credit card billing day status table, the ID mapping table, and the credit card transaction detail table, so that the dimension of any one data table in this example intersects the dimension of at least one other data table.
In S102, since the dimension of the data table is used to represent the object described by the data in the dimension of the data table, for example, the data in the dimension of the client in the data table is used to represent each client, the data in the dimension of the signal card number in the data table is used to represent each credit card number, and there is a certain dependency relationship between the objects described by the data, for example, the credit card number depends on the credit card account number and the credit card account number depends on the client, there is a certain correspondence relationship between the data of different data tables.
The data in the dimension "credit card number" in the credit card repayment state table and the data in the dimension "repayment state" have a corresponding relationship, and the data in the dimension "signal card number" in the ID mapping table and the data in the dimension "credit card account number" have an association relationship, so that the data in the dimension "credit card number" in the credit card repayment state table should be associated with the data in the dimension "credit card account number", the data in the dimension "credit card account number" is the target data, and the association manner between the data in each dimension in each data table and the target data can be determined according to the dimension of the data table.
In S103, the master analysis table contains the target data, so that the data associated with the target data can be extracted from the master analysis table according to the master analysis table, and in S102, the association manner between the data in each slave data table and the target data has been determined, so that the data associated with the target data can be extracted from each slave data table according to the association manner, that is, the data associated with the target data can be comprehensively extracted from all the data tables as the association data.
In S104, the manner of constructing the features may be different according to the application scenario, but the manner of constructing the features should be as comprehensive as possible so that the features obtained by the construction can better represent the object to be analyzed. As in the previous analysis, the associated data can be used to describe the object to be analyzed, i.e. the associated data can reflect the object to be analyzed to some extent, so that the features constructed from the associated data can be used as features of the object to be analyzed.
It will be appreciated that, since the master data table contains target data capable of uniquely identifying the object to be analyzed, in order to enable features constructed from the associated data in the slave data table to accurately characterize the object to be analyzed, all the slave data tables should be traversed step by step in order of high to low degree of association with the master data table from the master data table, so that features for the object to be analyzed can be constructed when features are constructed from the associated data of the slave data tables.
After the features of the object to be analyzed are constructed, machine learning can be performed based on the features of the object to be analyzed, so that a model capable of being used for analyzing the object of the class to which the object to be analyzed belongs is trained, and analysis is performed by using the model obtained by training. The model obtained by training may be different according to the application scenario, may be an algorithm model obtained by training using traditional machine learning, or may be a neural network model obtained by training based on deep learning, and the embodiment does not limit the method.
Taking the risk situation of the credit card account being analyzed as an example, the target data is used for uniquely identifying the credit card account, if the analysis in S102 is performed, it is determined that the data in the dimension "repayment state" in the credit card repayment state table is associated with the target data when the correspondence is determined, so that when the feature of the credit card account is constructed, the repayment state of each credit card number under the credit card account is used as the reference signal, and when the association mode is determined, it is determined that the data in the dimension "customer basic information" in the customer basic information table is associated with the target data, and therefore when the feature of the credit card account is constructed, the basic information of the customer to which the credit card account belongs is referred, and is also similar to the credit card transaction detail table.
Therefore, the feature engineering method provided by the embodiment of the invention can comprehensively construct features of the risk condition of the credit card account number by integrating the credit card bill date state table, the credit card transaction detail table, the client basic information table and the credit card repayment state table, so that the constructed features can more accurately represent the credit card account number.
The foregoing S102 may be implemented by:
s1021, generating a data relation tree according to the dimension of each data table.
Each node in the data relationship tree is used to represent one data table, and the root node is used to represent the primary data table, with the dimensions of the data table represented by any node intersecting the dimensions of the data table represented by the parent node of that node.
For example, still taking the foregoing scenario of analyzing the risk status of the credit card account, since the main data table is a credit card billing day status table, the root node in the generated data relationship tree is used to represent the credit card billing day status table, and since the common dimension "credit card account" exists between the credit card billing day status table and the ID mapping table, and the credit card transaction detail table, the node in the root node in the generated data relationship tree is used to represent the ID mapping table and the credit card transaction detail table is a child node of the node used to represent the credit card billing day status table. The ID mapping table and the customer basic information table have a common dimension of 'customer', and the ID mapping table and the credit card repayment state table have a common dimension of 'credit card number', so that a root node in the generated data relationship tree is used for representing the customer basic information table and the node of the credit card repayment state table is a child node of the node used for representing the ID mapping table, and the generated data relationship tree is shown in fig. 2.
It will be appreciated that the data relationship tree shown in fig. 2 is only one possible data relationship tree, for example, the credit card transaction statement and the credit card repayment status table have a common dimension "credit card number", and thus a node for representing the credit card repayment status table may also be considered as a child node of the node for representing the credit card transaction statement. In an actual application scenario, it may be decided according to the actual requirements and/or experience of the algorithm engineer what kind of data relation tree to generate.
S1022, for each node other than the root node, determining, as a correlation scheme between data in the data table represented by the node and the target data, a correlation scheme between the data table represented by the node and the data table represented by the root node, based on a common dimension between the data table represented by the node and the data table represented by the parent node of the node and a correlation scheme between the data table represented by the parent node and the data table represented by the root node.
Illustratively, taking an example for representing an ID mapping table, a parent node of a node of the ID mapping table is used for representing a credit card billing day state table, and a common dimension between the ID mapping table and the credit card billing day state table is "credit card account number", so that a manner of association between the ID mapping table and the credit card billing day state table is a credit card account number-credit card account number.
Taking the customer basic information table as an example, a father node of the node for representing the customer basic information table is used for representing the ID mapping table, the common dimension between the customer basic information table and the ID mapping table is "customer", and the association mode between the ID mapping table and the credit card billing day state table is credit card account number-credit card account number, so that the association mode between the customer basic information table and the credit card billing day state table is "credit card account number-credit card account number/customer-customer".
By adopting the embodiment, the association mode before the data in each node and the target data can be accurately determined by constructing the data relationship tree and transmitting the association relationship among the nodes layer by layer from the root node of the data relationship tree.
The foregoing S104 may be realized by:
s1041, respectively constructing data characteristics aiming at the associated data in each data table as single table characteristics of the object to be analyzed.
Operators for extracting data features in the embodiment of the invention are divided into two types of operators: a transformation operator and an aggregation operator, wherein the transformation operator is suitable for the situation that an object to be analyzed only exists in a data table, and the aggregation operator is suitable for the situation that the same object to be analyzed exists in a plurality of pieces of data in the data table, and aggregation operation is needed when extracting features.
Further, operators can be subdivided into operators for data of a Numerical type, operators for data of a Timestamp type, operators for data of a category type, respectively, such as in a transformer type operator, data field selectable operators for a Numerical type include a standard scaler and the like, and data field selectable operators for a category type include an onehot encoder and the like; while in the aggregation class operator, the optional operator for the Timestamp type data field includes AvgDiff (average of the adjacent time differences is calculated) or the like.
In a possible embodiment, before the subsequent processing based on the single-table feature, the single-table feature may be further preprocessed, and the object to be analyzed may be analyzed according to the feature obtained after the preprocessing. Wherein the pretreatment includes, but is not limited to, fusion, selection, cross-derivatization.
The feature fusion can directly carry out transverse splicing on a plurality of single-table features. And, the data in the feature for assisting in association may also be deleted in the feature fusion, such as the customer, credit card number, etc. in the previous example.
The rules of feature screening are: and deleting the features with lower information quantity, such as deleting the features with unique value number, variance smaller than a given threshold value, features with lower correlation with the tag column, and the like by adopting a filtering method feature screening principle.
The cross derivatization is based on the characteristics of each list, so that the cross derivatization can be carried out on the characteristics of the list for further enriching the characteristic information and improving the training effect of the model. For example, the total monthly transaction amount of each account is extracted according to a credit card transaction detail table, the total monthly repayment amount is extracted according to a credit card repayment state table, and the two features are subtracted to obtain new derivative features, namely the total monthly transaction amount of the account, which possibly helps to improve the information amount of a feature system. Specifically, a feature cross operator pool, such as a common binary operator like addition, subtraction, multiplication and division, can be defined, and cross derivatization is sequentially selected from the operator pool when the actual feature cross derivatization.
In the foregoing S1041, as in the foregoing analysis, it is necessary to traverse all the slave data tables step by step in order of the degree of association with the master data table from high to low so that the feature for the object to be analyzed can be constructed when the feature is constructed from the association data of the slave data tables, whereas the root node is used to represent the master data table in the constructed data relationship tree, so that the degree of association between the data table represented by the node closer to the root node and the master data table is higher.
Thus, the order of the degree of association with the master data table from high to low is equivalent to the hierarchical traversal order from the root node to the child nodes. Therefore, in S1041, the features of the associated data in the data table corresponding to each node are sequentially extracted from the root node to the child node in the hierarchical traversal order when the features are constructed. And determining the extracted data features as single-table features of the object to be analyzed until the number of the extracted features is greater than a preset number threshold.
Taking the data relationship tree shown in fig. 2 as an example, the order of traversing from the root node to the child nodes in the hierarchy is: credit card billing day status table- & gtid mapping table and credit card transaction details table- & gtcustomer base information table and credit card repayment status table.
By adopting the embodiment, the extracted characteristics can be prevented from being subjected to dimensional explosion.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a feature engineering device according to an embodiment of the present invention, which may include:
a data table obtaining module 301, configured to obtain a master data table and at least one slave data table, where each data table is configured to represent a correspondence between data of multiple dimensions, and the master data table includes target data for uniquely identifying an object to be analyzed, and a dimension of any data table has an intersection with a dimension of at least one other data table;
a relationship construction module 302, configured to determine, for each of the slave data tables, a relationship between data in the slave data table and the target data according to dimensions of the respective data tables; according to the association mode, determining the data associated with the target data in each data table as associated data;
and the feature engineering module 303 is configured to construct a feature according to the association data, and the feature engineering module is used as a feature of the object to be analyzed.
In one possible embodiment, the relationship construction module determines, for each of the slave data tables, a relationship between data in the slave data table and the target data according to dimensions of the respective data table, including:
generating a data relation tree according to the dimension of each data table, wherein each node in the data relation tree is used for representing one data table, the root node is used for representing the main data table, and an intersection exists between the dimension of the data table represented by any node and the dimension of the data table represented by the father node of the node;
for each node except the root node, determining the association mode between the data table represented by the node and the main data table as the association mode between the data in the data table represented by the node and the target data according to the common dimension between the data table represented by the node and the data table represented by the father node and the association mode between the data table represented by the father node and the main data table.
In a possible embodiment, the feature engineering module includes a recursive feature construction sub-module for constructing features for data of associated data in each data table, respectively, as single table features of the object to be analyzed.
In a possible embodiment, the feature engineering module further comprises a feature cross-deriving sub-module for jointly constructing features for associated data in at least two different data tables as cross-features of the target data.
In a possible embodiment, the feature engineering module further includes a feature fusion and selection sub-module, configured to calculate an information entropy of each sub-feature in the single table feature;
and deleting the sub-features of the single table features, the information entropy of which is lower than a preset information entropy threshold, so as to obtain the screened single table features.
In a possible embodiment, the recursive feature construction submodule is specifically configured to sequentially extract features of associated data in a data table corresponding to each node from a root node to a child node according to a hierarchical traversal order until the number of extracted features is greater than a preset number threshold;
and determining the extracted data characteristics as single-table characteristics of the object to be analyzed.
In a possible embodiment, the device further comprises an analysis module, configured to extract a feature of data to be analyzed according to a feature of the target data, where the feature of the data to be analyzed is consistent with a feature dimension included in the feature of the target data;
and analyzing the data to be analyzed according to the characteristics of the data to be analyzed to obtain an analysis result. The embodiment of the invention also provides an electronic device, as shown in fig. 4, including:
a memory 401 for storing a computer program;
a processor 402, configured to execute a program stored in the memory 401, and implement the following steps:
acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of multiple dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table;
for each slave data table, determining an association mode between data in the slave data table and the target data according to the dimension of each data table;
according to the association mode, determining data associated with the target data in each data table as associated data;
and constructing a characteristic according to the associated data as the characteristic of the object to be analyzed.
The electronic device may comprise other components besides the memory 401 and the processor 402, such as a communication bus for connecting the communication interface, which is used for communication between the electronic device and other devices, the memory 401 and the processor 402.
The communication bus mentioned may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of any of the above-described feature engineering methods.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of any of the features of the embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the apparatus, the electronic device, the computer-readable storage medium, and the computer program product, the description is relatively simple, as relevant to the method embodiments being referred to in the section of the description of the method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (11)

1. A method of feature engineering, the method comprising:
acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of multiple dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table;
for each slave data table, determining an association mode between data in the slave data table and the target data according to the dimension of each data table;
according to the association mode, determining data associated with the target data in each data table as associated data;
and constructing a characteristic according to the associated data as the characteristic of the object to be analyzed.
2. The method according to claim 1, wherein said determining, for each of said slave data tables, a manner of association between data in said slave data table and said target data based on dimensions of the respective data table, comprises:
generating a data relation tree according to the dimension of each data table, wherein each node in the data relation tree is used for representing one data table, the root node is used for representing the main data table, and an intersection exists between the dimension of the data table represented by any node and the dimension of the data table represented by the father node of the node;
for each node except the root node, determining the association mode between the data table represented by the node and the main data table as the association mode between the data in the data table represented by the node and the target data according to the common dimension between the data table represented by the node and the data table represented by the father node and the association mode between the data table represented by the father node and the main data table.
3. The method according to claim 1, wherein said constructing features from said associated data as features of said object to be analyzed comprises:
and respectively constructing characteristics aiming at the data of the associated data in each data table as single table characteristics of the object to be analyzed.
4. A method according to claim 3, characterized in that the method further comprises:
features are commonly constructed for associated data in at least two different data tables as intersecting features of the target data.
5. A method according to claim 3, characterized in that the method further comprises:
calculating information entropy of each sub-feature in the single-table feature;
and deleting the sub-features of the single table features, the information entropy of which is lower than a preset information entropy threshold, so as to obtain the screened single table features.
6. A method according to claim 3, wherein the constructing a feature for the data of the associated data in each data table, respectively, as a single table feature of the target data, comprises:
sequentially extracting the characteristics of the associated data in the data table corresponding to each node from the root node to the child nodes according to the hierarchical traversal sequence until the number of the extracted characteristics is greater than a preset number threshold;
and determining the extracted data characteristics as single-table characteristics of the object to be analyzed.
7. The method according to claim 1, wherein the method further comprises:
extracting characteristics of data to be analyzed according to the characteristics of the target data, wherein the characteristics of the data to be analyzed are consistent with characteristic dimensions included in the characteristics of the target data;
and analyzing the data to be analyzed according to the characteristics of the data to be analyzed to obtain an analysis result.
8. A feature engineering apparatus, the apparatus comprising:
the data table acquisition module is used for acquiring a master data table and at least one slave data table, wherein each data table is used for representing the corresponding relation between data of a plurality of dimensions, the master data table comprises target data for uniquely identifying an object to be analyzed, and the dimensions of any data table are intersected with the dimensions of at least one other data table;
the relation construction module is used for determining the association mode between the data in the slave data table and the target data according to the dimension of each data table for each slave data table; according to the association mode, determining the data associated with the target data in each data table as associated data;
and the characteristic engineering module is used for constructing characteristics according to the associated data and taking the characteristics as the characteristics of the object to be analyzed.
9. The apparatus of claim 7, wherein the relationship construction module determines, for each of the slave data tables, a manner of association between data in the slave data table and the target data based on dimensions of the respective data table, comprising:
generating a data relation tree according to the dimension of each data table, wherein each node in the data relation tree is used for representing one data table, the root node is used for representing the main data table, and an intersection exists between the dimension of the data table represented by any node and the dimension of the data table represented by the father node of the node;
for each node except the root node, determining the association mode between the data table represented by the node and the main data table as the association mode between the data in the data table represented by the node and the target data according to the common dimension between the data table represented by the node and the data table represented by the father node and the association mode between the data table represented by the father node and the main data table.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-7 when executing a program stored on a memory.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-7.
CN202111652827.7A 2021-12-30 2021-12-30 Feature engineering method and device and electronic equipment Pending CN116414880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111652827.7A CN116414880A (en) 2021-12-30 2021-12-30 Feature engineering method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111652827.7A CN116414880A (en) 2021-12-30 2021-12-30 Feature engineering method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116414880A true CN116414880A (en) 2023-07-11

Family

ID=87053321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111652827.7A Pending CN116414880A (en) 2021-12-30 2021-12-30 Feature engineering method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116414880A (en)

Similar Documents

Publication Publication Date Title
CN109062780B (en) Development method of automatic test case and terminal equipment
CN109343857B (en) Method, apparatus and storage medium for deploying machine learning model on line
Ray et al. Benchmarking for performance evaluation
CN108572963A (en) Information acquisition method and device
US20120150825A1 (en) Cleansing a Database System to Improve Data Quality
CN110990403A (en) Business data storage method, system, computer equipment and storage medium
CN106682099A (en) Data storage method and device
CN113205402A (en) Account checking method and device, electronic equipment and computer readable medium
CN103440199A (en) Method and device for guiding test
CN111414410A (en) Data processing method, device, equipment and storage medium
CN111061733A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN114168565A (en) Backtracking test method, device and system of business rule model and decision engine
CN112631889A (en) Portrayal method, device and equipment for application system and readable storage medium
CN116501979A (en) Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN116414880A (en) Feature engineering method and device and electronic equipment
CN114493279A (en) Workflow task prediction method and device, storage medium and electronic equipment
CN116383883B (en) Big data-based data management authority processing method and system
CN114092265B (en) Method, device and storage medium for improving insurance policy new service value determination efficiency
CN111882294B (en) Method and device for flow approval
US20240028310A1 (en) Comprehensive component analysis and visualization
CN112559331A (en) Test method and device
CN118200182A (en) Log data source path identification method and device
CN116011986A (en) Cloud primary operation and maintenance system management method and related equipment
CN113538154A (en) Risk object identification method and device, storage medium and electronic equipment
CN116401140A (en) Data processing method, device, equipment, readable medium and software product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination