WO2021164382A1 - 针对用户分类模型进行特征处理的方法及装置 - Google Patents

针对用户分类模型进行特征处理的方法及装置 Download PDF

Info

Publication number
WO2021164382A1
WO2021164382A1 PCT/CN2020/134499 CN2020134499W WO2021164382A1 WO 2021164382 A1 WO2021164382 A1 WO 2021164382A1 CN 2020134499 W CN2020134499 W CN 2020134499W WO 2021164382 A1 WO2021164382 A1 WO 2021164382A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
node
user
value
Prior art date
Application number
PCT/CN2020/134499
Other languages
English (en)
French (fr)
Inventor
张屹綮
张天翼
王维强
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021164382A1 publication Critical patent/WO2021164382A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Definitions

  • One or more embodiments of the present specification relate to the field of machine learning, and in particular, to a method and device for performing feature processing on a user classification model.
  • machine learning models have begun to be used for business analysis in a variety of business scenarios. For example, in many application scenarios, users need to be classified and identified, for example, to identify the user's risk level, distinguish the group to which the user belongs, and so on. For this reason, it is often necessary to train user classification models to perform business-related user identification and user classification.
  • the selection and processing of features is the basis of model training.
  • For the user classification model in order to train a model with excellent performance and accurate prediction, it is necessary to select from a large number of user features that are more relevant to the prediction target and can better reflect the characteristics of the user for model training.
  • One or more embodiments of this specification describe a method and device for feature processing for user classification models, which solve the problem of insufficient feature selection efficiency in existing feature engineering, and efficiently perform feature selection and processing for user classification models, thereby Achieve rapid automated modeling.
  • a feature processing method for a user classification model including: obtaining a label data table and obtaining N first feature tables, the label data table includes a user category label, each of the The first feature table records several features of the user; for each first feature table, the information value IV of each feature is determined in combination with the label data table, and the first screening operation is performed on the feature based on the information value IV to obtain the corresponding The second feature table; each second feature table is the first type of node, the features contained in the second feature table are the second type of nodes, and the inclusion relationship between the second feature table and the feature is used as the connecting edge to construct A bipartite graph; in the bipartite graph, a first node set is determined, which contains the smallest number of nodes of the first type connected to all nodes of the second type, so as to obtain the first type of node in the first node set Corresponding M second feature tables; merge the M second feature tables to obtain a comprehensive feature table, and based on the comprehensive feature table
  • the aforementioned N first characteristic tables may include respective statistical user characteristic tables obtained from multiple data platforms.
  • the tag data table further includes at least one feature of the user; in this case, the N first feature tables may include a first feature table generated based on the at least one feature .
  • the category label of the user may include one of the following: the risk level label of the user, the marketing group label to which the user belongs, and the credit level label of the user.
  • the preprocessing before determining the information value IV of each feature in combination with the label data table, it further includes preprocessing each first feature table, and the preprocessing includes: counting the missing rate of feature values of each feature , The feature whose missing rate is greater than the predetermined missing threshold is eliminated; for each feature retained in the first feature table, a unified default value is used to replace the missing feature value.
  • both the first feature table and the tag data table use user identification information as the main key, and the user identification information includes one of the following: account ID, mobile phone number, and email address.
  • determining the information value IV of each feature in combination with the tag data table may specifically include the following steps: obtaining the first feature value of each user for any first feature from any first feature table, and adding The first feature values are sorted to form a first feature value sequence; the tag data table and the first feature table are associated with the user identification information to obtain the tag value sequence, and the tag value sequence is aligned with the first feature value sequence with respect to the user order; The first feature value sequence classifies users into bins; based on the tag value sequence, statistics of the label value distribution of the category labels in each bin; determines the first feature value sequence according to the label value distribution of each bin The information value of the characteristic IV.
  • the tag data table further includes the tagging time of the category tag;
  • the first feature table includes a plurality of feature values collected by the user at different collection times for the first feature, and The collection time stamps corresponding to the multiple feature values; in this case, the first feature value is obtained in the following manner: For each user, among the multiple feature values collected for the first feature, it is determined that the collection time stamp is earlier than The marked time and the feature value closest to the marked time is used as the feature value of the user for the first feature.
  • the process of determining the first set of nodes in the bipartite graph specifically includes: among the first type of nodes contained in the current bipartite graph, determining the node with the largest number of connected edges as the selected node, and The selected node is added to the selected node set; the current bipartite graph is updated, including deleting the selected node and the second type of node connected to the selected node; according to the deleted second type of node, the connection of the remaining first type of nodes is updated And delete the first type of nodes that no longer have connected edges; repeat the above steps until the updated bipartite graph does not contain any nodes, and use the selected node set at this time as the first node set.
  • the non-repeated node is a second-type node with only one connected edge; the first-type node with the largest number of connected non-repeated nodes is determined as the selected node.
  • one of the more than one first-type nodes is randomly selected as the selected node.
  • the second screening operation is performed in the following manner: for each feature in the comprehensive feature table, if the correlation coefficient between the feature and any other feature is higher than a predetermined correlation threshold, then the item is eliminated Features, thereby obtaining a retained feature set; based on the retained feature set, the multiple selected features are determined.
  • each feature in the reserved feature set may be sorted according to the magnitude of the information value IV, and a predetermined number of features with a larger IV value may be selected as the multiple selected features.
  • the second screening operation can be performed in the following manner: For each feature in the comprehensive feature table, calculate the mean value of the correlation coefficient between the feature and the other features; The features in the table are sorted according to the mean value of the correlation coefficient, and a predetermined number of features with a smaller mean value are selected as the multiple selected features.
  • the user classification model is trained based on the multiple selected features and the label data table, and its performance is evaluated; in the user classification model When the performance evaluation meets the preset requirements, the feature information of the multiple selected features is added to the feature pool for selection by other prediction models.
  • the feature information of the multiple selected features includes the feature name of each selected feature, the table name of the first feature table from which the feature comes, and the usage information of the feature used by the model.
  • the performance evaluation of the trained user classification model does not meet the preset requirements, several feature derivatives are used to generate several derived features to form a derived feature table; the derived feature table is merged into the In the comprehensive feature table, an updated comprehensive feature table is obtained; and based on the updated comprehensive feature table, the correlation coefficient between the features is calculated; based on the correlation coefficient, the second screening operation is performed on the features again to obtain the expanded selected features, Used to train the user classification model again.
  • the several derivative features include one or more of the following: cumulative features based on basic features, combined features based on basic features, sequence features, and graph features related to the user relationship network.
  • an apparatus for performing feature processing on a user classification model including: a first obtaining unit configured to obtain a tag data table and obtain N first feature tables, the tag data tables including user information The category label, each of the first feature tables records several features of the user; the first screening unit is configured to determine the information value IV of each feature in combination with the label data table for each first feature table, based on all features The information value IV performs the first screening operation on the features to obtain the corresponding second feature table; the bipartite graph construction unit is configured to use each second feature table as the first type of node, and use the second feature table included The feature is a second type of node, and the inclusion relationship between the second feature table and the feature is used as the connection edge to construct a bipartite graph; the node set determining unit is configured to determine the first node set in the bipartite graph, which contains the connection To the smallest number of first-type nodes of all second-type nodes, so as to obtain M second feature tables corresponding to the first
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • the feature processing solution for the user classification model provided by the embodiment of this specification, it is generally carried out through two-stage feature screening, where the second-stage screening based on the correlation coefficient between features is passed through the principle of least point coverage in the bipartite graph , To streamline the number of feature tables, thereby greatly speeding up the calculation process of correlation coefficients between features, thereby speeding up the feature selection process. Further, by adding relevant information of the selected feature to the feature pool, the feature selection process of other models of the same type is accelerated, thereby realizing rapid modeling of multiple models. Furthermore, the feature can be further enriched and expanded through feature derivation, which is more conducive to the effect of automatic modeling.
  • FIG. 1 is a schematic diagram of the feature processing process of an embodiment disclosed in this specification
  • Fig. 2 is a flowchart of a method for performing feature processing on a user classification model according to an embodiment
  • Figure 3 shows the steps of determining the IV value of each feature in an embodiment
  • Fig. 4 shows a schematic diagram of a bipartite graph constructed based on a feature table-feature according to an embodiment
  • FIG. 5 shows the iterative process
  • Fig. 6 shows the process of determining the first set of nodes for the bipartite graph of Fig. 4;
  • Fig. 7 shows a schematic block diagram of a feature processing apparatus according to an embodiment.
  • an end-to-end feature processing solution is provided.
  • the solution can be quickly based on a large number of user features in multiple original feature tables.
  • Perform feature analysis and selection to efficiently determine features suitable for modeling and output to modeling tools for modeling.
  • the selected feature information and the usage status of the feature by the model can be recorded in the feature pool, so as to facilitate the selection and training of other models of the same type.
  • FIG. 1 is a schematic diagram of the feature processing process of an embodiment disclosed in this specification. As shown in Figure 1, the feature processing process includes two stages of feature screening, which are based on the information value IV of the feature and the correlation coefficient between the features.
  • the original feature set contains a large number of user features, and each user feature is represented by an ellipse for example.
  • These user characteristics may come from multiple original characteristic tables, and there may be duplicate recorded characteristics in different original characteristic tables.
  • the information value IV (Information Value) of the feature is determined, which is referred to as the IV value hereinafter. Then, based on the IV value of the feature, preliminary screening is performed on the features in the original feature set, for example, the feature whose IV value is lower than a certain threshold is eliminated, thereby obtaining the preliminary screening feature.
  • the preliminary screening features are still distributed in a number of different feature tables.
  • the second stage of screening is based on the correlation coefficient between the two features. If you want to calculate the correlation coefficient between two features from two different feature tables, you need to perform data table association operations on the two feature tables. Therefore, the calculation of the correlation coefficient between features involves a large number of data table association operations, and this part of the operation consumes computing resources and calculation time, especially when the data volume of each feature table is relatively large. Considering that there may be repeated features in the feature table, before starting the second stage of screening, the feature table is innovatively "simplified" in order to reduce the number of subsequent feature tables to be associated.
  • the simplification of the feature table is based on the principle of least point coverage of the bipartite graph. That is, the feature table is taken as the first type of node, and each feature in the table is taken as the second type of node to construct a bipartite graph. Then, in the bipartite graph, the smallest number of first-type nodes that can be connected to all second-type nodes is found, and the smallest number of feature tables that can cover all the feature items are also found.
  • the minimum number of feature tables obtained above are combined into a comprehensive table, and the correlation coefficient between the features is calculated based on the comprehensive table. Therefore, the second stage of screening can be performed, based on the correlation coefficient between the features, and then some features are eliminated, and finally some selected features are obtained.
  • the above selected features can then be output to a modeling tool for user classification model training and performance evaluation.
  • the performance meets the requirements, it is determined that the selected features are applicable to the user classification model, and relevant information of these features, such as the corresponding feature table name, the usage status of the feature by the model, etc., is added to the feature pool. Therefore, when training the user classification model of the same type in the subsequent training, the feature selection can be made directly based on the feature-related information recorded in the feature pool, instead of re-processing and selecting features from scratch.
  • the above scheme generally uses two-stage feature screening for feature selection.
  • the number of feature tables is simplified through the principle of least point coverage in the bipartite graph, thereby greatly accelerating the correlation between features.
  • the calculation process of coefficients speeds up the feature selection process. Further, by adding relevant information of the selected feature to the feature pool, the feature selection process of other models of the same type is accelerated, thereby realizing rapid modeling of multiple models.
  • Fig. 2 shows a flowchart of a method for performing feature processing on a user classification model according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the feature processing method includes at least the following steps.
  • step 21 a label data table and N first feature tables are obtained.
  • the label data table includes user category labels, and these category labels are used as label data for training the user classification model.
  • the category labels are also different accordingly.
  • the user classification model is used to predict the user's risk category, for example, ordinary users or high-risk users (accounts suspected of fraud or account theft); correspondingly, the user category label in the label data table can be, A risk level label showing the real risk status of the user.
  • the user classification model is used to predict the marketing group to which the user belongs, for example, marketing-sensitive users/marketing-insensitive users, or to predict the user’s marketing value level; correspondingly, the user category label may be, the user belongs to Marketing crowd label.
  • the user classification model is used for the lending platform to evaluate the user's credit status; in this case, the user category label may be the user's credit rating label.
  • the user category labels can have more meanings according to the classification goals and usage scenarios of the user classification model.
  • the tag data table usually uses user identification information as the main key, and the user identification information is used to uniquely identify different users.
  • the user identification information may take the form of account ID, mobile phone number, email address, etc.
  • N first feature tables are obtained, and each feature table records several features of the user.
  • the characteristics of the user can specifically include the characteristics of the static portrait of the user, such as gender, age, occupation, income, education level, etc.; the characteristics of the user's operating behavior, such as the type of the most recent operation, the page operated, and the length of stay And so on; the characteristics of the user's financial assets, such as Yu'ebao balance, the number of recent purchases, the amount of consumption, etc.; the characteristics of the user's credit history, such as the number of borrowings, the amount of borrowing, the amount of repayment, etc.; the user's social aspects The characteristics of, such as the number of friends, frequency of communication with friends, communication type, etc.; as well as other characteristics of the user, we will not enumerate one by one here.
  • the aforementioned N feature tables may be obtained by a computing platform (such as Alipay) implementing the method in FIG. 2 by recording user features in multiple aspects.
  • the aforementioned N first feature tables may come from multiple different data platforms, and the computing platform implementing the method in FIG. 2 obtains each data table from the multiple different data platforms.
  • the computing platform may obtain a feature table related to loan credit records from a banking institution, obtain a feature table related to financial consumption from a shopping platform (such as Taobao), and obtain a feature table related to social interaction from a social platform (such as Dingding).
  • the tag data table also includes a small number of user characteristics, for example, (account ID, age, category tag) is recorded in each row, where age is the user characteristic.
  • a feature table can be generated based on the features in the tag data table, which is included in the above N feature tables.
  • the N feature tables obtained above all use the same type of user identification information as the primary key.
  • Table 1 exemplarily shows a feature table that records the characteristics of a user's static portrait
  • Table 2 exemplarily shows a feature table that records the characteristics of the user's financial and credit aspects.
  • Account ID age Yu'ebao balance Sesame Xuxu 30 30k -00000 Coco twenty two 5k 610 Peny123 26 50k 680 Lily 28 55k -00000 ... ... ... ... ...
  • both Table 1 and Table 2 use the account ID as the user identification information and use this as the primary key of the table.
  • both Table 1 and Table 2 record the characteristics of the user's age.
  • the feature table obtained in step 21 is referred to as the first feature table.
  • some preprocessing is performed on these feature tables before the screening based on the IV value.
  • the preprocessing may include, for feature missing values Pretreatment.
  • the feature value missing rate of the feature can be counted, and features whose missing rate is greater than a certain threshold can be eliminated.
  • a certain threshold such as 30%, it means that the feature is not enough to provide enough information, and it can be eliminated to simplify the calculation of subsequent operations.
  • the missing record for the age value in Table 1 is "--"
  • the missing record for the sesame score in Table 2 is "-00000”.
  • a unified default value can be used to replace the missing feature value, which can be called the normalization of the missing feature.
  • preprocessing can also be performed on each first feature table to facilitate subsequent calculations.
  • step 22 for each first feature table, the information value IV of each feature is determined in combination with the label data table, and the first screening operation is performed on the features based on the IV value to obtain the corresponding second feature table.
  • Figure 3 shows the steps of determining the IV value of each feature in one embodiment.
  • the first feature value of each user for the first feature is obtained from the first feature table, and the first feature values are sorted to form a first feature value sequence.
  • the first feature is a static feature, such as gender, education level, etc. in Table 1.
  • the feature value of each user for the first feature can be directly read from the first feature table.
  • the first feature table may also contain dynamic features that change over time, for example, the Yu'ebao amount in Table 2 and the sesame points.
  • the first feature table usually records multiple feature values collected at different collection times for the dynamic feature, and the collection timestamps corresponding to the multiple feature values.
  • Table 3 shows a first feature table containing a time stamp on the basis of Table 2.
  • Account ID age Yu'ebao balance Sesame Timestamp Xuxu 30 30k -00000 February 1 Xuxu 30 30k -00000 February 2 Xuxu 30 35k 665 February 3 ... ... ... ... ... Coco twenty two 5k 610 February 1 Coco twenty two 6k 615 February 2 Coco twenty two 5k 615 February 3 ... ... ... ... Peny123 26 50k 680 February 1 ... ... ... ... ... ... ...
  • the label data table will also include the labeling time of the user's category label, and the labeling time of each user may be the same or different.
  • the process of obtaining the first feature value of each user may include, for each user, among the multiple feature values collected for the first feature, determining that the collection time stamp is earlier than the category label corresponding to the user.
  • the feature value that is marked with the time and is closest to the marked time is used as the first characteristic value corresponding to the user. For example, suppose the first feature is the balance of Yu'e Bao in Table 3.
  • the obtained first feature values are sorted to form a first feature value sequence (x 1 , x 2 ,...x n ), where x i is the first feature value of the user i for the first feature X.
  • x i is the first feature value of the user i for the first feature X.
  • the sorting can be performed directly. The sorting can be from largest to smallest, or from smallest to largest. If the feature value corresponding to the first feature X is not a numeric value, for example, features such as education level, gender, etc., can be mapped to numeric values according to a predetermined mapping relationship, and then sorted.
  • the tag data table and the first feature table are associated with the user identification information to obtain a tag value sequence (L 1 , L 2 ,...L n ), and the tag value sequence (L 1 , L 2 ,...L n ) Is aligned with the first eigenvalue sequence (x 1 , x 2 ,...x n ) with respect to the user order.
  • the tag value sequence (L 1 , L 2 ,...L n ) Is aligned with the first eigenvalue sequence (x 1 , x 2 ,...x n ) with respect to the user order.
  • the user identification information of the user i is used such as account ID, tag data related to the table, the acquired class labels of the label value of user i L i. In this way, the tag value sequence (L 1 , L 2 ,...L n ) is obtained.
  • the users are binned according to the first feature value sequence (x 1 , x 2 ,...x n ).
  • uniform binning is performed according to the value range defined by the maximum value and the minimum value in the first characteristic value sequence.
  • automatic binning is performed according to the data distribution embodied in the first feature value sequence.
  • another batch of users can be used as a verification set to verify the stability of the data distribution of the first feature value. If the feature value of the first feature of the other batch of users also reflects a similar data distribution, it indicates that the data distribution is stable, and non-uniform automatic binning can be performed based on the data distribution.
  • step 34 based on the tag value sequence, the distribution of the user's tag value in each bin is counted; in step 35, the information value IV of the first feature is determined according to the distribution of the tag value of each bin.
  • users can be divided into positive samples and negative samples according to whether the label value is 0 or 1.
  • the number of positive samples pos i and the number of negative samples neg i in bin i are counted; in step 35, the WOE value of the evidence weight corresponding to bin i can be calculated:
  • the IV value can be determined for each feature in each first feature table.
  • the IV value of the feature can be determined according to the distribution of the label value in each bin through the existing corresponding calculation method.
  • the first screening operation can be performed on the features based on the IV values of each feature to obtain the corresponding second feature table.
  • the IV value of each feature may be compared with a threshold value, the features whose IV value is lower than the threshold value can be eliminated, and the features whose IV value is higher than the threshold value are retained.
  • the threshold can be set to, for example, 0.5.
  • the threshold can also be adjusted according to the screening target.
  • the feature table after feature removal based on the IV value in the first feature table is referred to as the second feature table. In this way, N'second feature tables are obtained. Since all the features in a certain first feature table may be eliminated, the number N'of the second feature table is less than or equal to N.
  • the second stage of screening will be followed, which is based on the correlation coefficient between the features. It should be understood that in the process of calculating the correlation coefficient between two features, for example, the first feature X and the second feature Y, it is necessary to align the feature value sequences of the two features with respect to the user. When the first feature X and the second feature Y come from different feature tables, the above-mentioned alignment operation on the user is the association operation of the data table.
  • the idea of minimum point coverage of the bipartite graph is innovatively adopted. From the above N'second feature tables, the energy is determined The second feature table with the least number covering all features, thereby simplifying the number of feature tables.
  • each second feature table is used as the first type of node
  • the features contained in the second feature table are used as the second type of node
  • the inclusion relationship between the second feature table and the feature is used as the connecting edge to construct two ⁇ Department of the map.
  • Fig. 4 shows a schematic diagram of a bipartite graph constructed based on a feature table-feature according to an embodiment.
  • the nodes in the left column of Fig. 4 are the first-type nodes, and each first-type node corresponds to a feature table.
  • the nodes in the right column are the second type nodes, and each second type node corresponds to a feature. If the feature table i contains the feature j, a connecting edge is constructed between the first type node i corresponding to the feature table i and the second type node j corresponding to the feature j. It can be seen that the schematic bipartite graph of FIG. 4 is established based on 5 feature tables and a total of 12 features included in the 5 feature tables, and therefore, there are a total of 5 nodes of the first type and 12 nodes of the second type.
  • the second type of node The number of connected edges is greater than one.
  • this second type of node is called a repeating node.
  • the second type of node with only one connected edge is called a non-duplicate node.
  • the second-type nodes with serial numbers 1, 5, 8, and 12 are repeated nodes, which are represented by dark circles; the other second-type nodes are non-repeated nodes.
  • step 24 the first node set is determined in the above two-part graph, which contains the smallest number of first-type nodes connected to all second-type nodes, so as to obtain the corresponding M second feature tables. Therefore, the nodes of the first type included in the first node set correspond to the simplified second feature table.
  • Determining the above-mentioned first node set that is, solving the problem of minimum point coverage in the bipartite graph, can be achieved through the repeated iterative process shown in Figure 5 below.
  • the node with the largest number of connected edges is determined as the selected node, and the selected node is added to the selected node gather.
  • first type node with the largest number of connected edges in the current bipartite graph There may be more than one first type node with the largest number of connected edges in the current bipartite graph. In this case, in an example, one can be randomly selected as the selected node. However, preferably, in another example, if there are multiple first-type nodes with the same maximum number of connected edges, the number of non-duplicate nodes connected to each first-type node in the multiple first-type nodes is determined respectively , The first type node with the largest number of connected non-duplicate nodes is determined as the selected node.
  • one of the more than one first-type nodes is randomly selected as the selected node.
  • step 52 the selected node and the second-type node connected to the selected node are deleted from the bipartite graph.
  • step 53 the connected edges of the remaining first-category nodes are updated according to the deleted second-category nodes, and the first-category nodes that no longer have connected edges are deleted. That is, the two-part graph is updated through steps 52 and 53.
  • step 54 it is judged whether there are still nodes in the updated bipartite graph; if so, return to step 51, and the updated bipartite graph is used as the current bipartite graph, and the loop iteration is performed again. Until after a certain cycle, it is determined in step 54 that the updated bipartite graph does not contain nodes. In this case, in step 55, the selected node set at this time is used as the first node set.
  • Fig. 6 shows the process of determining the first set of nodes for the bipartite graph of Fig. 4.
  • the initial bipartite graph is shown in the leftmost part A of Fig. 4 and Fig. 6, based on the initial bipartite graph, the connection edge information of each first-type node is counted.
  • the connecting edge information of each first-type node is represented as [a,b], where a is the number of second-type nodes connected to the first-type node, that is, the number of connected edges, and b is the connected The number of unique nodes reached.
  • the connection edge information of the first type node (1) is [3, 2], which means that the node is connected to 3 second type nodes, 2 of which are non-duplicate nodes.
  • connection edge information of node (2) is [4,2]
  • the connection edge information of node (3) is [4,2]
  • the connection edge information of node (4) is [3,0]
  • the connection edge information of node (4) is [3,0].
  • the connected side information of 5) is [4,2].
  • step 52 of FIG. 5 delete the first-type node (2) in the bipartite graph, and at the same time delete the four second-type nodes connected to the node (2).
  • step 53 the connecting edges of the remaining nodes of the first type are updated. That is, the connecting edges that originally connected the remaining nodes of the first type to the 4 deleted nodes of the second type are deleted accordingly. Therefore, the bipartite graph is updated once, and the bipartite graph shown in part B is obtained as the current bipartite graph. At this point, all remaining nodes of the first type still have connected edges.
  • node (1) is [2,2]
  • node (3) is [3,2]
  • node (4) Is [2,0]
  • node (5) is [4,2].
  • the number of connected edges of node (5) is the largest. Therefore, in this round of iteration, node (5) is taken as the selected node and added to the set of selected nodes. At this time, the selected node set is ⁇ (2), (5) ⁇ .
  • the second-type nodes with serial numbers 8, 10, 11, and 12.
  • the connecting edges of the remaining first-type nodes are updated, that is, the connecting edges that originally connected the remaining first-type nodes to the 8, 10, 11, and 12 second-type nodes are deleted accordingly.
  • the first type node (4) was originally connected to the 8 and 12 second type nodes. With the deletion of these two second type nodes and the update of the connecting edge, the first type node (4) is no longer Have any connecting edges. Therefore, the first type of node (4) is also deleted.
  • the bipartite graph shown in part C is obtained as the current bipartite graph.
  • node (1) is [2,2]
  • node (3) is [2,2].
  • the connecting edge information of these two nodes is exactly the same, and one of them is randomly selected as the selected node. Assume that node (1) is selected in this round. Then the selected node set is ⁇ (2), (5), (1) ⁇ at this time.
  • the selected node set at this time is ⁇ (2), (5), (1), (3) ⁇ , which can be used as the first node set with the least point coverage.
  • the first node set thus obtained contains only 4 first-type nodes, which is less than the original first-type nodes, but these 4 first-type nodes can cover all 12 second-type nodes.
  • the node which means that the second feature table represented by the first-type node in the first node set can cover all the candidate feature items.
  • the minimum point coverage of the bipartite graph can also be achieved in other ways. For example, in each iteration, find nodes of the first type whose all connected nodes are duplicate nodes, and then delete such first-type nodes and their connecting edges until there is no such first-type node. Set the remaining nodes as the first node set.
  • M the number of second feature tables obtained from the first node set.
  • M is less than or equal to the number N'of the second feature table before step 23 is executed.
  • N' the number of second feature tables obtained from the first node set.
  • step 25 On the basis of obtaining the M second feature tables, in step 25, the M second feature tables are merged to obtain a comprehensive feature table, and based on the comprehensive feature table, the correlation coefficient between the features is calculated.
  • the process of merging the M second feature tables into a comprehensive feature table is a process of associating each second feature table with the comprehensive feature table through the association operation of the data tables. Since the M second feature tables have been streamlined, the amount of calculation can be greatly reduced compared to the association and merging based on the original feature tables.
  • each feature has been aligned according to the user. Therefore, various existing methods can be used to calculate the correlation coefficient between the two features.
  • the correlation coefficient usually adopts the Pearson correlation coefficient, which can be calculated according to a known algorithm. Other calculation methods can also be used, such as Spearman rank correlation coefficient.
  • step 26 a second screening operation is performed on the features based on the above correlation coefficients to obtain multiple selected features, which are used to train the user classification model.
  • the second screening operation can be performed in the following manner.
  • the feature in the comprehensive feature table if the correlation coefficient between the feature and any other feature is higher than a predetermined correlation threshold, such as 0.8, then the feature is removed, if it is different from all other features If the correlation coefficients are all lower than the threshold, the feature is retained. In this way, the secondary rejection is performed, and the retained feature set is obtained.
  • the feature in the reserved feature set can be used as the selected feature.
  • each feature in the retained feature set can be sorted according to the magnitude of the information value IV, and a predetermined number of features with a larger IV value can be selected as the selected feature.
  • each feature in the comprehensive feature table for each feature in the comprehensive feature table, the average value of the correlation coefficient between the feature and other features can be calculated. Then, each feature in the comprehensive feature table is sorted according to the mean value of the correlation coefficient, and a predetermined number of features with a smaller mean value are selected as the selected feature. Of course, it can be further combined with the IV value and screened again.
  • the second stage of screening is performed to obtain multiple selected features. These multiple selected features can then be used for training the user classification model. In this way, through the method steps of FIG. 2, the feature processing and selection are performed for the user classification model.
  • these selected features can be output to the user classification model for modeling.
  • the user classification model can be trained based on the above multiple selected features and the user tag data in the tag data table.
  • the user classification model can be implemented in various forms such as a tree model, a deep neural network, DNN, etc.
  • the tree model specifically includes, for example, a PS-Smart tree model, a GBDT tree, and the like.
  • the test set can be used to evaluate the performance of the model.
  • Performance evaluation can include a variety of evaluation indicators, such as prediction accuracy, recall, ROC curve, and so on.
  • the performance evaluation meets the preset requirements, for example, the accuracy rate and the recall rate are both higher than 70%, the model performance is considered to meet the requirements, which further indicates that the selected feature is suitable for the user classification model, so in the feature pool Add the feature information of the aforementioned selected feature for other models to choose from.
  • the feature information recorded in the feature pool may include the feature name of each selected feature, the table name of the first feature table from which the feature comes, and the usage information of the feature used by the model.
  • the usage information can specifically be the number of times used by each model.
  • the usage information may also include a description of the model that uses the feature.
  • the same type of model needs to be trained in the subsequent, for example, for different subjects, multiple user classification models are customized based on different user sample sets, and these user classification models are used to predict the same user classification, for example, are used to predict user risk
  • the high-frequency use feature can be determined according to the number of times the feature is used by each model of the same type, and the required feature value data can be directly obtained from the table name of the first feature table from which the feature comes for model training. In this way, other models of the same type do not need to perform feature processing again from scratch, but can quickly select features based on the information recorded in the feature pool.
  • the enhancement method can be used to further expand the features.
  • can include cumulative features based on basic features (for example, the cumulative number of consumption in a period of time based on a single consumption feature, cumulative consumption amount, etc.), and combined features based on basic features (for example, for multiple different feature items).
  • Perform combinatorial operations sequence features (for example, operation sequence features formed based on multiple operations), graph features related to the user relationship network, and so on.
  • sequence features for example, operation sequence features formed based on multiple operations
  • graph features related to the user relationship network and so on.
  • These derived features can have more complex forms (for example, sequence vector form), and more abstract meanings (for example, features obtained after image embedding), which are used to expand and supplement the original features.
  • the derived feature table can be merged into the aforementioned comprehensive feature table to obtain an updated comprehensive feature table.
  • the updated comprehensive feature table contains the original features and the above-mentioned derived features. Then, based on the updated comprehensive feature table, the correlation coefficient between the features is calculated; and based on the correlation coefficient, the second screening operation is performed on the features again to obtain the expanded selected features. Using these expanded selected features, train the user classification model again.
  • the feature information of the aforementioned expanded selected feature is recorded in the aforementioned feature pool. If the performance of the retrained user classification model still does not meet the performance requirements, judge the performance improvement of the retrained user classification model relative to the user classification model trained last time, such as the improvement value of the prediction accuracy. If the lift is higher than a certain threshold, the extended feature information of the selected feature is recorded in the feature pool; if the lift is not higher than the above threshold, the feature information of the selected feature obtained in step 26 is still recorded in the feature pool . In this way, the feature-derived enhancement method is used to further expand the features and optimize the effectiveness of the feature information in the feature pool.
  • the feature processing scheme for the user classification model is generally carried out through two-stage feature screening. Before the screening based on the correlation coefficient between features in the second stage, the least point coverage principle in the bipartite graph is used to The number of tables is streamlined, which greatly speeds up the calculation process of correlation coefficients between features, thereby speeding up the feature selection process. Further, by adding relevant information of the selected feature to the feature pool, the feature selection process of other models of the same type is accelerated, thereby realizing rapid modeling of multiple models. Furthermore, the feature can be further enriched and expanded through feature derivation, which is more conducive to the effect of automatic modeling.
  • an apparatus for performing feature processing for a user classification model is provided.
  • the apparatus can be deployed in any device, platform, or device cluster with computing and processing capabilities.
  • Fig. 7 shows a schematic block diagram of a feature processing apparatus according to an embodiment. As shown in FIG. 7, the device 700 includes:
  • the first obtaining unit 71 is configured to obtain a tag data table and obtain N first feature tables, the tag data tables include category tags of users, and each of the first feature tables records several features of the user;
  • the first screening unit 72 is configured to determine the information value IV of each feature in combination with the label data table for each first feature table, and perform a first screening operation on the feature based on the information value IV to obtain the corresponding second feature table.
  • the bipartite graph construction unit 73 is configured to use each second feature table as a first-type node, use the features contained in the second feature table as a second-type node, and use the inclusion relationship between the second feature table and the feature as a connection Edge, construct a bipartite graph;
  • the node set determining unit 74 is configured to determine the first node set in the bipartite graph, which contains the smallest number of first-type nodes connected to all the second-type nodes, so as to obtain the corresponding value in the first node set.
  • the correlation calculation unit 75 is configured to merge the M second characteristic tables to obtain a comprehensive characteristic table, and calculate the correlation coefficient between the characteristics based on the comprehensive characteristic table;
  • the second screening unit 76 is configured to perform a second screening operation on features based on the correlation coefficient to obtain multiple selected features for training the user classification model.
  • the first obtaining unit 71 is configured to obtain respective statistical user characteristic tables from multiple data platforms as the first characteristic table.
  • the tag data table further includes at least one characteristic of the user; in this case, the first acquiring unit 71 may be configured to generate a first characteristic table based on the at least one characteristic.
  • the category label of the user may include one of the following: the risk level label of the user, the marketing group label to which the user belongs, and the credit level label of the user.
  • the device 700 further includes a pre-processing unit (not shown) configured to pre-process each first feature table, and the pre-processing includes: counting the feature value missing rate of each feature, and removing the missing feature. The features with a rate greater than the predetermined missing threshold are eliminated; for the features retained in each first feature table, a unified default value is used to replace the missing feature value.
  • a pre-processing unit (not shown) configured to pre-process each first feature table, and the pre-processing includes: counting the feature value missing rate of each feature, and removing the missing feature. The features with a rate greater than the predetermined missing threshold are eliminated; for the features retained in each first feature table, a unified default value is used to replace the missing feature value.
  • both the first feature table and the tag data table use user identification information as the main key, and the user identification information includes one of the following: account ID, mobile phone number, and email address.
  • the first screening unit 72 is specifically configured to determine the IV value of each feature in the following manner: obtain the first feature value of each user for any first feature from any first feature table, and compare each The first feature value sorting forms a first feature value sequence; the tag data table and the first feature table are associated with the user identification information to obtain the tag value sequence, and the tag value sequence is aligned with the first feature value sequence with respect to the user order; The first feature value sequence classifies users into bins; based on the tag value sequence, statistics on the label value distribution of the category labels in each bin; determines the first feature according to the label value distribution of each bin The value of information IV.
  • the tag data table further includes the tagging time of the category tag;
  • the first feature table includes a plurality of feature values collected by the user at different collection times for the first feature, and The collection timestamp corresponding to the multiple feature values; in this case, the first screening unit 72 obtains the above-mentioned first feature value in the following manner: For each user, among the multiple feature values collected for the first feature, It is determined that the characteristic value of the collection time stamp earlier than the marking time and closest to the marking time is used as the characteristic value of the user for the first characteristic.
  • the node set determining unit 74 is specifically configured to determine the node with the largest number of connected edges as the selected node among the first type of nodes contained in the current bipartite graph, and add the selected node to the selected node set; update The current bipartite graph, including deleting the selected node and the second type of node connected to the selected node; according to the deleted second type of node, update the connection edges of the remaining first type nodes, and delete no more connected edges The first type of node; repeat the above steps until the updated bipartite graph does not contain any nodes, and the selected node set at this time is used as the first node set.
  • the node set determining unit 74 is specifically configured to, if there are multiple first-type nodes with the same maximum number of connected edges, respectively determine each first-type node among the multiple first-type nodes The number of connected non-repeated nodes, where the non-repeated nodes are second-type nodes with only one connected edge; the first-type node with the largest number of connected non-repeated nodes is determined as the selected node.
  • the node set determining unit 74 may also be configured to, if there is more than one node of the first type connected to the same maximum number of non-duplicate nodes, randomly select one of the more than one node of the first type as the all nodes. Describe the selected node.
  • the second screening unit 76 is specifically configured to: for each feature in the comprehensive feature table, if the correlation coefficient between the feature and any other feature is higher than a predetermined correlation threshold, then the item will be eliminated. Features, thereby obtaining a retained feature set; based on the retained feature set, the multiple selected features are determined.
  • the second screening unit 76 may sort the features in the reserved feature set according to the magnitude of the information value IV, and select a predetermined number of features with a larger IV value as the multiple items. Select the feature.
  • the second screening unit 76 may perform the second screening operation in the following manner: For each feature in the comprehensive feature table, calculate the average value of the correlation coefficient between the feature and the other features; The various features in the comprehensive feature table are sorted according to the average value of the correlation coefficient, and a predetermined number of features with a smaller average value are selected as the multiple selected features.
  • the above-mentioned device 700 may further include (not shown) a model training and evaluation unit configured to train the user classification model based on the multiple selected features and the label data table, and evaluate it. Performance; and including a feature adding unit configured to add feature information of the multiple selected features in the feature pool when the performance evaluation of the user classification model meets preset requirements, for selection by other prediction models.
  • a model training and evaluation unit configured to train the user classification model based on the multiple selected features and the label data table, and evaluate it. Performance
  • a feature adding unit configured to add feature information of the multiple selected features in the feature pool when the performance evaluation of the user classification model meets preset requirements, for selection by other prediction models.
  • the feature information of the multiple selected features includes the feature name of each selected feature, the table name of the first feature table from which the feature comes, and the usage information of the feature used by the model.
  • the above device may further include a feature derivation unit (not shown), configured to use several feature derivative tools to generate several derived features when the performance evaluation of the trained user classification model does not meet the preset requirements. , Forming a derived feature table; merge the derived feature table into the comprehensive feature table to obtain an updated comprehensive feature table; the correlation calculation unit 75 is further configured to calculate the relationship between the features based on the updated comprehensive feature table Correlation coefficient; the second screening unit 76 is further configured to perform a second screening operation on the features again based on the correlation coefficient to obtain expanded selected features for retraining the user classification model.
  • a feature derivation unit (not shown), configured to use several feature derivative tools to generate several derived features when the performance evaluation of the trained user classification model does not meet the preset requirements.
  • the several derivative features include one or more of the following: cumulative features based on basic features, combined features based on basic features, sequence features, and graph features related to the user relationship network.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.
  • the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof.
  • these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Technology Law (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书实施例提供一种针对用户分类模型进行特征处理的方法和装置。方法包括,首先获取标签数据表和第一特征表,每个第一特征表记录用户的若干项特征。针对各个第一特征表中的各项特征,计算特征IV值,并基于IV值对特征进行第一筛选操作,得到对应的第二特征表。然后,将第二特征表和其中的特征分别作为第一类节点和第二类节点,构建二部图,在该二部图中确定出,连接到所有第二类节点的最小数目的第一类节点,进而得到对应的M个第二特征表。接着,合并该M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;基于相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练用户分类模型。

Description

针对用户分类模型进行特征处理的方法及装置 技术领域
本说明书一个或多个实施例涉及机器学习领域,尤其涉及针对用户分类模型进行特征处理的方法和装置。
背景技术
随着人工智能和机器学习的快速发展,在多种业务场景中开始使用机器学习的模型进行业务分析。例如,在许多应用场景中,需要对用户进行分类识别,例如,识别用户的风险等级、区分用户所属的人群,等等。为此,常常需要训练用户分类模型,来进行与业务相关的用户识别和用户分类。
特征的选择和处理是模型训练的基础。对于用户分类模型来说,为了训练出性能优异,预测准确的模型,就需要从大量用户特征中选择出与预测目标更为相关、更能反映用户特点的特征,来进行模型训练。
在实际场景中,备选的大量用户特征往往分布于许多不同的数据表中,而数据表之间的关联综合需要极大的计算开销,这为特征的快速统一分析带来很大困难。此外,在一些情况下需要针对多个不同主体,针对性训练多个用户分类模型。例如,支付平台可能需要针对不同的大型支付主体(例如不同银行),定制用户风险识别模型;购物平台可能需要针对不同的商户,定制用户价值分类模型。面对数量较多的同类型定制模型,如何快速地进行特征选择和处理,成为特征工程的另一项挑战。
因此,希望能有改进的方案,可以更为高效地针对用户分类模型进行特征选择和处理,从而实现快速的自动化建模。
发明内容
本说明书一个或多个实施例描述了一种针对用户分类模型进行特征处理的方法和装置,解决现有特征工程中特征选择效率不足的问题,高效地针对用户分类模型进行特征选择和处理,从而实现快速的自动化建模。
根据第一方面,提供了一种针对用户分类模型进行特征处理的方法,包括:获取标签数据表以及获取N个第一特征表,所述标签数据表中包括用户的类别标签,每个所述第一特征表记录用户的若干项特征;针对每个第一特征表,结合所述标签数据表确定各项特征的信息价值IV,基于所述信息价值IV对特征进行第一筛选操作,得到对应的第二特征表;以各个第二特征表为第一类节点,以所述第二特征表中包含的特征为第二类节点,以第二特征表与特征的包含关系为连接边,构建二部图;在所述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到与该第一节点集合中的第一类节点对应的M个第二特征表;合并所述M个第二特征表, 得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练所述用户分类模型。
在一个实施例中,上述N个第一特征表可以包括,从多个数据平台获取的各自统计的用户特征表。
在另一实施例中,所述标签数据表中还包括用户的至少一项特征;在这样的情况下,N个第一特征表可以包括,基于该至少一项特征,生成的第一特征表。
在不同实施例中,用户的类别标签可以包括以下之一:用户的风险等级标签,用户所属的营销人群标签,用户的信用等级标签。
根据一种实施方式,在结合所述标签数据表确定各项特征的信息价值IV之前,还包括对各个第一特征表进行预处理,所述预处理包括:统计各项特征的特征值缺失率,将缺失率大于预定缺失阈值的特征剔除;对于各个第一特征表中保留的特征,用统一的缺省值替代缺失的特征值。
根据一个实施例,第一特征表和标签数据表均以用户标识信息为主键,所述用户标识信息包括以下之一:账户ID、手机号、邮箱地址。
在一个实施例中,结合所述标签数据表确定各项特征的信息价值IV具体可以包括以下步骤:从任意一个第一特征表中获取各个用户针对任意的第一特征的第一特征值,将各个第一特征值排序形成第一特征值序列;利用用户标识信息关联标签数据表和该第一特征表,得到标签值序列,该标签值序列与第一特征值序列关于用户顺序相对齐;根据所述第一特征值序列对用户进行分箱;基于所述标签值序列,统计各个分箱中所述类别标签的标签值分布情况;根据各个分箱的标签值分布情况,确定所述第一特征的信息价值IV。
进一步地,在一个实施例中,标签数据表还包括,所述类别标签的标注时间;所述第一特征表包括,用户针对所述第一特征在不同采集时间采集的多个特征值,以及该多个特征值对应的采集时间戳;在这样的情况下,第一特征值通过以下方式获取:对于每个用户,在针对第一特征采集的多个特征值中,确定采集时间戳早于所述标注时间,且距离所述标注时间最近的特征值,作为该用户针对第一特征的特征值。
根据一个实施例,在所述二部图中确定出第一节点集合的过程具体包括:在当前二部图包含的第一类节点中,确定出连接边数目最大的节点作为选中节点,将该选中节点添加到选中节点集合;更新当前二部图,包括,删除该选中节点以及与该选中节点相连接的第二类节点;根据删除后的第二类节点,更新其余第一类节点的连接边,并删除不再具有连接边的第一类节点;重复执行以上步骤,直到更新后的二部图不包含任何节点,将此时的选中节点集合作为所述第一节点集合。
在以上实施例的一个例子中,如果存在多个第一类节点具有相同的最大连接边数 目,则分别确定该多个第一类节点中各第一类节点所连接的非重复节点的数目,所述非重复节点为,仅有一条连接边的第二类节点;将所连接的非重复节点的数目最大的第一类节点,确定为所述选中节点。
更进一步的,如果存在多于一个第一类节点连接到相同的最大数目的非重复节点,则从该多于一个第一类节点中随机选择一个作为所述选中节点。
根据一种实施方式,第二筛选操作通过以下方式执行:对于所述综合特征表中每一项特征,如果该特征与任何其他特征之间的相关系数高于预定相关性阈值,则剔除该项特征,由此得到保留特征集合;基于该保留特征集合,确定所述多项选中特征。
进一步地,在一个实施例中,可以将所述保留特征集合中的各项特征按照信息价值IV的大小排序,选取IV值较大的预定数目的特征,作为所述多项选中特征。
根据另一种实施方式,可以通过以下方式执行第二筛选操作:对于所述综合特征表中每一项特征,计算该特征与其他各项特征之间的相关系数的均值;将所述综合特征表中的各项特征,按照相关系数的均值大小进行排序,选取均值较小的预定数目的特征作为所述多项选中特征。
根据一种实施方式,在所述得到多项选中特征之后,基于所述多项选中特征,以及所述标签数据表,训练所述用户分类模型,并评估其性能;在所述用户分类模型的性能评估满足预设要求的情况下,在特征池中添加所述多项选中特征的特征信息,以供其他预测模型选择。
在一个具体例子中,所述多项选中特征的特征信息包括,各项选中特征的特征名,该特征所来自的第一特征表的表名,该特征被模型使用的使用信息。
在一个实施例中,在训练的用户分类模型的性能评估不满足预设要求的情况下,使用若干特征衍生工具,生成若干衍生特征,形成衍生特征表;将所述衍生特征表合并到所述综合特征表中,得到更新的综合特征表;并基于该更新的综合特征表,计算特征之间的相关系数;基于所述相关系数,再次对特征进行第二筛选操作,得到扩展的选中特征,用于再次训练所述用户分类模型。
在具体例子中,所述若干衍生特征包括以下中的一项或多项:基于基础特征的累积特征,基于基础特征的组合特征,序列特征,与用户关系网络相关的图特征。
根据第二方面,提供了一种针对用户分类模型进行特征处理的装置,包括:第一获取单元,配置为获取标签数据表以及获取N个第一特征表,所述标签数据表中包括用户的类别标签,每个所述第一特征表记录用户的若干项特征;第一筛选单元,配置为针对每个第一特征表,结合所述标签数据表确定各项特征的信息价值IV,基于所述信息价值IV对特征进行第一筛选操作,得到对应的第二特征表;二部图构建单元,配置为以各个第二特征表为第一类节点,以所述第二特征表中包含的特征为第二类节点,以第 二特征表与特征的包含关系为连接边,构建二部图;节点集确定单元,配置为在所述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到与该第一节点集合中的第一类节点对应的M个第二特征表;相关性计算单元,配置为合并所述M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;第二筛选单元,配置为基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练所述用户分类模型。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。
根据本说明书实施例提供的针对用户分类模型的特征处理方案,总体上通过两阶段的特征筛选进行,其中在第二阶段基于特征间相关系数的筛选之前,通过二部图中的最少点覆盖原则,对特征表数目进行精简,从而极大地加快特征间相关系数的计算过程,进而加快特征筛选过程。进一步地,通过将选中的特征的相关信息添加到特征池中,来加速同类型的其他模型的特征选择过程,由此实现多个模型的快速建模。更进一步地,还可以通过特征衍生的方式,进一步对特征进行丰富和扩展,从而更有利于自动建模的效果。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书披露的一个实施例的特征处理过程的示意图;
图2为根据一个实施例的针对用户分类模型进行特征处理的方法流程图;
图3示出在一个实施例中,确定各项特征的IV值的步骤;
图4示出根据一个实施例基于特征表-特征构建的二部图的示意图;
图5示出反复迭代的过程;
图6示出针对图4的二部图确定其第一节点集合的过程;
图7示出根据一个实施例的特征处理装置的示意性框图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
为了更高效地实现用户分类模型的建模和训练,在本说明书的一个实施例中,提供一种端到端的特征处理方案,该方案可以基于多个原始特征表中的大量用户特征,快 速地进行特征分析和选择,从而高效确定出适合建模的特征,输出给建模工具进行建模。进一步地,可以将选择出的特征信息以及模型对特征的使用状况记录在特征池中,从而便于同类型的其他模型进行选择和训练。
图1为本说明书披露的一个实施例的特征处理过程的示意图。如图1所示,特征处理过程包含两阶段的特征筛选,这两个阶段的特征筛选分别基于特征的信息价值IV和特征之间的相关系数进行。
具体地,如图1所示,原始特征集中包含大量的用户特征,每项用户特征示例性用一个椭圆圈表示。这些用户特征可以来自于多个原始特征表,并且不同的原始特征表中可能存在重复记录的特征。
在第一阶段筛选中,针对各项特征,基于原始特征表和标签数据表的关联,确定特征的信息价值IV(InformationValue),下文中简称为IV值。然后基于特征的IV值,对原始特征集中的特征进行初步筛选,例如,剔除IV值低于一定阈值的特征,由此得到初步筛选的特征。初步筛选的特征仍然分布于多个不同的特征表中。
第二阶段的筛选基于两两特征之间的相关系数进行。如果要计算来自两个不同特征表的两项特征之间的相关系数,就需要对这两个特征表进行数据表关联运算。因此,特征间相关系数的计算,涉及大量的数据表关联运算,而这部分运算非常消耗计算资源和计算时间,特别是在各个特征表的数据量都比较大时。考虑到特征表中有可能存在重复特征,因此,在开始第二阶段的筛选之前,创新性地对特征表进行“精简”,以期减少后续有待关联的特征表的数目。
特征表的精简基于二部图的最少点覆盖原则来进行。也就是,将特征表作为第一类节点,将表中的各项特征作为第二类节点,构建成二部图。然后在该二部图中找到,能够连接到全部第二类节点的最小数目的第一类节点,也就找到了,能够覆盖所有特征项的最少数目的特征表。
然后,将以上得到的最少数目的特征表合并成一个综合表,基于该综合表,计算特征间的相关系数。于是,可以执行第二阶段的筛选,基于特征间的相关系数,再剔除一些特征,最终得到一些选中特征。
上述选中特征于是可以输出给建模工具,进行用户分类模型的训练以及性能评估。在性能满足要求的情况下,确定上述选中特征为针对用户分类模型适用的特征,将这些特征的相关信息,例如对应的特征表名,模型对该特征的使用状况等,添加到特征池中。于是,后续在训练同类型的用户分类模型时,可以直接根据特征池中所记录的特征相关信息,进行特征的选择,而不必从零开始重新进行特征的处理和选择。
因此,以上的方案总体上通过两阶段的特征筛选进行特征选择,其中在第二阶段筛选之前,通过二部图中的最少点覆盖原则,对特征表数目进行精简,从而极大地加快 特征间相关系数的计算过程,进而加快特征筛选过程。进一步地,通过将选中的特征的相关信息添加到特征池中,来加速同类型的其他模型的特征选择过程,由此实现多个模型的快速建模。
下面描述以上方案的具体步骤和执行方式。
图2示出根据一个实施例的针对用户分类模型进行特征处理的方法流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示,该特征处理方法至少包括以下步骤。
在步骤21,获取标签数据表以及获取N个第一特征表。
可以理解,标签数据表中包括用户的类别标签,这些类别标签用作训练用户分类模型的标注数据。取决于用户分类模型的具体分类目标,类别标签也相应的不同。例如,在一个例子中,用户分类模型用于预测用户的风险类别,例如,普通用户或是高风险用户(涉嫌欺诈、盗号的账户);相应的,标签数据表中的用户类别标签可以是,示出用户真实的风险状况的风险等级标签。在另一例子中,用户分类模型用于预测用户所属的营销人群,例如,营销敏感用户/营销不敏感用户,或者,预测用户的营销价值等级;相应的,用户类别标签可以是,用户所属的营销人群标签。在又一例子中,用户分类模型用于借贷平台对用户信用状况的评估;在这样的情况下,用户类别标签可以为,用户的信用等级标签。在更多其他例子中,根据用户分类模型的分类目标和使用场景,用户类别标签可以具有更多种含义。
标签数据表通常以用户标识信息为主键,其中用户标识信息用于唯一地标识出不同用户。具体的,用户标识信息可以采用账户ID、手机号、邮箱地址等形式。
为了进行用户分类模型的训练,除了获取用户类别标签,还要获取用户的特征数据。特征数据往往分布记录在多个特征表中,因此在步骤21中,获取N个第一特征表,每个特征表记录用户的若干项特征。
用户的特征具体可以包括,用户的静态画像方面的特征,例如性别、年龄、职业、收入、教育程度等;用户的操作行为方面的特征,例如最近一次操作的类型、操作的页面、停留的时间等等;用户的金融资产方面的特征,例如余额宝余额、近期消费次数,消费金额等等;用户的信用记录方面的特征,例如借款次数、借款金额、还款金额等等;用户的社交方面的特征,例如好友数目、与好友的沟通频次、沟通类别等等;以及用户的其他方面的特征,此处不一一进行枚举。
在一个实施例中,上述N个特征表可以是实施图2方法的计算平台(例如支付宝)通过对多个方面的用户特征进行记录而得到的。在另一实施例中,上述N个第一特征表可以来自多个不同的数据平台,实施图2方法的计算平台从该多个不同的数据平台获取到各个数据表。例如,计算平台可以从银行机构获得与借贷信用记录相关的特征表,从 购物平台(例如淘宝)获取与金融消费相关的特征表,从社交平台(例如钉钉)获取与社交相关的特征表。在又一实施例中,上述标签数据表中也包括少量用户特征,例如每行记录有(账户ID、年龄、类别标签),其中年龄为用户特征。此时,可以基于标签数据表中的特征,生成特征表,包含在上述N个特征表中。
以上获取的N个特征表均以相同类型的用户标识信息为主键。
下面的表1示例性示出一个记录用户静态画像特征的特征表,表2示例性示出一个记录用户金融和信用方面特征的特征表。
表1:
账户ID 性别 年龄 教育程度 注册时长
Lucy F 30 BA 5y
Lily F 28 MA 6y
Lilei M -- Under 1y
Xuxu M 35 Phd 8y
…… …… …… …… ……
表2:
账户ID 年龄 余额宝余额 芝麻分
Xuxu 30 30k -00000
Coco 22 5k 610
Peny123 26 50k 680
Lily 28 55k -00000
…… …… …… ……
可以看到,表1和表2均采用账户ID作为用户标识信息,并以此作为表的主键。并且,表1和表2中都记录有用户年龄这一特征。
通过以上具体例子中可以看到,所获取的N个特征表中,可以有特征的重复记录,并且不同表之间,用户记录的顺序通常是不同的。为了区别于后续经过筛选操作后的特征表,方便描述,将步骤21获取的特征表称为第一特征表。
在一个实施例中,可选地,在获取到上述N个第一特征表后,进行基于IV值的筛选之前,先对这些特征表进行一些预处理,该预处理可以包括,针对特征缺失值的预处理。
具体的,针对各个第一特征表中的各项特征,可以统计该项特征的特征值缺失率,将缺失率大于一定阈值的特征剔除。例如,在表1中,年龄这一特征下,用户Lilei的特征值缺失;在表2中,芝麻分这一特征下,至少两个用户(Xuxu和Lily)的特征值缺失。如果某项特征的特征值缺失率大于一定阈值,例如30%,则说明该项特征不足以 提供足够多的信息量,可以将其剔除,以简化后续操作的计算量。
如以上表1和表2所示,由于第一特征表的来源可能不同,记录的特征项不同,不同的第一特征表中常常会采用不同的方式记录特征的缺失项。例如,表1中对于年龄值的缺失记录为“--”,而表2中对于芝麻分的缺失记录为“-00000”。为便于后续各个特征表的统一分析,可以在预处理阶段,对于以上剔除之后保留的特征,用统一的缺省值替代缺失的特征值,这可以称为缺失特征的归一化。
还可以对各个第一特征表进行其他方面的预处理,以便于后续的计算。
接着,在步骤22,针对每个第一特征表,结合标签数据表确定各项特征的信息价值IV,基于IV值对特征进行第一筛选操作,得到对应的第二特征表。
图3示出在一个实施例中,确定各项特征的IV值的步骤。如图3所示,在步骤31,从第一特征表中获取各个用户针对第一特征的第一特征值,将各个第一特征值排序形成第一特征值序列。
在一个实施例中,第一特征为静态特征,例如表1中的性别,教育程度等。此时,可以直接从第一特征表中读取各个用户针对该第一特征的特征值。
第一特征表中还可能会包含随时间变化的动态特征,例如,表2中的余额宝金额,芝麻分等。在这样的情况下,第一特征表通常会记录针对动态特征在不同采集时间采集的多个特征值,以及该多个特征值对应的采集时间戳。例如,表3示出在表2的基础上包含时间戳的第一特征表。
表3:
账户ID 年龄 余额宝余额 芝麻分 时间戳
Xuxu 30 30k -00000 2月1日
Xuxu 30 30k -00000 2月2日
Xuxu 30 35k 665 2月3日
…… …… …… …… ……
Coco 22 5k 610 2月1日
Coco 22 6k 615 2月2日
Coco 22 5k 615 2月3日
…… …… …… …… ……
Peny123 26 50k 680 2月1日
…… …… …… …… ……
相应的,标签数据表也会包括,用户的类别标签的标注时间,各个用户的标注时间可以相同,也可以不同。在这样的情况下,获取各个用户的第一特征值的过程可以包 括,对于每个用户,在针对第一特征采集的多个特征值中,确定采集时间戳早于该用户对应的类别标签的标注时间,且距离该标注时间最近的特征值,作为该用户对应的第一特征值。例如,假定第一特征为表3中的余额宝余额。对于表3中的用户Xuxu,如果标签数据表中该用户的标签标注时间为2月4日,那么从表3中用户Xuxu的多个余额值中选取2月3日的余额值35k,作为其第一特征值。如此,获取到各个用户的第一特征值。
然后,将得到的各个第一特征值进行排序,形成第一特征值序列(x 1,x 2,…x n),其中x i为用户i针对第一特征X的第一特征值。如果第一特征X对应的特征值为数值,则可以直接进行排序。排序可以是从大到小排序,或者从小到大排序。如果第一特征X对应的特征值不是数值,例如教育程度,性别之类的特征,可以按照预定映射关系,将其映射为数值,然后进行排序。
接着在步骤32,利用用户标识信息关联标签数据表和该第一特征表,得到标签值序列(L 1,L 2,…L n),该标签值序列(L 1,L 2,…L n)与第一特征值序列(x 1,x 2,…x n)关于用户顺序相对齐。具体的,对于第一特征值序列(x 1,x 2,…x n)中第i个第一特征值,在步骤31中已知其对应于用户i,然后利用该用户i的用户标识信息,例如账户ID,关联到标签数据表,获取到该用户i的类别标签的标签值L i。如此得到标签值序列(L 1,L 2,…L n)。
接下来,在步骤33,根据第一特征值序列(x 1,x 2,…x n)对用户进行分箱。在一个实施例中,根据第一特征值序列中最大值和最小值所限定的取值范围,进行均匀分箱。在另一实施例中,根据第一特征值序列所体现的数据分布,进行自动分箱。在这样的情况下,可以使用另一批用户作为验证集,验证上述第一特征值的数据分布的稳定性。如果该另一批用户的第一特征的特征值也反映出类似的数据分布,则表明该数据分布是稳定的,可以基于该数据分布进行非均匀的自动分箱。
如此,各个用户被划分到各个分箱中。于是,在步骤34,基于标签值序列,统计各个分箱中用户的标签值分布情况;在步骤35,根据各个分箱的标签值分布情况,确定第一特征的信息价值IV。
以用户分类模型为二分类模型,类别标签具有二值化的情况为例,根据标签值为0还是1,可以将用户划分为正样本和负样本。在步骤34,统计分箱i中正样本个数pos i,负样本个数neg i;在步骤35,可以计算分箱i对应的证据权重WOE值:
Figure PCTCN2020134499-appb-000001
其中,
Figure PCTCN2020134499-appb-000002
为分箱i中正样本数目占全部正样本数目的比例,
Figure PCTCN2020134499-appb-000003
为分箱i中负样本数目占全部负样本数目的比例。
进而,可以得到第一特征的IV值:
Figure PCTCN2020134499-appb-000004
通过以上方式,针对每个第一特征表中的每项特征,可以确定出其IV值。对于其他的标签值情况,可以通过已有的相应计算方式,根据各个分箱中标签值的分布,确定出特征的IV值。
回到图2的步骤22,于是可以基于各项特征的IV值,对特征进行第一筛选操作,得到对应的第二特征表。具体的,可以将各项特征的IV值与一阈值比较,将IV值低于该阈值的特征剔除,保留IV值高于该阈值的特征。实际操作中,可以将该阈值设置为例如0.5。当然也可以根据筛选目标调整该阈值。在此,将第一特征表中基于IV值进行特征剔除之后的特征表,称为第二特征表。如此,得到了N’个第二特征表。由于有可能存在某个第一特征表中所有特征均被剔除的情况,第二特征表的数目N’小于或等于N。
在许多情况下,在以上执行基于IV值的第一阶段筛选之后,保留的特征仍然比较多,第二特征表的数目N’仍然比较大。如前所述,在第一阶段筛选之后,后续将要进行第二阶段的筛选,其中基于特征之间的相关系数进行筛选。需要理解,在计算两项特征,例如第一特征X和第二特征Y之间的相关系数的过程中,需要将该两项特征的特征值序列关于用户进行对齐。在第一特征X和第二特征Y来自于不同的特征表时,上述关于用户的对齐操作即为数据表的关联操作。在关联操作的基本算法中,每对齐一个用户的特征值,就需要遍历特征表的所有表项。在实际场景中,每个特征表包含的用户数目常常在几十万,百万甚至更多的量级,因此,特征表的关联操作需要极大的计算量。如果针对大量的第二特征表中的大量特征项,计算两两特征之间的相关系数,则需要进行大量的表关联操作,会极大地消耗计算资源和计算时间。
考虑到N’个第二特征表中仍然可能存在重复特征,根据本说明书的一个实施例,创新性采用二部图的最少点覆盖思路,从上述N’个第二特征表中,确定出能覆盖所有特征的最少数目的第二特征表,从而对特征表数目进行精简。
具体的,在步骤23,以各个第二特征表为第一类节点,以第二特征表中包含的特征为第二类节点,以第二特征表与特征的包含关系为连接边,构建二部图。
图4示出根据一个实施例基于特征表-特征构建的二部图的示意图。图4左侧一列的节点为第一类节点,每个第一类节点对应一个特征表。右侧一列的节点为第二类节点,每个第二类节点对应一项特征。如果特征表i中包含特征j,则在对应于特征表i的第一类节点i,和对应于特征j的第二类节点j之间构建连接边。可以看到,图4的示意性二部图基于5个特征表以及这5个特征表包括的共计12项特征而建立,因此,共具有5个第一类节点和12个第二类节点。
如前所述,不同特征表有可能重复性记录同一特征,反应在二部图中,表现为,存在多个第一类节点连接到同一个第二类节点,于是,该第二类节点的连接边的数目大于1。为了便于描述,将这样的第二类节点称为重复节点。相应的,将仅有一条连接边的第二类节点称为非重复节点。在图4中,序号为1,5,8,12的第二类节点为重复节点,用深色圆圈表示;其他第二类节点为非重复节点。
接着,在步骤24,在上述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到对应的M个第二特征表。于是,该第一节点集合中包含的第一类节点,即对应于精简后的第二特征表。
确定上述第一节点集合,也就是解决二部图中的最少点覆盖问题,可以通过以下图5所示的反复迭代过程实现。如图5所示,在每次迭代过程,首先在步骤51,在当前二部图包含的第一类节点中,确定出连接边数目最大的节点作为选中节点,将该选中节点添加到选中节点集合。
当前二部图中具有最大连接边数目的第一类节点有可能不止一个。在这样的情况下,在一个例子中,可以从中随机选择一个作为选中节点。不过优选的,在另一例子中,如果存在多个第一类节点具有相同的最大连接边数目,则分别确定该多个第一类节点中各第一类节点所连接的非重复节点的数目,将所连接的非重复节点的数目最大的第一类节点,确定为选中节点。
进一步地,如果仍然存在多于一个第一类节点连接到相同的最大数目的非重复节点,则从该多于一个第一类节点中随机选择一个作为选中节点。
在确定出本轮的选中节点后,在步骤52,从二部图中删除该选中节点以及与该选中节点相连接的第二类节点。在步骤53,根据删除后的第二类节点,更新其余第一类节点的连接边,并删除不再具有连接边的第一类节点。即通过步骤52和53对二部图进行更新。
然后在步骤54,判断更新后的二部图中是否仍然存在节点;如果有,则返回到步骤51,以更新后的二部图作为当前二部图,再次进行循环迭代。直到在某次循环后,在步骤54判断出更新后的二部图中不包含节点,在这样的情况下,在步骤55,将此时的选中节点集合作为上述第一节点集合。
图6示出针对图4的二部图确定其第一节点集合的过程。
初始的二部图如图4和图6最左侧A部分所示,基于该初始二部图统计各个第一类节点的连接边信息。在一个例子中,将各个第一类节点的连接边信息表示为[a,b],其中a为第一类节点所连接到的第二类节点的数目,即连接边数目,b为所连接到的非重复节点的数目。如此可以看到,初始二部图中,第一类节点(1)的连接边信息为[3,2],表示该节点连接到3个第二类节点,其中2个是非重复节点。类似的,节点(2)的连 接边信息为[4,2],节点(3)的连接边信息为[4,2],节点(4)的连接边信息为[3,0],节点(5)的连接边信息为[4,2]。通过各个第一类节点的连接边信息可以看到,节点(2),(4),(5)均具有最大的连接边数目4,于是进一步判断其中非重复节点的数目。可以看到,这3个节点所连接到的非重复节点的数目也是相同的,均为2,于是,可以从这3个节点中随机选择一个,作为选中节点。假定,在第一轮迭代中,选择了节点(2),并将其添加到选中节点集合。于是此时,选中节点集合中仅包含节点(2),可以表示为{(2)}。
接着,如图5的步骤52所示,在二部图中删除该第一类节点(2),同时删除该节点(2)所连接到的4个第二类节点。相应地,在步骤53,更新其余第一类节点的连接边。也就是,将其余第一类节点原本连接到被删除的4个第二类节点的连接边,都相应删除。于是对二部图进行了一次更新,得到B部分所示的二部图作为当前二部图。此时,所有剩余第一类节点仍然具有连接边。
对于B部分所示的二部图,更新各个第一类节点的连接边信息,于是得到:节点(1)为[2,2],节点(3)为[3,2],节点(4)为[2,0],节点(5)为[4,2]。显然,节点(5)的连接边数目最大,因此,在该轮迭代中,将节点(5)作为选中节点,添加到选中节点集合中。此时,选中节点集合为{(2),(5)}。
然后,删除节点(5),以及其连接的所有4个第二类节点(序号为8,10,11,12的第二类节点)。相应的,更新其余第一类节点的连接边,也就是,将其余第一类节点原本连接到8,10,11,12号第二类节点的连接边都相应删除。可以看到,第一类节点(4)原本连接到8和12号第二类节点,随着这两个第二类节点的删除以及连接边的更新,该第一类节点(4)不再具有任何连接边。于是,将该第一类节点(4)也删除。于是,得到C部分所示的二部图作为当前二部图。
对于C部分所示的二部图,将各个第一类节点的连接边信息更新为:节点(1)为[2,2],节点(3)为[2,2]。这两个节点的连接边信息完全相同,从中随机选择一个作为选中节点。假定本轮选择了节点(1)。那么此时选中节点集合为{(2),(5),(1)}。
然后,删除节点(1)及其连接节点,对二部图进行更新,得到D部分所示的二部图。接下来选择节点(3),添加到选中节点集合。然后,在删除节点(3)及其连接节点后,二部图中不再包含任何节点,于是循环迭代结束。此时的选中节点集合为{(2),(5),(1),(3)},就可以作为最少点覆盖的第一节点集合。
可以看到,如此得到的第一节点集合仅包含了4个第一类节点,少于原始的第一类节点数目,但是这4个第一类节点能够覆盖所有12个第二类节点。对应于节点的含义,即意味着,第一节点集合中的第一类节点所表示的第二特征表,能够涵盖备选的所有特征项。于是,通过这样的方式,实现了第二特征表数目的精简,同时不损失任何特 征项。
在其他实施例中,也可以通过其他方式实现二部图的最少点覆盖。例如,在每次迭代中,找到其所有连接节点均为重复节点的第一类节点,然后删除这样的第一类节点及其连接边,直到不存在这样的第一类节点。将剩下的节点作为第一节点集合。
简单清楚起见,将根据第一节点集合得到的第二特征表数目记为M。原则上,M小于或等于执行步骤23之前的第二特征表的数目N’。实际操作中,由于特征表中常常会有重复记录的特征项,因此,M相对于N’往往有明显的减小。
在如此得到M个第二特征表的基础上,在步骤25,合并该M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数。
可以理解,将M个第二特征表合并为综合特征表的过程,即通过数据表的关联操作,将各个第二特征表关联到综合特征表的过程。由于此处M个第二特征表已经经过精简,相对于基于原始的特征表进行关联和合并,可以极大减小计算量。
在得到的综合特征表中,各个特征已经按照用户进行对齐。因此,可以采用各种已有的方式,计算两两特征之间的相关系数。相关系数通常采用Pearson相关系数,可以根据已知的算法来计算。也可以采用其他计算方式,例如Spearman秩相关系数等。
接着,在步骤26,基于上述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练用户分类模型。具体的,第二筛选操作可以通过以下方式执行。
在一个实施例中,对于综合特征表中每一项特征,如果该特征与任何其他特征之间的相关系数高于预定相关性阈值,例如0.8,则剔除该项特征,如果与所有其他特征之间的相关系数均低于该阈值,则保留该特征。由此进行二次剔除,得到保留特征集合。可以将该保留特征集合中的特征作为选中特征。
在另一实施例中,基于以上的保留特征集合,结合之前确定的特征的IV值,再次进行筛选。具体的,可以将保留特征集合中的各项特征,按照信息价值IV的大小排序,选取IV值较大的预定数目的特征,作为选中特征。
在又一实施例中,对于综合特征表中的每一项特征,可以计算该特征与其他各项特征之间的相关系数的均值。然后,将综合特征表中的各项特征,按照相关系数的均值大小进行排序,选取均值较小的预定数目的特征作为选中特征。当然还可以进一步结合IV值,再次筛选。
如此,通过多种方式,基于特征之间的相关系数,进行第二阶段的筛选,得到多个选中特征。这多个选中特征于是可以用于用户分类模型的训练。如此,通过图2的方法步骤,针对用户分类模型,进行特征的处理和选择。
进一步地,在这之后,就可以将这些选中特征,输出给用户分类模型进行建模。具体的,可以基于上述多项选中特征,以及标签数据表中的用户标签数据,训练用户分 类模型。该用户分类模型具体可以采用树模型,深度神经网络DNN等各种形式实现,树模型又具体包括,例如PS-Smart树模型,GBDT树等。
在利用训练集对用户分类模型进行训练后,可以利用测试集,评估该模型的性能。性能评估可以包括多种评估指标,例如预测准确率,召回率,ROC曲线等等。在性能评估满足预设要求的情况下,例如准确率和召回率均高于70%,则认为模型性能满足要求,进而说明,所选的特征适用于该用户分类模型,于是,在特征池中添加前述选中特征的特征信息,以供其他模型选择。
具体的,在特征池中记录的特征信息可以包括,各项选中特征的特征名,该特征所来自的第一特征表的表名,该特征被模型使用的使用信息。使用信息具体可以是,被各个模型使用的次数。在一个例子中,使用信息还可以包括,使用该特征的模型的描述。
于是,在后续需要训练同类型的模型时,例如针对不同主体,基于不同用户样本集定制多个用户分类模型,而这些用户分类模型均用于预测相同的用户分类,例如均用于预测用户风险,此时,就可以参照特征池中记录的特征信息,进行特征选择。例如,可以根据特征被各个同类模型使用的次数,确定出高频使用特征,根据该特征所来自的第一特征表的表名,直接从中获取所需的特征值数据进行模型训练。如此,同类型的其他模型可以不必从零开始重新进行特征处理,而是基于特征池中记录的信息,快速进行特征的选择。
在一种情况下,在利用图2方式得到的选中特征进行用户分类模型的训练后,评估结果不够理想。此时,可以采用增强方式,进一步扩展特征。
具体的,如果前述训练得到的用户分类模型的性能评估不满足预设要求,则可以使用若干特征衍生工具,生成若干衍生特征,形成衍生特征表。这些衍生特征可以包括,基于基础特征的累积特征(例如基于单笔消费特征得到的一段时间内的累积消费次数,累积消费金额等),基于基础特征的组合特征(例如对多个不同的特征项进行组合运算),序列特征(例如基于多次操作形成的操作序列特征),与用户关系网络相关的图特征,等等。这些衍生特征可以具有更复杂的形式(例如序列向量形式),更抽象的含义(例如进行图嵌入后得到的特征),用于对原始的特征进行扩展和补充。
于是,可以将该衍生特征表合并到前述综合特征表中,得到更新的综合特征表。该更新的综合特征表中包含原有特征和上述衍生特征。然后基于该更新的综合特征表,计算特征之间的相关系数;并基于相关系数,再次对特征进行第二筛选操作,得到扩展的选中特征。利用这些扩展的选中特征,再次训练用户分类模型。
如果再次训练的用户分类模型的性能评估结果满足要求,则将上述扩展的选中特征的特征信息记录在前述特征池中。如果再次训练的用户分类模型的性能仍然没有达到性能要求,则判断再次训练的用户分类模型相对于前次训练的用户分类模型,性能的提 升量,例如预测准确率的提升值。如果提升量高于一定阈值,则将扩展的选中特征的特征信息记录到特征池中;如果提升量不高于上述阈值,则仍然将之前步骤26得到的选中特征的特征信息记录到特征池中。如此,采用特征衍生的增强方式,进一步扩展特征,优化特征池中的特征信息的有效性。
回顾以上过程,针对用户分类模型的特征处理方案,总体上通过两阶段的特征筛选进行,其中在第二阶段基于特征间相关系数的筛选之前,通过二部图中的最少点覆盖原则,对特征表数目进行精简,从而极大地加快特征间相关系数的计算过程,进而加快特征筛选过程。进一步地,通过将选中的特征的相关信息添加到特征池中,来加速同类型的其他模型的特征选择过程,由此实现多个模型的快速建模。更进一步地,还可以通过特征衍生的方式,进一步对特征进行丰富和扩展,从而更有利于自动建模的效果。
根据另一方面的实施例,提供了一种针对用户分类模型进行特征处理的装置,该装置可以部署在任何具有计算、处理能力的设备、平台或设备集群中。图7示出根据一个实施例的特征处理装置的示意性框图。如图7所示,该装置700包括:
第一获取单元71,配置为获取标签数据表以及获取N个第一特征表,所述标签数据表中包括用户的类别标签,每个所述第一特征表记录用户的若干项特征;
第一筛选单元72,配置为针对每个第一特征表,结合所述标签数据表确定各项特征的信息价值IV,基于所述信息价值IV对特征进行第一筛选操作,得到对应的第二特征表;
二部图构建单元73,配置为以各个第二特征表为第一类节点,以所述第二特征表中包含的特征为第二类节点,以第二特征表与特征的包含关系为连接边,构建二部图;
节点集确定单元74,配置为在所述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到与该第一节点集合中的第一类节点对应的M个第二特征表;
相关性计算单元75,配置为合并所述M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;
第二筛选单元76,配置为基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练所述用户分类模型。
在一个实施例中,第一获取单元71配置为,从多个数据平台获取各自统计的用户特征表,作为第一特征表。
在另一实施例中,所述标签数据表中还包括用户的至少一项特征;在这样的情况下,第一获取单元71可以配置,基于该至少一项特征,生成第一特征表。
在不同实施例中,用户的类别标签可以包括以下之一:用户的风险等级标签,用户所属的营销人群标签,用户的信用等级标签。
根据一种实施方式,该装置700还包括预处理单元(未示出),配置为对各个第一特征表进行预处理,所述预处理包括:统计各项特征的特征值缺失率,将缺失率大于预定缺失阈值的特征剔除;对于各个第一特征表中保留的特征,用统一的缺省值替代缺失的特征值。
根据一个实施例,第一特征表和标签数据表均以用户标识信息为主键,所述用户标识信息包括以下之一:账户ID、手机号、邮箱地址。
在一个实施例中,第一筛选单元72具体配置为通过以下方式确定各项特征的IV值:从任意一个第一特征表中获取各个用户针对任意的第一特征的第一特征值,将各个第一特征值排序形成第一特征值序列;利用用户标识信息关联标签数据表和该第一特征表,得到标签值序列,该标签值序列与第一特征值序列关于用户顺序相对齐;根据所述第一特征值序列对用户进行分箱;基于所述标签值序列,统计各个分箱中所述类别标签的标签值分布情况;根据各个分箱的标签值分布情况,确定所述第一特征的信息价值IV。
进一步地,在一个实施例中,标签数据表还包括,所述类别标签的标注时间;所述第一特征表包括,用户针对所述第一特征在不同采集时间采集的多个特征值,以及该多个特征值对应的采集时间戳;在这样的情况下,第一筛选单元72通过以下方式获取上述第一特征值:对于每个用户,在针对第一特征采集的多个特征值中,确定采集时间戳早于所述标注时间,且距离所述标注时间最近的特征值,作为该用户针对第一特征的特征值。
根据一个实施例,节点集确定单元74具体配置为,在当前二部图包含的第一类节点中,确定出连接边数目最大的节点作为选中节点,将该选中节点添加到选中节点集合;更新当前二部图,包括,删除该选中节点以及与该选中节点相连接的第二类节点;根据删除后的第二类节点,更新其余第一类节点的连接边,并删除不再具有连接边的第一类节点;重复执行以上步骤,直到更新后的二部图不包含任何节点,将此时的选中节点集合作为所述第一节点集合。
在以上实施例的一个例子中,节点集确定单元74具体配置为,如果存在多个第一类节点具有相同的最大连接边数目,则分别确定该多个第一类节点中各第一类节点所连接的非重复节点的数目,所述非重复节点为,仅有一条连接边的第二类节点;将所连接的非重复节点的数目最大的第一类节点,确定为所述选中节点。
更进一步的,节点集确定单元74还可以配置为,如果存在多于一个第一类节点连接到相同的最大数目的非重复节点,则从该多于一个第一类节点中随机选择一个作为所述选中节点。
根据一种实施方式,第二筛选单元76具体配置为:对于所述综合特征表中每一项特征,如果该特征与任何其他特征之间的相关系数高于预定相关性阈值,则剔除该项特征,由此得到保留特征集合;基于该保留特征集合,确定所述多项选中特征。
进一步地,在一个实施例中,第二筛选单元76可以将所述保留特征集合中的各项特征按照信息价值IV的大小排序,选取IV值较大的预定数目的特征,作为所述多项选 中特征。
根据另一种实施方式,第二筛选单元76可以通过以下方式执行第二筛选操作:对于所述综合特征表中每一项特征,计算该特征与其他各项特征之间的相关系数的均值;将所述综合特征表中的各项特征,按照相关系数的均值大小进行排序,选取均值较小的预定数目的特征作为所述多项选中特征。
根据一种实施方式,上述装置700还可以包括(未示出)模型训练和评估单元,配置为基于所述多项选中特征,以及所述标签数据表,训练所述用户分类模型,并评估其性能;以及包括特征添加单元,配置为在所述用户分类模型的性能评估满足预设要求的情况下,在特征池中添加所述多项选中特征的特征信息,以供其他预测模型选择。
在一个具体例子中,所述多项选中特征的特征信息包括,各项选中特征的特征名,该特征所来自的第一特征表的表名,该特征被模型使用的使用信息。
在一个实施例中,上述装置还可以包括特征衍生单元(未示出),配置为在训练的用户分类模型的性能评估不满足预设要求的情况下,使用若干特征衍生工具,生成若干衍生特征,形成衍生特征表;将所述衍生特征表合并到所述综合特征表中,得到更新的综合特征表;相关性计算单元75还配置为,基于该更新的综合特征表,计算特征之间的相关系数;第二筛选单元76还配置为,基于所述相关系数,再次对特征进行第二筛选操作,得到扩展的选中特征,用于再次训练所述用户分类模型。
在具体例子中,所述若干衍生特征包括以下中的一项或多项:基于基础特征的累积特征,基于基础特征的组合特征,序列特征,与用户关系网络相关的图特征。
通过以上装置,针对用户分类模型实现特征的处理和选择。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (21)

  1. 一种针对用户分类模型进行特征处理的方法,包括:
    获取标签数据表以及获取N个第一特征表,所述标签数据表中包括用户的类别标签,每个所述第一特征表记录用户的若干项特征;
    针对每个第一特征表,结合所述标签数据表确定各项特征的信息价值IV,基于所述信息价值IV对特征进行第一筛选操作,得到对应的第二特征表;
    以各个第二特征表为第一类节点,以所述第二特征表中包含的特征为第二类节点,以第二特征表与特征的包含关系为连接边,构建二部图;
    在所述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到与该第一节点集合中的第一类节点对应的M个第二特征表;
    合并所述M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;
    基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练所述用户分类模型。
  2. 根据权利要求1所述的方法,其中,获取N个第一特征表包括,从多个数据平台获取其各自统计的用户特征表,作为所述第一特征表。
  3. 根据权利要求1所述的方法,其中,所述标签数据表中还包括用户的至少一项特征;所述获取N个第一特征表包括:基于所述至少一项特征,生成第一特征表。
  4. 根据权利要求1所述的方法,其中,所述用户的类别标签包括以下之一:用户的风险等级标签、用户所属的营销人群标签、用户的信用等级标签。
  5. 根据权利要求1所述的方法,其中,在结合所述标签数据表确定各项特征的信息价值IV之前,还包括对各个第一特征表进行预处理,所述预处理包括:
    统计各项特征的特征值缺失率,将缺失率大于预定缺失阈值的特征剔除;
    对于各个第一特征表中保留的特征,用统一的缺省值替代缺失的特征值。
  6. 根据权利要求1所述的方法,其中,所述第一特征表和所述标签数据表均以用户标识信息为主键,所述用户标识信息包括以下之一:账户ID、手机号、邮箱地址。
  7. 根据权利要求6所述的方法,其中,结合所述标签数据表确定各项特征的信息价值IV,包括:
    从任意一个第一特征表中获取各个用户针对任意的第一特征的第一特征值,将各个第一特征值排序形成第一特征值序列;
    利用用户标识信息关联标签数据表和该第一特征表,得到标签值序列,该标签值序 列与第一特征值序列关于用户顺序相对齐;
    根据所述第一特征值序列对用户进行分箱;
    基于所述标签值序列,统计各个分箱中所述类别标签的标签值分布情况;
    根据各个分箱的标签值分布情况,确定所述第一特征的信息价值IV。
  8. 根据权利要求7所述的方法,其中,所述标签数据表还包括,所述类别标签的标注时间;所述第一特征表包括,用户针对所述第一特征在不同采集时间采集的多个特征值,以及该多个特征值对应的采集时间戳;
    从任意一个第一特征表中获取各个用户针对任意的第一特征的第一特征值,包括:对于每个用户,在针对第一特征采集的多个特征值中,确定采集时间戳早于所述标注时间,且距离所述标注时间最近的特征值,作为该用户针对第一特征的特征值。
  9. 根据权利要求1所述的方法,其中,在所述二部图中确定出第一节点集合,包括:
    在当前二部图包含的第一类节点中,确定出连接边数目最大的节点作为选中节点,将该选中节点添加到选中节点集合;
    更新当前二部图,包括,删除该选中节点以及与该选中节点相连接的第二类节点;根据删除后的第二类节点,更新其余第一类节点的连接边,并删除不再具有连接边的第一类节点;
    重复执行以上步骤,直到更新后的二部图不包含任何节点,将此时的选中节点集合作为所述第一节点集合。
  10. 根据权利要求9所述的方法,其中,在当前二部图包含的第一类节点中,确定出连接边数目最大的节点作为选中节点,包括:
    如果存在多个第一类节点具有相同的最大连接边数目,则分别确定该多个第一类节点中各第一类节点所连接的非重复节点的数目,所述非重复节点为,仅有一条连接边的第二类节点;
    将所连接的非重复节点的数目最大的第一类节点,确定为所述选中节点。
  11. 根据权利要求10所述的方法,其中,将所连接的非重复节点的数目最大的第一类节点,确定为所述选中节点,包括:
    如果存在多于一个第一类节点连接到相同的最大数目的非重复节点,则从该多于一个第一类节点中随机选择一个作为所述选中节点。
  12. 根据权利要求1所述的方法,其中,基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,具体包括:
    对于所述综合特征表中每一项特征,如果该特征与任何其他特征之间的相关系数高于预定相关性阈值,则剔除该项特征,由此得到保留特征集合;
    基于该保留特征集合,确定所述多项选中特征。
  13. 根据权利要求12所述的方法,其中,基于该保留特征集合,确定所述多项选中特征,包括:
    将所述保留特征集合中的各项特征按照信息价值IV的大小排序,选取IV值较大的预定数目的特征,作为所述多项选中特征。
  14. 根据权利要求1所述的方法,其中,基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,包括:
    对于所述综合特征表中每一项特征,计算该特征与其他各项特征之间的相关系数的均值;
    将所述综合特征表中的各项特征,按照相关系数的均值大小进行排序,选取均值较小的预定数目的特征作为所述多项选中特征。
  15. 根据权利要求1所述的方法,其中,在所述得到多项选中特征之后,还包括:
    基于所述多项选中特征,以及所述标签数据表,训练所述用户分类模型,并评估其性能;
    在所述用户分类模型的性能评估满足预设要求的情况下,在特征池中添加所述多项选中特征的特征信息,以供其他预测模型选择。
  16. 根据权利要求15所述的方法,其中,所述多项选中特征的特征信息包括:各项选中特征的特征名、该特征所来自的第一特征表的表名、该特征被模型使用的使用信息。
  17. 根据权利要求15所述的方法,其中,在训练所述用户分类模型,并评估其性能之后,还包括:
    在所述用户分类模型的性能评估不满足预设要求的情况下,使用若干特征衍生工具,生成若干衍生特征,形成衍生特征表;
    将所述衍生特征表合并到所述综合特征表中,得到更新的综合特征表;并基于该更新的综合特征表,计算特征之间的相关系数;
    基于所述相关系数,对特征进行所述第二筛选操作,得到扩展的选中特征,用于再次训练所述用户分类模型。
  18. 根据权利要求17所述的方法,其中,所述若干衍生特征包括以下中的一项或多项:基于基础特征的累积特征、基于基础特征的组合特征、序列特征、与用户关系网 络相关的图特征。
  19. 一种针对用户分类模型进行特征处理的装置,包括:
    第一获取单元,配置为获取标签数据表以及获取N个第一特征表,所述标签数据表中包括用户的类别标签,每个所述第一特征表记录用户的若干项特征;
    第一筛选单元,配置为针对每个第一特征表,结合所述标签数据表确定各项特征的信息价值IV,基于所述信息价值IV对特征进行第一筛选操作,得到对应的第二特征表;
    二部图构建单元,配置为以各个第二特征表为第一类节点,以所述第二特征表中包含的特征为第二类节点,以第二特征表与特征的包含关系为连接边,构建二部图;
    节点集确定单元,配置为在所述二部图中确定出第一节点集合,其中包含连接到所有第二类节点的最小数目的第一类节点,从而得到与该第一节点集合中的第一类节点对应的M个第二特征表;
    相关性计算单元,配置为合并所述M个第二特征表,得到综合特征表,并基于该综合特征表,计算特征之间的相关系数;
    第二筛选单元,配置为基于所述相关系数,对特征进行第二筛选操作,得到多项选中特征,用于训练所述用户分类模型。
  20. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-18中任一项的所述的方法。
  21. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-18中任一项所述的方法。
PCT/CN2020/134499 2020-02-17 2020-12-08 针对用户分类模型进行特征处理的方法及装置 WO2021164382A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010097814.7 2020-02-17
CN202010097814.7A CN111291816B (zh) 2020-02-17 2020-02-17 针对用户分类模型进行特征处理的方法及装置

Publications (1)

Publication Number Publication Date
WO2021164382A1 true WO2021164382A1 (zh) 2021-08-26

Family

ID=71028461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134499 WO2021164382A1 (zh) 2020-02-17 2020-12-08 针对用户分类模型进行特征处理的方法及装置

Country Status (2)

Country Link
CN (1) CN111291816B (zh)
WO (1) WO2021164382A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113917364A (zh) * 2021-10-09 2022-01-11 广东电网有限责任公司东莞供电局 一种配电网高阻接地识别方法和装置
CN114372835A (zh) * 2022-03-22 2022-04-19 佰聆数据股份有限公司 综合能源服务潜力客户识别方法、系统及计算机设备
CN114553395A (zh) * 2022-04-24 2022-05-27 蓝象智联(杭州)科技有限公司 一种风控场景下的纵向联邦特征衍生方法
CN116089809A (zh) * 2023-04-07 2023-05-09 平安银行股份有限公司 金融特征数据的筛选方法、装置、电子设备及存储介质
CN116880340A (zh) * 2023-09-07 2023-10-13 深圳金亚太科技有限公司 基于工业物联网的控制终端

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291816B (zh) * 2020-02-17 2021-08-06 支付宝(杭州)信息技术有限公司 针对用户分类模型进行特征处理的方法及装置
CN111753920B (zh) * 2020-06-30 2022-06-21 重庆紫光华山智安科技有限公司 特征构建方法、装置、计算机设备及存储介质
CN112104706B (zh) * 2020-08-24 2022-12-20 中国银联股份有限公司 分布式系统中模型发布方法、装置、设备、存储介质
CN112215238B (zh) * 2020-10-29 2022-06-07 支付宝(杭州)信息技术有限公司 一种通用特征提取模型构建方法、系统及装置
CN112818028B (zh) * 2021-01-12 2021-09-17 平安科技(深圳)有限公司 数据指标筛选方法、装置、计算机设备及存储介质
CN112801563B (zh) * 2021-04-14 2021-08-17 支付宝(杭州)信息技术有限公司 风险评估方法和装置
CN114707990B (zh) * 2022-03-23 2023-04-07 支付宝(杭州)信息技术有限公司 一种用户行为模式的识别方法和装置
CN115578307B (zh) * 2022-05-25 2023-09-15 广州市基准医疗有限责任公司 一种肺结节良恶性分类方法及相关产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897778A (zh) * 2018-06-04 2018-11-27 四川创意信息技术股份有限公司 一种基于多源大数据分析的图像标注方法
CN110322142A (zh) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 一种大数据风控模型及线上系统配置技术
US20190391901A1 (en) * 2018-06-20 2019-12-26 Ca, Inc. Adaptive baselining and filtering for anomaly analysis
CN110674848A (zh) * 2019-08-31 2020-01-10 南京理工大学 联合稀疏表示与二部图分割的高维数据联合聚类方法
CN111291816A (zh) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 针对用户分类模型进行特征处理的方法及装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6112181A (en) * 1997-11-06 2000-08-29 Intertrust Technologies Corporation Systems and methods for matching, selecting, narrowcasting, and/or classifying based on rights management and/or other information
US7203864B2 (en) * 2004-06-25 2007-04-10 Hewlett-Packard Development Company, L.P. Method and system for clustering computers into peer groups and comparing individual computers to their peers
JP4762870B2 (ja) * 2006-12-06 2011-08-31 日本電信電話株式会社 信号特徴抽出方法、信号特徴抽出装置、信号特徴抽出プログラム、及びその記録媒体
CN101848455B (zh) * 2009-03-23 2014-02-19 华为技术有限公司 业务网络中增强用户信息的方法、设备及系统
CN101923689A (zh) * 2009-06-15 2010-12-22 中国移动通信集团公司 确定广告信息投放受众的方法以及相关装置
CN102663027A (zh) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 一种网页人群属性的预测方法
US20150169758A1 (en) * 2013-12-17 2015-06-18 Luigi ASSOM Multi-partite graph database
CN109767255B (zh) * 2018-12-06 2023-05-12 东莞团贷网互联网科技服务有限公司 一种通过大数据建模实现智能运营及精准营销的方法
CN109903198B (zh) * 2019-01-24 2022-08-30 南京邮电大学 专利对比分析方法
CN109886758A (zh) * 2019-03-13 2019-06-14 中南大学 一种基于组合分类器的客户流失预测模型
CN110061859B (zh) * 2019-03-20 2021-11-12 重庆邮电大学 一种基于用户生存性条件约束下的sdn控制器部署方法
CN110245687B (zh) * 2019-05-17 2021-06-04 腾讯科技(上海)有限公司 用户分类方法以及装置
CN110222267B (zh) * 2019-06-06 2023-07-25 中山大学 一种游戏平台信息推送方法、系统、存储介质及设备
CN110309335B (zh) * 2019-07-03 2023-01-06 腾讯科技(深圳)有限公司 一种图片匹配方法、装置、设备及存储介质
CN110659318B (zh) * 2019-08-15 2024-05-03 中国平安财产保险股份有限公司 基于大数据的策略推送方法、系统及计算机设备
CN110704706B (zh) * 2019-09-11 2021-09-03 北京海益同展信息科技有限公司 分类模型的训练方法、分类方法及相关设备、分类系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897778A (zh) * 2018-06-04 2018-11-27 四川创意信息技术股份有限公司 一种基于多源大数据分析的图像标注方法
US20190391901A1 (en) * 2018-06-20 2019-12-26 Ca, Inc. Adaptive baselining and filtering for anomaly analysis
CN110322142A (zh) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 一种大数据风控模型及线上系统配置技术
CN110674848A (zh) * 2019-08-31 2020-01-10 南京理工大学 联合稀疏表示与二部图分割的高维数据联合聚类方法
CN111291816A (zh) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 针对用户分类模型进行特征处理的方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113917364A (zh) * 2021-10-09 2022-01-11 广东电网有限责任公司东莞供电局 一种配电网高阻接地识别方法和装置
CN113917364B (zh) * 2021-10-09 2024-03-08 广东电网有限责任公司东莞供电局 一种配电网高阻接地识别方法和装置
CN114372835A (zh) * 2022-03-22 2022-04-19 佰聆数据股份有限公司 综合能源服务潜力客户识别方法、系统及计算机设备
CN114372835B (zh) * 2022-03-22 2022-06-24 佰聆数据股份有限公司 综合能源服务潜力客户识别方法、系统及计算机设备
CN114553395A (zh) * 2022-04-24 2022-05-27 蓝象智联(杭州)科技有限公司 一种风控场景下的纵向联邦特征衍生方法
CN114553395B (zh) * 2022-04-24 2022-07-26 蓝象智联(杭州)科技有限公司 一种风控场景下的纵向联邦特征衍生方法
CN116089809A (zh) * 2023-04-07 2023-05-09 平安银行股份有限公司 金融特征数据的筛选方法、装置、电子设备及存储介质
CN116880340A (zh) * 2023-09-07 2023-10-13 深圳金亚太科技有限公司 基于工业物联网的控制终端
CN116880340B (zh) * 2023-09-07 2023-12-29 深圳金亚太科技有限公司 基于工业物联网的控制终端

Also Published As

Publication number Publication date
CN111291816B (zh) 2021-08-06
CN111291816A (zh) 2020-06-16

Similar Documents

Publication Publication Date Title
WO2021164382A1 (zh) 针对用户分类模型进行特征处理的方法及装置
CN108320171A (zh) 热销商品预测方法、系统及装置
CN111368147B (zh) 图特征处理的方法及装置
CN110310114B (zh) 对象分类方法、装置、服务器及存储介质
CN107844533A (zh) 一种智能问答系统及分析方法
CN109299258A (zh) 一种舆情事件检测方法、装置及设备
US20210073669A1 (en) Generating training data for machine-learning models
CN114298176A (zh) 一种欺诈用户检测方法、装置、介质及电子设备
CN111932130A (zh) 业务类型识别方法及装置
CN111639690A (zh) 基于关系图谱学习的欺诈分析方法、系统、介质及设备
CN114077836A (zh) 一种基于异构神经网络的文本分类方法及装置
CN113537960A (zh) 一种异常资源转移链路的确定方法、装置和设备
CN111325344A (zh) 评估模型解释工具的方法和装置
CA3156642A1 (en) Anti-fraud method and system based on automatic feature engineering
CN108830302B (zh) 一种图像分类方法、训练方法、分类预测方法及相关装置
CN114723554B (zh) 异常账户识别方法及装置
CN113705201B (zh) 基于文本的事件概率预测评估算法、电子设备及存储介质
CN115994331A (zh) 基于决策树的报文分拣方法及装置
KR101085066B1 (ko) 대용량 다속성 데이터집합에서 의미 있는 지식 탐사를 위한 연관 분류 방법
CN115905293A (zh) 作业执行引擎的切换方法及装置
CN114387005A (zh) 一种基于图分类的套利团伙识别方法
CN114170000A (zh) 信用卡用户风险类别识别方法、装置、计算机设备和介质
Zimal et al. Customer churn prediction using machine learning
US20230376977A1 (en) System for determining cross selling potential of existing customers
CN118260671A (zh) 面向亿级规模属性网络的节点分类方法及用户分类方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920712

Country of ref document: EP

Kind code of ref document: A1