WO2021169454A1 - 图特征处理的方法及装置 - Google Patents

图特征处理的方法及装置 Download PDF

Info

Publication number
WO2021169454A1
WO2021169454A1 PCT/CN2020/132654 CN2020132654W WO2021169454A1 WO 2021169454 A1 WO2021169454 A1 WO 2021169454A1 CN 2020132654 W CN2020132654 W CN 2020132654W WO 2021169454 A1 WO2021169454 A1 WO 2021169454A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
graph
feature
user
Prior art date
Application number
PCT/CN2020/132654
Other languages
English (en)
French (fr)
Inventor
张屹綮
张天翼
王维强
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021169454A1 publication Critical patent/WO2021169454A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Definitions

  • One or more embodiments of this specification relate to the field of machine learning, and in particular to a method and device for processing graph features for a user classification model.
  • machine learning models have begun to be used for business analysis in a variety of business scenarios. For example, in many application scenarios, users need to be classified and identified, for example, to identify the user's risk level, distinguish the group to which the user belongs, and so on. For this reason, it is often necessary to train user classification models to perform business-related user identification and user classification.
  • the selection and processing of features is the basis of model training.
  • For the user classification model in order to train a model with excellent performance and accurate prediction, it is necessary to select from a large number of user features that are more relevant to the prediction target and can better reflect the characteristics of the user for model training.
  • the trained model can meet the requirements.
  • the basic attributes of users are often not rich and comprehensive enough to meet the performance requirements of model training.
  • generating graph features based on user relationship networks is an aspect of supplementary features.
  • the network graph is a relatively complex data structure, and its analysis and calculations require a lot of calculations. How to efficiently extract meaningful features suitable for model training is a difficulty and challenge.
  • One or more embodiments of this specification describe a method and device for processing graph features for a user classification model, which can efficiently generate rich graph features, thereby facilitating feature selection and training of the user classification model.
  • a method for processing graph features includes: constructing a relationship network graph based on relationship data; the relationship data includes a record of interaction events in which users participate; the relationship network graph includes a plurality of nodes, and A directed edge between nodes formed based on the interaction event, wherein the multiple nodes include user nodes; and the relationship network graph is divided into multiple subgraphs, including the first subgraph used for user classification model training For each node in the first subgraph, obtain low-level features of the node, where the low-level features include at least the degree of the node; convert the first subgraph to an undirected graph; for the undirected For each node in the graph, obtain the high-level features of the node, the high-level features include multi-level H-index, where each H-index represents the maximum that satisfies the condition that the previous H-index of H neighbor nodes is greater than or equal to H H value; where the 0-order H index is the degree of the node; based on at least the low-order
  • the interaction event is an event performed by a user with the aid of a medium; the multiple nodes further include a media node; and the directed edge is a directed connection edge between the user node and the media node.
  • the interaction event is specifically a login event or an authentication event
  • the information of the media node includes one or more of the following: device identification information, network environment information, and authentication media information.
  • the interaction event is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event
  • the interaction event may be a transaction event.
  • the first type of node is a buyer node
  • the second type of node is a seller node
  • the interaction event may be a transfer event.
  • the node of the first type is the transfer-out node
  • the node of the second type is the beneficiary node.
  • performing graph filtering on the relationship network graph includes removing from the relationship network graph that does not meet the training requirements of the user classification model Nodes, and connecting edges corresponding to the several nodes.
  • the number of removed nodes may include one or more of the following: invalid nodes that do not conform to a predetermined format; nodes with a number of connected edges greater than a certain threshold; nodes in the whitelist; funds involved in the interaction event In the case of the node where the funds in a predetermined period of time exceed the predetermined threshold.
  • the relationship network graph is divided into multiple subgraphs in the following manner: according to the time period when the interaction event corresponding to the directed edge in the relationship network graph occurs, the relationship network graph is divided into multiple subgraphs Each sub-picture corresponds to a time period; the time period corresponding to the labeling time of the label data used for training the user classification model is determined, and the sub-picture corresponding to the time period is determined as the first sub-picture.
  • the relationship network graph is divided into multiple subgraphs in the following manner: according to the geographic area in the basic attributes of the user node, the relationship network graph is divided into multiple subgraphs, each subgraph corresponding to a geographic area ; Determine the sub-picture corresponding to the geographic area of the user sample set in the label data used to train the user classification model as the first sub-picture.
  • the relationship network graph is a homogenous graph.
  • the low-level features of the obtained node further include: the number and proportion of dual nodes among the neighbor nodes connected to the node; wherein the dual nodes In order to serve as user nodes of both the first type of node and the second type of node in the relationship network graph.
  • converting the first subgraph into an undirected graph specifically includes: converting directed edges in the first subgraph into undirected edges, and merging the repetitions therein. Node, get the undirected graph.
  • any order H index when it is impossible to determine the maximum H value that satisfies the H index of the H neighbor nodes and is greater than or equal to the condition of H, change The maximum H value that satisfies the condition that the previous H index of the H neighbor nodes is greater than H is used as the H index of the current order.
  • generating the candidate feature set specifically includes: for each node, obtaining statistical features according to the statistical results of the low-order features and high-order features of its neighbor nodes, and including the statistical features in the The candidate feature set; the statistical result includes one or more of the following: maximum value, minimum value, average value, median and mode.
  • the method further includes: acquiring label data used to train the user classification model, the label data including a user sample set and a category label of each user sample therein; and mapping the user sample set to The first node set in the first subgraph; according to the feature value distribution and label value distribution of each feature in the candidate feature set on the first node set, feature screening is performed to obtain The feature set of the user classification model.
  • the process of feature screening may specifically include: determining the information value IV of each feature according to the feature value distribution of each feature and the label value distribution, and performing the first analysis on each feature based on the information value IV.
  • a screening operation for the retained features after the first screening operation, a correlation coefficient between the retained features is calculated, and a second screening operation is performed based on the correlation coefficient to obtain the feature set.
  • a feature record table is also generated to record the description information of each feature in the feature set.
  • a graph feature processing apparatus including: a graph construction unit configured to construct a relationship network diagram based on relationship data; the relationship data includes a record of interaction events in which a user participates; the relationship network diagram Comprising a plurality of nodes and directed edges between nodes formed based on the interaction events, the plurality of nodes including user nodes; a graph dividing unit configured to divide the relational network graph into a plurality of subgraphs, wherein It includes a first subgraph used for user classification model training; a low-level feature acquisition unit configured to acquire low-level features of the nodes for each node in the first sub-graph, wherein the low-level features include at least: Degree; graph conversion unit configured to convert the first subgraph into an undirected graph; high-level feature acquisition unit configured to obtain high-level features of nodes for each node in the undirected graph, the high The first-order features include multi-order H-indexes, where each-order H-index represents the maximum H value that satisfies
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • a relationship network graph is constructed based on the interaction events participated by the user, and graph features are extracted from it.
  • graph features not only include low-level features such as the degree of nodes, but also innovatively introduce H index as high-level graph features. In this way, a richer graph feature of each node is obtained, which is used for feature selection and training of the user analysis model.
  • FIG. 1 is a schematic diagram of a process of processing image features according to an embodiment disclosed in this specification
  • Fig. 2 shows a flowchart of a method for processing image features for a user classification model according to an embodiment
  • Figure 3 shows an example of a homogenous graph according to one embodiment
  • Figure 4 shows an example of transforming a homogenous image according to an embodiment
  • Fig. 5 shows a schematic block diagram of a graph feature processing apparatus according to an embodiment.
  • an end-to-end graph feature processing solution which can generate a relational network graph based on relational data of recording user interaction events , Extract the low-order graph features and high-order graph features of nodes as candidate features for user classification model screening and training.
  • FIG. 1 is a schematic diagram of a process of processing image features according to an embodiment disclosed in this specification.
  • the relational data is used to record the event records of the interaction events in which the user participates;
  • the relational network graph constructed accordingly includes user nodes, and the connection edges between the nodes are established based on the interaction events.
  • the directionality of the interaction event may be considered to establish a directional connection edge.
  • the relationship network graph can be embodied as a bipartite graph.
  • some filtering processing can be performed on the relationship network graph constructed above to remove some nodes and edges that do not need to be analyzed.
  • the relational network diagram can be split into sub-graphs to facilitate subsequent processing.
  • node features can be extracted.
  • the extracted node features include low-order features and high-order features, where the low-order features include at least the degree of the node.
  • the H index used in other fields is innovatively applied to graph analysis as a high-order graph feature.
  • the H index of a node in the relational network graph refers to the value of neighbor nodes. The number of at most H neighbor nodes whose degree is greater than or equal to H. Further, it is also possible to iteratively obtain the multi-order H index. In this way, more abundant high-level features of each node are obtained.
  • the above low-level features, high-level features, and optional statistical features together constitute a candidate feature set.
  • the candidate feature set contains graph features generated and extracted based on the relational network graph, and these graph features, especially node high-order features, are essentially different from the features extracted in a conventional manner.
  • a feature set suitable for the user classification model can be selected from the candidate feature set, which helps to train a user classification model with better performance.
  • Fig. 2 shows a flowchart of a method for processing image features for a user classification model according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the image feature processing method includes at least the following steps.
  • a relationship network diagram is constructed based on the relationship data.
  • the relationship data includes event records of interactive events in which the user participates; accordingly, the constructed relationship network graph includes multiple nodes and directed edges between nodes formed based on the foregoing interactive events, and the multiple nodes include User node.
  • the aforementioned interaction event may be an event that the user participates in and is helpful to the classification prediction target of the user classification model.
  • a relationship network diagram can be constructed based on transaction events; when a user classification model is used to evaluate a user’s login risk, a relationship network diagram can be constructed based on login events; when a user classification model When used to determine the marketing group to which a user belongs, a relationship network diagram can be constructed based on discount write-off events, and so on.
  • the interaction event may be an interaction between users, and may also involve other objects.
  • the formed relationship network diagram is a homogeneous diagram and a heterogeneous diagram.
  • the aforementioned interaction event is an event performed by a user with the aid of a medium, which involves a medium object.
  • the constructed relationship network graph is a heterogeneous graph, which includes not only user nodes, but also media nodes.
  • the connecting edge is the directed connecting edge between the user node and the media node.
  • the interaction event may be a login event, in which the user logs in with the help of a specific device and network environment.
  • the media node may include a device node, and/or a network environment node.
  • the device node may be represented by device identification information, which may specifically include device identification such as the MAC address of the device, the SIM number of the mobile phone, UMID, APDID, and so on.
  • the network environment node can show the network environment information when logging in, for example, IP address, wifi network identification, and so on. If a user logs in by means of a certain medium, a connecting edge is constructed between the user node corresponding to the user and the medium node corresponding to the medium.
  • the interaction event may be an authentication event, in which the user performs identity authentication by means of some authentication media.
  • the media node can be the above-mentioned authentication medium, for example, the credit card number, ID number, mobile phone number, etc. used for authentication. If a user uses a certain authentication medium for identity authentication, a connection edge is constructed between the user node corresponding to the user and the media node corresponding to the authentication medium.
  • heterogeneous graphs There are other specific examples of heterogeneous graphs, which will not be enumerated here.
  • user nodes can be regarded as one type of node, and other objects can be regarded as another type of node.
  • the heterogeneous graph thus obtained can be a bipartite graph.
  • the aforementioned interaction event is a directional interaction event between users.
  • the constructed relationship network graph is a homogenous graph, in which all nodes are user nodes.
  • Each user node can be represented by user identification information, where the user identification information can specifically take the form of account ID, mobile phone number, email address, etc.
  • user nodes can be divided into two types of nodes, called the first type of node and the second type of node.
  • the first type of node corresponds to the event starting point
  • the second type of node corresponds to the event target.
  • the connected edges in the homogenous graph are directed edges from the nodes of the first type to the nodes of the second type.
  • the aforementioned interaction event is a transaction event.
  • the first type of node corresponds to the buyer user
  • the second type of node corresponds to the seller user.
  • the corresponding relational data is a transaction record table, in which each row records one transaction.
  • the transaction record table may include, for example, 4 columns of data: buyer account, seller account, transaction amount, and transaction time.
  • each account in the column of buyer accounts can be used as the first type of node
  • each account in the column of seller accounts can be used as the second type of node, and a link between the buyer's account and the seller's account that appears in the same transaction can be established.
  • the same account is in different transactions, sometimes it is recorded as a buyer in the buyer's account column, and sometimes as a seller in the seller's account column, then when composing the picture, record the account in the first type of node and the second type of node respectively , That is to say, the account is represented as a first-type node and a second-type node respectively.
  • transaction amount and transaction time can be used as edge attribute information.
  • the same set of buyer accounts and seller accounts may have conducted multiple transactions.
  • information such as the number of transactions may also be included in the side attribute information.
  • the interaction event may be a transfer event.
  • the first type of node is the transferor node
  • the second type of node is the payee node.
  • the interaction event may be a social event that includes a certain behavior, such as call behavior, sharing behavior, etc.
  • the first type of node corresponds to the originator of the behavior, such as the caller, the sharing originator
  • the second The class node corresponds to the recipient of the behavior, such as the called party, the shared recipient, and so on.
  • homogenous graphs There are other specific examples of homogenous graphs, which will not be enumerated here.
  • the homogenous graph obtained at this time can also be regarded as a bipartite graph.
  • the relational network graph can record its topology in the form of a table, for example, it can be recorded as an adjacency list, or the starting point and the target point of the directed edge are respectively two columns, and each connected edge is recorded, etc. .
  • the preprocessing operation may include a graph filtering operation, that is, removing nodes and related connecting edges that do not meet the user classification model training requirements from the above-mentioned relational network graph.
  • the graph filtering operation may include first removing some invalid nodes.
  • Invalid nodes are nodes that do not meet the format requirements, and mainly include invalid nodes caused by node format errors during data transmission. In actual business, invalid nodes are mostly medium nodes, including UMID, APDID, SIM and other nodes. If the format does not meet the standard format, the node and the edges connected to the node are all removed.
  • the graph filtering operation may also include removing nodes whose number of connected edges is greater than a certain threshold. Such nodes can be called hotspot nodes.
  • thresholds are set according to different relational data. For example, in a transaction event, nodes with more than 300 connected edges are considered hotspots, and more than 1000 connected edges can be set as hotspot nodes for medium nodes in a heterogeneous graph.
  • nodes with funds exceeding a predetermined threshold within a predetermined period of time can be removed in the graph filtering step, such as nodes with a daily transaction volume of 10w.
  • a whitelist may be preset, which contains nodes whose classification status is known and does not need to be analyzed, such as the accounts of known merchants. In this case, you can remove the nodes in the whitelist and related connected edges in the graph filtering operation.
  • the above-mentioned hotspot nodes, whitelist nodes, and nodes with a large transaction volume can usually be classified by other rules, and they are often not used as training samples for user classification models.
  • the connection structure of such nodes is usually more complicated. Therefore, removing these nodes during preprocessing can simplify the relational network graph, thereby facilitating subsequent graph calculation and analysis, and at the same time, does not affect the selection of training data for the user classification model.
  • the relationship network graph is divided into a plurality of sub-graphs, including the first sub-graph used for user classification model training.
  • the relationship network diagram in step 22 may be a relationship network diagram that has been filtered or not.
  • the training of the user classification model requires not only feature data, but also label data.
  • the label data includes the user sample set and the category label of each user sample therein.
  • the segmentation of the relationship network graph and the selection of subgraphs can be performed with reference to the label data.
  • the tag data contains the tagging time of the category tag.
  • the relationship network graph can be segmented based on time. Specifically, the relationship network graph may be divided into a plurality of sub-graphs according to the time period when the interaction event corresponding to the connecting edge of the relationship network graph occurs, and each sub-graph corresponds to a time period. Then, the time period corresponding to the labeling time in the label data is determined, and the subgraph corresponding to the time period is determined as the first subgraph used for model training. It should be understood that the first sub-picture may be a collective term for multiple sub-pictures. For example, the label data may be labeled by month, including labels labeled in July and August respectively.
  • the transaction relationship network graph can be divided into multiple sub-graphs according to the month in which the transaction occurs, and each sub-graph corresponds to a month. Therefore, the sub-pictures corresponding to July and August can be selected from each of the sub-pictures as the above-mentioned first sub-picture.
  • the tag data is divided according to the geographical area where the user sample is located.
  • the relationship network graph can be divided based on the geographical area.
  • the relationship network graph can be divided into a plurality of sub-graphs according to the geographic area in the basic attributes of each user node in the relationship network graph, for example, a city, and each sub-graph corresponds to a geographic area. Therefore, the subgraph corresponding to the geographic area of the user sample set in the label data can be determined as the first subgraph used for model training.
  • a pre-trained segmentation model can be used to segment the above-mentioned relational network graph. For example, you can train a meta-learning multi-classification model to classify the connected edges in the graph, and then segment the relational network graph according to the classification of the edges.
  • the loss function of the meta-learning multi-classification model may be the error of the information value IV value of the graph feature generated after graph segmentation and the information value of the graph feature generated without segmentation.
  • the training of the meta-learning multi-classification model can be carried out in an existing way, which will not be described in detail here.
  • the above-mentioned relational network graph may be divided into multiple subgraphs. Among the obtained multiple subgraphs, there may be one or more subgraphs corresponding to the label data as the first subgraph used for user classification model training.
  • step 23 for each node in the first subgraph, the low-level features of the node are obtained.
  • the relationship network graph can be a heterogeneous graph or a homogenous graph, and correspondingly, the first subgraph corresponds to it.
  • the low-order feature of the node may be the degree of the node.
  • the degree of a node indicates the number of neighbor nodes that the node is connected to, or the number of connected edges the node has.
  • the low-level features of the node include not only the degree of the node, but also the number and proportion of dual nodes in the connected neighbor nodes; the dual nodes are in the relational network graph As the user node of the first type node and the second type node at the same time.
  • Figure 3 shows an example of a homogenous graph according to one embodiment.
  • the left column is the buyer user node, that is, the payer in the transaction event
  • the right column is the seller user node, that is the payee of the transaction event.
  • nodes 2 and 4 are both sellers and buyers, so nodes 2 and 4 are dual nodes, or are called interchange identity nodes.
  • step 24 the first subgraph is converted into an undirected graph.
  • the first subgraph is a heterogeneous graph
  • the above conversion may include converting directed edges in the homogeneous graph into undirected edges, and merging the repeated nodes therein, so as to obtain the above undirected graph.
  • FIG. 4 shows an example of transforming a homogenous image according to an embodiment.
  • the leftmost side of FIG. 4 shows the original homogenous image, which is the same as that shown in FIG. 3.
  • For this homogenous graph firstly convert the directed edge from the node of the first type on the left to the node of the second type on the right to an undirected edge to obtain the graph A. Then, merge the duplicate nodes in Figure A. Therefore, two of the nodes 2 are merged into one node, and the two nodes 4 are merged into one node. In the process of merging two duplicate nodes into one node, other nodes and the connecting edges of the two duplicate nodes are all classified as connecting edges with the merged node. Then we get graph B, in which the nodes and connecting edges in the homogeneous graph are updated.
  • step 25 for each node in the undirected graph, obtain high-order features of the node, where the high-order features include multi-order H-indexes, where each-order H-index indicates that the previous H-index of the H neighbor nodes is satisfied.
  • the order H index is greater than or equal to the maximum H value under the condition of H; where the 0 order H index is the degree of the node.
  • high-order features can be extracted to obtain graph features that are higher in dimension and more abstract than the nodes.
  • the concept of H index is innovatively introduced in graph analysis as a feature of higher-order graphs.
  • H-index also known as H-factor (h-factor)
  • H-factor is a method of evaluating academic achievements. H stands for "high citations”.
  • the H index of a researcher means that he has at most H papers that have been cited at least H times.
  • the concept of H index is applied to graph analysis, where the H index of a certain node refers to the number of at most H neighbor nodes whose degree of neighbor nodes is greater than or equal to H, or in other words, The maximum H value that satisfies the condition that "there are H neighbor nodes with a degree greater than or equal to H".
  • the H value of the maximum H neighbor nodes with a degree greater than H is used as the H index.
  • the degree of the node here is the degree of the node in the undirected graph.
  • the degree of a node in an undirected graph is the same as the degree determined in a low-level feature; for a homogeneous graph, the node is updated during the conversion of the undirected graph, and accordingly, the degree of the node needs to be re-determined.
  • the degree of the node in the graph is the degree of the node in the undirected graph.
  • the rightmost graph C in FIG. 4 shows the degree of each node more clearly.
  • the neighbors of node 1 are nodes 2, 4, and 6, and querying the above table 2 shows that the degrees of these 3 neighboring nodes are 4, 4, and 2, respectively. Therefore, there are 2 neighboring nodes with a degree greater than 2 (but There are no three neighbor nodes with a degree greater than 3), therefore, the H index of node 1 is 2.
  • the most H neighbors with a degree greater than or equal to H cannot be found, the most H neighbors with a degree greater than H are searched.
  • the H index of each node can be determined one by one. Then, based on the H-index thus determined, the higher-order H-index can be further determined. That is to say, take the degree of the node as the 0-order H index, and the H index determined above as the first-order H index, and recursively determine the higher-order H index, where the k-order H index indicates that the k-1 order H index of the neighbor node is greater than The maximum number of H neighbors equal to H, or the maximum H value that satisfies the condition that the H index of order k-1 of the H neighbor nodes is greater than or equal to H. In this way, the second-order H-index and the third-order H-index of each node can be determined iteratively until the predetermined order N is reached.
  • the above-mentioned predetermined order N can be set according to the characteristics of the graph structure and business needs. Generally, through the above-mentioned recursive calculation, the high-order H index of each node will eventually converge and converge to the core degree (K-Core) of the graph. Therefore, in an example, the order N can be set to the order when convergence is reached.
  • the high-order features of each node in the first subgraph are obtained: the first-order H-index, the second-order H-index, ..., the N-order H-index.
  • step 26 based on at least the aforementioned low-level features and high-level features, a candidate feature set is generated as candidate features for training the user classification model.
  • the low-level features and high-level features obtained above are aggregated to form a candidate feature set.
  • statistical features are obtained according to the statistical results of the low-order features and high-order features of its neighbor nodes, and the statistical features are classified into the candidate feature set.
  • the above statistical results include one or more of the following: maximum value, minimum value, average value, median and mode.
  • the median represents the middle one found by sorting all the observed values in a limited set of numbers. If there is an even number of observations, the average of the two middle values is usually taken as the median.
  • the mode represents the most frequent value in a set of data. When there are multiple modes, the average of multiple modes can be selected as the output.
  • the final candidate feature set generated for each node includes the degree of each node, the number of double nodes, the ratio of double nodes, the 0-order H index, 1 Order H index, order 2 H index,...N order H index, as well as the maximum, minimum, average, median and mode of neighbor nodes for the above characteristics.
  • a candidate feature set is generated for user classification model selection and training.
  • the label data used to train the user classification model can be obtained, which includes the user sample set and the category label of each user sample therein; then the above user sample set is mapped to the first node set in the first subgraph; According to the feature value distribution and label value distribution of each feature in the candidate feature set on the first node set, feature screening is performed to obtain the selected feature set for the user classification model.
  • Feature screening can be performed based on feature information value IV, and/or correlation coefficients between features.
  • the screening is performed based on the feature IV value first, and then based on the feature correlation coefficient. For this reason, for any feature in the candidate feature set (for example, the second-order H index), it is called the first feature. According to the feature value distribution of the first feature in the first node set and the label value distribution, Determine its information value IV.
  • the first feature value of each user node (assuming n nodes) in the first node set for the first feature can be obtained, and the first feature values are sorted to form a first feature value sequence (x 1 ,x 2 ,...x n ).
  • the user nodes are binned according to the first eigenvalue sequence (x 1 , x 2 ,...x n ).
  • uniform binning is performed according to the value range defined by the maximum value and the minimum value in the first characteristic value sequence.
  • automatic binning is performed according to the data distribution embodied in the first feature value sequence.
  • each user node is divided into each sub-box. Therefore, based on the label value sequence, the distribution of the label value of the user node in each bin is counted; then, the information value IV of the first feature is determined according to the distribution of the label value of each bin.
  • users can be divided into positive samples and negative samples according to whether the label value is 0 or 1.
  • the number of positive samples pos i and the number of negative samples neg i can be counted; then the weight of evidence WOE value corresponding to bin i can be calculated:
  • the first screening operation can be performed based on the IV value of each feature.
  • the IV value of each feature may be compared with a threshold value, the features whose IV value is lower than the threshold value can be eliminated, and the features whose IV value is higher than the threshold value are retained.
  • the threshold can be set to, for example, 0.5.
  • the threshold can also be adjusted according to the screening target.
  • a correlation coefficient between the retained features is calculated, and a second screening operation is performed based on the correlation coefficient to obtain a selected feature set.
  • the correlation coefficient usually adopts the Pearson correlation coefficient, which can be calculated according to a known algorithm. Other calculation methods can also be used, such as Spearman rank correlation coefficient.
  • a second screening operation can be performed on the features to obtain multiple selected features. Specifically, the second screening operation can be performed in a manner.
  • the feature is removed, if the correlation coefficient between the feature and all other features If all values are lower than the threshold, the feature is retained.
  • a predetermined correlation threshold such as 0.8
  • the average value of the correlation coefficient between the feature and other features can be calculated. Then, each feature in the comprehensive feature table is sorted according to the mean value of the correlation coefficient, and a predetermined number of features with a smaller mean value are selected and retained. For the retained features, the IV value can be further combined to filter again, and finally the selected feature can be obtained.
  • the second stage of screening is performed to obtain multiple selected features to form a selected feature set. These multiple selected features can then be used for training the user classification model.
  • a feature record table is generated to record the description information of each feature in the selected feature set.
  • the description information can specifically be a definition explanation of the selected feature, or a description of the generation process. In this way, such a feature record table can be used for feature generation and selection of similar other models during modeling.
  • a relationship network graph is constructed based on the interaction events participated by the user, and graph features are extracted from it.
  • graph features not only include low-level features such as the degree of nodes, but also innovatively introduce H index as high-level graph features. In this way, a richer graph feature of each node is obtained, which is used for feature selection and training of the user analysis model.
  • an apparatus for processing graph features for a user classification model is provided.
  • the apparatus can be deployed in any device, platform, or device cluster with computing and processing capabilities.
  • Fig. 5 shows a schematic block diagram of a graph feature processing apparatus according to an embodiment. As shown in FIG. 5, the device 500 includes:
  • the graph construction unit 51 is configured to construct a relational network graph based on relational data;
  • the relational data includes records of interaction events in which users participate;
  • the relational network graph includes a plurality of nodes, and nodes formed based on the interaction events Directed edges of, the multiple nodes include user nodes;
  • the graph dividing unit 52 is configured to divide the relationship network graph into a plurality of subgraphs, including a first subgraph used for user classification model training;
  • the low-level feature acquiring unit 53 is configured to acquire low-level features of the nodes for each node in the first subgraph, where the low-level features include at least the degree of the node;
  • the graph conversion unit 54 is configured to convert the first subgraph into an undirected graph
  • the high-order feature acquisition unit 55 is configured to acquire high-order features of the nodes for each node in the undirected graph, where the high-order features include multi-order H-indexes, where each-order H-index indicates that H neighbor nodes are satisfied
  • the previous order H index is greater than or equal to the maximum H value under the condition of H; where the 0 order H index is the degree of the node;
  • the feature set generating unit 56 is configured to generate a candidate feature set based on at least the low-level features and high-level features as candidate features for training the user classification model. .
  • the aforementioned interaction event is an event performed by a user with the aid of a medium; the multiple nodes further include a media node; and the directed edge is a directed connection edge between the user node and the media node.
  • the interaction event may specifically be a login event or an authentication event
  • the information of the media node includes one or more of the following: device identification information, network environment information, and authentication media information.
  • the interaction event is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event between users, and the user node includes a first-type node and a second-type node; the directed edge is from the first-type node to the second-type node.
  • the connecting edge of the class node is a directional interaction event
  • the interaction event may be a transaction event.
  • the first type of node is a buyer node
  • the second type of node is a seller node
  • the interaction event may be a transfer event.
  • the node of the first type is the transfer-out node
  • the node of the second type is the beneficiary node.
  • the above-mentioned apparatus 500 further includes a graph filtering unit (not shown) configured to remove from the relational network graph several nodes that do not meet the training requirements of the user classification model, and connections corresponding to the several nodes side.
  • a graph filtering unit (not shown) configured to remove from the relational network graph several nodes that do not meet the training requirements of the user classification model, and connections corresponding to the several nodes side.
  • the number of removed nodes may include one or more of the following: invalid nodes that do not conform to a predetermined format; nodes with a number of connected edges greater than a certain threshold; nodes in the whitelist; funds involved in the interaction event In the case of the node where the funds in a predetermined period of time exceed the predetermined threshold.
  • the graph dividing unit 52 is specifically configured to divide the relational network graph into a plurality of subgraphs according to the time period when the interaction event corresponding to the directed edge in the relational network graph occurs, and each subgraph corresponds to A time period; the time period corresponding to the labeling time of the label data used for training the user classification model is determined, and the sub-picture corresponding to the time period is determined as the first sub-picture.
  • the graph dividing unit 52 is specifically configured to divide the relationship network graph into a plurality of sub graphs according to the geographic area in the basic attributes of the user node, and each sub graph corresponds to a geographic area;
  • the subgraph corresponding to the geographic area of the user sample set in the label data for training the user classification model is determined to be the first subgraph.
  • the relationship network graph is a homogenous graph.
  • the low-level feature acquisition unit 53 is further configured to acquire the following characteristics of the node: the number and proportion of dual nodes among the neighbor nodes connected to the node;
  • the dual node is a user node that serves as both a first-type node and a second-type node in the relationship network graph.
  • the graph conversion unit 54 is configured to: convert the directed edges in the first subgraph into undirected edges, and merge the repeated nodes therein to obtain the undirected graph .
  • the high-order feature acquisition unit 55 acquires high-order features of a node, for any order H index, when it is impossible to determine the condition that the previous H index of the H neighbor nodes is greater than or equal to H
  • the maximum H value that satisfies the condition that the previous H index of H neighbor nodes is greater than H is taken as the H index of this order.
  • the feature set generating unit 56 is configured to: for each node, obtain statistical features according to the statistical results of the low-order features and high-order features of its neighbor nodes, and include the statistical features in the The candidate feature set; the statistical result includes one or more of the following: maximum value, minimum value, average value, median and mode.
  • the device further includes a feature screening unit (not shown) configured to obtain label data used to train the user classification model, the label data including a user sample set and the data of each user sample therein. Category label; map the user sample set to the first node set in the first subgraph; according to the feature value distribution and label value of each feature in the candidate feature set on the first node set Distribution, feature screening is performed, and a feature set for the user classification model is obtained.
  • a feature screening unit (not shown) configured to obtain label data used to train the user classification model, the label data including a user sample set and the data of each user sample therein. Category label; map the user sample set to the first node set in the first subgraph; according to the feature value distribution and label value of each feature in the candidate feature set on the first node set Distribution, feature screening is performed, and a feature set for the user classification model is obtained.
  • the process of feature screening may specifically include: determining the information value IV of each feature according to the feature value distribution of each feature and the label value distribution, and performing the first analysis on each feature based on the information value IV.
  • a screening operation for the retained features after the first screening operation, a correlation coefficient between the retained features is calculated, and a second screening operation is performed based on the correlation coefficient to obtain the feature set.
  • the feature screening unit after obtaining the feature set, the feature screening unit further generates a feature record table for recording the description information of each feature in the feature set.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. 2 method.
  • the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof.
  • these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种图特征处理的方法和装置。根据该方法,首先根据关系数据,构建关系网络图,其中关系数据包括用户参与的交互事件记录;关系网络图包括多个用户节点,以及基于交互事件形成的有向边。然后,将该关系图分割为多个子图,其中包括用于用户分类模型训练的第一子图。对于第一子图中各个节点,获取节点的低阶特征,其中包括节点的度。然后,还对于基于第一子图得到的无向图中的各个节点,获取节点的高阶特征,其中包括多阶H指数,每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H的最大H值;其中0阶H指数为节点的度。于是,可以基于低阶特征和高阶特征,生成备选特征集,作为训练用户分类模型的备选特征。

Description

图特征处理的方法及装置 技术领域
本说明书一个或多个实施例涉及机器学习领域,尤其涉及针对用户分类模型进行图特征处理的方法和装置。
背景技术
随着人工智能和机器学习的快速发展,在多种业务场景中开始使用机器学习的模型进行业务分析。例如,在许多应用场景中,需要对用户进行分类识别,例如,识别用户的风险等级、区分用户所属的人群,等等。为此,常常需要训练用户分类模型,来进行与业务相关的用户识别和用户分类。
特征的选择和处理是模型训练的基础。对于用户分类模型来说,为了训练出性能优异,预测准确的模型,就需要从大量用户特征中选择出与预测目标更为相关、更能反映用户特点的特征,来进行模型训练。在最为简单的场景下,从用户的基本属性特征中进行特征选择,训练的模型就可以达到要求。然而,随着业务场景越来越复杂,在许多情况下,用户的基本属性特征往往不够丰富和全面,不能满足模型训练的性能要求。为此,考虑生成一些附加特征或衍生特征,作为模型训练的补充,其中,基于用户关系网络生成图特征,是补充特征的一个方面。然而,网络图是一种比较复杂的数据结构,其分析运算都需要很大的计算量,如何高效地从中提取出适用于模型训练的有意义的特征是一项困难和挑战。
因此,希望能有改进的方案,可以更为高效地对图数据进行处理,快速提取出有效的图特征,以供用户分类模型进行选择和训练。
发明内容
本说明书一个或多个实施例描述了一种针对用户分类模型进行图特征处理的方法和装置,可以高效地生成丰富的图特征,从而便于用户分类模型的特征选择和训练。
根据第一方面,提供了一种图特征处理的方法,包括:根据关系数据,构建关系网络图;所述关系数据包括,用户参与的交互事件记录;所述关系网络图包括多个节点,以及基于所述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点;将所述关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图;对于所述第一子图中各个节点,获取节点的低阶特征,其中所述低阶特征至少包括,节点的度;将所述第一子图转换为无向图;对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度;至少基于所述低阶特征和高阶特征, 生成备选特征集,作为训练所述用户分类模型的备选特征。
根据一种实施方式,交互事件为用户借助介质进行的事件;所述多个节点还包括介质节点;所述有向边为用户节点与介质节点之间的有向连接边。
在上述实施方式的具体实施例中,交互事件具体为登录事件或认证事件,所述介质节点的信息包括以下中的一项或多项:设备标识信息,网络环境信息,认证媒介信息。
根据另一种实施方式,交互事件为用户之间的有方向的交互事件,所述用户节点包括,第一类节点和第二类节点;所述有向边为从第一类节点指向第二类节点的连接边。
在上述实施方式的具体实施例中,交互事件可以为交易事件,此时所述第一类节点为买家节点,第二类节点为卖家节点;或者,交互事件可以为转账事件,此时所述第一类节点为转出方节点,第二类节点为收款方节点。
根据一个实施例,在将所述关系网络图分割为多个子图之前,对该关系网络图进行图过滤,这包括,从所述关系网络图中剔除不符合所述用户分类模型训练需要的若干节点,以及所述若干节点对应的连接边。
具体的,所剔除的若干节点可以包括以下中的一项或多项:不符合预定格式的无效节点;连接边数目大于一定阈值的节点;位于白名单中的节点;在所述交互事件涉及资金的情况下,预定时长周期内往来资金超过预定阈值的节点。
根据一种实施方式,通过以下方式将关系网络图分割为多个子图:根据所述关系网络图中有向边所对应的交互事件发生的时间段,将所述关系网络图分割为多个子图,每个子图对应一个时间段;确定用于训练所述用户分类模型的标签数据的标注时间所对应的时间段,将该时间段对应的子图确定为所述第一子图。
根据另一种实施方式,通过以下方式将关系网络图分割为多个子图:根据所述用户节点的基本属性中的地理区域,将关系网络图分割为多个子图,每个子图对应一个地理区域;将与用于训练所述用户分类模型的标签数据中用户样本集的地理区域相对应的子图,确定为所述第一子图。
根据一个实施例,所述关系网络图为同质图,此时,获取的节点的低阶特征还包括:该节点所连接的邻居节点中,双重节点的数目和占比;其中所述双重节点为,在所述关系网络图中同时作为第一类节点和第二类节点的用户节点。
在关系网络图为同质图的情况下,将所述第一子图转换为无向图具体包括:将所述第一子图中的有向边转换为无向边,并合并其中的重复节点,得到所述无向图。
根据一个实施例,在获取节点的高阶特征时,对于任意阶H指数,当无法确定出所 述满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值时,将满足H个邻居节点的上一阶H指数大于H这一条件的最大H值,作为其本阶H指数。
根据一个实施例,生成备选特征集具体包括:对于各个节点,根据其邻居节点的低阶特征和高阶特征中各项特征的统计结果,得到统计特征,将所述统计特征包含在所述备选特征集中;所述统计结果包括以下中的一项或多项:最大值、最小值、平均值、中位数和众数。
根据一种实施方式,所述方法还包括:获取用于训练所述用户分类模型的标签数据,所述标签数据包括用户样本集和其中各个用户样本的类别标签;将所述用户样本集映射到所述第一子图中的第一节点集;根据所述备选特征集中的各项特征在所述第一节点集上的特征值分布和标签值分布,进行特征筛选,得到用于所述用户分类模型的特征集。
在上述实施方式中,特征筛选的过程具体可以包括:根据所述各项特征的特征值分布和所述标签值分布,确定各项特征的信息价值IV,基于信息价值IV对各项特征进行第一筛选操作;对于所述第一筛选操作后的保留特征,计算保留特征之间的相关系数,基于所述相关系数进行第二筛选操作,得到所述特征集。
在一个实施例中,在得到上述特征集后,还生成特征记录表,用于记录所述特征集中各项特征的描述信息。
根据第二方面,提供了一种图特征处理的装置,包括:图构建单元,配置为根据关系数据,构建关系网络图;所述关系数据包括,用户参与的交互事件记录;所述关系网络图包括多个节点,以及基于所述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点;图分割单元,配置为将所述关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图;低阶特征获取单元,配置为对于所述第一子图中各个节点,获取节点的低阶特征,其中所述低阶特征至少包括,节点的度;图转换单元,配置为将所述第一子图转换为无向图;高阶特征获取单元,配置为对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度;特征集生成单元,配置为至少基于所述低阶特征和高阶特征,生成备选特征集,作为训练所述用户分类模型的备选特征。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储 器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。
根据本说明书实施例的描述,为了针对用户分类模型的训练提供更为丰富的特征选择,在实施例的方案中,基于用户参与的交互事件构建关系网络图,并从中提取图特征。其中,图特征不仅包括例如节点的度的低阶特征,还创新性引入H指数作为高阶图特征。如此,得到了各个节点的更为丰富的图特征,用于用户分析模型的特征选择和训练。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书披露的一个实施例的图特征处理过程的示意图;
图2示出根据一个实施例的针对用户分类模型进行图特征处理的方法流程图;
图3示出根据一个实施例的同质图的例子;
图4示出根据一个实施例对同质图进行变换的例子;
图5示出根据一个实施例的图特征处理装置的示意性框图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
为了更高效地实现用户分类模型的建模和训练,在本说明书的一个实施例中,提供一种端到端的图特征处理方案,该方案可以基于记录用户交互事件的关系数据,生成关系网络图,从中提取节点的低阶图特征和高阶图特征作为备选特征,供用户分类模型筛选和训练使用。
图1为本说明书披露的一个实施例的图特征处理过程的示意图。如图1所示,首先基于关系数据,构建关系网络图。其中,关系数据用于记录用户所参与的交互事件的事件记录;相应的,据此构建的关系网络图中包括有用户节点,而节点之间的连接边基于交互事件而建立。在本说明书的实施例中,可以考虑交互事件的方向性,建立有向的连接边。相应的,关系网络图可以体现为二部图。
可选的,可以对以上构建的关系网络图进行一些过滤处理,去除掉一些不必进行分析的节点和边。进一步地,可以将关系网络图拆分为子图,从而便于后续处理。
基于以上处理得到的子图,可以进行节点特征的提取。所提取的节点特征包括低阶特征和高阶特征,其中低阶特征至少包括节点的度。对于高阶特征,在本说明书的实施例中, 创新性地将其他领域采用的H指数应用到图分析中,作为高阶图特征,其中关系网络图中节点的H指数是指,邻居节点的度大于等于H的最多H个邻居节点的个数。进一步地,还可以迭代得到多阶H指数。如此,得到了各个节点的更为丰富的高阶特征。
可选地,还可以对各个节点的邻居节点的低阶/高阶特征进行统计,得到统计特征。于是,以上的低阶特征,高阶特征,以及可选的统计特征,共同构成备选特征集。该备选特征集中包含基于关系网络图生成和提取的图特征,且这些图特征,特别是节点高阶特征,与常规方式提取的特征具有本质的不同。
对于备选特征集中的各项特征,可以通过各种评估方式,例如特征信息价值IV,相关系数等,对特征进行评估,从而进行筛选。于是,最终可以从备选特征集中选择出适用于用户分类模型的特征集,从而有助于训练得到性能更为优异的用户分类模型。
下面描述以上方案的具体步骤和执行方式。
图2示出根据一个实施例的针对用户分类模型进行图特征处理的方法流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示,该图特征处理方法至少包括以下步骤。
在步骤21,根据关系数据,构建关系网络图。其中,关系数据包括,用户参与的交互事件的事件记录;相应地,构建的关系网络图包括多个节点,以及基于上述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点。
具体的,上述交互事件可以是用户所参与的、对于用户分类模型的分类预测目标有帮助的事件。例如,当用户分类模型用于评估用户的交易风险时,可以基于交易事件构建关系网络图;当用户分类模型用于评估用户的登录风险时,可以基于登录事件构建关系网络图;当用户分类模型用于判断用户所属的营销人群时,可以基于优惠核销事件构建关系网络图,等等。
在不同实施例中,交互事件可以是用户之间的交互,也可以涉及其他的对象。在以上两种情况下,所形成的关系网络图分别为同质图和异质图。
具体地,在一种实施方式中,上述交互事件为用户借助介质进行的事件,其中涉及介质对象。在这样的情况下,构建的关系网络图为异质图,其中除了包含用户节点,还包括介质节点。相应的,连接边为用户节点与介质节点之间的有向连接边。
例如,交互事件可以为登录事件,其中用户借助特定的设备以及网络环境进行登录。此时,介质节点可以包括设备节点,和/或网络环境节点。更具体的,设备节点可以通过设备标识信息来表示,具体可以包括,设备的MAC地址,手机SIM号,UMID,APDID等 设备标识。网络环境节点可以示出登录时的网络环境信息,例如,IP地址,wifi网络标识,等等。如果某个用户借助于某种介质进行了登录,就在该用户对应的用户节点和该介质对应的介质节点之间,构建连接边。
又例如,交互事件可以为认证事件,其中用户借助于一些认证媒介进行身份认证。此时,介质节点可以上述认证媒介,例如,认证所用的信用卡号,身份证号,手机号等等。如果某个用户使用某种认证媒介进行了身份认证,就在该用户对应的用户节点和该认证媒介对应的介质节点之间,构建连接边。
还存在其他异质图的具体例子,在此不一一进行枚举。在异质图的情况下,可以将用户节点视为一类节点,将其他对象视为另一类节点,如此得到的异质图可以是二部图。
在另一种实施方式中,上述交互事件为用户之间的进行的有方向性的交互事件。在这样的情况下,构建的关系网络图为同质图,其中所有节点均为用户节点。各个用户节点可以通过用户标识信息表示,其中用户标识信息具体可以采用账户ID、手机号、邮箱地址等形式。进一步地,根据交互事件的方向性,可以将用户节点划分为两类节点,称为第一类节点和第二类节点,第一类节点对应于事件起点,第二类节点对应于事件目标。相应的,同质图中的连接边为从第一类节点指向第二类节点的有向边。
具体的,在一个例子中,上述交互事件为交易事件。在这样的情况下,第一类节点对应于买家用户,第二类节点对应于卖家用户。在一种典型的实现中,对应的关系数据为交易记录表,其中每行记录一条交易。该交易记录表例如可以包含4列数据:买家账户、卖家账户、交易金额、交易时间。如此,可以将买家账户这一列的各个账户作为第一类节点,将卖家账户这一列的各个账户作为第二类节点,在出现在同一条交易中的买家账户和卖家账户之间建立一条有向边,从买家账户指向卖家账户。
如果同一账户在不同交易中,有时作为买家记录在买家账户一列中,有时作为卖家记录在卖家账户一列中,那么在构图时,分别在第一类节点和第二类节点中记录该账户,也就是说,将该账户分别表示为一个第一类节点和一个第二类节点。
对于以上图中的连接边,可以将交易金额,交易时间作为边属性信息。在一种情况下,同一组买家账户和卖家账户可能进行过多次交易,此时,可以将交易次数等信息也包含在边属性信息中。
在另一例子中,交互事件可以是转账事件,在这样的情况下,第一类节点为转出方节点,第二类节点为收款方节点。在又一个例子中,交互事件可以是包含某种行为的社交事件,比如呼叫行为、分享行为等,此时,第一类节点对应于行为的发起方,例如呼叫方、 分享发起方,第二类节点对应于行为的接收方,例如被呼叫方、分享接收方,等等。
还存在其他同质图的具体例子,在此不一一进行枚举。在同质图的情况下,由于将用户节点划分为两类,此时得到的同质图也可以视为一种二部图。
在一个实施例中,关系网络图可以采用表的形式记录其拓扑结构,例如,可以记录为邻接表,或者以有向边的起点和目标点分别为2列,记录各条连接边,等等。
在得到上述关系网络图之后,可选地,可以对该关系网络图进行一些预处理操作,以简化或便于后续运算。在一个实施例中,预处理操作可以包括,图过滤操作,即,从上述关系网络图中去除掉不符合用户分类模型训练需要的节点以及相关的连接边。
具体的,图过滤操作可以包括,首先去除一些无效节点。无效节点是不符合格式要求的节点,主要包括在数据传输过程中导致的节点格式错误产生的无效节点。实际业务中无效节点多为介质类节点,包括UMID、APDID、SIM等节点。若其格式不满足标准格式,则将该节点以及与该节点相连的边全部去除。
图过滤操作还可以包括,去除连接边数目大于一定阈值的节点。这样的节点可以称为热点节点。业务中根据关系数据的不同设定不同的阈值。例如,交易事件中,将连接边超过300的节点认为是热点,异质图中针对介质节点可以设定超过1000条连接边为热点节点。
在交互事件涉及资金的情况下,例如交易事件,转账事件等等,可以在图过滤步骤中,去除预定时长周期内往来资金超过预定阈值的节点,例如单日交易量达到10w的节点。
在其他例子中,可以预先设置有一份白名单,其中包含分类情况已知,无需进行分析的节点,例如已知商户的账户。在这样的情况下,可以在图过滤操作中,去除白名单中的节点,以及相关的连接边。
需要理解,上述热点节点,白名单节点,交易量大的节点,通常可以通过其他规则确定其分类,往往不作为用户分类模型的训练样本。而这类节点通常连接结构比较复杂。因此,在预处理时去除这些节点,可以简化关系网络图,进而便于后续的图计算分析,同时,不影响用户分类模型的训练数据选取。
为了进一步简化关系网络图的处理,在步骤22,将关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图。步骤22中的关系网络图可以是经过图过滤或不经过图过滤的关系网络图。
需要理解,用户分类模型的训练不仅需要特征数据,还需要标签数据,标签数据中包括用户样本集和其中各个用户样本的类别标签。在一种实施方式中,可以参考标签数据进 行关系网络图的分割和子图的选取。
在一个实施例中,标签数据中包含类别标签的标注时间,在这样的情况下,可以基于时间对关系网络图进行分割。具体的,可以根据关系网络图中连接边所对应的交互事件发生的时间段,将关系网络图分割为多个子图,每个子图对应一个时间段。然后,确定标签数据中的标注时间所对应的时间段,将该时间段对应的子图确定为用于模型训练的第一子图。需要理解,第一子图可以是多个子图的统称。例如,标签数据可以是按照月份进行标注的,其中包含分别在7,8两个月标注的标签。相应的,可以按照交易发生的月份,将交易关系网络图分割为多个子图,每个子图对应一个月份。于是,可以从各个子图中选出7,8两个月对应的子图,作为上述第一子图。
在一个实施例中,标签数据按照用户样本所在的地理区域进行划分,在这样的情况下,可以基于地理区域对关系网络图进行分割。具体的,可以根据关系网络图中各个用户节点的基本属性中的地理区域,例如城市,将关系网络图分割为多个子图,每个子图对应一个地理区域。于是,可以将与标签数据中用户样本集的地理区域相对应的子图确定为用于模型训练的第一子图。
根据另一种实施方式,可以采用预先训练的分割模型,对上述关系网络图进行分割。例如,可以训练一个元学习多分类模型,用于对图中的连接边进行分类,然后根据边的分类,进行关系网络图的分割。元学习多分类模型的损失函数可以是,图分割后生成的图特征与不分割生成图特征的信息价值IV值的误差。元学习多分类模型的训练可以采用已有方式进行,此处不进行详细描述。
在其他具体例子中,还可以基于其他原则,对上述关系网络图进行图分割,得到多个子图。在得到的多个子图中,可以有一个或多个子图与标签数据相对应,作为用于用户分类模型训练的第一子图。
接着,在步骤23,针对第一子图中的各个节点,获取节点的低阶特征。
如前所述,关系网络图可以是异质图或同质图,相应的,第一子图与之对应。在第一子图为异质图的情况下,节点的低阶特征可以是,节点的度。节点的度表示,节点所连接到的邻居节点数目,或者节点所具有的连接边数目。
在第一子图为同质图的情况下,节点的低阶特征除了包含节点的度,还包括,所连接的邻居节点中双重节点的数目和占比;其中双重节点为,在关系网络图中同时作为第一类节点和第二类节点的用户节点。
图3示出根据一个实施例的同质图的例子。假定该关系网络图基于交易事件而构建, 左边一列为买家用户节点,即交易事件中的付款方,右边一列是卖家用户节点,即交易事件的收款方。如图所示,节点2和4既是卖家又是买家,因此节点2和节点4属于双重节点,或称为互换身份节点。
如图3所示,双重节点对应的用户在构图时,按照第一类节点和第二类节点分别表示为两个节点;在计算节点的低阶特征时,也分别考虑其作为第一类节点时的低阶特征和作为第二类节点的低阶特征。如此,可以针对图3中各个买家节点和各个卖家节点,分别确定其低阶特征。例如,对于买家节点1,其连接到3个卖家节点(6,2,4),并且这3个卖家节点中,节点2和4均为双重节点,因此,双重节点数目为2,占比为2/3。如此,按买家节点和卖家节点分组计算低阶特征如以下表1所示。
表1:
Figure PCTCN2020132654-appb-000001
如此,得到第一子图中各个节点的低阶特征。
接着,在步骤24,将所述第一子图转换为无向图。
对于第一子图为异质图的情况,只需要将其中的有向边转换为无向边,即可得到上述无向图。对于第一子图为同质图的情况,上述转换可以包括,将同质图中的有向边转换为无向边,并且合并其中的重复节点,从而得到上述无向图。
图4示出根据一个实施例对同质图进行变换的例子。图4最左侧示出原始的同质图,该同质图与图3所示相同。对于该同质图,首先将从左侧第一类节点指向右侧第二类节点的有向边,转换为无向边,得到图A。然后,将图A中的重复节点进行合并。于是,将其中的两个节点2合并为一个节点,将两个节点4合并为一个节点。在将两个重复节点合并 为一个节点的过程中,其他节点与将该两个重复节点的连接边,都归为与合并后节点的连接边。于是得到图B,其中更新了同质图中的各个节点和连接边。
接着,在步骤25,对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度。例如,可以基于该无向图,进行高阶特征的提取,得到比节点的度更高维更抽象的图特征。在本说明书的实施例中,创新性地在图分析中引入H指数的概念,作为高阶图特征。
H指数(H-index)又称为H因子(h-factor),是一种评价学术成就的方法。H代表“高引用次数”(high citations),一名科研人员的H指数是指,他至多有H篇论文分别被引用了至少H次。在本实施例的方案中,将H指数的概念应用到图分析中,其中,某个节点的H指数指代,邻居节点的度大于等于H的最多H个邻居节点的个数,或者说,满足“存在H个邻居节点的度大于等于H”这一条件的最大H值。如果无法确定出度大于等于H的最多H个邻居节点,则将度大于H的最多H个邻居节点的H值作为H指数。此处节点的度,为无向图中节点的度。对于异质图来说,无向图中节点的度与其在低阶特征中确定的度相同;对于同质图来说,无向图转换过程中进行了节点的更新,相应也需要重新确定无向图中节点的度。
下面结合例子进行描述。延续图4的示例,其中在图B,对同质图进行了变换,更新了各个节点和连接边,得到无向图。因此,可以重新确定各个节点的度,得到以下的列表:
表2:
节点 节点
1 3 5 1
2 4 6 2
3 2 7 2
4 4    
图4中最右侧的图C更清楚地示出了各个节点的度。下面以节点1,即图C中的深色节点为例,描述H指数的确定。
可以看到,节点1的邻居为节点2、4、6,查询上述表2可知,这3个邻居节点的度分别为4、4、2,因此,存在2个邻居节点的度大于2(却不存在3个邻居节点的度大于3),因此,节点1的H指数为2。此处由于无法找到度大于等于H的最多H个邻居,则寻找大于H的最多H个邻居。
通过类似的方式,可以逐个确定各个节点的H指数。然后,可以基于如此确定的H指数,进一步确定高阶H指数。也就是说,将节点的度作为0阶H指数,以上确定的H指数作为1阶H指数,递归确定更高阶H指数,其中k阶H指数表示,邻居节点的k-1阶H指数大于等于H的最多H个邻居个数,或者说,满足H个邻居节点的k-1阶H指数大于等于H这一条件的最大H值。如此,可以依次迭代确定出各个节点的2阶H指数,3阶H指数,直到预定阶数N。
上述预定阶数N可以根据图结构的特点和业务需要来设定。一般地,通过上述递归计算,最终各个节点的高阶H指数会达到收敛,且收敛至图的核心度(K-Core)。因此,在一个例子中,可以将阶数N设定为达到收敛时的阶数。
通过以上方式,得到了第一子图中各个节点的高阶特征:1阶H指数,2阶H指数,…,N阶H指数。
需要说明的是,在关系网络图通过表的形式记录时,低阶特征和高阶特征均可以通过SQL查询语句简单地实现,避免了常规图特征运算中大量的矩阵运算,因此特征生成效率很高。
接下来,在步骤26,至少基于上述低阶特征和高阶特征,生成备选特征集,作为训练用户分类模型的备选特征。
在一个实施例中,将以上得到的低阶特征和高阶特征,汇总形成备选特征集。在另一实施例中,对于各个节点,根据其邻居节点的低阶特征和高阶特征中各项特征的统计结果,得到统计特征,将所述统计特征归入备选特征集中。其中,上述统计结果包括以下中的一项或多项:最大值、最小值、平均值、中位数和众数。
在以上统计结果中,中位数表示,在有限的数集中,通过把所有观察值高低排序后找出的正中间的一个。如果观察值有偶数个,通常取最中间的两个数值的平均数作为中位数。众数表示一组数据中出现次数最多的数值。当众数有多个时,可以选择多个众数的平均数作为产出。
在一个具体例子中,对于同质图的第一子图,针对各个节点最终生成的备选特征集包括,每个节点自身的度、双重节点个数、双重节点比例、0阶H指数,1阶H指数,2阶H指数,…N阶H指数,以及邻居节点针对以上各项特征的最大值、最小值、平均值、中位数和众数。
如此,基于关系网络图,生成了备选特征集,以供用户分类模型进行选择和训练使用。
接下来,可以针对上述备选特征集进行特征筛选,从中选择出适用于用户分类模型的 特征。具体地,可以获取用于训练用户分类模型的标签数据,其中包括用户样本集和其中各个用户样本的类别标签;然后将上述用户样本集映射到第一子图中的第一节点集;根据所述备选特征集中的各项特征在第一节点集上的特征值分布和标签值分布,进行特征筛选,得到用于用户分类模型的选中特征集。特征筛选可以基于特征信息价值IV,和/或特征之间的相关系数进行。
在一个实施例中,首先基于特征IV值进行筛选,然后基于特征相关系数进行筛选。为此,对于备选特征集中的任意的一项特征(例如2阶H指数),称为第一特征,可以根据该第一特征在第一节点集中的特征值分布和所述标签值分布,确定其信息价值IV。
更具体地,对于上述第一特征X,可以得到第一节点集中各个用户节点(假定n个节点)针对该第一特征的第一特征值,将各个第一特征值排序形成第一特征值序列(x 1,x 2,…x n)。
接着,关联标签数据,得到标签值序列(L 1,L 2,…L n),该标签值序列(L 1,L 2,…L n)与第一特征值序列(x 1,x 2,…x n)关于用户顺序相对齐。
接下来,根据第一特征值序列(x 1,x 2,…x n)对用户节点进行分箱。在一个实施例中,根据第一特征值序列中最大值和最小值所限定的取值范围,进行均匀分箱。在另一实施例中,根据第一特征值序列所体现的数据分布,进行自动分箱。
如此,各个用户节点被划分到各个分箱中。于是,基于标签值序列,统计各个分箱中用户节点的标签值分布情况;然后根据各个分箱的标签值分布情况,确定第一特征的信息价值IV。
以用户分类模型为二分类模型,类别标签具有二值化的情况为例,根据标签值为0还是1,可以将用户划分为正样本和负样本。对于任意分箱i,可以统计其中正样本个数pos i,负样本个数neg i;然后计算分箱i对应的证据权重WOE值:
Figure PCTCN2020132654-appb-000002
其中,
Figure PCTCN2020132654-appb-000003
为分箱i中正样本数目占全部正样本数目的比例,
Figure PCTCN2020132654-appb-000004
为分箱i中负样本数目占全部负样本数目的比例。
进而,可以得到第一特征的IV值:
Figure PCTCN2020132654-appb-000005
通过以上方式,针对备选特征集中的每项特征,可以确定出其IV值。于是可以基于各项特征的IV值,进行第一筛选操作。具体的,可以将各项特征的IV值与一阈值比较,将IV值低于该阈值的特征剔除,保留IV值高于该阈值的特征。实际操作中,可以将该阈值设置为例如0.5。当然也可以根据筛选目标调整该阈值。
然后,对于所述第一筛选操作后的保留特征,计算保留特征之间的相关系数,基于所述相关系数进行第二筛选操作,得到选中特征集。
可以采用各种已有的方式,计算两两特征之间的相关系数。相关系数通常采用Pearson相关系数,可以根据已知的算法来计算。也可以采用其他计算方式,例如Spearman秩相关系数等。基于上述相关系数,可以对特征进行第二筛选操作,得到多项选中特征。具体的,第二筛选操作可以通过方式执行。
在一个实施例中,对于每一项特征,如果该特征与任何其他特征之间的相关系数高于预定相关性阈值,例如0.8,则剔除该项特征,如果与所有其他特征之间的相关系数均低于该阈值,则保留该特征。在又一实施例中,对于每一项特征,可以计算该特征与其他各项特征之间的相关系数的均值。然后,将综合特征表中的各项特征,按照相关系数的均值大小进行排序,选取均值较小的预定数目的特征予以保留。对于保留的特征,还可以进一步结合IV值,再次筛选,最终得到选中特征。
如此,通过多种方式,基于特征之间的相关系数,进行第二阶段的筛选,得到多个选中特征,构成选中特征集。这多个选中特征于是可以用于用户分类模型的训练。
在确定出各项选中特征的基础上,在一个实施例中,生成特征记录表,用于记录上述选中特征集中各项特征的描述信息。该描述信息具体可以是对选中特征的定义解释,或生成过程描述。如此,这样的特征记录表可以用于类似的其他模型在建模时,进行特征生成和选择。
回顾以上过程,在说明书实施例中,为了为用户分类模型的训练提供更为丰富的特征选择,基于用户参与的交互事件构建关系网络图,并从中提取图特征。其中,图特征不仅包括例如节点的度的低阶特征,还创新性引入H指数作为高阶图特征。如此,得到了各个节点的更为丰富的图特征,用于用户分析模型的特征选择和训练。
根据另一方面的实施例,提供了一种针对用户分类模型进行图特征处理的装置,该装置可以部署在任何具有计算、处理能力的设备、平台或设备集群中。图5示出根据一个实施例的图特征处理装置的示意性框图。如图5所示,该装置500包括:
图构建单元51,配置为根据关系数据,构建关系网络图;所述关系数据包括,用户 参与的交互事件记录;所述关系网络图包括多个节点,以及基于所述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点;
图分割单元52,配置为将所述关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图;
低阶特征获取单元53,配置为对于所述第一子图中各个节点,获取节点的低阶特征,其中所述低阶特征至少包括,节点的度;
图转换单元54,配置为将所述第一子图转换为无向图;
高阶特征获取单元55,配置为对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度;
特征集生成单元56,配置为至少基于所述低阶特征和高阶特征,生成备选特征集,作为训练所述用户分类模型的备选特征。。
根据一种实施方式,上述交互事件为用户借助介质进行的事件;所述多个节点还包括介质节点;所述有向边为用户节点与介质节点之间的有向连接边。
在上述实施方式的具体实施例中,交互事件具体可以为登录事件或认证事件,所述介质节点的信息包括以下中的一项或多项:设备标识信息,网络环境信息,认证媒介信息。
根据另一种实施方式,交互事件为用户之间的有方向的交互事件,所述用户节点包括,第一类节点和第二类节点;所述有向边为从第一类节点指向第二类节点的连接边。
在上述实施方式的具体实施例中,交互事件可以为交易事件,此时所述第一类节点为买家节点,第二类节点为卖家节点;或者,交互事件可以为转账事件,此时所述第一类节点为转出方节点,第二类节点为收款方节点。
根据一个实施例,上述装置500还包括图过滤单元(未示出),配置为从所述关系网络图中剔除不符合所述用户分类模型训练需要的若干节点,以及所述若干节点对应的连接边。
具体的,所剔除的若干节点可以包括以下中的一项或多项:不符合预定格式的无效节点;连接边数目大于一定阈值的节点;位于白名单中的节点;在所述交互事件涉及资金的情况下,预定时长周期内往来资金超过预定阈值的节点。
根据一种实施方式,图分割单元52具体配置为:根据所述关系网络图中有向边所对应的交互事件发生的时间段,将所述关系网络图分割为多个子图,每个子图对应一个时间段;确定用于训练所述用户分类模型的标签数据的标注时间所对应的时间段,将该时间段 对应的子图确定为所述第一子图。
根据另一种实施方式,图分割单元52具体配置为,根据所述用户节点的基本属性中的地理区域,将关系网络图分割为多个子图,每个子图对应一个地理区域;将与用于训练所述用户分类模型的标签数据中用户样本集的地理区域相对应的子图,确定为所述第一子图。
根据一个实施例,所述关系网络图为同质图,此时,低阶特征获取单元53还配置为获取节点的以下特征:该节点所连接的邻居节点中,双重节点的数目和占比;其中所述双重节点为,在所述关系网络图中同时作为第一类节点和第二类节点的用户节点。
在关系网络图为同质图的情况下,图转换单元54配置为:将所述第一子图中的有向边转换为无向边,并合并其中的重复节点,得到所述无向图。
根据一个实施例,高阶特征获取单元55在获取节点的高阶特征时,对于任意阶H指数,当无法确定出所述满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值时,将满足H个邻居节点的上一阶H指数大于H这一条件的最大H值,作为其本阶H指数。
根据一个实施例,特征集生成单元56配置为:对于各个节点,根据其邻居节点的低阶特征和高阶特征中各项特征的统计结果,得到统计特征,将所述统计特征包含在所述备选特征集中;所述统计结果包括以下中的一项或多项:最大值、最小值、平均值、中位数和众数。
根据一种实施方式,所述装置还包括特征筛选单元(未示出),配置为:获取用于训练所述用户分类模型的标签数据,所述标签数据包括用户样本集和其中各个用户样本的类别标签;将所述用户样本集映射到所述第一子图中的第一节点集;根据所述备选特征集中的各项特征在所述第一节点集上的特征值分布和标签值分布,进行特征筛选,得到用于所述用户分类模型的特征集。
在上述实施方式中,特征筛选的过程具体可以包括:根据所述各项特征的特征值分布和所述标签值分布,确定各项特征的信息价值IV,基于信息价值IV对各项特征进行第一筛选操作;对于所述第一筛选操作后的保留特征,计算保留特征之间的相关系数,基于所述相关系数进行第二筛选操作,得到所述特征集。
在一个实施例中,上述特征筛选单元在得到上述特征集后,还生成特征记录表,用于记录所述特征集中各项特征的描述信息。
通过以上装置,针对用户分类模型,快速高效地生成丰富的图特征,从而便于用户分 类模型的特征选择和训练。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (19)

  1. 一种图特征处理的方法,包括:
    根据关系数据,构建关系网络图;所述关系数据包括,用户参与的交互事件记录;所述关系网络图包括多个节点,以及基于所述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点;
    将所述关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图;
    对于所述第一子图中各个节点,获取节点的低阶特征,其中所述低阶特征至少包括,节点的度;
    将所述第一子图转换为无向图;
    对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度;
    至少基于所述低阶特征和高阶特征,生成备选特征集,作为训练所述用户分类模型的备选特征。
  2. 根据权利要求1所述的方法,其中,所述交互事件为用户借助介质进行的事件;所述多个节点还包括介质节点;所述有向边为用户节点与介质节点之间的有向连接边。
  3. 根据权利要求2所述的方法,其中,所述交互事件为登录事件或认证事件,所述介质节点的信息包括以下中的一项或多项:设备标识信息、网络环境信息、认证媒介信息。
  4. 根据权利要求1所述的方法,其中,所述交互事件为用户之间的有方向的交互事件,所述用户节点包括,第一类节点和第二类节点;所述有向边为从第一类节点指向第二类节点的连接边。
  5. 根据权利要求4所述的方法,其中,
    所述交互事件为交易事件,所述第一类节点为买家节点,第二类节点为卖家节点;或者:
    所述交互事件为转账事件,所述第一类节点为转出方节点,第二类节点为收款方节点。
  6. 根据权利要求1所述的方法,其中,在将所述关系网络图分割为多个子图之前,还包括:从所述关系网络图中剔除不符合所述用户分类模型训练需要的若干节点,以及所述若干节点对应的连接边。
  7. 根据权利要求6所述的方法,其中,所述若干节点包括以下中的一项或多项:
    不符合预定格式的无效节点;
    连接边数目大于一定阈值的节点;
    位于白名单中的节点;
    在所述交互事件涉及资金的情况下,预定时长周期内往来资金超过预定阈值的节点。
  8. 根据权利要求1所述的方法,其中,将所述关系网络图分割为多个子图,包括:
    根据所述关系网络图中有向边所对应的交互事件发生的时间段,将所述关系网络图分割为多个子图,每个子图对应一个时间段;
    确定用于训练所述用户分类模型的标签数据的标注时间所对应的时间段,将该时间段对应的子图确定为所述第一子图。
  9. 根据权利要求1所述的方法,其中,将所述关系网络图分割为多个子图,包括:
    根据所述用户节点的基本属性中的地理区域,将关系网络图分割为多个子图,每个子图对应一个地理区域;
    将与用于训练所述用户分类模型的标签数据中用户样本集的地理区域相对应的子图,确定为所述第一子图。
  10. 根据权利要求4所述的方法,其中,所述节点的低阶特征还包括:该节点所连接的邻居节点中,双重节点的数目和占比;其中所述双重节点为,在所述关系网络图中同时作为第一类节点和第二类节点的用户节点。
  11. 根据权利要求4所述的方法,其中,将所述第一子图转换为无向图,包括:
    将所述第一子图中的有向边转换为无向边,并合并其中的重复节点,得到所述无向图。
  12. 根据权利要求1所述的方法,其中,获取节点的高阶特征包括,对于任意阶H指数,当无法确定出所述满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值时,将满足H个邻居节点的上一阶H指数大于H这一条件的最大H值,作为其本阶H指数。
  13. 根据权利要求1所述的方法,其中,至少基于所述低阶特征和高阶特征,生成备选特征集,包括:对于各个节点,根据其邻居节点的低阶特征和高阶特征中各项特征的统计结果,得到统计特征,将所述统计特征包含在所述备选特征集中;所述统计结果包括以下中的一项或多项:最大值、最小值、平均值、中位数和众数。
  14. 根据权利要求1或13所述的方法,还包括:
    获取用于训练所述用户分类模型的标签数据,所述标签数据包括用户样本集和其中各个用户样本的类别标签;
    将所述用户样本集映射到所述第一子图中的第一节点集;
    根据所述备选特征集中的各项特征在所述第一节点集上的特征值分布和标签值分布,进行特征筛选,得到用于所述用户分类模型的特征集。
  15. 根据权利要求14所述的方法,其中,根据所述备选特征集中的各项特征在所述第一节点集上的特征值分布和标签值分布,进行特征筛选,包括:
    根据所述各项特征的特征值分布和所述标签值分布,确定各项特征的信息价值IV,基于信息价值IV对各项特征进行第一筛选操作;
    对于所述第一筛选操作后的保留特征,计算保留特征之间的相关系数,基于所述相关系数进行第二筛选操作,得到所述特征集。
  16. 根据权利要求14所述的方法,还包括,生成特征记录表,用于记录所述特征集中各项特征的描述信息。
  17. 一种图特征处理的装置,包括:
    图构建单元,配置为根据关系数据,构建关系网络图;所述关系数据包括,用户参与的交互事件记录;所述关系网络图包括多个节点,以及基于所述交互事件形成的节点之间的有向边,所述多个节点中包括用户节点;
    图分割单元,配置为将所述关系网络图分割为多个子图,其中包括用于用户分类模型训练的第一子图;
    低阶特征获取单元,配置为对于所述第一子图中各个节点,获取节点的低阶特征,其中所述低阶特征至少包括,节点的度;
    图转换单元,配置为将所述第一子图转换为无向图;
    高阶特征获取单元,配置为对于所述无向图中的各个节点,获取节点的高阶特征,所述高阶特征包括多阶H指数,其中每阶H指数表示,满足H个邻居节点的上一阶H指数大于等于H这一条件的最大H值;其中0阶H指数为节点的度;
    特征集生成单元,配置为至少基于所述低阶特征和高阶特征,生成备选特征集,作为训练所述用户分类模型的备选特征。
  18. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-16中任一项的所述的方法。
  19. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-16中任一项所述的方法。
PCT/CN2020/132654 2020-02-25 2020-11-30 图特征处理的方法及装置 WO2021169454A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010114823.2A CN111368147B (zh) 2020-02-25 2020-02-25 图特征处理的方法及装置
CN202010114823.2 2020-02-25

Publications (1)

Publication Number Publication Date
WO2021169454A1 true WO2021169454A1 (zh) 2021-09-02

Family

ID=71206435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132654 WO2021169454A1 (zh) 2020-02-25 2020-11-30 图特征处理的方法及装置

Country Status (2)

Country Link
CN (1) CN111368147B (zh)
WO (1) WO2021169454A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368147B (zh) * 2020-02-25 2021-07-06 支付宝(杭州)信息技术有限公司 图特征处理的方法及装置
CN112071435B (zh) * 2020-09-09 2023-07-18 北京百度网讯科技有限公司 无向关系至有向关系转换方法、装置、设备以及存储介质
CN111932273B (zh) * 2020-09-28 2021-02-19 支付宝(杭州)信息技术有限公司 一种交易风险识别方法、装置、设备及介质
CN112380216B (zh) * 2020-11-17 2023-07-28 北京融七牛信息技术有限公司 一种基于交叉的自动特征生成方法
CN112214499B (zh) 2020-12-03 2021-03-19 腾讯科技(深圳)有限公司 图数据处理方法、装置、计算机设备和存储介质
CN112600810B (zh) * 2020-12-07 2021-10-08 中山大学 一种基于图分类的以太坊网络钓鱼诈骗检测方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144818A1 (en) * 2011-12-06 2013-06-06 The Trustees Of Columbia University In The City Of New York Network information methods devices and systems
CN108491511A (zh) * 2018-03-23 2018-09-04 腾讯科技(深圳)有限公司 基于图数据的数据挖掘方法和装置、模型训练方法和装置
CN109102393A (zh) * 2018-08-15 2018-12-28 阿里巴巴集团控股有限公司 训练和使用关系网络嵌入模型的方法及装置
CN110020662A (zh) * 2019-01-09 2019-07-16 阿里巴巴集团控股有限公司 用户分类模型的训练方法和装置
CN110213164A (zh) * 2019-05-21 2019-09-06 南瑞集团有限公司 一种基于拓扑信息融合的识别网络关键传播者的方法及装置
US20190378050A1 (en) * 2018-06-12 2019-12-12 Bank Of America Corporation Machine learning system to identify and optimize features based on historical data, known patterns, or emerging patterns
CN111368147A (zh) * 2020-02-25 2020-07-03 支付宝(杭州)信息技术有限公司 图特征处理的方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653689B (zh) * 2015-12-30 2019-03-26 杭州师范大学 一种用户传播影响力的确定方法和装置
CN106446124B (zh) * 2016-09-19 2019-11-15 成都知道创宇信息技术有限公司 一种基于网络关系图的网站分类方法
CN107220902A (zh) * 2017-06-12 2017-09-29 东莞理工学院 在线社会网络的级联规模预测方法
CN108763354B (zh) * 2018-05-16 2021-04-06 浙江工业大学 一种个性化的学术文献推荐方法
CN109034562B (zh) * 2018-07-09 2021-07-23 中国矿业大学 一种社交网络节点重要性评估方法及系统
CN109445843B (zh) * 2018-10-26 2021-08-03 浙江工商大学 一种基于类多层网络的软件类重要性度量方法
CN109472626B (zh) * 2018-11-26 2020-08-18 浙江大学 一种面向手机租赁业务的智能金融风险控制方法及系统
CN110555455A (zh) * 2019-06-18 2019-12-10 东华大学 一种基于实体关系的在线交易欺诈检测方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144818A1 (en) * 2011-12-06 2013-06-06 The Trustees Of Columbia University In The City Of New York Network information methods devices and systems
CN108491511A (zh) * 2018-03-23 2018-09-04 腾讯科技(深圳)有限公司 基于图数据的数据挖掘方法和装置、模型训练方法和装置
US20190378050A1 (en) * 2018-06-12 2019-12-12 Bank Of America Corporation Machine learning system to identify and optimize features based on historical data, known patterns, or emerging patterns
CN109102393A (zh) * 2018-08-15 2018-12-28 阿里巴巴集团控股有限公司 训练和使用关系网络嵌入模型的方法及装置
CN110020662A (zh) * 2019-01-09 2019-07-16 阿里巴巴集团控股有限公司 用户分类模型的训练方法和装置
CN110213164A (zh) * 2019-05-21 2019-09-06 南瑞集团有限公司 一种基于拓扑信息融合的识别网络关键传播者的方法及装置
CN111368147A (zh) * 2020-02-25 2020-07-03 支付宝(杭州)信息技术有限公司 图特征处理的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, KAI: "Research on Technology of User Behavior Modeling Based on Graph Mining", THESIS SUBMITTED TO NANJING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS FOR THE DEGREE OF MASTER OF ENGINEERING, 1 April 2018 (2018-04-01), XP055840565, [retrieved on 20210913] *

Also Published As

Publication number Publication date
CN111368147A (zh) 2020-07-03
CN111368147B (zh) 2021-07-06

Similar Documents

Publication Publication Date Title
WO2021169454A1 (zh) 图特征处理的方法及装置
WO2021164382A1 (zh) 针对用户分类模型进行特征处理的方法及装置
WO2018077039A1 (zh) 社区发现方法、装置、服务器及计算机存储介质
WO2015135321A1 (zh) 基于金融数据的社会关系挖掘的方法及装置
JP2019057286A (ja) データアップロード、処理及び予測クエリapi公開を実施するシステム、方法及び装置
KR101674924B1 (ko) 데이터베이스 마이그레이션 방법 및 그 장치
CN107832407B (zh) 用于生成知识图谱的信息处理方法、装置和可读存储介质
CN108885673B (zh) 用于计算数据隐私-效用折衷的系统和方法
CN104077723B (zh) 一种社交网络推荐系统及方法
CN110991474A (zh) 一种机器学习建模平台
CN105824855B (zh) 一种对数据对象筛选分类的方法、装置以及电子设备
US10713573B2 (en) Methods and systems for identifying and prioritizing insights from hidden patterns
CN111090780A (zh) 可疑交易信息的确定方法及装置、存储介质、电子设备
CN112989059A (zh) 潜在客户识别方法及装置、设备及可读计算机存储介质
CN110224859B (zh) 用于识别团伙的方法和系统
CN111639690A (zh) 基于关系图谱学习的欺诈分析方法、系统、介质及设备
CN111581450A (zh) 确定用户的业务属性的方法及装置
US20130006880A1 (en) Method for finding actionable communities within social networks
CN116401379A (zh) 金融产品数据推送方法、装置、设备及存储介质
Ma et al. Class-imbalanced learning on graphs: A survey
CN112100452B (zh) 数据处理的方法、装置、设备及计算机可读存储介质
WO2023178767A1 (zh) 基于企业征信大数据知识图谱的企业风险检测方法和装置
CN114331665A (zh) 用于预定申请人的信用判定模型的训练方法、装置和电子设备
CN110765100B (zh) 标签的生成方法、装置、计算机可读存储介质及服务器
CN114202418A (zh) 信息处理方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921753

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921753

Country of ref document: EP

Kind code of ref document: A1