CN111368147B

CN111368147B - Graph feature processing method and device

Info

Publication number: CN111368147B
Application number: CN202010114823.2A
Authority: CN
Inventors: 张屹綮; 张天翼; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2021-07-06
Anticipated expiration: 2040-02-25
Also published as: WO2021169454A1; CN111368147A

Abstract

The embodiment of the specification provides a method and a device for processing graph features. According to the method, firstly, a relational network graph is constructed according to relational data, wherein the relational data comprise interactive event records participated by users; the relational network graph includes a plurality of user nodes and directed edges formed based on the interaction events. The relationship graph is then partitioned into a plurality of sub-graphs, including a first sub-graph for user classification model training. For each node in the first subgraph, the low-order characteristics of the node are obtained, wherein the low-order characteristics comprise the degree of the node. Then, for each node in the undirected graph obtained based on the first subgraph, obtaining high-order characteristics of the node, wherein the high-order characteristics comprise multi-order H indexes, and each-order H index represents that the upper-order H index of H neighbor nodes is larger than or equal to the maximum H value of H; wherein the 0 th order H index is the degree of the node. Therefore, an alternative feature set can be generated based on the low-order features and the high-order features, and the alternative feature set can be used as an alternative feature for training the user classification model.

Description

Graph feature processing method and device

Technical Field

One or more embodiments of the present specification relate to the field of machine learning, and more particularly, to a method and apparatus for graph feature processing for a user classification model.

Background

With the rapid development of artificial intelligence and machine learning, business analysis begins to be performed by using a machine learning model in various business scenes. For example, in many application scenarios, it is necessary to perform classification identification on users, for example, to identify risk levels of users, to distinguish groups of people to which users belong, and so on. For this reason, it is often necessary to train a user classification model for service-related user identification and user classification.

The selection and processing of features is the basis for model training. For the user classification model, in order to train a model with excellent performance and accurate prediction, characteristics which are more relevant to a prediction target and can reflect user characteristics need to be selected from a large number of user characteristics to train the model. In the simplest scenario, feature selection is performed from the basic attribute features of the user, and the trained model can meet the requirements. However, as the service scenario becomes more complex, in many cases, the basic attribute features of the user are often not rich and comprehensive enough to meet the performance requirements of model training. To this end, it is considered to generate some additional or derived features as a supplement to model training, wherein generating graph features based on a user relationship network is one aspect of the supplemental features. However, the network diagram is a relatively complex data structure, the analysis operation of the network diagram requires a large amount of computation, and how to efficiently extract meaningful features suitable for model training is difficult and challenging.

Therefore, it is desirable to have an improved scheme for more efficiently processing graph data and rapidly extracting effective graph features for user classification model selection and training.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for graph feature processing for a user classification model, which can efficiently generate rich graph features, thereby facilitating feature selection and training of the user classification model.

According to a first aspect, there is provided a method of graph feature processing, comprising:

constructing a relationship network graph according to the relationship data; the relationship data comprises a record of interaction events in which the user participates; the relational network graph comprises a plurality of nodes and directed edges between the nodes formed based on the interaction events, wherein the plurality of nodes comprise user nodes;

segmenting the relational network graph into a plurality of sub-graphs, wherein the sub-graphs comprise a first sub-graph used for user classification model training;

for each node in the first subgraph, acquiring low-order characteristics of the node, wherein the low-order characteristics at least comprise the degree of the node;

converting the first subgraph into an undirected graph;

for each node in the undirected graph, obtaining high-order characteristics of the node, wherein the high-order characteristics comprise multi-order H indexes, and each order H index represents the maximum H value meeting the condition that the last-order H index of H neighbor nodes is greater than or equal to H; wherein the 0-order H index is the degree of a node;

and generating an alternative characteristic set serving as an alternative characteristic for training the user classification model at least based on the low-order characteristic and the high-order characteristic.

According to one embodiment, the interactive event is an event performed by a user via a medium; the plurality of nodes further comprises a media node; the directed edge is a directed connecting edge between the user node and the medium node.

In a specific embodiment of the foregoing implementation, the interaction event is specifically a login event or an authentication event, and the information of the media node includes one or more of the following: device identification information, network environment information, authentication medium information.

According to another embodiment, the interactive event is a directional interactive event between users, and the user nodes comprise a first type node and a second type node; the directed edge is a connecting edge pointing from the first type node to the second type node.

In a specific embodiment of the foregoing embodiment, the interaction event may be a transaction event, where the first type of node is a buyer node and the second type of node is a seller node; or, the interaction event may be a transfer event, where the first type node is a transfer party node and the second type node is a receiving party node.

According to one embodiment, before the relational network graph is divided into a plurality of sub-graphs, graph filtering is performed on the relational network graph, and the graph filtering includes eliminating a plurality of nodes which do not meet the training requirement of the user classification model and connecting edges corresponding to the nodes from the relational network graph.

Specifically, the removed nodes may include one or more of the following: invalid nodes not conforming to a predetermined format; connecting nodes with the number of edges larger than a certain threshold value; a node located in a white list; and in the case that the interaction event relates to funds, the nodes with the current funds exceeding a preset threshold value in a preset time period.

According to one embodiment, a relational network graph is partitioned into a plurality of subgraphs by: according to the time period of the occurrence of the interaction event corresponding to the directed edge in the relational network graph, dividing the relational network graph into a plurality of sub-graphs, wherein each sub-graph corresponds to one time period; and determining a time period corresponding to the labeling time of the label data used for training the user classification model, and determining a sub-graph corresponding to the time period as the first sub-graph.

According to another embodiment, the relational network graph is partitioned into a plurality of subgraphs by: dividing the relationship network graph into a plurality of sub-graphs according to the geographic area in the basic attribute of the user node, wherein each sub-graph corresponds to one geographic area; determining a sub-graph corresponding to a geographic region of a user sample set in label data used to train the user classification model as the first sub-graph.

According to one embodiment, the relational network graph is a homogeneous graph, and the obtaining of the low-order features of the nodes further includes: the number and the ratio of double nodes in the neighbor nodes connected with the node; the dual nodes are user nodes which are used as a first class node and a second class node in the relation network graph at the same time.

When the relationship network graph is a homogeneous graph, converting the first sub-graph into an undirected graph specifically includes: and converting the directed edges in the first subgraph into undirected edges, and merging repeated nodes in the undirected edges to obtain the undirected graph.

According to one embodiment, when the high-order characteristics of the nodes are obtained, for any order H index, when the maximum H value meeting the condition that the last-order H index of the H neighbor nodes is greater than or equal to H cannot be determined, the maximum H value meeting the condition that the last-order H index of the H neighbor nodes is greater than H is used as the order H index.

According to an embodiment, generating the set of alternative features specifically comprises: for each node, obtaining statistical characteristics according to the statistical results of each characteristic in the low-order characteristics and the high-order characteristics of the neighbor nodes, and including the statistical characteristics in the alternative characteristic set; the statistical results include one or more of: maximum, minimum, mean, median, and mode.

According to one embodiment, the method further comprises: obtaining label data for training the user classification model, wherein the label data comprises a user sample set and category labels of all user samples; mapping the set of user samples to a first set of nodes in the first subgraph; and performing feature screening according to the feature value distribution and the label value distribution of each feature in the candidate feature set on the first node set to obtain a feature set for the user classification model.

In the above embodiment, the process of feature screening may specifically include: determining the information value IV of each feature according to the feature value distribution and the label value distribution of each feature, and performing first screening operation on each feature based on the information value IV; and calculating a correlation coefficient between the reserved features after the first screening operation, and performing a second screening operation based on the correlation coefficient to obtain the feature set.

In one embodiment, after the feature set is obtained, a feature recording table is further generated for recording description information of each feature in the feature set.

According to a second aspect, there is provided an apparatus for graph feature processing, comprising:

the graph building unit is configured to build a relational network graph according to the relational data; the relationship data comprises a record of interaction events in which the user participates; the relational network graph comprises a plurality of nodes and directed edges between the nodes formed based on the interaction events, wherein the plurality of nodes comprise user nodes;

a graph partitioning unit configured to partition the relational network graph into a plurality of sub-graphs, including a first sub-graph for user classification model training;

a low-order feature obtaining unit configured to obtain, for each node in the first subgraph, a low-order feature of the node, where the low-order feature at least includes a degree of the node;

a graph converting unit configured to convert the first sub-graph into an undirected graph;

the high-order characteristic acquisition unit is configured to acquire high-order characteristics of nodes for each node in the undirected graph, wherein the high-order characteristics comprise multi-order H indexes, and each order H index represents the maximum H value which meets the condition that the last-order H index of H neighbor nodes is greater than or equal to H; wherein the 0-order H index is the degree of a node;

and the characteristic set generating unit is configured to generate an alternative characteristic set serving as an alternative characteristic for training the user classification model at least based on the low-order characteristic and the high-order characteristic.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first aspect.

According to the description of the embodiment of the present specification, in order to provide richer feature selection for training of the user classification model, in the scheme of the embodiment, a relational network graph is constructed based on interaction events in which users participate, and graph features are extracted from the relational network graph. The graph features not only comprise low-order features such as the degrees of nodes, but also innovatively introduce an H index as a high-order graph feature. Therefore, richer graph characteristics of each node are obtained and are used for characteristic selection and training of the user analysis model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a graph feature processing process according to one embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method of graph feature processing for a user classification model, according to one embodiment;

FIG. 3 illustrates an example of a homogeneity map in accordance with one embodiment;

FIG. 4 illustrates an example of transforming a homogeneity map according to one embodiment;

FIG. 5 shows a schematic block diagram of a graph feature processing apparatus according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

In order to more efficiently realize modeling and training of the user classification model, in one embodiment of the present specification, an end-to-end graph feature processing scheme is provided, which may generate a relational network graph based on relational data recording user interaction events, and extract low-order graph features and high-order graph features of nodes therefrom as alternative features for user classification model screening and training.

FIG. 1 is a schematic diagram of a graph feature processing procedure according to an embodiment disclosed herein. As shown in fig. 1, a relational network graph is first constructed based on relational data. Wherein the relationship data is used for recording event records of interactive events participated by the user; correspondingly, the relationship network graph constructed according to the method comprises user nodes, and connection edges among the nodes are established based on interaction events. In embodiments of the present description, directional connecting edges may be established in consideration of the directionality of the interaction event. Accordingly, the relational network graph may be embodied as a bipartite graph.

Optionally, some filtering processes may be performed on the relationship network graph constructed above, so as to remove some nodes and edges that do not need to be analyzed. Further, the relational network graph can be split into subgraphs, thereby facilitating subsequent processing.

Based on the subgraph obtained by the processing, the extraction of the node characteristics can be carried out. The extracted node features include low-order features and high-order features, wherein the low-order features include at least a degree of the node. For the high-order features, in the embodiment of the present specification, H indexes adopted in other fields are innovatively applied to graph analysis as the high-order graph features, where the H index of a node in a relational network graph refers to the number of H neighbor nodes at most with the degree of the neighbor node being greater than or equal to H. Further, the multi-order H index can be obtained through iteration. Thus, richer high-order characteristics of each node are obtained.

Optionally, the low-order/high-order characteristics of the neighbor nodes of each node may be counted to obtain statistical characteristics. Thus, the above low-order features, high-order features, and optionally statistical features, collectively comprise an alternative feature set. The candidate feature set comprises graph features generated and extracted based on the relational network graph, and the graph features, particularly node high-order features, are different from features extracted in a conventional mode in nature.

For each feature in the candidate feature set, the feature may be evaluated through various evaluation manners, such as a feature information value IV, a correlation coefficient, and the like, so as to perform screening. Therefore, the feature set suitable for the user classification model can be selected from the alternative feature set finally, and the user classification model with more excellent performance is obtained through training.

The specific steps and implementations of the above scheme are described below.

FIG. 2 illustrates a flow diagram of a method for graph feature processing for a user classification model, according to one embodiment. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 2, the graph feature processing method includes at least the following steps.

At step 21, a relational network graph is constructed from the relational data. Wherein the relationship data comprises event records of interaction events in which the user participates; correspondingly, the constructed relational network graph comprises a plurality of nodes and directed edges between the nodes formed on the basis of the interaction events, wherein the plurality of nodes comprise user nodes.

In particular, the interaction events may be events that are participated in by the user and that are helpful to the classification prediction goal of the user classification model. For example, when a user classification model is used to assess a user's transaction risk, a relational network graph may be constructed based on transaction events; when the user classification model is used for evaluating the login risk of the user, a relational network graph can be constructed based on login events; when the user classification model is used for judging the marketing personnel group to which the user belongs, a relationship network graph can be constructed based on the preferential cancel and cancel events, and the like.

In different embodiments, the interaction event may be an interaction between users, or may involve other objects. In the two cases, the formed relationship network graph is a homogeneous graph and a heterogeneous graph respectively.

In particular, in one embodiment, the interaction event is an event performed by a user via a medium, wherein the medium object is involved. In such a case, the constructed relational network graph is a heterogeneous graph, which includes media nodes in addition to user nodes. Correspondingly, the connection edge is a directed connection edge between the user node and the media node.

For example, the interaction event may be a login event, wherein the user logs in with a particular device and network environment. At this time, the media node may include a device node, and/or a network environment node. More specifically, the device node may be represented by device identification information, which may specifically include a device identifier such as a MAC address, a mobile phone SIM number, a UMID, and an APDID of the device. The network environment node may show network environment information at login, e.g., an IP address, wifi network identification, etc. If a user logs in by means of a certain medium, a connection edge is constructed between a user node corresponding to the user and a medium node corresponding to the medium.

As another example, the interaction event may be an authentication event, in which the user is authenticated by means of some authentication medium. In this case, the media node may use the above-mentioned authentication medium, such as a credit card number, an identification number, a mobile phone number, and the like for authentication. If a certain user uses a certain authentication medium to perform identity authentication, a connection edge is constructed between a user node corresponding to the user and a medium node corresponding to the authentication medium.

There are other specific examples of heterogeneous graphs, not enumerated here. In the case of a heterogeneous graph, the user nodes may be treated as one type of node, and the other objects may be treated as another type of node, and the heterogeneous graph thus obtained may be a bipartite graph.

In another embodiment, the interactive event is a directional interactive event between users. In such a case, the constructed relational network graph is a homogenous graph in which all nodes are user nodes. Each user node can be represented by user identification information, wherein the user identification information can specifically adopt the forms of account ID, mobile phone number, mailbox address and the like. Further, according to the directionality of the interaction event, the user nodes may be divided into two types of nodes, referred to as a first type of node and a second type of node, where the first type of node corresponds to the event starting point and the second type of node corresponds to the event target. Accordingly, the connecting edges in the homogenous graph are directed edges pointing from the first type of node to the second type of node.

Specifically, in one example, the interaction event is a transaction event. In such a case, the first type of node corresponds to a buyer user and the second type of node corresponds to a seller user. In a typical implementation, the corresponding relationship data is a transaction record table in which each row records one transaction. The transaction record table may contain, for example, 4 columns of data: buyer account, seller account, transaction amount, transaction time. Thus, each account in the list of buyer accounts may be used as a first class node, each account in the list of seller accounts may be used as a second class node, and an oriented edge may be established between the buyer account and the seller account occurring in the same transaction, pointing from the buyer account to the seller account.

If the same account is recorded in different transactions, sometimes as a buyer in a column of buyer accounts and sometimes as a seller in a column of seller accounts, then in composition the account is recorded in a first type node and a second type node respectively, that is, the account is represented as a first type node and a second type node respectively.

For the connecting edges in the above figures, the transaction amount and the transaction time can be used as the edge attribute information. In one case, the same set of buyer account and seller account may have many transactions, and in this case, information such as the number of transactions may be included in the side attribute information.

In another example, the interaction event may be a transfer event, in which case the first type of node is a forwarding node and the second type of node is a receiving node. In yet another example, the interaction event may be a social event including a certain behavior, such as a calling behavior, a sharing behavior, etc., where the first type of node corresponds to an initiator of the behavior, e.g., a caller, a sharing initiator, and the second type of node corresponds to a recipient of the behavior, e.g., a callee, a sharing recipient, etc.

There are other specific examples of homogenous graphs, not enumerated here. In the case of the homogeneous graph, since the user nodes are divided into two types, the homogeneous graph obtained at this time can also be regarded as a bipartite graph.

In one embodiment, the relational network graph may record its topology in the form of a table, for example, it may record as an adjacency table, or record each connecting edge with the start point and the target point of the directed edge being 2 columns respectively, and so on.

After the above-mentioned relational network graph is obtained, optionally, some preprocessing operations may be performed on the relational network graph to simplify or facilitate subsequent operations. In one embodiment, the preprocessing operations may include graph filtering operations, i.e., removing nodes and associated connecting edges from the relational network graph that do not meet the training requirements of the user classification model.

In particular, the graph filtering operation may include first removing some invalid nodes. The invalid nodes are nodes which do not meet format requirements and mainly comprise invalid nodes generated by node format errors caused in the data transmission process. In actual service, the invalid nodes are mostly medium-class nodes, including nodes such as UMID, APDID, SIM, and the like. If the format does not meet the standard format, the node and the edge connected with the node are all removed.

The graph filtering operation may also include removing nodes having a number of connected edges greater than a threshold. Such nodes may be referred to as hotspot nodes. Different thresholds are set in the business according to different relation data. For example, in a transaction event, a node with a connection edge exceeding 300 is considered as a hotspot, and more than 1000 connection edges may be set as hotspot nodes for media nodes in the heterogeneous graph.

In the case where the interaction event involves funds, such as a transaction event, a transfer event, etc., the nodes where the current funds exceed a predetermined threshold for a predetermined period of time, such as the nodes where the single-day transaction amount reaches 10w, may be removed in the graph filtering step.

In other examples, a white list may be pre-configured, which includes nodes whose classification is known and need not be analyzed, such as accounts of known merchants. In such a case, the nodes in the white list, and associated connecting edges, may be removed in a graph filtering operation.

It should be understood that the hot spot nodes, the white list nodes, and the nodes with large transaction amount may generally determine their classification through other rules, and are not often used as training samples of the user classification model. And the connection structure of the nodes is usually complex. Therefore, the nodes are removed during preprocessing, the relational network graph can be simplified, the subsequent graph calculation and analysis are facilitated, and meanwhile, the selection of training data of the user classification model is not influenced.

To further simplify the processing of the relational network graph, at step 22, the relational network graph is partitioned into a plurality of sub-graphs, including a first sub-graph for user classification model training. The relationship network graph in step 22 may be graph filtered or not graph filtered.

It should be understood that training of the user classification model requires not only feature data, but also label data, which includes a user sample set and category labels of each user sample therein. In one embodiment, the segmentation of the relational network graph and the selection of the subgraph can be performed by referring to the tag data.

In one embodiment, the label data includes the labeling time of the category label, in which case the relational network graph can be segmented based on time. Specifically, the relationship network graph may be divided into a plurality of sub-graphs according to the time period of the occurrence of the interaction event corresponding to the connection edge in the relationship network graph, where each sub-graph corresponds to one time period. Then, a time period corresponding to the labeling time in the label data is determined, and a sub-graph corresponding to the time period is determined as a first sub-graph used for model training. It is to be understood that the first sub-graph may be a collective term for multiple sub-graphs. For example, the label data may be labeled by month, including labels labeled at 7 and 8 months, respectively. Correspondingly, the trading relation network graph can be divided into a plurality of sub-graphs according to the month of the trading occurrence, and each sub-graph corresponds to one month. Then, sub-graphs corresponding to 7,8 months can be selected from the sub-graphs as the first sub-graph.

In one embodiment, the tag data is divided by the geographic region in which the user sample is located, in which case the relational network graph may be segmented based on the geographic region. Specifically, the relationship network graph may be divided into a plurality of sub-graphs according to the geographic area, such as a city, in the basic attribute of each user node in the relationship network graph, where each sub-graph corresponds to one geographic area. Thus, a sub-graph corresponding to the geographic region of the user sample set in the label data may be determined as the first sub-graph for model training.

According to another embodiment, the relationship network graph can be segmented by adopting a pre-trained segmentation model. For example, a meta-learning multi-classification model may be trained for classifying connected edges in the graph, and then segmenting the relational network graph according to the classification of the edges. The loss function of the meta-learning multi-classification model may be an error between graph features generated after graph segmentation and information value IV values generated without segmentation. The training of the meta-learning multi-classification model can be performed in an existing manner, and is not described in detail herein.

In other specific examples, the relational network graph may be graph-partitioned based on other principles to obtain a plurality of subgraphs. One or more sub-graphs among the obtained sub-graphs may correspond to the label data as a first sub-graph for user classification model training.

Next, at step 23, for each node in the first sub-graph, a low-order feature of the node is obtained.

As previously mentioned, the relational network graph may be a heterogeneous graph or a homogeneous graph, to which the first sub-graph corresponds, respectively. In the case where the first sub-graph is a heterogeneous graph, the low-order characteristic of the node may be the degree of the node. The degree of a node represents the number of neighbor nodes to which the node is connected, or the number of connecting edges the node has.

Under the condition that the first subgraph is a homogeneous graph, the low-order characteristics of the nodes comprise the number and the ratio of double nodes in the connected neighbor nodes besides the degree of the nodes; the dual nodes are user nodes which are used as a first class node and a second class node in the relation network graph at the same time.

FIG. 3 illustrates an example of a homogeneity map in accordance with one embodiment. It is assumed that the relational network graph is constructed based on transaction events, with the buyer user node, the payer in the transaction event, in the left column and the seller user node, the payee in the transaction event, in the right column. As shown,

nodes

2 and 4 are both sellers and buyers, and thus

nodes

2 and 4 belong to dual nodes, or referred to as interchange identity nodes.

As shown in fig. 3, when a user corresponding to the dual node constructs a composition, the user is represented as two nodes according to the first type of node and the second type of node; when calculating the low-order characteristics of the nodes, the low-order characteristics of the nodes serving as the first class nodes and the low-order characteristics of the nodes serving as the second class nodes are also considered respectively. Thus, the low-level features of each buyer node and each seller node in fig. 3 can be determined separately. For example, for buyer node 1, it is connected to 3 seller nodes (6,2,4), and of these 3 seller nodes,

nodes

2 and 4 are both dual nodes, so the number of dual nodes is 2, and the percentage is 2/3. Thus, the calculation of the low-order characteristics by the buyer node and the seller node groups is shown in table 1 below.

Table 1:

thus, the low-order characteristics of each node in the first subgraph are obtained.

Next, at step 24, the first subgraph is converted to an undirected graph.

For the case that the first sub-graph is a heterogeneous graph, the undirected graph can be obtained only by converting the directed edges into undirected edges. For the case that the first sub-graph is a homogeneous graph, the converting may include converting directed edges in the homogeneous graph into undirected edges, and merging repeated nodes therein, thereby obtaining the undirected graph.

FIG. 4 illustrates an example of transforming a homogeneity map according to one embodiment. The leftmost side of fig. 4 shows the original homogeneity map, which is the same as that shown in fig. 3. For the homogeneous graph, firstly, a directed edge pointing to a second type node from a first type node on the left side is converted into an undirected edge, and a graph A is obtained. Then, the duplicate nodes in fig. a are merged. Then, two of the nodes 2 are merged into one node, and two of the nodes 4 are merged into one node. In the process of merging two repeated nodes into one node, other nodes and the connecting edges of the two repeated nodes are classified as the connecting edges of the merged node. Thus, a graph B is obtained in which the nodes and connecting edges in the homogenous graph are updated.

Next, in step 25, based on the undirected graph, high-order feature extraction may be performed, resulting in a more highly dimensional and abstract graph feature than a node. In the embodiments of the present specification, the concept of H-index is innovatively introduced in graph analysis as a high-level graph feature.

The H-index (H-index), also known as the H-factor (H-factor), is a method of assessing academic achievements. H represents 'high quote times' (high quotes), and the H index of a scientific research staff means that at most H papers are quoted at least H times respectively. In the solution of this embodiment, the concept of H index is applied to graph analysis, where the H index of a certain node refers to the number of H neighbor nodes whose degrees are greater than or equal to H, or the maximum H value that satisfies the condition that "the degrees of H neighbor nodes are greater than or equal to H". And if the maximum H neighbor nodes with the degree greater than or equal to H cannot be determined, taking the H values of the maximum H neighbor nodes with the degree greater than H as an H index. Here, the degree of a node is the degree of a node in the undirected graph. For heterogeneous graphs, the degree of nodes in the undirected graph is the same as the degree determined in the low-order features; for the homogeneous graph, the nodes are updated in the conversion process of the undirected graph, and accordingly, the degrees of the nodes in the undirected graph also need to be determined again.

The following description is made with reference to examples. Continuing with the example of FIG. 4, in graph B, the homogenous graph is transformed, and the nodes and connecting edges are updated, resulting in an undirected graph. Thus, the degrees of each node can be redetermined, resulting in the following list:

table 2:

node point	Degree of rotation	Node point	Degree of rotation
				1	3	5	1
2	4	6	2
				3	2	7	2
4	4

The right-most graph C in fig. 4 shows the degrees of the various nodes more clearly. The determination of the H-index is described below using node 1, the dark node in fig. C, as an example.

It can be seen that the neighbors of the node 1 are

nodes

2,4, and 6, and it can be known from the above table 2 that the degrees of the 3 neighboring nodes are 4, and 2, respectively, so that the degree of the 2 neighboring nodes is greater than 2 (but the degree of the 3 neighboring nodes is not greater than 3), and therefore, the H index of the node 1 is 2. Here, since the maximum H neighbors with the degree greater than or equal to H cannot be found, the maximum H neighbors greater than H are found.

In a similar manner, the H-indices of the individual nodes can be determined one by one. Then, based on the H index thus determined, a higher-order H index may be further determined. That is, the degree of the node is taken as an H index of 0 order, the H index determined above is taken as an H index of 1 order, and a higher-order H index is recursively determined, where the H index of k order indicates the maximum number of H neighbors whose H index of k-1 order of the neighbor node is equal to or greater than H, or the maximum H value satisfying the condition that the H index of k-1 order of the H neighbor node is equal to or greater than H. In this way, 2H index and 3H index of each node can be iteratively determined in sequence until the predetermined order N.

The predetermined order N may be set according to the characteristics of the graph structure and the service requirement. Generally, through the above recursive calculation, the high-order H index of each node is converged and converges to the Core degree (K-Core) of the graph. Thus, in one example, the order N may be set to the order at which convergence is achieved.

Through the above manner, the high-order characteristics of each node in the first subgraph are obtained: h-index order 1, H-index order 2, …, H-index order N.

It should be noted that, when the relational network graph is recorded in the form of a table, both the low-order features and the high-order features can be simply implemented by SQL query statements, which avoids a large amount of matrix operations in the conventional graph feature operations, and therefore, the feature generation efficiency is very high.

Next, at step 26, an alternative feature set is generated as an alternative feature for training the user classification model based on at least the above-mentioned low-order features and high-order features.

In one embodiment, the low-order features and the high-order features obtained above are aggregated to form an alternative feature set. In another embodiment, for each node, a statistical feature is obtained according to the statistical results of each feature in the low-order features and the high-order features of its neighboring nodes, and the statistical feature is classified into a candidate feature set. Wherein the statistical result comprises one or more of the following: maximum, minimum, mean, median, and mode.

In the above statistics, the median indicates the median found by ranking all observations in order of magnitude, in a finite set of numbers. If there are an even number of observations, the median is usually taken as the average of the two most intermediate values. The mode represents the value that occurs the most frequently in a set of data. When there is a plurality of modes, the average of the plurality of modes may be selected as the yield.

In a specific example, for the first sub-graph of the homogeneous graph, the candidate feature set finally generated for each node includes the degree of each node itself, the number of double nodes, the proportion of double nodes, the H index of 0 th order, the H index of 1 st order, the H index of 2 nd order, the H index of … N th order, and the maximum value, the minimum value, the average value, the median and the mode of the neighbor node for the above features.

Therefore, based on the relational network diagram, an alternative characteristic set is generated for selection and training of the user classification model.

Next, feature screening may be performed on the candidate feature sets, and features suitable for the user classification model are selected from the candidate feature sets. Specifically, label data for training a user classification model may be obtained, where the label data includes a user sample set and category labels of respective user samples; then mapping the user sample set to a first node set in a first sub-graph; and performing feature screening according to the feature value distribution and the label value distribution of each feature in the candidate feature set on the first node set to obtain a selected feature set for the user classification model. Feature screening may be performed based on the feature information value IV, and/or a correlation coefficient between features.

In one embodiment, the screening is performed first based on the feature IV values and then based on the feature correlation coefficients. For this reason, for any one feature (for example, H index of order 2) in the candidate feature set, referred to as a first feature, the information value IV of the first feature may be determined according to the feature value distribution of the first feature in the first node set and the tag value distribution.

More specifically, for the first feature X, first feature values of each user node (assuming n nodes) in the first node set for the first feature may be obtained, and the first feature values may be sorted to form a first feature value sequence (X) of the first feature values₁,x₂,…x_n)。

Then, the tag data is correlated to obtain a tag value sequence (L)₁,L₂,…L_n) The sequence of tag values (L)₁,L₂,…L_n) With a first sequence of characteristic values (x)₁,x₂,…x_n) Aligned with respect to the user order.

Is connected withAccording to a first sequence of eigenvalues (x)₁,x₂,…x_n) And carrying out binning on the user nodes. In one embodiment, uniform binning is performed according to a value range defined by a maximum value and a minimum value in the first characteristic value sequence. In another embodiment, automatic binning is performed based on the data distribution represented by the first sequence of eigenvalues.

In this manner, the individual user nodes are partitioned into individual bins. Then, based on the label value sequence, counting the label value distribution condition of the user nodes in each sub-box; and then determining the information value IV of the first characteristic according to the label value distribution condition of each sub-box.

Taking the user classification model as a binary classification model and the case that the class label has binarization as an example, the user can be classified into a positive sample and a negative sample according to whether the label value is 0 or 1. For any bin i, the number pos of positive samples in the bin can be counted_iNumber of negative samples neg_i(ii) a Then, calculating an evidence weight WOE value corresponding to the box i:

wherein the content of the first and second substances,

the proportion of the number of positive samples in bin i to the total number of positive samples,

the proportion of the number of negative samples in the bin i to the total number of negative samples.

Further, the IV value of the first feature can be obtained:

in the above manner, for each feature in the candidate feature set, its IV value may be determined. A first screening operation may then be performed based on the IV values of the various features. Specifically, the IV value of each feature may be compared with a threshold, the features with IV values lower than the threshold may be eliminated, and the features with IV values higher than the threshold may be retained. In practice, the threshold value may be set to, for example, 0.5. Of course, the threshold may be adjusted according to the screening objective.

And then, calculating a correlation coefficient between the retained features after the first screening operation, and performing a second screening operation based on the correlation coefficient to obtain a selected feature set.

The correlation coefficient between two features may be calculated in various known ways. The correlation coefficient is usually a Pearson correlation coefficient, and may be calculated according to a known algorithm. Other calculation methods, such as Spearman rank correlation coefficient, etc., may also be used. Based on the correlation coefficient, a second screening operation can be performed on the features to obtain a plurality of selected features. Specifically, the second screening operation may be performed by a method.

In one embodiment, for each feature, the feature is rejected if the correlation coefficient between the feature and any other feature is above a predetermined correlation threshold, e.g. 0.8, and retained if the correlation coefficient between the feature and all other features is below the threshold. In yet another embodiment, for each feature, a mean of the correlation coefficients between the feature and the other features may be calculated. Then, the characteristics in the comprehensive characteristic table are sorted according to the average value of the correlation coefficients, and a predetermined number of characteristics with smaller average values are selected for reservation. And for the retained features, further combining the IV value, and screening again to finally obtain the selected features.

Therefore, the second stage of screening is carried out in various modes based on the correlation coefficient among the features to obtain a plurality of selected features to form a selected feature set. The plurality of selected features may then be used for training of the user classification model.

On the basis of determining each selected feature, in one embodiment, a feature record table is generated for recording description information of each feature in the selected feature set. The description information may specifically be a definition explanation of the selected feature, or a generation process description. Thus, such a profile may be used for feature generation and selection in modeling other models similarly.

Referring back to the above process, in the embodiment of the specification, in order to provide richer feature choices for training of the user classification model, a relational network graph is constructed based on interaction events in which users participate, and graph features are extracted from the graph. The graph features not only comprise low-order features such as the degrees of nodes, but also innovatively introduce an H index as a high-order graph feature. Therefore, richer graph characteristics of each node are obtained and are used for characteristic selection and training of the user analysis model.

According to another aspect, an apparatus for graph feature processing for a user classification model is provided, which may be deployed in any device, platform, or device cluster having computing and processing capabilities. FIG. 5 shows a schematic block diagram of a graph feature processing apparatus according to one embodiment. As shown in fig. 5, the apparatus 500 includes:

a graph construction unit 51 configured to construct a relational network graph according to the relational data; the relationship data comprises a record of interaction events in which the user participates; the relational network graph comprises a plurality of nodes and directed edges between the nodes formed based on the interaction events, wherein the plurality of nodes comprise user nodes;

a graph segmenting unit 52 configured to segment the relational network graph into a plurality of sub-graphs, including a first sub-graph for user classification model training;

a low-order feature obtaining unit 53, configured to obtain, for each node in the first subgraph, a low-order feature of the node, where the low-order feature at least includes a degree of the node;

a graph converting unit 54 configured to convert the first sub-graph into an undirected graph;

a high-order feature obtaining unit 55, configured to obtain, for each node in the undirected graph, a high-order feature of the node, where the high-order feature includes multiple-order H indices, where each order H index represents a maximum H value that satisfies a condition that a last-order H index of H neighbor nodes is greater than or equal to H; wherein the 0-order H index is the degree of a node;

a feature set generating unit 56, configured to generate an alternative feature set as an alternative feature for training the user classification model based on at least the low-order feature and the high-order feature. .

In a specific embodiment of the foregoing implementation, the interaction event may specifically be a login event or an authentication event, and the information of the media node includes one or more of the following: device identification information, network environment information, authentication medium information.

According to an embodiment, the apparatus 500 further includes a graph filtering unit (not shown) configured to remove, from the relationship network graph, a number of nodes that do not meet the training requirement of the user classification model and connection edges corresponding to the number of nodes.

According to one embodiment, the graph partitioning unit 52 is specifically configured to: according to the time period of the occurrence of the interaction event corresponding to the directed edge in the relational network graph, dividing the relational network graph into a plurality of sub-graphs, wherein each sub-graph corresponds to one time period; and determining a time period corresponding to the labeling time of the label data used for training the user classification model, and determining a sub-graph corresponding to the time period as the first sub-graph.

According to another embodiment, the graph partitioning unit 52 is specifically configured to partition the relationship network graph into a plurality of sub-graphs according to the geographic region in the basic attribute of the user node, where each sub-graph corresponds to one geographic region; determining a sub-graph corresponding to a geographic region of a user sample set in label data used to train the user classification model as the first sub-graph.

According to an embodiment, the relational network graph is a homogeneous graph, and in this case, the lower-order feature obtaining unit 53 is further configured to obtain the following features of the node: the number and the ratio of double nodes in the neighbor nodes connected with the node; the dual nodes are user nodes which are used as a first class node and a second class node in the relation network graph at the same time.

In the case where the relationship network diagram is a homogenous diagram, the diagram conversion unit 54 is configured to: and converting the directed edges in the first subgraph into undirected edges, and merging repeated nodes in the undirected edges to obtain the undirected graph.

According to an embodiment, when acquiring the high-order features of a node, for an H index of any order, when it cannot be determined that the maximum H value satisfying the condition that the H index of the last order of H neighbor nodes is equal to or greater than H, the high-order feature acquiring unit 55 uses the maximum H value satisfying the condition that the H index of the last order of H neighbor nodes is greater than H as the H index of the current order.

According to one embodiment, the feature set generation unit 56 is configured to: for each node, obtaining statistical characteristics according to the statistical results of each characteristic in the low-order characteristics and the high-order characteristics of the neighbor nodes, and including the statistical characteristics in the alternative characteristic set; the statistical results include one or more of: maximum, minimum, mean, median, and mode.

According to an embodiment, the apparatus further comprises a feature screening unit (not shown) configured to: obtaining label data for training the user classification model, wherein the label data comprises a user sample set and category labels of all user samples; mapping the set of user samples to a first set of nodes in the first subgraph; and performing feature screening according to the feature value distribution and the label value distribution of each feature in the candidate feature set on the first node set to obtain a feature set for the user classification model.

In an embodiment, after obtaining the feature set, the feature screening unit further generates a feature record table for recording description information of each feature in the feature set.

Through the device, rich graph features are generated quickly and efficiently aiming at the user classification model, so that feature selection and training of the user classification model are facilitated.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of graph feature processing, comprising:

converting the first subgraph into an undirected graph;

2. The method of claim 1, wherein the interaction event is an event by a user via a medium; the plurality of nodes further comprises a media node; the directed edge is a directed connecting edge between the user node and the medium node.

3. The method of claim 2, wherein the interaction event is a login event or an authentication event, and the information of the media node comprises one or more of: device identification information, network environment information, authentication medium information.

4. The method of claim 1, wherein the interaction event is a directional interaction event between users, and the user nodes comprise a first type node and a second type node; the directed edge is a connecting edge pointing from the first type node to the second type node.

5. The method of claim 4, wherein,

the interaction event is a transaction event, the first class node is a buyer node, and the second class node is a seller node; or:

the interactive event is a transfer event, the first type node is a transfer party node, and the second type node is a receiving party node.

6. The method of claim 1, wherein prior to partitioning the relational network graph into a plurality of subgraphs, further comprising: and removing a plurality of nodes which do not meet the training requirement of the user classification model and connecting edges corresponding to the nodes from the relational network graph.

7. The method of claim 6, wherein the number of nodes comprises one or more of:

invalid nodes not conforming to a predetermined format;

connecting nodes with the number of edges larger than a set threshold value;

a node located in a white list;

and in the case that the interaction event relates to funds, the nodes with the current funds exceeding a preset threshold value in a preset time period.

8. The method of claim 1, wherein partitioning the relational network graph into a plurality of subgraphs comprises:

according to the time period of the occurrence of the interaction event corresponding to the directed edge in the relational network graph, dividing the relational network graph into a plurality of sub-graphs, wherein each sub-graph corresponds to one time period;

and determining a time period corresponding to the labeling time of the label data used for training the user classification model, and determining a sub-graph corresponding to the time period as the first sub-graph.

9. The method of claim 1, wherein partitioning the relational network graph into a plurality of subgraphs comprises:

dividing the relationship network graph into a plurality of sub-graphs according to the geographic area in the attribute of the user node, wherein each sub-graph corresponds to one geographic area;

determining a sub-graph corresponding to a geographic region of a user sample set in label data used to train the user classification model as the first sub-graph.

10. The method of claim 4, wherein the low-order characteristics of the node further comprise: the number and the ratio of double nodes in the neighbor nodes connected with the node; the dual nodes are user nodes which are used as a first class node and a second class node in the relation network graph at the same time.

11. The method of claim 4, wherein converting the first subgraph to an undirected graph comprises:

and converting the directed edges in the first subgraph into undirected edges, and merging repeated nodes in the undirected edges to obtain the undirected graph.

12. The method according to claim 1, wherein obtaining the high-order characteristics of the node comprises, for any order H index, when the maximum H value satisfying the condition that the last-order H index of the H neighbor nodes is equal to or greater than H cannot be determined, taking the maximum H value satisfying the condition that the last-order H index of the H neighbor nodes is greater than H as the own-order H index.

13. The method of claim 1, wherein generating an alternative feature set based at least on the low-order features and the high-order features comprises: for each node, obtaining statistical characteristics according to the statistical results of each characteristic in the low-order characteristics and the high-order characteristics of the neighbor nodes, and including the statistical characteristics in the alternative characteristic set; the statistical results include one or more of: maximum, minimum, mean, median, and mode.

14. The method of claim 1 or 13, further comprising:

obtaining label data for training the user classification model, wherein the label data comprises a user sample set and category labels of all user samples;

mapping the set of user samples to a first set of nodes in the first subgraph;

and performing feature screening according to the feature value distribution and the label value distribution of each feature in the candidate feature set on the first node set to obtain a feature set of the user classification model.

15. The method of claim 14, wherein the feature screening according to the feature value distribution and the tag value distribution of each feature in the candidate feature set on the first node set comprises:

determining the information value IV of each feature according to the feature value distribution and the label value distribution of each feature, and performing first screening operation on each feature based on the information value IV;

and calculating a correlation coefficient between the reserved features after the first screening operation, and performing a second screening operation based on the correlation coefficient to obtain a feature set of the user classification model.

16. The method of claim 14, further comprising generating a feature record table for recording description information of each feature in the feature set of the user classification model.

17. An apparatus for graph feature processing, comprising:

18. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-16.

19. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-16.