WO2022116689A1 - 图数据处理方法、装置、计算机设备和存储介质 - Google Patents

图数据处理方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022116689A1
WO2022116689A1 PCT/CN2021/123265 CN2021123265W WO2022116689A1 WO 2022116689 A1 WO2022116689 A1 WO 2022116689A1 CN 2021123265 W CN2021123265 W CN 2021123265W WO 2022116689 A1 WO2022116689 A1 WO 2022116689A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cores
nodes
subgraph
current
Prior art date
Application number
PCT/CN2021/123265
Other languages
English (en)
French (fr)
Inventor
许杰
李晓森
欧阳文
肖品
陶阳宇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023518909A priority Critical patent/JP2023545940A/ja
Priority to EP21899723.7A priority patent/EP4206943A4/en
Publication of WO2022116689A1 publication Critical patent/WO2022116689A1/zh
Priority to US17/949,030 priority patent/US11935049B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/389Keeping log of transactions for guaranteeing non-repudiation of a transaction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of big data technology, and in particular, to a graph data processing method, apparatus, computer equipment and storage medium.
  • the K-Core algorithm is a subgraph mining algorithm that can mine closely related subgraphs from a complex network. For example, it can dig out gangs of abnormal behaviors between buyers or sellers from a transaction network, and it can also find out the entire transaction. A buyer or seller at the heart of a network.
  • a graph data processing method executed by a computer device, the method comprising:
  • the determined number of cores is used to generate the feature vector corresponding to the corresponding node.
  • a graph data processing device comprising:
  • the network graph obtaining module is used to obtain the correlation degree of each node in the network graph
  • a dense subgraph obtaining module configured to split a dense subgraph from the network graph according to a preset threshold and the degree of association of each node;
  • a first determination module configured to determine, based on the dense subgraph, a stable node in the network graph and the number of cores of the stable node, where the number of cores of the stable node is greater than the preset threshold;
  • a sparse subgraph obtaining module configured to obtain the sparse subgraph in the network graph according to the remaining nodes other than the stable node in the network graph and the connections between the remaining nodes;
  • the determined number of cores is used to generate the feature vector corresponding to the corresponding node.
  • a computer device comprising a memory and one or more processors, the memory storing a computer program that, when executed by the one or more processors, causes the one or more processors Implement the steps of the above graph data processing method.
  • a non-volatile computer-readable storage medium storing one or more computer-readable instructions storing a program of computer-readable instructions that, when executed by one or more processors, cause all The one or more processors implement the steps of the above graph data processing method.
  • a computer program comprising computer instructions stored in a computer-readable storage medium from which a processor of a computer device reads the computer instructions, the processor Executing the computer instructions causes the computer device to perform the steps in the above-mentioned graph data processing method.
  • Fig. 1 is the application environment diagram of the graph data processing method in one embodiment
  • FIG. 3 is a schematic diagram of dividing a network graph into 3-core subgraphs in one embodiment
  • FIG. 4 is a schematic diagram of decomposing a network graph by k-core and dividing by a threshold respectively in one embodiment
  • FIG. 6 is a schematic flowchart of determining the number of cores of a node in a network graph according to a sparse subgraph in one embodiment
  • FIG. 7 is a schematic diagram of a method for processing graph data in one embodiment
  • FIG. 8 is a schematic flowchart of a method for processing graph data in a specific embodiment
  • the graph data processing method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the terminal 102 interacts with the terminal 102 through the server, and the server 104 can acquire the interaction data formed when the terminal 102 interacts on the network, and generate a network diagram according to the interaction data.
  • the terminal 102 may be, but is not limited to, various personal computers, smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, etc., but is not limited thereto.
  • the server 104 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the execution body of the graph data processing method may be the graph data processing apparatus provided in the embodiment of the present application, or a computer device integrated with the graph data processing apparatus, wherein the graph data processing apparatus It can be implemented in hardware or software.
  • the computer device may be the terminal 102 or the server 104 described above.
  • the core number of the node after obtaining the core number of each node in the network graph, can also be used to generate a feature vector corresponding to the node according to the core number, and the feature vector is used to identify the node according to the feature vector. sort.
  • the feature vector can be used as the input of a machine learning (ML) algorithm to classify the node.
  • ML machine learning
  • Graph is a data structure that models the connection between things.
  • the graph includes a series of nodes and edges used to connect the nodes, and the nodes can also be called vertices.
  • An edge exists between two nodes, indicating that there is an association between the two nodes.
  • An edge between two nodes can have weights.
  • the degree of association of a node refers to the number of edges connected to the node, as well as the number of neighbor nodes adjacent to the node.
  • Neighbor nodes refer to nodes that have edges connected to the node.
  • a computer device in order to realize the mining of useful information in a complex network, can generate a network graph based on a large amount of interactive data in the network, and obtain the correlation degree of each node in the network graph, so that according to the network graph and the relationship of each node in the network graph Relevance enables graph mining of this network graph.
  • Graph mining refers to the process of mining potentially useful information from graphs using some algorithms. Graph mining includes graph classification, graph distance, subgraph mining, and so on.
  • the network diagram may be a payment relationship network diagram
  • the steps of generating the payment relationship network diagram include: acquiring payment records corresponding to user identifiers; acquiring payment interaction data between user identifiers according to the payment records; The data generates a payment relationship network diagram; wherein, the nodes of the payment relationship network diagram represent user IDs, and the connection between two nodes in the payment relationship network diagram indicates that there is a payment interaction event between the corresponding two user IDs.
  • the payment interaction event is at least one of transaction events such as transfer, red envelope issuance, borrowing, and payment by scanning code.
  • a user is a node, and if there is a payment interaction event between two users, an edge will be formed between the two users. For example, if user a transfers money to user b, an edge is formed between user a and user b.
  • the number of links formed between these users is extremely large, and thus, the generated payment network relationship graph is extremely large. For example, in the WeChat payment scenario, the number of nodes can reach 2 billion, and the number of edges formed between these 2 billion nodes can reach a super-scale of 100 billion.
  • the network graph may be a social relationship network graph
  • the step of generating the social relationship network graph includes: acquiring historical session data identified by the user; generating a social relationship network graph according to the historical session data; wherein, the nodes of the social relationship network graph Represents a user ID, and an edge between two nodes in the social relationship network graph indicates that there is a historical session between the corresponding two user IDs.
  • one user is one node. If there is a historical session between two users, an edge will be formed between the two users. In another embodiment, if two users have added a friend relationship to each other, a connection is formed between the two users. Similarly, when the number of users is large, the formed social relationship network graph is also very complex.
  • obtaining the association degree of each node in the network graph includes: obtaining the network graph; determining the number of neighbor nodes of each node in the network graph; and using the number of neighbor nodes as the association degree of the corresponding node.
  • a graph can be represented by an adjacency matrix or an adjacency list.
  • the adjacency list for each node in the graph, a list of edges starting from that node is stored. For example, if node A has three edges connected to B and C, respectively and D, then A's list will have three edges.
  • the adjacency matrix both rows and columns represent nodes. The corresponding element in the matrix determined by two nodes indicates whether the two nodes are connected. If they are connected, the value of the corresponding element can indicate the connection between the two nodes. edge weight.
  • the computer device can obtain the adjacency list or adjacency matrix corresponding to the network graph, and traverse the number of adjacent nodes of each node in the network graph from the adjacency list or adjacency matrix, and the number of neighbor nodes can be used as the correlation degree of the corresponding node.
  • the degree of association of a node in the payment relationship network graph can be understood as the number of nodes that have transaction behavior with this node.
  • the degree of association of a node in the social relationship network graph can be understood as the number of nodes that have historical conversations with this node.
  • Step 204 splitting a dense subgraph from the network graph according to a preset threshold and the degree of association of each node.
  • the core number of each node in the network graph is mainly mined. Coreness is one of the indicators used to judge the importance of nodes in the entire network graph.
  • the k-core subgraph of a graph refers to the remaining subgraphs after repeatedly removing nodes with an association degree less than or equal to k from the graph, that is, all vertices with an association degree less than k in the graph G are removed to obtain a subgraph G'; remove all the vertices with the degree of association less than k in the graph G' to obtain a new subgraph G", ..., and so on, stop when the degree of association of each node in the remaining subgraphs is greater than k, and obtain the The k-nucleus subgraph of graph G.
  • the number of cores of a node is defined as the largest kernel subgraph where the node is located, that is, if a node exists in the M kernel subgraph and is removed in the (M+1) kernel subgraph, then the node The number of cores is M.
  • a 2-core subgraph is to first remove all nodes with an association degree of less than 2 from the graph, then remove nodes with an association degree of less than 2 from the remaining graph, and so on, until it cannot be removed, and a 2-core subgraph is obtained; 3-core subgraph; It is to first remove all nodes with an association degree of less than 3 from the graph, and then remove the nodes with an association degree of less than 3 from the remaining graph, and so on, until it cannot be removed, and the 3-core subgraph of the graph is obtained. If a node is at most in the 5-core subgraph but not in the 6-core subgraph, then the number of cores for that node is 5. As shown in Figure 3, it is a schematic diagram of the process of dividing a 3-core subgraph. Referring to Figure 3, it can be seen that the final 3-core subgraph is obtained by removing nodes with an association degree less than 3 from the graph twice.
  • the computer device sets a threshold, and divides the original network graph into two parts, the dense subgraph and the sparse subgraph, according to the association degree of each node and the threshold, and then digs out each node in turn. number of cores.
  • the dense subgraph is split from the network graph through the threshold, which can directly mine the dense subgraph and avoid wasting a lot of iteration time and computing resources on the non-important nodes whose number of cores is less than the threshold. Number mining is very important.
  • the association degree of each node in the dense subgraph must be greater than the threshold, but the nodes with association degree greater than the threshold in the network graph do not necessarily exist in the dense subgraph.
  • the preset threshold may be set according to actual needs.
  • the preset threshold may be determined according to the needs of specific business scenarios. For example, according to past experience, nodes with more than 300 cores play a larger role in the network graph, so the computer device may set the preset threshold to 300.
  • the preset threshold can also be determined according to the limitation of computing resources, because the smaller the threshold is set, the larger the number of nodes included in the dense subgraph split from the network graph, and the larger the dense subgraph, the greater the number of nodes. The more computing resources are required, on the contrary, the larger the threshold is set, the smaller the dense subgraph split from the network graph, and the less computing resources are required.
  • the size of the threshold can also be set according to the distribution of the correlation degree of each node in the network diagram. For example, if the correlation degree of most nodes in the network diagram is less than a certain value, then the threshold value can be set to this value. .
  • splitting a dense subgraph from the network graph according to the association degree of each node and a preset threshold includes: obtaining a preset threshold; removing nodes whose association degree is less than or equal to the threshold from the network graph and the edge where the node is located, and obtain a dense subgraph according to the remaining nodes and the edges between the remaining nodes in the network graph.
  • the computer device filters out the nodes whose correlation degree is less than or equal to the threshold value from the original image, that is, obtains a dense subgraph, and the obtained dense subgraph has a degree of association of all nodes greater than this threshold. It can be seen that the larger the threshold is set, the smaller the dense subgraph obtained, and the less computing resources are required.
  • FIG. 4 it is a schematic diagram of decomposing a network graph by k-core and dividing by a threshold, respectively, in an embodiment.
  • computer equipment iteratively filters out Nodes with a degree of association less than 2 and nodes equal to 2 only need to iterate 2 times in total to determine dense subgraphs and sparse subgraphs from the original network graph. Due to the sparsity of iterative calculation, many nodes in the subsequent iteration process After the number of cores is determined, it will not be updated again.
  • Step 206 based on the dense subgraph, determine the stable node in the network graph and the number of cores of the stable node, where the number of cores of the stable node is greater than a preset threshold.
  • the stable node is a node whose number of cores excavated from the dense subgraph is greater than a preset threshold.
  • the computer equipment splits the dense subgraph from the network graph, it first mines the dense subgraph to determine the stable nodes and the number of cores of the stable nodes, so as to realize the first step of dividing and conquering.
  • each node in the sparse subgraph will not affect the number of cores of each node in the dense subgraph, then the computer equipment can directly cut in Dense subgraph, mining the dense subgraph, determining the number of cores of each node according to the degree of association of each node in the dense subgraph, and taking the node with the number of cores greater than the preset threshold as the stable node in the network graph.
  • the subgraph of the maximum number of cores where each node is located, so as to determine the number of cores of each node, and the node with the number of cores greater than the preset threshold is regarded as a stable node.
  • the computer device may update the number of cores in the current iteration process of the corresponding node using the core index of each neighbor node after the previous iteration of the node in the current iteration process.
  • the computer device can also set the number of updated cores to be greater than the preset threshold.
  • the nodes with the number of cores continue to participate in the next iteration, and the nodes whose core number is less than or equal to the preset threshold after the update will no longer participate in the next iteration, so that nodes with core numbers greater than the preset threshold can be mined in the dense subgraph.
  • the core index of all neighbor nodes of a node may be an H index. If the H index of a node is h, it means that this node has at least h neighbor nodes, and the association degrees of these h neighbor nodes are all not less than h. That is to say, if the node satisfies that the current number of cores of h neighbor nodes in the neighbor nodes is greater than or equal to h, and does not satisfy that the current number of cores of h+1 neighbor nodes is greater than or equal to h+1, it is determined that the node corresponds to The core exponent of is h, where h is a positive integer.
  • the stable nodes in the network graph and the number of cores of the stable nodes are determined, including:
  • Step 502 Obtain the degree of association of each node in the dense subgraph according to the number of neighbor nodes in the dense subgraph, and use the degree of association in the dense subgraph as the initial current number of cores of the corresponding node.
  • the computer device may use the correlation degree of each node in the dense subgraph to initialize the number of cores of each node as the initial current number of cores.
  • the "current number of cores” in this embodiment changes dynamically, referring to the number of cores updated by each node after the previous iteration, and the "previous iteration process” and “current iteration process” are also dynamic. Changed, in the next iteration, the "current iteration process” becomes the “previous iteration process”, and the next iteration becomes the “current iteration process”.
  • Step 504 For each node in the dense subgraph, calculate the core index corresponding to the node according to the current core number of the node's neighbor nodes in the dense subgraph; when the core index is less than or equal to a preset threshold, Remove the node from the dense subgraph; when the core index is greater than the threshold and less than the current number of cores of the node, update the current number of cores of the node according to the core index of the node, until each node in the dense subgraph in the current iteration process Stop iteration when the current number of cores has not been updated.
  • the computer equipment needs to process each node in the dense subgraph.
  • the core index corresponding to the node is calculated according to the current core number of its neighbor nodes, that is, the core number of all neighbor nodes after the previous iteration process.
  • the node will not affect the calculation of the core number of other nodes whose core number is greater than this node, then the node does not need to participate in the subsequent iteration process, and the node can be removed from the dense subgraph; if the core index of the node is If it is greater than the preset threshold and less than the current number of cores of the node, the current number of cores of the node is updated by the core index, and the node needs to continue to participate in the subsequent iteration process.
  • the number of cores of each node in the current iteration process is determined according to the number of cores of all neighbors of the node in the previous iteration process, it has locality and can be easily extended to the logic of distributed parallel computing, thereby speeding up the entire mining process. .
  • the iteration stop condition is that the current number of cores of all the remaining nodes in the dense subgraph has not changed during the current iteration. That is to say, when the core index calculated according to the core number of the node's neighbor nodes in the previous iteration is consistent with the current core number of the node, the core number of the node will not be updated. If the current number of cores of all nodes of , has not been updated during the current iteration, the iteration is stopped.
  • the dense subgraph is also dynamically changed during the iteration process, so that each node in the dense subgraph is
  • the neighbor nodes of the node are also constantly changing, so when calculating the core index according to the current number of cores of each node's neighbor nodes, it should be calculated according to the current number of cores of the node's neighbor nodes in the current dense subgraph, not Calculated according to the current number of kernels of the node's neighbor nodes in the initial dense subgraph, which can further reduce the amount of computation.
  • the computer device can mark the node as an unstable state, and the node marked as an unstable state will no longer be in a stable state. Participate in the next iteration process.
  • the above method further includes: after the current iteration ends, recording the node whose current core count is updated during the current iteration; the recorded node is used to indicate that when the next iteration starts, the recorded node will be in the dense
  • the neighbor nodes in the subgraph are used as the target nodes that need to recalculate the core index in the next iteration process; for each node in the dense subgraph, according to the current number of cores of the node's neighbor nodes in the dense subgraph, the corresponding node is calculated.
  • the core index of including: for the target node in the dense subgraph, calculating the core index corresponding to the target node according to the current number of cores of the neighbor nodes of the target node in the dense subgraph.
  • the node whose core number needs to be recalculated in the next iteration process can be directly determined.
  • the node will affect the determination of the number of cores of its neighbor nodes. Therefore, when the iteration process ends, the nodes whose core numbers are updated are recorded.
  • From The neighbor nodes of these nodes are traversed from the remaining nodes in the dense subgraph, as the nodes that need to recalculate the number of cores in the next iteration process, which can avoid recalculating the number of cores for all nodes in the dense subgraph and improve the mining efficiency. It can be understood that the neighbor nodes of the nodes whose current core counts are updated do not include nodes that have been removed from the dense subgraph.
  • the above method further includes: when the current iteration process starts, initializing the node update number to zero, and the node update number is used to record the number of nodes whose current core number is updated in the current iteration process; The number of nodes whose current cores are updated in the iteration process; update the number of node updates according to the number; if the number of node updates is non-zero when the next iteration process ends, continue the next iteration process; if the next iteration process ends, The iteration stops when the number of node updates is zero.
  • a mark may be used to record the number of nodes whose current core number is updated in the current iteration process.
  • the computer equipment can set a number used to record the number of nodes whose current cores are updated in each iteration, and set this flag to 0 when the current iteration starts. For nodes participating in the current iteration, every time a node If the number of cores is updated, the flag is incremented by 1. Then, after the end of the iteration, if the flag is not 0, it means that there is a node whose core number is updated in the current iteration, and it is necessary to continue the iteration. If it is 0, it means that there is no node whose core number is updated in the whole process of the current iteration, and the whole iterative process ends.
  • step 506 the nodes in the dense subgraph obtained when the iteration is stopped are used as stable nodes, and the current number of cores of the stable nodes when the iteration is stopped is used as the number of cores corresponding to the stable nodes.
  • the number of cores of a stable node is the number of cores of the node in the entire original network graph.
  • the process of determining the number of cores of each node in the dense subgraph is as follows:
  • the core index For each node in the dense subgraph, calculate the core index according to the current number of cores of its neighbor nodes.
  • the neighbor nodes of the node are the nodes in the dense subgraph that have filtered out the nonActive state.
  • the core index is less than or equal to the preset threshold, the node is marked as nonActive state;
  • the core index is greater than the preset threshold and less than the current number of cores of the node, the current number of cores of the node is updated according to the core index, and numMsgs is increased by 1 ;
  • the current number of cores of the node whose state is not marked as nonActive in the dense subgraph is the number of the node in the entire original network graph.
  • the number of cores, nodes that are not marked as nonActive are stable nodes in the network graph.
  • the number of cores of each node in the dense subgraph is calculated based on the core index, and the number of cores obtained by each iteration is compared with a preset threshold. Only when the number of cores calculated by iteration is greater than the threshold , the node will continue to iterate, otherwise it will not participate in subsequent iterations, which can improve the mining efficiency of dense subgraphs.
  • Step 208 Obtain a sparse subgraph in the network graph according to the remaining nodes other than the stable node and the edges between the remaining nodes in the network graph.
  • the computer device determines the stable nodes in the network graph
  • the number of cores of the remaining nodes other than the stable nodes in the network graph is less than or equal to the preset threshold, and these remaining nodes and the connections formed between them are determined.
  • the edges are called sparse subgraphs.
  • obtaining a sparse subgraph in the network graph according to the remaining nodes other than the stable nodes and the edges between the remaining nodes in the network graph including: removing the stable nodes from the network graph; After stabilizing the node, the remaining nodes and the edges between the remaining nodes are used to obtain a sparse subgraph.
  • the graph can be stored in the form of an adjacency matrix or an adjacency list.
  • the computer device determines the stable nodes in the network graph, it can traverse the adjacency matrix or the adjacency list, remove the stable nodes from it, and get the remaining nodes. and the connection relationship between the remaining nodes to obtain a sparse subgraph.
  • Step 210 based on the sparse subgraph and the stable nodes, determine the number of cores of each node in the sparse subgraph.
  • the calculation of the number of cores of each node in the sparse subgraph also follows the above-mentioned method of core index iteration, but since the stable node will affect the calculation of the number of cores of each node in the sparse subgraph, it is necessary to consider the stable node pair in the iterative process.
  • the computer device can determine the number of cores of each node in the sparse subgraph based on the sparse subgraph and the stable nodes, so as to realize the second step of divide-and-conquer solution.
  • the computer device may also use the core index of each neighbor node of the node in the network graph to update the current time of the corresponding node after the previous iteration of the node in the current iteration process when iterating the sparse subgraph. The number of cores for the next iteration.
  • the core index of all neighbor nodes of a node may be an H index. If the H index of a node is h, it means that this node has at least h neighbor nodes, and the association degrees of these h neighbor nodes are all not less than h. That is to say, if the node satisfies that the current number of cores of h neighbor nodes in the neighbor nodes is greater than or equal to h, and does not satisfy that the current number of cores of h+1 neighbor nodes is greater than or equal to h+1, it is determined that the node corresponds to The core exponent of is h, where h is a positive integer.
  • the number of cores of each node in the sparse subgraph is determined, including:
  • Step 602 Initialize the current number of cores of each node in the sparse subgraph according to the number of neighbor nodes in the original network graph of each node in the sparse subgraph.
  • the computer device may use the correlation degree of each node in the sparse subgraph in the original network graph to initialize the number of cores of each node as the initial current number of cores.
  • the number of cores of stable nodes has been determined, the number of cores of stable nodes is greater than a preset threshold, and the number of cores of each node in the sparse subgraph is less than or equal to a preset value Therefore, when calculating the number of cores of each node in the sparse subgraph, if you need to use the number of cores of the stable node, in order to reduce the memory, you can set the number of cores of the stable node to the preset threshold, and also It can be set to any value greater than the preset threshold, and the number of cores of stable nodes determined in the preceding steps can also be directly used. The above-mentioned different settings will not affect the calculation result of the number of cores of each node in the sparse subgraph.
  • Step 604 For each node in the sparse subgraph, calculate the core index corresponding to the node according to the current core number of the node's neighbor nodes in the network graph; when the core index is less than the current core number of the node, calculate the core index according to The step of updating the current number of cores of the node by the core index of the node, the iteration is stopped until the current number of cores of each node in the sparse subgraph has not been updated in the current iteration process.
  • the computer equipment needs to process each node in the sparse subgraph.
  • the core index corresponding to the node is calculated according to the current core number of its neighbor nodes in the network graph, that is, the core number of all neighbor nodes after the previous iteration process. It can be understood that if the neighbor nodes include stable nodes, the number of cores of the stable nodes has been determined in the preceding steps, so in the iterative process of the sparse subgraph, the number of cores of the stable nodes does not need to participate in the update. If the core index of the node is smaller than the current core number of the node, the current core number of the node is updated by using the core index.
  • the number of cores of each node in the current iteration process is determined according to the number of cores of all neighbor nodes of the node in the previous iteration process, it has locality and can be easily extended to the logic of distributed parallel computing, thereby speeding up the entire mining process. process.
  • the iteration stop condition is that the current number of cores of all nodes in the sparse subgraph does not change during the current iteration. That is to say, when the core index calculated according to the core number of the node's neighbor nodes in the previous iteration is consistent with the current core number of the node, the core number of the node will not be updated. If the current number of cores has not been updated during the current iteration, the iteration is stopped.
  • the above method further includes: after the current iteration ends, recording the node whose current number of cores is updated during the current iteration; the recorded node is used to indicate that when the next iteration starts, the recorded node will be in the sparse
  • the neighbor node in the subgraph is used as the target node that needs to recalculate the core index in the next iteration process; for each node in the sparse subgraph, according to the current number of cores of the node's neighbor nodes in the network graph, the corresponding node is calculated.
  • the core index includes: for the target node in the sparse subgraph, calculating the core index corresponding to the target node according to the current number of cores of the neighbor nodes of the target node in the network graph.
  • the node for which the core number needs to be recalculated in the next iteration process can be directly determined.
  • the node will affect the determination of the core number of its neighbor nodes. Therefore, when the iteration process ends, record the nodes whose core numbers are updated.
  • From The neighbor nodes of these nodes are traversed in the sparse subgraph as the nodes that need to recalculate the number of cores in the next iteration process, which can avoid recalculating the number of cores for all nodes in the sparse subgraph and improve the mining efficiency. It can be understood that, after determining the neighbor nodes of the node whose current core number is updated, if the neighbor nodes include stable nodes, the stable node does not need to recalculate the core number.
  • the above method further includes: when the current iteration process starts, initializing the node update number to zero, and the node update number is used to record the number of nodes whose current core number is updated in the current iteration process; The number of nodes whose current cores are updated in the iteration process; update the number of node updates according to the number; if the number of node updates is non-zero when the next iteration process ends, continue the next iteration process; if the next iteration process ends, The iteration stops when the number of node updates is zero.
  • a mark may be used to record the number of nodes whose current core number is updated in the current iteration process.
  • the computer equipment can set a number used to record the number of nodes whose current cores are updated in each iteration, and set this flag to 0 when the current iteration starts. For nodes participating in the current iteration, every time a node If the number of cores is updated, the flag is incremented by 1. Then, after the end of the iteration, if the flag is not 0, it means that there is a node whose core number is updated in the current iteration, and it is necessary to continue the iteration. If it is 0, it means that there is no node whose core number is updated in the whole process of the current iteration, and the whole iterative process ends.
  • step 606 the current number of cores of the node when the iteration is stopped is taken as the number of cores corresponding to the node.
  • the number of cores of each node in the sparse subgraph is the number of cores of the node in the entire original network graph.
  • the process of determining the number of cores of each node in the sparse subgraph is as follows:
  • the neighbor set here refers to the neighbor nodes of the node in the original network graph, that is to say, the neighbor nodes include not only the nodes in the sparse subgraph, but also the stable nodes.
  • the core index is less than the current core number of the node, the current core number of the node is updated according to the core index, and numMsgs is increased by 1.
  • the number of cores of each node in the sparse subgraph is the number of cores of each node in the entire original network graph.
  • the network graph is divided and conquered based on the idea of divide and conquer, which can support subgraph mining of super large-scale networks. That is, according to the degree of association of each node, the complete network graph is divided into dense subgraphs and sparse subgraphs, and the mining is divided into two parts, which greatly reduces the memory usage, and can directly cut into the dense subgraphs, avoiding non-important nodes. It wastes more iteration time and computing resources, and improves the mining performance.
  • each node in the sparse subgraph will not affect the nodes in the dense subgraph, for the dense subgraph, directly determine the stable node and the number of cores corresponding to the stable node, and then divide the stable node and the number of cores in the network graph.
  • the remaining part outside the connection between the stable nodes constitutes a sparse subgraph.
  • the stable nodes in the dense subgraph will affect the nodes in it, for the sparse subgraph, it is necessary to base on the sparse subgraph itself and the dense subgraph. to determine the number of cores of each node in the sparse subgraph. After mining the number of cores of each node in the network graph, the number of cores can be used as the feature of the corresponding node to generate the corresponding feature vector and input it to other downstream tasks.
  • Parameter Server is a super-large-scale parameter server used to store or update parameters in a distributed manner in the field of machine learning.
  • Angel is a high-performance distributed machine learning platform developed based on the concept of parameter server.
  • Spark is a platform designed for large-scale data processing. Designed as a fast and general computing engine, Spark on Angel is a high-performance distributed computing platform that combines Angel's powerful parameter server function with Spark's large-scale data processing capabilities.
  • the computer device can implement the iterative process of the above graph data processing method on Spark on Angel.
  • the node whose core number was updated in the last iteration process can be pulled from the parameter server. Since the core number of a node is determined by the core number of its neighbor nodes, if the core value of the neighbor node changes, it will affect the core number of the node. The number of node cores, so it can be inferred that the node needs to recalculate the number of cores in the current iteration. Next, pull the core number of the node that needs to recalculate the core number and its neighbor nodes from the parameter server. Then calculate the number of cores of these nodes in the current iteration based on the core index. If you need to update the previous number of cores with the calculated number of cores, you need to store the updated number of cores in the parameter server for use in the iterative process.
  • FIG. 7 it is a schematic diagram of a method for processing graph data in one embodiment.
  • the parameter server stores the current number of cores of all nodes, the updated nodes and their core numbers in each iteration and the previous iteration, and the iteration server stores the adjacency list.
  • each iteration mainly has the following steps:
  • ReadMessage is updated with WriteMessage, and WriteMessage is reset. Determine whether the ReadMessage is empty, if it is empty, it means that the number of cores of all nodes is no longer updated, end the iteration, otherwise continue to iterate.
  • the efficient data parallel processing capability of the Spark platform is used to perform the iterative calculation of the update of the number of cores, which can improve the data processing efficiency.
  • Using the storage capacity of Angel's powerful parameter server to pull or update the number of cores can eliminate the network bottleneck of single-point drivers in Spark and support k-core mining of super-large relational networks.
  • k-core mining itself As the iteration deepens, most nodes will remain stable and will not be updated.
  • the calculation process has a certain sparsity. Therefore, based on the divide and conquer idea, a threshold is set to divide the complete graph structure into Two-step mining for dense subgraph and sparse subgraph greatly reduces memory usage, reduces computation, runs faster, and consumes less resources. And it can directly cut into dense subgraphs to avoid wasting a lot of iteration time and computing resources on non-important nodes with core numbers of 1, 2, ..., etc., which is very important for k-core mining of ultra-large networks.
  • the above-mentioned graph data processing method includes the following steps:
  • Step 802 acquiring a network map.
  • Step 804 Determine the number of neighbor nodes of each node in the network graph.
  • Step 806 taking the number of neighbor nodes as the association degree of the corresponding node.
  • Step 808 Obtain a preset threshold.
  • Step 810 remove the nodes with an association degree less than or equal to the threshold value and the connected edges of the nodes from the network graph, and obtain a dense subgraph according to the remaining nodes and the connected edges between the remaining nodes in the network graph.
  • Step 812 Obtain the association degree of each node in the dense subgraph according to the number of neighbor nodes in the dense subgraph, and use the association degree in the dense subgraph as the initial current core number of the corresponding node.
  • Step 814 For each node in the dense subgraph, calculate the core index corresponding to the node according to the current core number of the node's neighbor nodes in the dense subgraph; when the core index is less than or equal to a preset threshold, Remove the node from the dense subgraph; when the core index is greater than the threshold and less than the current number of cores of the node, update the current number of cores of the node according to the core index of the node, until each node in the dense subgraph in the current iteration process Stop iteration when the current number of cores has not been updated.
  • step 816 the nodes in the dense subgraph obtained when the iteration is stopped are used as stable nodes, and the current number of cores of the stable nodes when the iteration is stopped is used as the number of cores corresponding to the stable nodes.
  • Step 818 remove stable nodes from the network graph.
  • Step 820 Obtain a sparse subgraph according to the remaining nodes after removing the stable nodes and the edges between the remaining nodes.
  • Step 822 Initialize the current number of cores of each node in the sparse subgraph according to the number of neighbor nodes in the original network graph of each node in the sparse subgraph.
  • Step 824 For each node in the sparse subgraph, calculate the core index corresponding to the node according to the current core number of the node's neighbor nodes in the network graph; when the core index is less than the current core number of the node, calculate the core index according to The step of updating the current number of cores of the node by the core index of the node, the iteration is stopped until the current number of cores of each node in the sparse subgraph has not been updated in the current iteration process.
  • Step 826 taking the current number of cores of the node when the iteration is stopped as the number of cores corresponding to the node;
  • Step 828 generating a feature vector corresponding to the node according to the number of cores of each node;
  • Step 830 classify the nodes according to the feature vectors of the nodes.
  • the core number of the node can be used to generate a feature vector corresponding to the node according to the core number, and the feature vector is used to classify the node according to the feature vector.
  • the number of cores of a node can be input into the machine learning algorithm as a feature to realize the classification of nodes. For example, it can be applied to the mining of merchants' business models to classify consumers and merchants in a large-scale payment network. It can also be applied to financial risk control products to realize abnormal operations such as illegal credit intermediaries, cashing out, long-term lending, and gambling.
  • the network diagram is a payment relationship network diagram
  • nodes in the payment relationship network diagram represent user IDs
  • an edge between two nodes in the payment relationship network diagram indicates that there is a payment interaction between the corresponding two user IDs
  • the above method further includes: generating a feature vector corresponding to the user ID represented by the node according to the number of cores of each node in the payment relationship network graph; and predicting the payment type corresponding to the user ID based on the feature vector through a pre-trained classification model.
  • the computer device can obtain payment records corresponding to user IDs; obtain payment interaction data between user IDs according to the payment records; generate a payment relationship network diagram according to the payment interaction data, and use the embodiments of the present application
  • the provided graph data processing method processes the payment relationship network graph, obtains the number of cores of each node, generates the corresponding feature vector according to the number of cores of each node, and uses the classification algorithm based on machine learning to classify each node to distinguish each node.
  • Nodes are merchants or consumers.
  • steps in the flowcharts of FIG. 2 , FIG. 5 , FIG. 6 , and FIG. 8 are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the above figure may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence of these steps or stages is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages within the other steps.
  • a graph data processing apparatus 900 is provided.
  • the apparatus may adopt software modules or hardware modules, or a combination of the two to become a part of computer equipment.
  • the apparatus specifically includes: a network
  • the graph obtaining module 902 the dense subgraph obtaining module 904, the first determining module 906, the sparse subgraph obtaining module 908, and the second determining module 910, wherein:
  • a network graph obtaining module 902 configured to obtain the association degree of each node in the network graph
  • a dense subgraph obtaining module 904 configured to split the dense subgraph from the network graph according to a preset threshold and the degree of association of each node;
  • a first determination module 906 configured to determine the stable node in the network graph and the number of cores of the stable node based on the dense subgraph, where the number of cores of the stable node is greater than the preset threshold;
  • a sparse subgraph obtaining module 908 configured to obtain a sparse subgraph in the network graph according to the remaining nodes other than the stable node and the connections between the remaining nodes in the network graph;
  • the second determination module 910 is configured to determine the number of cores of each node in the sparse subgraph based on the sparse subgraph and the stable node; wherein the number of cores of the node is used to generate a feature vector corresponding to the node
  • the network graph obtaining module 902 is further configured to obtain a network graph; determine the number of neighbor nodes of each node in the network graph; and use the number of neighbor nodes as the correlation degree of the corresponding node.
  • the network graph obtaining module 902 is further configured to obtain payment records corresponding to the user IDs; obtain payment interaction data between user IDs according to the payment records; generate a payment relationship network diagram according to the payment interaction data;
  • the nodes of the relationship network graph represent user IDs, and the connection between two nodes in the payment relationship network graph indicates that there is a payment interaction event between the corresponding two user IDs.
  • the dense subgraph obtaining module 904 is further configured to obtain a preset threshold; remove the nodes whose correlation degree is less than or equal to the threshold and the connecting edges where the nodes are located from the network graph, according to the remaining nodes and remaining nodes in the network graph The edges between nodes obtain dense subgraphs.
  • the first determining module 906 is further configured to obtain the degree of association of each node in the dense subgraph according to the number of neighbor nodes of each node in the dense subgraph, and use the degree of association in the dense subgraph as the corresponding node
  • the initial current number of cores iteratively executed for each node in the dense subgraph, calculate the core index corresponding to the node according to the current core number of the node's neighbor nodes in the dense subgraph; when the core index is less than or equal to the preset When the threshold is reached, the node is removed from the dense subgraph; when the core index is greater than the threshold and less than the current number of cores of the node, the current number of cores of the node is updated according to the core index of the node, until the dense subgraph in the current iteration process Stop iteration when the current number of cores of each node has not been updated; take the node in the dense subgraph obtained when the iteration is stopped as the stable node, and take the current number of core
  • the first determining module 906 is further configured to, after the current iteration ends, record the node whose current number of cores is updated during the current iteration; the recorded node is used to indicate that when the next iteration starts, the recorded
  • the neighbor node of the node in the dense subgraph is used as the target node that needs to recalculate the core index in the next iteration process; for the target node in the dense subgraph, according to the current core number of the neighbor node of the target node in the dense subgraph, Calculate the core index corresponding to the target node.
  • the first determining module 906 is further configured to: if the node satisfies that the current number of cores of h neighbor nodes in the neighbor nodes is greater than or equal to h, and does not satisfy that the current number of cores of h+1 neighbor nodes is greater than or equal to h When it is equal to h+1, the core index corresponding to the node is determined to be h, where h is a positive integer.
  • the first determining module 906 is further configured to initialize the node update number to zero when the current iteration process starts, and the node update number is used to record the number of nodes whose current core number is updated in the current iteration process; Count the number of nodes whose current cores are updated during the current iteration; update the node update number according to the number; if the node update number is non-zero when the current iteration process ends, continue the next iteration process; if the current iteration process At the end, when the number of node updates is zero, the iteration stops.
  • the sparse subgraph obtaining module 908 is further configured to remove stable nodes from the network graph; and obtain the sparse subgraph according to the remaining nodes after removing the stable nodes and the connections between the remaining nodes.
  • the second determining module 910 is further configured to, after the current iteration ends, record the node whose current number of cores is updated during the current iteration; the recorded node is used to indicate that when the next iteration starts, the recorded
  • the neighbor node of the node in the sparse subgraph is used as the target node that needs to recalculate the core index in the next iteration process; for the target node in the sparse subgraph, according to the current core number of the neighbor node of the target node in the network graph, calculate The core index corresponding to the target node.
  • the second determining module 910 is further configured to: if the node satisfies that the current number of cores of h neighbor nodes in the neighbor nodes is greater than or equal to h, and does not satisfy that the current number of cores of h+1 neighbor nodes is greater than or equal to h When it is equal to h+1, the core index corresponding to the node is determined to be h, where h is a positive integer.
  • the second determining module 910 is further configured to initialize the node update number to zero when the current iteration process starts, and the node update number is used to record the number of nodes whose current core number is updated in the current iteration process; Count the number of nodes whose current cores are updated during the current iteration; update the node update number according to the number; if the node update number is non-zero when the current iteration process ends, continue the next iteration process; if the current iteration process At the end, when the number of node updates is zero, the iteration stops.
  • the network diagram is a payment relationship network diagram
  • the nodes in the payment relationship network diagram represent user IDs
  • an edge between two nodes in the payment relationship network diagram indicates that there is a payment interaction between the corresponding two user IDs event
  • the above-mentioned device further includes a classification module, which is used to generate a feature vector corresponding to the user ID represented by the node according to the number of cores of each node in the payment relationship network diagram; through a pre-trained classification model, predict the corresponding user ID based on the feature vector.
  • Payment Types is used to generate a feature vector corresponding to the user ID represented by the node according to the number of cores of each node in the payment relationship network diagram.
  • the above-mentioned graph data processing device after acquiring the correlation degree of each node in the network graph, performs a divide-and-conquer solution on the network graph based on the idea of divide and conquer, so as to support subgraph mining of a super large-scale network. That is, according to the degree of association of each node, the complete network graph is divided into dense subgraphs and sparse subgraphs, which are divided into two parts for mining, which greatly reduces the memory usage, and can directly cut into the dense subgraphs to avoid unimportant nodes. It wastes more iteration time and computing resources, and improves the mining performance.
  • each node in the sparse subgraph will not affect the nodes in the dense subgraph, for the dense subgraph, directly determine the stable node and the number of cores corresponding to the stable node, and then divide the stable node and the number of cores in the network graph.
  • the remaining part outside the connection between the stable nodes constitutes a sparse subgraph.
  • the stable nodes in the dense subgraph will affect the nodes in it, for the sparse subgraph, it is necessary to base on the sparse subgraph itself and the dense subgraph. to determine the number of cores of each node in the sparse subgraph. After mining the number of cores of each node in the network graph, the number of cores can be used as the feature of the corresponding node to generate the corresponding feature vector and input it to other downstream tasks.
  • Each module in the above-mentioned graph data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by a processor, implement a graph data processing method.
  • FIG. 10 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, where computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps in the foregoing method embodiments are implemented.
  • a computer-readable storage medium which stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, implements the steps in the foregoing method embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Finance (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种图数据处理方法,包括:获取网络图中各节点的关联度;根据各节点的关联度从网络图中拆分出稠密子图;基于稠密子图,确定网络图中的稳定节点及稳定节点的核数,稳定节点的核数大于所述预设的阈值;根据预设的阈值和网络图中除稳定节点之外的剩余节点及剩余节点之间的连边,获得网络图中的稀疏子图;基于稀疏子图和稳定节点,确定稀疏子图中各节点的核数。

Description

图数据处理方法、装置、计算机设备和存储介质
本申请要求于2020年12月03日提交中国专利局,申请号为202011394355.5,申请名称为“图数据处理方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据技术领域,特别是涉及一种图数据处理方法、装置、计算机设备和存储介质。
背景技术
随着互联网技术的不断提高和普及,越来越多的用户加入各种各样的网络平台,从而形成了大规模的网络,分析和挖掘这些大规模网络中隐藏的信息是非常具有研究价值的。
K-Core算法是一种子图挖掘算法,可以从复杂的网络中挖掘紧密关联的子图,例如,可以从交易网络中挖掘出买家或卖家之间行为异常的团伙,还可以找出整个交易网络中处在核心位置的买家或卖家。
目前的K-core挖掘算法主要采用的是递归剪枝方法,也就是从k=1开始不断迭代地确定每个网络中每个节点的核数。然而,这种递归剪枝方法按照core值为1,2,…,k的方式向上挖掘,在非重要节点上消耗了较多的迭代时间和计算资源,导致整体计算时间过长,导致对超大规模网络的挖掘性能较差。
发明内容
一种图数据处理方法,由计算机设备执行,所述方法包括:
获取网络图中各节点的关联度;
根据预设的阈值和所述各节点的关联度从所述网络图中拆分出稠密子图;
基于所述稠密子图,确定所述网络图中的稳定节点及所述稳定节点的核数,所述稳定节点的核数大于所述预设的阈值;
根据所述网络图中除所述稳定节点之外的剩余节点及所述剩余节点之间的连边,获得所述网络图中的稀疏子图;
基于所述稀疏子图和所述稳定节点,确定所述稀疏子图中各节点的核数;
其中,确定的所述核数用于生成与相应节点对应的特征向量。
一种图数据处理装置,所述装置包括:
网络图获取模块,用于获取网络图中各节点的关联度;
稠密子图获取模块,用于根据预设的阈值和所述各节点的关联度从所述网络图中拆分出稠密子图;
第一确定模块,用于基于所述稠密子图,确定所述网络图中的稳定节点及所述稳定节点的核数,所述稳定节点的核数大于所述预设的阈值;
稀疏子图获取模块,用于根据所述网络图中除所述稳定节点之外的剩余节点及所述剩余节点之间的连边,获得所述网络图中的稀疏子图;
第二确定模块,用于基于所述稀疏子图和所述稳定节点,确定所述稀疏子图中各节点的核数;
其中,确定的所述核数用于生成与相应节点对应的特征向量。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机程序,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器实现上述图数据处理方法的步骤。
一个或多个存储有计算机可读指令的非易失性一种计算机可读存储介质,存储有计算机可读指令程序,所述计算机可读指令程序被一个或多个处理器执行时,使得所述一个或多个处理器实现上述图数据处理方法的步骤。
一种计算机程序,所述计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行上述图数据处理方法中的步骤。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中图数据处理方法的应用环境图;
图2为一个实施例中图数据处理方法的流程示意图;
图3为一个实施例中对网络图进行3核子图划分的示意图;
图4为一个实施例中分别对网络图按k-core分解和按阈值拆分的示意图;
图5为一个实施例中根据稠密子图确定网络图中的稳定节点的流程示意图;
图6为一个实施例中根据稀疏子图确定网络图中节点的核数的流程示意图;
图7为一个实施例中图数据处理方法的示意图;
图8为一个具体的实施例中图数据处理方法的流程示意图;
图9为一个实施例中图数据处理装置的结构框图;
图10为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的图数据处理方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。终端102与终端102之间是通过服务器进行交互的,服务器104可以获取终端102在网络上交互时所形成的交互数据,根据交互数据生成网络图。在一个实施例中,服务器104获取网络图中各节点的关联度;根据预设的阈值和各节点的关联度从网络图中拆分出稠密子图;基于稠密子图,确定网络图中的稳定节点及稳定节点的核数,稳定节点的核数大于预设的阈值;根据网络图中除稳定节点之外的剩余节点及剩余节点之间的连边,获得网络图中的稀疏子图;基于稀疏子图和稳定节点,确定稀疏子图中各节点的核数。
其中,终端102可以但不限于是各种个人计算机、智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。服务器104可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端102以及服务器104可以通过有线或无线通信方式进行 直接或间接地连接,本申请在此不做限制。
本申请实施例提供的图数据处理方法,该图数据处理方法的执行主体可以是本申请实施例提供的图数据处理装置,或者集成了该图数据处理装置的计算机设备,其中该图数据处理装置可以采用硬件或软件的方式实现。计算机设备可以是上述的终端102或服务器104。
本申请实施例提供的图数据处理方法,获得网络图中各节点的核数后,节点的核数还可以用于根据核数生成与节点对应的特征向量,特征向量用于根据特征向量对节点进行分类。例如,特征向量可作为机器学习(Machine Learning,ML)算法的输入,实现对该节点的分类。
在一个实施例中,如图2所示,提供了一种图数据处理方法,以该方法应用于图1中的计算机设备(终端102或服务器104)为例进行说明,包括以下步骤:
步骤202,获取网络图中各节点的关联度。
其中,图(Graph)是一种对事物之间的联系进行建模的数据结构,图包括一系列节点和用于连接节点的连边,节点又可以称为顶点。两个节点之间存在连边,表示该两个节点之间存在关联。两个节点之间的连边可以有权重。节点的关联度是指与该节点相连接的边的条数,也是与该节点相邻的邻居节点的数量,邻居节点是指与该节点有连边的节点。
网络图是根据基于互联网的网络交互数据生成的图。其中,网络交互数据例如可以是支付交互数据、即时通信交互数据、在线购物交互数据,等等,相应的网络图例如可以是支付关系网络图、即时通信关系网络图和在线购物关系网络图,等等。
具体地,为了实现对复杂网络中有用信息的挖掘,计算机设备可以基于该网络中大量的交互数据生成网络图,并获取网络图中各节点的关联度,从而根据该网络图和其中各节点的关联度实现对该网络图的图挖掘。图挖掘(graph mining)是指采用一些算法从图中挖掘出潜在的有用信息的过程,图挖掘包括对图分类、图距离、子图挖掘,等等。本申请实施例中,主要是对该网络图中各节点的核数进行挖掘,得到各节点的核数后,不仅可以从网络图中寻找出符合指定核数的节点的集合,还可以根据各节点的核数生成相应的特征向量,作为其他机器学习算法的输入。
在一个实施例中,网络图可以是支付关系网络图,支付关系网络图的生成步骤包括: 获取与用户标识对应的支付记录;根据支付记录获得各用户标识之间的支付交互数据;根据支付交互数据生成支付关系网络图;其中,支付关系网络图的节点表示用户标识,支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件。
其中,支付交互事件是转账、发红包、借款、扫码付款等交易事件中的至少一种。本实施例中,一个用户是一个节点,若两个用户之间存在支付交互事件,则该两个用户之间会形成一条连边。例如,用户a向用户b转账,则用户a与用户b之间形成一条连边。可以理解,当用户群体数量众多时,则这些用户之间形成的连边的数量是超大规模的,从而,生成的支付网络关系图是超大规模的。例如,在微信支付场景中,节点的数量可达20亿,这20亿节点之间形成的连边条数可以达到千亿的超大规模。
在一个实施例中,网络图可以是社交关系网络图,社交关系网络图的生成步骤包括:获取用户标识的历史会话数据;根据历史会话数据生成社交关系网络图;其中,社交关系网络图的节点表示用户标识,社交关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在历史会话。
本实施例中,一个用户是一个节点。若两个用户之间存在历史会话,则该两个用户之间会形成一条连边。在另一个实施例中,若两个用户之间相互添加了好友关系,这该两个用户之间形成一条连边。同样地,当用户数量量较大时,所形成的社交关系网络图也是非常复杂的。
在一个实施例中,获取网络图中各节点的关联度,包括:获取网络图;确定网络图中各节点的邻居节点的数量;将邻居节点的数量作为相应节点的关联度。
图可以用邻接矩阵或邻接列表来表示,在邻接列表中,对于图中的每一个节点,都会存储一个从该节点开始的边的列表,例如,如果节点A有三条边分别连接于B、C和D,那么A的列表中会有三条边。在邻接矩阵中,行和列都表示节点,由两个节点所决定的矩阵中的对应元素,表示该两个节点是否相连,如果相连,该对应元素的值可以表示该两个节点之间连边的权重。
计算机设备可以获取网络图对应的邻接列表或邻接矩阵,从邻接列表或邻接矩阵中遍历该网络图中各个节点的邻接节点的数量,邻居节点的数量可作为相应节点的关联度。
在支付场景中,支付关系网络图中某个节点的关联度,可以理解为与该节点有交易行 为的节点的数量。在社交场景中,社交关系网络图中某个节点的关联度,可以理解为与该节点存在历史会话的节点的数量。
步骤204,根据预设的阈值和各节点的关联度从网络图中拆分出稠密子图。
本实施例中,主要是对网络图中各节点的核数进行挖掘。核数(coreness)是用来判断节点在整个网络图中的重要性的指标之一。一个图的k核子图,是指从该图中反复去除关联度小于或等于k的节点后,所剩余的子图,也就是将图G中关联度小于k的顶点全部移除,得到子图G';将图G'中关联度小于k的顶点全部移除,得到新子图G”,…,以此类推,直至剩余子图中每个节点的关联度都大于k时停止,得到该图G的k核子图。节点的核数,定义为该节点所在的最大核子图,即若一个节点存在于M核子图中,而在(M+1)核子图中被移除,那么该节点的核数为M。
例如,2核子图就是先从图中去掉所有关联度小于2的节点,然后再从剩下的图中去掉关联度小于2的节点,依次类推,直到不能去掉为止,得到2核子图;3核就是先从图中去掉所有关联度小于3的节点,然后再从剩下的图中去掉关联度小于3的节点,依次类推,直到不能去掉为止,得到该图的3核子图。若一个节点最多在5核子图而不在6核子图中,那么该节点的核数为5。如图3所示,为一个3核子图划分的过程示意图。参照图3,可以看出,经过两次从图中去除关联度小于3的节点,得到了最终的3核子图。
根据上面的分析可知,核数大于k的节点该节点的关联度必然大于k。因此,在本申请实施例中,计算机设备通过设置一个阈值,根据各节点的关联度和该阈值将原本的网络图拆分为稠密子图和稀疏子图两部分,然后依次挖掘出其中各节点的核数。通过阈值从网络图中拆分出稠密子图,能够直接对稠密子图进行挖掘,避免在核数小于阈值的非重要节点上浪费较多的迭代时间和计算资源,这对于超大规模网络的核数挖掘是非常重要的。需要说明的是,稠密子图中各节点的关联度必然是大于该阈值的,但网络图中关联度大于该阈值的节点并不一定必然存在于该稠密子图中。
其中,预设的阈值,可以根据实际需要进行设置。可选地,可以根据具体业务场景需要来确定预设的阈值,比如根据以往经验核数大于300的节点在网络图中发挥的作用比较大,那么计算机设备可以将预设的阈值设置为300。可选地,还可以根据计算资源的限制来确定预设的阈值,因为阈值设置越小,从网络图中拆分出的稠密子图所包括的节点数 量越大,稠密子图越大,所需计算资源也越多,反之阈值设置的越大,从网络图中拆分出的稠密子图越小,所需的计算资源也就越少。可选地,还可以根据该网络图中各节点关联度的分布来设置阈值的大小,例如,若网络图中大多数节点的关联度均小于某个值,那么可以将该阈值设置成该值。
在一个实施例中,根据各节点的关联度和预设的阈值从网络图中拆分出稠密子图,包括:获取预设的阈值;从网络图中移除关联度小于或等于阈值的节点及节点所在的连边,根据网络图中剩余节点及剩余节点之间的连边获得稠密子图。
具体地,根据预设的阈值,计算机设备从原图中过滤掉关联度小于该阈值及等于该阈值的节点,即得到稠密子图,获得的该稠密子图中所有节点的关联度都大于该阈值。可见,阈值设置的越大,得到的稠密子图越小,所需要的计算资源也更少。
如图4所示,为一个实施例中,分别对网络图按k-core分解和按阈值拆分的示意图。参照图4左边,是按照k-core算法,按k=1,k=2,k=3…,依次自下向上挖掘出网络图中各节点的核数的示意图,即从k=1开始,反复去除关联度小于或等于k的节点,对于k=1,计算机设备需要迭代2次,对于k=2,计算机设备需要迭代2次,对于k=3,计算机设备需要迭代2次,k=4,计算机设备需要迭代2次,由于不存在关联度大于5的节点,所以对于k=5,计算机设备需要迭代1次,也就是说,计算机设备一共需要迭代9次,才能确定该图中各节点的核数,得到相同核数的节点构成的子图。参照图4右边,是直接从原图中迭代去除关联度小于预设的阈值的节点,以阈值为2将原图拆分为稠密子图和稀疏子图举例来说,计算机设备迭代地过滤掉关联度小于2的节点及等于2的节点,一共只需要迭代2次,就可以从原始的网络图中确定稠密子图和稀疏子图,由于迭代计算的稀疏性,后续迭代过程中很多节点的核数被确定后就不再被更新。
步骤206,基于稠密子图,确定网络图中的稳定节点及稳定节点的核数,稳定节点的核数大于预设的阈值。
其中,稳定节点是从稠密子图中挖掘出的核数大于预设的阈值的节点。计算机设备从网络图中拆分出稠密子图后,先对稠密子图进行挖掘,确定其中的稳定节点及稳定节点的核数,以实现分治求解的第一步。
具体地,由于稀疏子图中各节点的关联度是小于预设的阈值的,所以稀疏子图中各节 点不会对稠密子图中各节点的核数产生影响,那么,计算机设备可以直接切入稠密子图,对稠密子图进行挖掘,根据稠密子图中各节点的关联度确定各节点的核数,将核数大于预设的阈值的节点作为网络图中的稳定节点。
在一个实施例中,计算机设备可以采用k-core算法,直接对稠密子图进行k-core挖掘,从稠密子图中挖掘出核数大于预设的阈值的稳定节点。具体地,按照k=1,k=2,…,k处等于预设的阈值,从该稠密子图中反复去除关联度小于或等于k的节点,得到k核子图,从而确定稠密子图中各节点所在最大核数的子图,从而确定各节点的核数,将其中核数大于预设的阈值的节点作为稳定节点。
在一个实施例中,计算机设备可以在对稠密子图进行迭代时,在当次迭代过程中,利用节点前次迭代后各邻居节点的核心指数更新相应节点的当次迭代过程的核数。并且,由于一个节点不会影响核数大于该节点的其他节点的核数的计算,因此,在当次迭代更新各节点的核数后,计算机设备还可以将更新后核数大于预设的阈值的节点继续参与下次迭代,更新后核数小于或等于预设的阈值的节点不再参与下次迭代,这样就可以挖掘出稠密子图中核数大于预设的阈值的节点。
在一个实施例中,节点的所有邻居节点的核心指数,可以是H指数,一个节点的H指数如果是h,就说明这个节点至少有h个邻居节点,并且这h个邻居节点的关联度都不小于h。也就是说,若节点满足邻居节点中存在h个邻居节点的当前核数大于或等于h,且不满足存在h+1个邻居节点的当前核数大于或等于h+1时,则确定节点对应的核心指数为h,其中h为正整数。
在一个实施例中,如图5所示,基于稠密子图,确定网络图中的稳定节点及稳定节点的核数,包括:
步骤502,根据各节点在稠密子图中邻居节点的数量,获得各节点在稠密子图的关联度,将在稠密子图中的关联度作为相应节点初始的当前核数。
具体地,计算机设备在对稠密子图进行挖掘时,可以利用稠密子图中各节点在该稠密子图中的关联度来初始化各节点的核数,作为最初的当前核数。
可以理解的是,本实施例中的“当前核数”是动态变化的,指的是前次迭代后各节点被更新的核数,“前次迭代过程”、“当次迭代过程”也是动态变化的,在下一次迭代时, “当次迭代过程”成为了“前次迭代过程”,而下一次迭代变成了“当次迭代过程”。
步骤504,迭代地执行对于稠密子图中的每个节点,根据节点在稠密子图中的邻居节点的当前核数,计算节点对应的核心指数;当核心指数小于或等于预设的阈值时,从稠密子图中移除节点;当核心指数大于阈值且小于节点的当前核数时,则根据节点的核心指数更新节点的当前核数的步骤,直至当次迭代过程中稠密子图中各节点的当前核数均未被更新时停止迭代。
具体地,在每次迭代过程中,计算机设备都需要对稠密子图中的每个节点都需要进行处理。对稠密子图中的每个节点,根据其邻居节点的当前核数,即上一轮迭代过程后所有邻居节点的核数,计算该节点对应的核心指数,若节点的核心指数小于或等于预设的阈值,则该节点不会影响核数大于该节点的其他节点的核数的计算,则该节点无需参与后续的迭代过程,可以从稠密子图中移除该节点;若节点的核心指数大于预设的阈值且小于节点的当前核数,则利用该核心指数更新该节点的当前核数,并且该节点还需要继续参与后续的迭代过程。由于每个节点在当次迭代过程的核数是根据该节点所有邻居在前次迭代过程的核数确定的,具有局部性,可以很容易扩展成分布式并行计算的逻辑,从而加快整个挖掘过程。
迭代停止条件,是当次迭代过程中,稠密子图中剩下的所有节点的当前核数均未发生变化。也就是说,根据节点的邻居节点在前次迭代的核数计算得到的核心指数与该节点的当前核数一致时,则该节点的核数不会被更新,若稠密子图中剩下的所有节点的当前核数在当次迭代过程中均未被更新,则停止迭代。
可以理解的是,由于每次迭代过程中会移除稠密子图中核心指数小于或等于预设的阈值的节点,因此迭代过程中稠密子图也是动态变化的,从而稠密子图中每个节点的邻居节点也是不断变化的,所以在根据每个节点的邻居节点的当前核数计算其核心指数时,应当根据该节点在当前的稠密子图中的邻居节点的当前核数来计算,而非根据该节点在最初的稠密子图中的邻居节点的当前核数来计算,可以进一步减少计算量。
在一个实施例中,当次迭代后若计算得到节点的核心指数小于获等于预设的阈值,则计算机设备可以将该节点标记为非稳定状态,那么被标记为非稳定状态的节点将不再参与下次迭代过程。
在一个实施例中,上述方法还包括:在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;记录的节点用于指示在下次迭代开始时,将记录的节点在稠密子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;对于稠密子图中的每个节点,根据节点在稠密子图中的邻居节点的当前核数,计算节点对应的核心指数,包括:对于稠密子图中的目标节点,根据目标节点在稠密子图中的邻居节点的当前核数,计算目标节点对应的核心指数。
本实施例中,通过记录当次迭代过程中当前核数被更新的节点,就可以直接确定下次迭代过程中需要重新计算核数的节点。当某一个节点的核数被更新后,该节点将会影响其邻居节点的核数的确定,因此,当次迭代过程结束后,记录这些核数被更新的节点,在下次迭代开始时,从稠密子图中剩余的节点中遍历出这些节点的邻居节点,作为下次迭代过程需要重新计算核数的节点,可以避免对稠密子图中所有节点重新计算核数,提升挖掘效率。可以理解的是,这些当前核数被更新的节点的邻居节点不包括已经从稠密子图中移除的节点。
在一个实施例中,上述方法还包括:在当次迭代过程开始时,初始化节点更新数为零,节点更新数用于记录当次迭代过程中当前核数被更新的节点的数量;统计当次迭代过程中当前核数被更新的节点的数量;根据数量更新节点更新数;若当次迭代过程结束时,节点更新数为非零,则继续下一次迭代过程;若当次迭代过程结束时,节点更新数为零,则停止迭代。
在本实施例中,在挖掘稠密子图的过程中,可以利用一个标记,来记录当次迭代过程中当前核数被更新的节点的数量。计算机设备可以设置一个用于记录每轮迭代过程中当前核数被更新的节点的数量,在当次迭代过程开始时,将这个标记置为0,对于参与当次迭代的节点,每当一个节点的核数被更新了,则该标记增1,那么,当次迭代结束后,若该标记不为0,说明当次迭代过程中存在核数被更新的节点,则需要继续迭代,若该标记为0,说明当次迭代的整个过程中都不存在核数被更新的节点,整个迭代过程结束。
步骤506,将停止迭代时获得的稠密子图中的节点作为稳定节点,并将停止迭代时稳定节点的当前核数作为稳定节点对应的核数。
由于迭代结束后稠密子图中剩余的节点的核数都是大于预设的阈值的,因此可以将 这些节点称为稳定节点。稳定节点的核数即为该节点在整个原始的网络图中的核数。
在一个具体的实施例中,确定稠密子图中各节点的核数的过程如下:
1、根据稠密子图中各节点在该稠密子图中邻居节点的数量,计算稠密子图中各节点的关联度,用关联度来初始化各节点的当前核数;
2、用零初始化numMsgs,numMsgs表示每轮迭代中核数被更新的节点的数量;
3、对稠密子图中的每个节点,根据其邻居节点的当前核数计算核心指数,该节点的邻居节点是该节点在稠密子图中的且已过滤掉nonActive状态的节点。当核心指数小于等于预设的阈值时,将该节点标记为nonActive状态;当核心指数大于预设的阈值且小于节点当前核数时,根据核心指数更新该节点的当前核数,且numMsgs增加1;
4、当numMsgs不为0时,重复第2-3步;否则结束迭代,此时稠密子图中状态没有被标记为nonActive的节点的当前核数即为该节点在整个原始的网络图中的核数,没有被标记为nonActive的节点为该网络图中的稳定节点。
在本实施例中,基于核心指数对稠密子图中各节点的核数进行计算,并对每次迭代计算得到的核数与预设的阈值进行比较,只有当迭代计算的核数大于阈值时,该节点才继续进行迭代,反之则不再参与后续迭代,能够提升对稠密子图的挖掘效率。
步骤208,根据网络图中除稳定节点之外的剩余节点及剩余节点之间的连边,获得网络图中的稀疏子图。
具体地,计算机设备确定了网络图中的稳定节点后,网络图中除稳定节点之外的剩余节点的核数是小于或等于预设的阈值的,将这些剩余节点及其之间构成的连边称为稀疏子图。
在一个实施例中,根据网络图中除稳定节点之外的剩余节点及剩余节点之间的连边,获得网络图中的稀疏子图,包括:从网络图中移除稳定节点;根据移除稳定节点后剩余节点及剩余节点之间的连边,获得稀疏子图。
前文提到,图可以以邻接矩阵或邻接列表的形式进行存储,计算机设备确定了网络图中的稳定节点之后,可以从邻接矩阵或邻接表进行遍历,从中移除稳定节点后,得到剩余的节点及剩余节点之间的连接关系,获得稀疏子图。
步骤210,基于稀疏子图和稳定节点,确定稀疏子图中各节点的核数。
稀疏子图中各节点核数的计算同样遵循上述核心指数迭代的方法,但由于稳定节点会对稀疏子图中各节点的核数的计算产生影响,所以在迭代过程中还需要考虑稳定节点对稀疏子图中节点的核数的增幅。计算机设备在获得网络图中的稀疏子图和稳定节点后,可以基于该稀疏子图和稳定节点确定该稀疏子图中各节点的核数,以实现分治求解的第二步。
在一个实施例中,计算机设备可以采用k-core算法,对稀疏子图进行k-core挖掘,从稀疏子图中挖掘出各节点的核数。具体地,按照k=1,k=2,…,k处等于预设的阈值,从该稀疏子图中反复去除关联度小于或等于k的节点,得到k核子图,从而确定稀疏子图中各节点所在最大核数的子图,从而确定各节点的核数。
在一个实施例中,计算机设备还可以在对稀疏子图进行迭代时,在当次迭代过程中,利用节点前次迭代后,该节点在网络图中各邻居节点的核心指数更新相应节点的当次迭代过程的核数。
在一个实施例中,节点的所有邻居节点的核心指数,可以是H指数,一个节点的H指数如果是h,就说明这个节点至少有h个邻居节点,并且这h个邻居节点的关联度都不小于h。也就是说,若节点满足邻居节点中存在h个邻居节点的当前核数大于或等于h,且不满足存在h+1个邻居节点的当前核数大于或等于h+1时,则确定节点对应的核心指数为h,其中h为正整数。
在一个实施例中,如图6所示,基于稀疏子图和稳定节点,确定稀疏子图中各节点的核数,包括:
步骤602,根据稀疏子图中各节点在原始的网络图中邻居节点的数量,初始化稀疏子图中各节点的当前核数。
具体地,计算机设备在对稀疏子图进行挖掘时,可以利用稀疏子图中各节点在原始的网络图中的关联度来初始化各节点的核数,作为最初的当前核数。
也就是说,在计算稀疏子图中各节点的核数时,每次迭代过程中,不仅要考虑稀疏子图中的节点对其的影响,还需要考虑稳定节点对其的影响,所以需要考虑稳定节点对其关联度的增幅,也就是将节点在稀疏子图中的关联度与该节点与稳定节点相连的数量之和来初始化该节点的当前核数,实际上也就是该节点在原始的网络图中的关联度。
在一个实施例中,根据前面的步骤,稳定节点的核数已经确定,稳定节点的核数都是 大于预设的阈值的,而稀疏子图中各节点的核数都是小于或等于预设的阈值的,所以,在计算稀疏子图中各节点的核数时,若需要用到稳定节点的核数,为了减小内存,可以将稳定节点的核数均设置为预设的阈值,也可以设置为任意大于预设的阈值的值,还可以直接使用按前述步骤所确定的稳定节点的核数,上述不同方式的设置不会影响对稀疏子图中各节点的核数的计算结果。
步骤604,迭代地执行对于稀疏子图中的每个节点,根据节点在网络图中的邻居节点的当前核数,计算节点对应的核心指数;当核心指数小于节点的当前核数时,则根据节点的核心指数更新节点的当前核数的步骤,直至当次迭代过程中稀疏子图中各节点的当前核数均未被更新时停止迭代。
具体地,在每次迭代过程中,计算机设备都需要对稀疏子图中的每个节点都需要进行处理。对稀疏子图中的每个节点,根据其在网络图中的邻居节点的当前核数,即上一轮迭代过程后所有邻居节点的核数,计算该节点对应的核心指数。可以理解的是,若邻居节点包括稳定节点,稳定节点的核数在前述步骤已经确定,所以在稀疏子图的迭代过程中,稳定节点的核数都不用参与更新。若节点的核心指数小于节点的当前核数,则利用该核心指数更新该节点的当前核数。由于每个节点在当次迭代过程的核数是根据该节点所有邻居节点在前次迭代过程的核数确定的,具有局部性,可以很容易扩展成分布式并行计算的逻辑,从而加快整个挖掘过程。
迭代停止条件,是当次迭代过程中,稀疏子图中所有节点的当前核数均未发生变化。也就是说,根据节点的邻居节点在前次迭代的核数计算得到的核心指数与该节点当前核数一致时,则该节点的核数不会被更新,若稀疏子图中所有节点的当前核数在当次迭代过程中均未被更新,则停止迭代。
在一个实施例中,上述方法还包括:在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;记录的节点用于指示在下次迭代开始时,将记录的节点在稀疏子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;对于稀疏子图中的每个节点,根据节点在网络图中的邻居节点的当前核数,计算节点对应的核心指数,包括:对于稀疏子图中的目标节点,根据目标节点在网络图中的邻居节点的当前核数,计算目标节点对应的核心指数。
本实施例中,通过记录当次迭代过程中当前核数被更新的节点,就可以直接确定下次迭代过程中需要重新计算核数的节点。当某一个节点的核数被更新后,该节点将会影响其邻居节点的核数的确定,因此,当次迭代过程结束后,记录这些核数被更新的节点,在下次迭代开始时,从稀疏子图中遍历出这些节点的邻居节点,作为下次迭代过程需要重新计算核数的节点,可以避免对稀疏子图中所有节点重新计算核数,提升挖掘效率。可以理解的是,确定当前核数被更新的节点的邻居节点后,邻居节点中若包括稳定节点,则稳定节点不需要重新计算核数。
在一个实施例中,上述方法还包括:在当次迭代过程开始时,初始化节点更新数为零,节点更新数用于记录当次迭代过程中当前核数被更新的节点的数量;统计当次迭代过程中当前核数被更新的节点的数量;根据数量更新节点更新数;若当次迭代过程结束时,节点更新数为非零,则继续下一次迭代过程;若当次迭代过程结束时,节点更新数为零,则停止迭代。
在本实施例中,在挖掘稀疏子图的过程中,可以利用一个标记,来记录当次迭代过程中当前核数被更新的节点的数量。计算机设备可以设置一个用于记录每轮迭代过程中当前核数被更新的节点的数量,在当次迭代过程开始时,将这个标记置为0,对于参与当次迭代的节点,每当一个节点的核数被更新了,则该标记增1,那么,当次迭代结束后,若该标记不为0,说明当次迭代过程中存在核数被更新的节点,则需要继续迭代,若该标记为0,说明当次迭代的整个过程中都不存在核数被更新的节点,整个迭代过程结束。
步骤606,将停止迭代时节点的当前核数作为节点对应的核数。
迭代结束后,稀疏子图中各节点的核数即为该节点在整个原始的网络图中的核数。
在一个具体的实施例中,确定稀疏子图中各节点的核数的过程如下:
1、计算稀疏子图中各节点的关联度;
2、对稀疏子图中的各节点,统计其与稳定节点相连的个数q,用q值与其关联度之和初始化该节点的当前核数;
3、用零初始化numMsgs,numMsgs表示每轮迭代中核数被更新的节点的数量;
4、对稀疏子图中的每个节点,根据其邻居节点的当前核数计算核心指数。这里的邻居集合指的是节点在原始的网络图的邻居节点,也就是说,邻居节点不仅包括稀疏子图中 的节点,也可能包括稳定节点。当核心指数小于节点的当前核数时,根据核心指数更新该节点的当前核数,且numMsgs增加1。
5、当numMsgs不为0时,重复第3-4步;否则结束迭代,此时稀疏子图中各节点的核数即为各节点在整个原始的网络图中的核数。
上述图数据处理方法中,在获取了网络图中各节点的关联度之后,基于分治思想,对网络图进行分治求解,从而可以支持超大规模网络的子图挖掘。也就是,根据各节点的关联度将完整的网络图分为稠密子图和稀疏子图,分成两部分进行挖掘,极大降低了内存占用,并且可以直接切入稠密子图,避免在非重要节点上浪费较多的迭代时间和计算资源,提升了挖掘性能。
由于稀疏子图中的各节点不会对稠密子图中的节点产生影响,因此针对稠密子图,直接确定其中的稳定节点及稳定节点对应的核数,接着将网络图中除该稳定节点及稳定节点之间的连边外剩余的部分构成稀疏子图,考虑到稠密子图中的稳定节点会对其中的节点产生影响,因此针对该稀疏子图,需要根据稀疏子图本身及稠密子图中的稳定节点来确定该稀疏子图中各节点的核数。挖掘出网络图中各节点的核数之后,核数可作为相应节点的特征生成相应的特征向量输入至其他下游任务。
Parameter Server是一种用于机器学习领域分布式地存储或更新参数的超大规模参数服务器,Angel是一个基于参数服务器理念开发的高性能分布式机器学习平台,Spark是一个专为大规模数据处理而设计的快速通用的计算引擎,Spark on Angel是一个将Angel强大的参数服务器功能与Spark的大规模数据处理能力相结合的高性能分布式计算平台。
在一个实施例中,计算机设备可以在Spark on Angel上的实现上述图数据处理方法的迭代过程。
具体地,首先,可以从参数服务器中拉取上轮迭代过程中核数被更新的节点,由于节点的核数由其邻居节点的核数决定,如果邻居节点的核数值发生变化,则会影响该节点核数,所以可以推断出当次迭代需要重新计算核数的节点。接着,从参数服务器拉取需要重新计算核数的节点及其邻居节点的核数。然后基于核心指数计算当次迭代中这些节点的核数,如果需要用计算得到的核数更新之前的核数,则需要将更新后的核数存储在参数服务器中,以供迭代过程使用。
如图7所示,为一个实施例中图数据处理方法的示意图。参照图7,示出了将上述图数据处理方法在Spark on Angel平台上的实现的流程。参数服务器存储所有节点的当前核数以及每轮迭代和上轮迭代有更新的节点及其核数,迭代服务器存储邻接表。对于每个迭代服务器来说,每次迭代主要有以下几个步骤:
1、从参数服务器的ReadMessage拉取上轮迭代中有更新的节点,从而推断出本轮迭代需要重新计算核数的节点。这里的依据是:节点的核数由其邻居节点的核数决定,如果邻居节点的核数发生变化,则会影响该节点的核数。
2、从参数服务器的Coreness拉取需要计算的节点及其邻居节点的核数。
3、计算本轮迭代中节点的核数。
4、用第3步中得到的核数更新参数服务器WriteMessage和Coreness中存储的核数。
当所有迭代服务器计算一次后,用WriteMessage更新ReadMessage,并重置WriteMessage。判断ReadMessage是否为空,如果为空,表明所有节点的核数都不再被更新,结束迭代,否则继续迭代。
本实施例中,利用Spark平台高效的数据并行处理能力进行核数更新的迭代计算,可以提高数据处理效率。利用Angel强大的参数服务器的存储能力来拉取或更新核数,可以消除Spark中单点Driver的网络瓶颈,可以支持超大规模关系网络的k-core挖掘。利用k-core挖掘本身的特点,随着迭代深入,大部分节点都将保持稳定不再更新的,计算过程具有一定的稀疏性,所以基于分治思想通过设定一个阈值,将完整图结构分为稠密子图和稀疏子图两步骤挖掘,极大降低了内存占用,减少计算量,运行速度更快,资源消耗也更低。并且可以直接切入稠密子图,避免在核数为1、2、…等非重要节点上浪费较多的迭代时间和计算资源,这一点对于超大规模网络的k-core挖掘是非常重要的。
在一个具体的实施例中,如图8所示,上述图数据处理方法包括以下步骤:
步骤802,获取网络图。
步骤804,确定网络图中各节点的邻居节点的数量。
步骤806,将邻居节点的数量作为相应节点的关联度。
步骤808,获取预设的阈值。
步骤810,从网络图中移除关联度小于或等于阈值的节点及节点所在的连边,根据网 络图中剩余节点及剩余节点之间的连边获得稠密子图。
步骤812,根据各节点在稠密子图中邻居节点的数量,获得各节点在稠密子图的关联度,将在稠密子图中的关联度作为相应节点初始的当前核数。
步骤814,迭代地执行对于稠密子图中的每个节点,根据节点在稠密子图中的邻居节点的当前核数,计算节点对应的核心指数;当核心指数小于或等于预设的阈值时,从稠密子图中移除节点;当核心指数大于阈值且小于节点的当前核数时,则根据节点的核心指数更新节点的当前核数的步骤,直至当次迭代过程中稠密子图中各节点的当前核数均未被更新时停止迭代。
步骤816,将停止迭代时获得的稠密子图中的节点作为稳定节点,并将停止迭代时稳定节点的当前核数作为稳定节点对应的核数。
步骤818,从网络图中移除稳定节点。
步骤820,根据移除稳定节点后剩余节点及剩余节点之间的连边,获得稀疏子图。
步骤822,根据稀疏子图中各节点在原始的网络图中邻居节点的数量,初始化稀疏子图中各节点的当前核数。
步骤824,迭代地执行对于稀疏子图中的每个节点,根据节点在网络图中的邻居节点的当前核数,计算节点对应的核心指数;当核心指数小于节点的当前核数时,则根据节点的核心指数更新节点的当前核数的步骤,直至当次迭代过程中稀疏子图中各节点的当前核数均未被更新时停止迭代。
步骤826,将停止迭代时节点的当前核数作为节点对应的核数;
步骤828,根据各节点的核数生成与节点对应的特征向量;
步骤830,根据节点的特征向量对节点进行分类。
在一个实施例中,节点的核数可以用于根据核数生成与节点对应的特征向量,特征向量用于根据特征向量对节点进行分类。具体地,节点的核数可以作为特征输入到机器学习算法中,实现对节点的分类。例如,可以应用到商户的商业模式的挖掘中,对超大支付规模网络中的消费者和商户进行分类。还可以应用到金融风控产品中,实现对非法信贷中介、套现、多头借贷、赌博等异常进行。
在一个实施例中,网络图为支付关系网络图,支付关系网络图中的节点表示用户标识, 支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件,上述方法还包括:根据支付关系网络图中各节点的核数,生成节点表示的用户标识所对应的特征向量;通过预先训练的分类模型,基于特征向量预测用户标识对应的支付类型。
在一个具体的应用场景中,计算机设备可以获取与用户标识对应的支付记录;根据支付记录获得各用户标识之间的支付交互数据;根据支付交互数据生成支付关系网络图,并采用本申请实施例提供的图数据处理方法对该支付关系网络图进行处理,获得各节点的核数,根据各节点的核数生成对应的特征向量,利用基于机器学习的分类算法对各节点进行分类,以区分各节点是商户或消费者。
应该理解的是,虽然图2、图5、图6、图8的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图9所示,提供了一种图数据处理装置900,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:网络图获取模块902、稠密子图获取模块904、第一确定模块906、稀疏子图获取模块908和第二确定模块910,其中:
网络图获取模块902,用于获取网络图中各节点的关联度;
稠密子图获取模块904,用于根据预设的阈值和各节点的关联度从网络图中拆分出稠密子图;
第一确定模块906,用于基于稠密子图,确定网络图中的稳定节点及稳定节点的核数,所述稳定节点的核数大于所述预设的阈值;
稀疏子图获取模块908,用于根据网络图中除稳定节点之外的剩余节点及剩余节点之间的连边,获得网络图中的稀疏子图;
第二确定模块910,用于基于稀疏子图和稳定节点,确定稀疏子图中各节点的核数;其中,节点的核数用于生成与节点对应的特征向量
在一个实施例中,网络图获取模块902还用于获取网络图;确定网络图中各节点的邻居节点的数量;将邻居节点的数量作为相应节点的关联度。
在一个实施例中,网络图获取模块902还用于获取与用户标识对应的支付记录;根据支付记录获得各用户标识之间的支付交互数据;根据支付交互数据生成支付关系网络图;其中,支付关系网络图的节点表示用户标识,支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件。
在一个实施例中,稠密子图获取模块904还用于获取预设的阈值;从网络图中移除关联度小于或等于阈值的节点及节点所在的连边,根据网络图中剩余节点及剩余节点之间的连边获得稠密子图。
在一个实施例中,第一确定模块906还用于根据各节点在稠密子图中邻居节点的数量,获得各节点在稠密子图的关联度,将在稠密子图中的关联度作为相应节点初始的当前核数;迭代地执行对于稠密子图中的每个节点,根据节点在稠密子图中的邻居节点的当前核数,计算节点对应的核心指数;当核心指数小于或等于预设的阈值时,从稠密子图中移除节点;当核心指数大于阈值且小于节点的当前核数时,则根据节点的核心指数更新节点的当前核数的步骤,直至当次迭代过程中稠密子图中各节点的当前核数均未被更新时停止迭代;将停止迭代时获得的稠密子图中的节点作为稳定节点,并将停止迭代时稳定节点的当前核数作为稳定节点对应的核数。
在一个实施例中,第一确定模块906还用于在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;记录的节点用于指示在下次迭代开始时,将记录的节点在稠密子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;对于稠密子图中的目标节点,根据目标节点在稠密子图中的邻居节点的当前核数,计算目标节点对应的核心指数。
在一个实施例中,第一确定模块906还用于若节点满足邻居节点中存在h个邻居节点的当前核数大于或等于h,且不满足存在h+1个邻居节点的当前核数大于或等于h+1时,则确定节点对应的核心指数为h,其中h为正整数。
在一个实施例中,第一确定模块906还用于在当次迭代过程开始时,初始化节点更新数为零,节点更新数用于记录当次迭代过程中当前核数被更新的节点的数量;统计当次迭代过程中当前核数被更新的节点的数量;根据数量更新节点更新数;若当次迭代过程结束时,节点更新数为非零,则继续下一次迭代过程;若当次迭代过程结束时,节点更新数为零,则停止迭代。
在一个实施例中,稀疏子图获取模块908还用于从网络图中移除稳定节点;根据移除稳定节点后剩余节点及剩余节点之间的连边,获得稀疏子图。
在一个实施例中,第二确定模块910还用于在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;记录的节点用于指示在下次迭代开始时,将记录的节点在稀疏子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;对于稀疏子图中的目标节点,根据目标节点在网络图中的邻居节点的当前核数,计算目标节点对应的核心指数。
在一个实施例中,第二确定模块910还用于若节点满足邻居节点中存在h个邻居节点的当前核数大于或等于h,且不满足存在h+1个邻居节点的当前核数大于或等于h+1时,则确定节点对应的核心指数为h,其中h为正整数。
在一个实施例中,第二确定模块910还用于在当次迭代过程开始时,初始化节点更新数为零,节点更新数用于记录当次迭代过程中当前核数被更新的节点的数量;统计当次迭代过程中当前核数被更新的节点的数量;根据数量更新节点更新数;若当次迭代过程结束时,节点更新数为非零,则继续下一次迭代过程;若当次迭代过程结束时,节点更新数为零,则停止迭代。
在一个实施例中,网络图为支付关系网络图,支付关系网络图中的节点表示用户标识,支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件,上述装置还包括分类模块,用于根据支付关系网络图中各节点的核数,生成节点表示的用户标识所对应的特征向量;通过预先训练的分类模型,基于特征向量预测用户标识对应的支付类型。
上述图数据处理装置,在获取了网络图中各节点的关联度之后,基于分治思想,对网络图进行分治求解,从而可以支持超大规模网络的子图挖掘。也就是,根据各节点的关联 度将完整的网络图分为稠密子图和稀疏子图,分成两部分进行挖掘,极大降低了内存占用,并且可以直接切入稠密子图,避免在非重要节点上浪费较多的迭代时间和计算资源,提升了挖掘性能。
由于稀疏子图中的各节点不会对稠密子图中的节点产生影响,因此针对稠密子图,直接确定其中的稳定节点及稳定节点对应的核数,接着将网络图中除该稳定节点及稳定节点之间的连边外剩余的部分构成稀疏子图,考虑到稠密子图中的稳定节点会对其中的节点产生影响,因此针对该稀疏子图,需要根据稀疏子图本身及稠密子图中的稳定节点来确定该稀疏子图中各节点的核数。挖掘出网络图中各节点的核数之后,核数可作为相应节点的特征生成相应的特征向量输入至其他下游任务。
关于图数据处理装置的具体限定可以参见上文中对于图数据处理方法的限定,在此不再赘述。上述图数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种图数据处理方法。
本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种图数据处理方法,所述方法由计算机设备执行,所述方法包括:
    获取网络图中各节点的关联度;
    根据预设的阈值和所述各节点的关联度从所述网络图中拆分出稠密子图;
    基于所述稠密子图,确定所述网络图中的稳定节点及所述稳定节点的核数,所述稳定节点的核数大于所述预设的阈值;
    根据所述网络图中除所述稳定节点之外的剩余节点及所述剩余节点之间的连边,获得所述网络图中的稀疏子图;
    基于所述稀疏子图和所述稳定节点,确定所述稀疏子图中各节点的核数;
    其中,确定的所述核数用于生成与相应节点对应的特征向量。
  2. 根据权利要求1所述的方法,其特征在于,所述获取网络图中各节点的关联度,包括:
    获取所述网络图;
    确定所述网络图中各节点的邻居节点的数量;
    将所述邻居节点的数量作为相应节点的关联度。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取与用户标识对应的支付记录;
    根据所述支付记录获得所述用户标识之间的支付交互数据;
    根据所述支付交互数据生成支付关系网络图;
    其中,所述支付关系网络图的节点表示用户标识,所述支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件。
  4. 根据权利要求1所述的方法,其特征在于,所述根据预设的阈值和所述各节点的关联度从所述网络图中拆分出稠密子图,包括:
    获取预设的阈值;
    从所述网络图中移除所述关联度小于或等于所述预设的阈值的节点及所述节点所在的连边,根据所述网络图中剩余节点及所述剩余节点之间的连边获得所述稠密子图。
  5. 根据权利要求1所述的方法,其特征在于,所述基于所述稠密子图,确定所述网 络图中的稳定节点及所述稳定节点的核数,包括:
    根据各节点在所述稠密子图中邻居节点的数量,获得各节点在所述稠密子图的关联度,将在所述稠密子图中的关联度作为相应节点初始的当前核数;
    迭代地执行对于所述稠密子图中的每个节点,根据所述节点在所述稠密子图中的邻居节点的当前核数,计算所述节点对应的核心指数;当所述核心指数小于或等于预设的阈值时,从所述稠密子图中移除所述节点;当所述核心指数大于所述阈值且小于所述节点的当前核数时,则根据所述节点的核心指数更新所述节点的当前核数的步骤,直至当次迭代过程中所述稠密子图中各节点的当前核数均未被更新时停止迭代;
    将停止迭代时获得的稠密子图中的节点作为所述稳定节点,并将停止迭代时所述稳定节点的当前核数作为所述稳定节点对应的核数。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;
    记录的节点用于指示在下次迭代开始时,将所述记录的节点在所述稠密子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;
    所述对于所述稠密子图中的每个节点,根据所述节点在所述稠密子图中的邻居节点的当前核数,计算所述节点对应的核心指数,包括:
    对于所述稠密子图中的目标节点,根据所述目标节点在所述稠密子图中的邻居节点的当前核数,计算所述目标节点对应的核心指数。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述网络图中除所述稳定节点之外的剩余节点及所述剩余节点之间的连边,获得所述网络图中的稀疏子图,包括:
    从所述网络图中移除所述稳定节点;
    根据移除所述稳定节点后所述剩余节点及所述剩余节点之间的连边,获得所述稀疏子图。
  8. 根据权利要求1所述的方法,其特征在于,所述基于所述稀疏子图和所述稳定节点,确定所述稀疏子图中各节点的核数,包括:
    根据所述稀疏子图中各节点在原始的所述网络图中邻居节点的数量,初始化所述稀疏子图中各节点的当前核数;
    迭代地执行对于所述稀疏子图中的每个节点,根据所述节点在所述网络图中的邻居节点的当前核数,计算所述节点对应的核心指数;当所述核心指数小于所述节点的当前核数时,则根据所述节点的核心指数更新所述节点的当前核数的步骤,直至当次迭代过程中所述稀疏子图中各节点的当前核数均未被更新时停止迭代;
    将停止迭代时所述节点的当前核数作为所述节点对应的核数。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;
    记录的节点用于指示在下次迭代开始时,将所述记录的节点在所述稀疏子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;
    所述对于所述稀疏子图中的每个节点,根据所述节点在所述网络图中的邻居节点的当前核数,计算所述节点对应的核心指数,包括:
    对于所述稀疏子图中的目标节点,根据所述目标节点在所述网络图中的邻居节点的当前核数,计算所述目标节点对应的核心指数。
  10. 根据权利要求5或8所述的方法,其特征在于,所述计算所述节点对应的核心指数,包括:
    若所述节点满足邻居节点中存在h个邻居节点的当前核数大于或等于h,且不满足存在h+1个邻居节点的当前核数大于或等于h+1时,则
    确定所述节点对应的核心指数为h,其中h为正整数。
  11. 根据权利要求5或8所述的方法,其特征在于,所述方法还包括:
    在当次迭代过程开始时,初始化节点更新数为零,所述节点更新数用于记录当次迭代过程中当前核数被更新的节点的数量;
    统计当次迭代过程中当前核数被更新的节点的数量;
    根据所述数量更新所述节点更新数;
    若当次迭代过程结束时,所述节点更新数为非零,则继续下一次迭代过程;
    若当次迭代过程结束时,所述节点更新数为零,则停止迭代。
  12. 根据权利要求1至9任意一项所述的方法,其特征在于,所述网络图为支付关系网络图,所述支付关系网络图中的节点表示用户标识,所述支付关系网络图中两个节点之 间的连边表示相应的两个用户标识之间存在支付交互事件,所述方法还包括:
    根据所述支付关系网络图中各节点的核数,生成所述节点表示的用户标识所对应的特征向量;
    通过预先训练的分类模型,基于所述特征向量预测所述用户标识对应的支付类型。
  13. 一种图数据处理装置,所述装置包括:
    网络图获取模块,用于获取网络图中各节点的关联度;
    稠密子图获取模块,用于根据预设的阈值和所述各节点的关联度从所述网络图中拆分出稠密子图;
    第一确定模块,用于基于所述稠密子图,确定所述网络图中的稳定节点及所述稳定节点的核数,所述稳定节点的核数大于所述预设的阈值;
    稀疏子图获取模块,用于根据所述网络图中除所述稳定节点之外的剩余节点及所述剩余节点之间的连边,获得所述网络图中的稀疏子图;
    第二确定模块,用于基于所述稀疏子图和所述稳定节点,确定所述稀疏子图中各节点的核数;
    其中,确定的所述核数用于生成与相应节点对应的特征向量。
  14. 根据权利要求13所述的装置,其特征在于,所述网络图获取模块还用于获取所述网络图;确定所述网络图中各节点的邻居节点的数量;将所述邻居节点的数量作为相应节点的关联度。
  15. 根据权利要求13所述的装置,其特征在于,所述网络图获取模块还用于获取与用户标识对应的支付记录;根据所述支付记录获得所述用户标识之间的支付交互数据;根据所述支付交互数据生成支付关系网络图;其中,所述支付关系网络图的节点表示用户标识,所述支付关系网络图中两个节点之间的连边表示相应的两个用户标识之间存在支付交互事件。
  16. 根据权利要求13所述的装置,其特征在于,所述稠密子图获取模块还用于获取所述预设的阈值;从所述网络图中移除所述关联度小于或等于所述预设的阈值的节点及所述节点所在的连边,根据所述网络图中剩余节点及所述剩余节点之间的连边获得所述稠密子图。
  17. 根据权利要求13所述的装置,其特征在于,所述第一确定模块还用于根据各节点在所述稠密子图中邻居节点的数量,获得各节点在所述稠密子图的关联度,将在所述稠密子图中的关联度作为相应节点初始的当前核数;迭代地执行对于所述稠密子图中的每个节点,根据所述节点在所述稠密子图中的邻居节点的当前核数,计算所述节点对应的核心指数;当所述核心指数小于或等于预设的阈值时,从所述稠密子图中移除所述节点;当所述核心指数大于所述阈值且小于所述节点的当前核数时,则根据所述节点的核心指数更新所述节点的当前核数的步骤,直至当次迭代过程中所述稠密子图中各节点的当前核数均未被更新时停止迭代;将停止迭代时获得的稠密子图中的节点作为所述稳定节点,并将停止迭代时所述稳定节点的当前核数作为所述稳定节点对应的核数。
  18. 根据权利要求17所述的装置,其特征在于,所述第一确定模块还用于第一确定模块还用于在当次迭代结束后,记录当次迭代过程中当前核数被更新的节点;记录的节点用于指示在下次迭代开始时,将所述记录的节点在所述稠密子图中的邻居节点,作为下次迭代过程中需要重新计算核心指数的目标节点;对于所述稠密子图中的目标节点,根据所述目标节点在所述稠密子图中的邻居节点的当前核数,计算所述目标节点对应的核心指数。
  19. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器存储有计算机程序,其特征在于,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至12中任一项所述的方法的步骤。
  20. 一个或多个存储有计算机可读指令的非易失性一种计算机可读存储介质,存储有计算机可读指令程序,其特征在于,所述计算机可读指令程序被一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至12中任一项所述的方法的步骤。
PCT/CN2021/123265 2020-12-03 2021-10-12 图数据处理方法、装置、计算机设备和存储介质 WO2022116689A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023518909A JP2023545940A (ja) 2020-12-03 2021-10-12 グラフデータ処理方法、装置、コンピュータ機器及びコンピュータプログラム
EP21899723.7A EP4206943A4 (en) 2020-12-03 2021-10-12 METHOD AND APPARATUS FOR PROCESSING GRAPH DATA, COMPUTER DEVICE AND STORAGE MEDIUM
US17/949,030 US11935049B2 (en) 2020-12-03 2022-09-20 Graph data processing method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011394355.5 2020-12-03
CN202011394355.5A CN112214499B (zh) 2020-12-03 2020-12-03 图数据处理方法、装置、计算机设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/949,030 Continuation US11935049B2 (en) 2020-12-03 2022-09-20 Graph data processing method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022116689A1 true WO2022116689A1 (zh) 2022-06-09

Family

ID=74068022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123265 WO2022116689A1 (zh) 2020-12-03 2021-10-12 图数据处理方法、装置、计算机设备和存储介质

Country Status (5)

Country Link
US (1) US11935049B2 (zh)
EP (1) EP4206943A4 (zh)
JP (1) JP2023545940A (zh)
CN (1) CN112214499B (zh)
WO (1) WO2022116689A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214499B (zh) 2020-12-03 2021-03-19 腾讯科技(深圳)有限公司 图数据处理方法、装置、计算机设备和存储介质
CN113704309B (zh) * 2021-09-02 2024-01-26 湖南大学 图数据处理方法、装置、计算机设备和存储介质
CN113609345B (zh) * 2021-09-30 2021-12-10 腾讯科技(深圳)有限公司 目标对象关联方法和装置、计算设备以及存储介质
CN117408806A (zh) * 2022-07-07 2024-01-16 汇丰软件开发(广东)有限公司 一种识别加密货币市场中操纵价格行为的方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195941B2 (en) * 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
CN107203619A (zh) * 2017-05-25 2017-09-26 电子科技大学 一种复杂网络下的核心子图提取算法
CN109978705A (zh) * 2019-02-26 2019-07-05 华中科技大学 一种基于极大团枚举的社交网络中社团发现方法
CN111339374A (zh) * 2020-02-25 2020-06-26 华南理工大学 一种基于加权三角密度的稠密子图抽取方法
CN111444395A (zh) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 获取实体间关系表达的方法、系统和设备、广告召回系统
CN111475680A (zh) * 2020-03-27 2020-07-31 深圳壹账通智能科技有限公司 检测异常高密子图的方法、装置、设备及存储介质
CN112214499A (zh) * 2020-12-03 2021-01-12 腾讯科技(深圳)有限公司 图数据处理方法、装置、计算机设备和存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140145018A (ko) * 2013-06-12 2014-12-22 한국전자통신연구원 지식 인덱스 시스템 및 그 방법
CN106844500A (zh) * 2016-12-26 2017-06-13 深圳大学 一种k‑core‑truss社区模型及分解、搜索算法
US11328128B2 (en) * 2017-02-28 2022-05-10 SavantX, Inc. System and method for analysis and navigation of data
CN111444394B (zh) * 2019-01-16 2023-05-23 阿里巴巴集团控股有限公司 获取实体间关系表达的方法、系统和设备、广告召回系统
CN109921939B (zh) * 2019-03-18 2022-04-15 中电科大数据研究院有限公司 一种通信网络中关键节点的选取方法及系统
CN111177479B (zh) * 2019-12-23 2023-08-18 北京百度网讯科技有限公司 获取关系网络图中节点的特征向量的方法以及装置
CN111368147B (zh) * 2020-02-25 2021-07-06 支付宝(杭州)信息技术有限公司 图特征处理的方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195941B2 (en) * 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
CN107203619A (zh) * 2017-05-25 2017-09-26 电子科技大学 一种复杂网络下的核心子图提取算法
CN111444395A (zh) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 获取实体间关系表达的方法、系统和设备、广告召回系统
CN109978705A (zh) * 2019-02-26 2019-07-05 华中科技大学 一种基于极大团枚举的社交网络中社团发现方法
CN111339374A (zh) * 2020-02-25 2020-06-26 华南理工大学 一种基于加权三角密度的稠密子图抽取方法
CN111475680A (zh) * 2020-03-27 2020-07-31 深圳壹账通智能科技有限公司 检测异常高密子图的方法、装置、设备及存储介质
CN112214499A (zh) * 2020-12-03 2021-01-12 腾讯科技(深圳)有限公司 图数据处理方法、装置、计算机设备和存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of EP4206943A4
ZHU, RONG ET AL.: "Mining Top-k Dense Subgraphs from Uncertain Graphs", CHINESE JOURNAL OF COMPUTERS, vol. 39, no. 8, 31 August 2016 (2016-08-31), pages 1570 - 1582, XP055935700 *

Also Published As

Publication number Publication date
US20230013392A1 (en) 2023-01-19
EP4206943A4 (en) 2024-01-31
EP4206943A1 (en) 2023-07-05
US11935049B2 (en) 2024-03-19
JP2023545940A (ja) 2023-11-01
CN112214499A (zh) 2021-01-12
CN112214499B (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022116689A1 (zh) 图数据处理方法、装置、计算机设备和存储介质
US11074295B2 (en) Distributed graph embedding method and apparatus, device, and system
US20210390461A1 (en) Graph model build and scoring engine
US11423295B2 (en) Dynamic, automated fulfillment of computer-based resource request provisioning using deep reinforcement learning
US9449115B2 (en) Method, controller, program and data storage system for performing reconciliation processing
US8990209B2 (en) Distributed scalable clustering and community detection
US10812551B1 (en) Dynamic detection of data correlations based on realtime data
US10268749B1 (en) Clustering sparse high dimensional data using sketches
CN104077723B (zh) 一种社交网络推荐系统及方法
US10922627B2 (en) Determining a course of action based on aggregated data
CN112380453B (zh) 物品推荐方法、装置、存储介质及设备
CN114579584B (zh) 数据表处理方法、装置、计算机设备和存储介质
WO2015180340A1 (zh) 一种数据挖掘方法及装置
CN113378470A (zh) 一种面向时序网络的影响力最大化方法及系统
CN110674181B (zh) 信息推荐方法、装置、电子设备及计算机可读存储介质
Alshayeji et al. Spark‐based parallel processing whale optimization algorithm
CN115344794A (zh) 一种基于知识图谱语义嵌入的旅游景点推荐方法
US20150356132A1 (en) Methods And Systems For Processing Data
CN114080609A (zh) 基于编码知识的非线性因果建模
CN110033098A (zh) 在线gbdt模型学习方法及装置
CN109242027A (zh) 一种可交互的大数据并行k-means聚类方法
CN115858145A (zh) 一种基于图节点特征预计算的超大规模电商网络分区优化方法及装置
CN115131161A (zh) 意见领袖识别方法、装置、电子设备和存储介质
US20180165587A1 (en) Epistemic uncertainty reduction using simulations, models and data exchange
CN117422120A (zh) 深度学习模型的计算图优化方法、装置、终端及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899723

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023518909

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2021899723

Country of ref document: EP

Effective date: 20230327

NENP Non-entry into the national phase

Ref country code: DE