CN111523831B

CN111523831B - Risk group identification method and device, storage medium and computer equipment

Info

Publication number: CN111523831B
Application number: CN202010629923.9A
Authority: CN
Inventors: 王宝坤; 张屹綮; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-11-03
Anticipated expiration: 2040-07-03
Also published as: CN111523831A

Abstract

The embodiment of the specification provides a risk group identification method, a risk group identification device, a storage medium and a computer device. The method comprises the following steps: constructing a relational graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes; performing subgraph cutting on the relationship graph to generate a plurality of node subsets; removing useless nodes in the node subset to generate a query subset; identifying nodes in each of the query subsets as each first risk group.

Description

Risk group identification method and device, storage medium and computer equipment

Technical Field

The embodiment of the specification relates to the technical field of internet, in particular to a risk group identification method, a risk group identification device, a storage medium and computer equipment.

Background

In the wind control scenes of embezzlement, fraud, cheating, cash register, common use and the like, a ganged-partner mode of doing a case usually exists, and the mode of doing a case by mutually hooking members is wide in spread range and easy to cause great investment loss.

In the related art, the identification process of the risk group is mainly realized through an unsupervised learning algorithm.

Disclosure of Invention

In view of the above, the present specification provides a method, an apparatus, a storage medium, and a computer device for risk group identification, which are used to reduce the amount of computation and improve the accuracy of identification.

In one aspect, an embodiment of the present specification provides a method for identifying a risk group, including:

constructing a relational graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes;

performing subgraph cutting on the relationship graph to generate a plurality of node subsets;

removing useless nodes in the node subset to generate a query subset;

identifying nodes in each of the query subsets as each first risk group.

Optionally, before the constructing the relationship graph according to the obtained black seed node and the diffusion node of the black seed node, the method includes:

acquiring a plurality of black seed nodes and an edge relation table;

and according to the edge relation table, performing N-degree diffusion by taking a plurality of black seed nodes as initial nodes to obtain diffusion nodes of the black seed nodes.

Optionally, the performing subgraph cut on the relationship graph to generate a plurality of node subsets includes:

and performing subgraph cutting on the relational graph through a Louvain algorithm to generate a plurality of node subsets.

Optionally, the removing useless nodes in the node subset to generate a query subset includes:

calculating the node subset through a minimum spanning tree algorithm to obtain a minimum spanning tree;

performing iterative pruning on the minimum spanning tree through a k-kernel algorithm to remove the useless nodes;

generating the query subset from the remaining nodes.

Optionally, after identifying nodes in each of the query subsets as each first risk group, the method further includes:

taking the nodes in the query subset as initial nodes to recall the nodes to obtain recalled nodes;

generating a subgraph according to the nodes in the query subset and the recall nodes;

scoring the nodes in the subgraph to generate risk scores of all the nodes;

determining a node selection threshold according to the risk score of each node;

identifying a threshold number of nodes of selection of nodes in the subgraph as second risk groups according to the node selection threshold.

Optionally, the risk score comprises a first risk score and a second risk score;

the scoring the nodes in the subgraph to generate the risk score of each node comprises the following steps:

splitting the subgraph into a risk set and an intermediary set, wherein the risk set comprises nodes with incoming edges, the intermediary set comprises nodes with outgoing edges, and connection relations are formed between the nodes in the risk set and the nodes in the intermediary set;

calculating a state transition matrix of a node in the risk set and a state transition matrix of a node in the intermediary set according to the connection relation between the risk set and the intermediary set;

calculating a first risk score of the nodes in the risk set according to the state transition matrix of the nodes in the risk set by using a personalized webpage sorting algorithm;

and calculating a second risk score of the nodes in the intermediary set according to the state transition matrix of the nodes in the intermediary set by a personalized webpage sorting algorithm.

Optionally, the determining a node selection threshold according to the risk score of each node includes:

sequencing the nodes in the subgraph according to the sequence of the first risk scores of the nodes in the risk set from large to small;

generating an alternative set according to the sorted nodes;

according to the number of the nodes in the alternative set, different alternative node thresholds are selected;

selecting nodes corresponding to different alternative node thresholds from the alternative set according to different alternative node thresholds, and generating alternative subsets corresponding to different alternative node thresholds according to the nodes corresponding to different alternative node thresholds;

calculating a connectivity value corresponding to each alternative subset;

and taking the number of the nodes in the front candidate subset corresponding to the minimum connectivity value as the node selection threshold value.

Optionally, the different candidate node thresholds gradually increase at set numerical intervals, and the maximum candidate node threshold includes the number of nodes in the candidate set.

In another aspect, an apparatus for identifying a risk group includes:

the construction module is used for constructing a relational graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes;

the subgraph cutting module is used for carrying out subgraph cutting on the relational graph to generate a plurality of node subsets;

the node removing module is used for removing useless nodes in the node subset to generate a query subset;

a first identification module to identify nodes in each of the query subsets as each first risk group.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a plurality of black seed nodes and the edge relation table;

and the diffusion module is used for performing N-degree diffusion by taking the plurality of black seed sub-nodes as initial nodes according to the edge relation table to obtain the diffusion nodes of the black seed sub-nodes.

Optionally, the subgraph cutting module is specifically configured to perform subgraph cutting on the relationship graph through a Louvain algorithm to generate a plurality of node subsets.

Optionally, the node removing module is specifically configured to calculate a node subset through a minimum spanning tree algorithm to obtain a minimum spanning tree, perform iterative pruning on the minimum spanning tree through a k-kernel algorithm to remove the useless nodes, and generate the query subset according to remaining nodes.

Optionally, the apparatus further comprises:

the node recall module is used for recalling nodes by taking the nodes in the query subset as initial nodes to obtain recalled nodes;

the subgraph generation module is used for generating a subgraph according to the nodes in the query subset and the recall nodes;

the scoring module is used for scoring the nodes in the subgraph to generate risk scores of all the nodes;

the threshold selection module is used for determining a node selection threshold according to the risk score of each node;

a second identification module to identify a threshold number of nodes of the selection of nodes in the sub-graph as second risk groups based on the node selection threshold.

Optionally, the risk score comprises a first risk score and a second risk score; the scoring module is specifically configured to split the subgraph into a risk set and an intermediary set, where the risk set includes nodes with incoming edges, the intermediary set includes nodes with outgoing edges, and a connection relationship is provided between the nodes in the risk set and the nodes in the intermediary set; calculating a state transition matrix of a node in the risk set and a state transition matrix of a node in the intermediary set according to the connection relation between the risk set and the intermediary set; calculating a first risk score of the nodes in the risk set according to the state transition matrix of the nodes in the risk set by using a personalized webpage sorting algorithm; and calculating a second risk score of the nodes in the intermediary set according to the state transition matrix of the nodes in the intermediary set by a personalized webpage sorting algorithm.

Optionally, the threshold selection module is specifically configured to sort the nodes in the subgraph according to a descending order of the first risk scores of the nodes in the risk set; generating an alternative set according to the sorted nodes; according to the number of the nodes in the alternative set, different alternative node thresholds are selected; selecting nodes corresponding to different alternative node thresholds from the alternative set according to different alternative node thresholds, and generating alternative subsets corresponding to different alternative node thresholds according to the nodes corresponding to different alternative node thresholds; calculating a connectivity value corresponding to each alternative subset; and taking the number of the nodes in the front candidate subset corresponding to the minimum connectivity value as the node selection threshold value.

In another aspect, the present specification provides a storage medium comprising a stored program, wherein the program is executed to control a device on which the storage medium is located to perform the steps of the above method for identifying a risk group.

In another aspect, the present specification provides a computer device comprising a memory for storing information including program instructions and a processor for controlling the execution of the program instructions, which program instructions, when loaded and executed by the processor, implement the steps of the above-described method of identification of risk groups.

In the scheme of the embodiment of the specification, a relational graph is constructed according to the obtained black seed nodes and the diffusion nodes of the black seed nodes, composition is not required according to the full-amount nodes, the identified first risk group is related to the black seed nodes, and a large number of unrelated communities cannot be generated, so that the calculated amount is reduced; and removing useless nodes in the plurality of cut node subsets to generate query subsets, and identifying the nodes in each query subset as each first risk group, so that the accuracy of the identified first risk groups is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without any creative effort.

Fig. 1 is a schematic diagram of a risk group identification method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a risk group identification method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of performing sub-graph cutting on a relationship graph by a Louvain algorithm in an embodiment of the present specification;

fig. 4 is a schematic diagram of a risk group identification method according to another embodiment of the present disclosure;

FIG. 5 is a flow chart of a method for identifying risk groups according to another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a node recall in an embodiment of the present description;

FIG. 7 is a schematic diagram illustrating the scoring of nodes by the SALSA algorithm in an embodiment of the present disclosure;

FIG. 8 is another schematic diagram illustrating the scoring of nodes by the SALSA algorithm in an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of determining a node selection threshold in an embodiment of the present description;

fig. 10 is a schematic structural diagram of a risk group identification apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

For better understanding of the technical solutions in the present specification, the following detailed description of the embodiments of the present specification is provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only a few embodiments of the present specification, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present specification.

The terminology used in the embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the specification. As used in the specification examples and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of associative relationship that describes an associated object, meaning that three types of relationships may exist, e.g., A and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

For example, in a fraud scenario, a person may impersonate a public inspection method to call a victim and have the victim transfer to a designated account in a threatening manner. Usually, such fraud groups have a certain group aggregation, that is, there may be some association relations between people who pretend to be public inspection methods, for example, the association relations include fund, address book, social contact, media, and so on. If known cheating committing groups are related through the association relationship, the group relationship among the known cheating committing groups is obtained.

In the cheating scenario, when a merchant conducts campaign promotion or brand promotion in a self-owned channel or a platform channel, the merchant usually uses a coupon or a small amount of cash to conduct marketing. But there are such groups of "wool-in-tear" which will pull out these coupons or small cash offers in some way. The group members also have certain association relations such as funds, address lists, social contact, media and the like, and if the group members are associated by using the association relations, similar risk groups can be obtained. The common phenomenon of these groups is that a group of people are grouped together in bulk to engage in uneartful activities. If people who risk later are connected in series, potential criminal personnel cannot be found.

If the group member is found to have a certain relationship with a plurality of historical black-producing persons without any after-the-fact information, the group member can be determined to be possible to become a black-producing product in the future with a high probability. The crowd can be limited in the processes of mental education, right granting, fund transaction and the like in advance, so that the purpose of avoiding risks is achieved. Therefore, the embodiment of the specification mainly aims to carry out mining and discrimination in advance on the part of people and achieve the purpose of reducing the investment loss and the case quantity.

In the related art, the group mining is mainly realized through an unsupervised learning algorithm. According to the scheme, the whole graph is sub-divided, many irrelevant communities can be found, and due to the introduction of the unorthodox points, the identification of the risk groups is interfered, so that the calculation amount is increased, and the identification accuracy is reduced.

In the embodiment of the specification, the semi-supervised learning algorithm is adopted to realize the identification of the risk group, the relationship graph is constructed from the black seed nodes, and the calculation on the full graph formed by the full amount of nodes is avoided, so that the calculation amount is reduced, and the identification accuracy of the risk group is improved.

Fig. 1 is a schematic diagram of an identification method of a risk group provided in an embodiment of the present specification, and fig. 2 is a flowchart of the identification method of a risk group provided in an embodiment of the present specification, as shown in fig. 1 and fig. 2, the method includes:

and 102, constructing a relational graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes.

The steps of the embodiments of the present specification may be performed by a computing platform, wherein the computing platform may include a platform having data computing, data processing capabilities.

In this embodiment of the present specification, step 102 further includes:

step 101a, acquiring a plurality of black seed nodes and an edge relation table.

The service personnel can input a plurality of black seed nodes and edge relation tables to the client, and then the client sends the black seed nodes and the edge relation tables input by the service personnel to the computing platform so that the computing platform can obtain the black seed nodes and the edge relation tables.

And step 101b, performing N-degree diffusion by taking the plurality of black seed nodes as initial nodes according to the edge relation table to obtain diffusion nodes of the black seed nodes.

The edge relation table includes an N-degree relation in which a plurality of black seed nodes are diffused as initial nodes, in other words, a diffusion node in which N-degree diffusion is performed with the black seed nodes as the initial nodes to obtain the black seed nodes. The edge relation table comprises a full-quantity incidence relation, then the full-quantity black seed nodes are used as initial nodes to expand the N-degree incidence relation to obtain diffusion nodes, and then composition (Build Graph) can be carried out according to the black seed nodes and the diffusion nodes of the black seed nodes to construct a relation Graph. As an alternative, the value of N includes 2, and the black seed child node is used as the initial node to perform 2-degree diffusion, so that the range of risk group identification can be met, and the calculation amount can be reduced.

In the embodiment of the present specification, the constructed relationship graph includes black seed nodes and diffusion nodes. The black seed child node includes a node that has performed illegal activities in a previous historical period of time, for example, the illegal activities may include theft, fraud, cheating, cash-out, and fraud. The black seed node may correspond to an illegal account, which may include an illegal personal account or an illegal merchant account, for example. The diffusion nodes comprise high-risk nodes to be queried.

And 104, carrying out subgraph cutting on the relation graph to generate a plurality of node subsets.

The method specifically comprises the following steps: and performing subgraph cutting on the relational graph through a Louvain algorithm to generate a plurality of node subsets.

The Louvain algorithm is a community discovery algorithm based on modularity, the algorithm is good in efficiency and effect, a hierarchical community structure can be discovered, and the optimization goal is to maximize the modularity of the whole community network. The core idea of the Louvain algorithm is to fold the community into a single point continuously until the overall modularity is no longer increased.

The calculation formula of the modularity is as follows:

wherein, in the step (A),mis the total number of edges in the network,A _ijis the edge weight (weight) of the edge between node i and node j.k _iIs the sum of the weights of the neighboring edges of node i,k _jis the sum of the weights of the neighboring edges of node j,C _iis a label that the node i belongs to a certain community,C _jis the label that node j belongs to a certain community. Wherein, if and only if

Time of flight

With this setting, calculation is facilitated.

In this embodiment of the present specification, in the process of performing subgraph cut on a relational graph by using a Louvain algorithm, in order to mine independent node subsets therein, weights of edges between nodes need to be set according to a weight setting rule, where the weight setting rule includes: (1) the edge weight between the black seed nodes is increased no matter what the original edge weight is; (2) and acquiring diffusion nodes connected with a plurality of black seed nodes, distributing different edge weights according to the connection number of the diffusion nodes and the black seed nodes, wherein the edge weight of an edge between a diffusion node with a larger number of the black seed nodes and other nodes is larger.

As an alternative, the step of setting the edge weights may be performed in step 101a, for example: in step 101a, a service person inputs a plurality of black seed nodes and an edge relation table to a client and inputs a set edge weight to the client; and then the client sends the input edge weight to a computing platform so that the computing platform can obtain the edge weight, and the computing platform constructs a relational graph according to the obtained black seed nodes, the diffusion nodes of the black seed nodes and the edge weight.

The following describes a process of performing sub-graph cutting on a relationship graph by the Louvain algorithm by using an example. Fig. 3 is a schematic diagram of performing subgraph cutting on a relational graph by a Louvain algorithm in an embodiment of the present specification, and as shown in fig. 3, assuming that the relational graph includes a node a, a node B, and a node C, calculating a difference between a modularity of a community formed by (a, C) and a modularity of a community formed by (a, B):

wherein, in the step (A),W _a,cis the edge weight of the edge between node a and node C,W _a,bis the edge weight of the edge between node a and node B,W _Ais the sum of the weights of the neighboring edges of node a,W _Bis the sum of the neighboring edge weights of the node B.

Suppose that

Then when

The modularity of the community of (A, C)Larger and more likely to be grouped together during the course of an iteration. Node a and node C may form a node subset.

For example: one node subset generated by calculation of the Louvain algorithm comprises 185 nodes and 1232 edges, wherein the 185 nodes comprise black seed nodes and diffusion nodes.

And 106, removing useless nodes in the node subset to generate a query subset.

This step may include: and purifying the plurality of node sets through a minimum spanning tree algorithm and a K-kernel algorithm to remove useless nodes, and generating a plurality of query subsets.

Since there may be more useless nodes in the node subset generated in step 104, the node set may be further refined by a minimum spanning tree (Kruskal) algorithm and a k-core algorithm (k-core) to remove the useless nodes in this step.

Specifically, step 106 may include:

and step 1062, calculating the node subset through a minimum spanning tree algorithm to obtain a minimum spanning tree.

Since higher edge weights of edges between nodes indicate that relationships between nodes are tighter, in this embodiment of the present specification, a monotonically decreasing exponential function is used in the minimum spanning tree algorithm, and a minimum spanning tree is generated by nodes with higher edge weights, so that the generated minimum spanning tree may be actually referred to as a "maximum spanning tree".

And 1064, performing iterative pruning on the minimum spanning tree through a k-kernel algorithm to remove useless nodes.

The minimum spanning tree generated through calculation of the minimum spanning tree algorithm comprises a plurality of diffusion nodes with the degree of 1, and the diffusion nodes with the degree of 1 are only connected with one black seed sub-node, so that iterative pruning is realized through the diffusion nodes with the subtraction degree of 1 and only connected with one black seed sub-node of the k-kernel algorithm iteration. For example: after pruning, 68 useless nodes in the diffusion nodes are removed, and the rest diffusion nodes are highly suspected black-related nodes.

Step 1066, generating a query subset from the remaining nodes.

For example: the remaining nodes, which may include black seed nodes and the remaining diffused nodes after pruning, result after removing 68 useless nodes in the diffused nodes. In this step, a query subset can be generated according to the remaining nodes, and the query subset includes black seed nodes and the remaining diffusion nodes after pruning.

Step 108, identify nodes in each query subset as each first risk group.

In this step, all nodes in a query subset may be identified as a first risk group, i.e. a query subset is a first risk group. As shown in fig. 1, n query subsets are generated, including query subset 1, query subset 2, … …, query subset n, and n first risk groups are identified.

In the embodiment of the specification, a relational graph is constructed according to the obtained black seed nodes and the diffusion nodes of the black seed nodes, composition is not required according to a full amount of nodes, the identified first risk group is related to the black seed nodes, a large number of unrelated communities cannot be generated, and therefore the calculated amount is reduced; and removing useless nodes in the plurality of cut node subsets to generate query subsets, and identifying the nodes in each query subset as each first risk group, so that the accuracy of the identified first risk groups is improved. The algorithm adopted by the identification method of the risk group provided by the embodiment of the specification is a stable algorithm, and the result of each execution of the algorithm is the same, so that the situation of randomly generating the result is avoided.

Fig. 4 is a schematic diagram of a risk group identification method according to another embodiment of the present disclosure, and fig. 5 is a flowchart of a risk group identification method according to another embodiment of the present disclosure, as shown in fig. 4 and 5, the method includes:

step 202, a relational graph is constructed according to the obtained black seed nodes and the diffusion nodes of the black seed nodes.

And 204, carrying out subgraph cutting on the relationship graph to generate a plurality of node subsets.

And step 206, removing useless nodes in the node subset to generate a query subset.

Step 208, identify nodes in the respective query subsets as respective first risk groups.

In this embodiment, the description of step 202 to step 208 may refer to the description of step 102 to step 108 in the above embodiment, and will not be repeated herein.

And step 210, taking the nodes in the query subset as initial nodes to recall the nodes to obtain recalled nodes.

Fig. 6 is a schematic diagram of node recall in an embodiment of the present specification, and as shown in fig. 6, a solid dot in fig. 6 represents a node, where a node located in the center surrounded by a plurality of arrows is a starting node, and the remaining nodes are recall nodes. Specifically, the nodes in the query set are used as starting nodes to recall the nodes at the M degree to obtain recalled nodes. As an alternative, the value of M includes 2, and the nodes in the query set are used as initial nodes to perform 2-degree node recall, so that the range of risk group identification can be met, and the calculation amount can be reduced.

Step 212, a subgraph is generated according to the nodes in the query subset and the recall nodes.

As shown in fig. 4, for example: and generating a subgraph 1 according to the nodes in the query subset 1 and the recall nodes, and generating a subgraph n according to the nodes in the query subset n and the recall nodes in the same way.

And step 214, scoring the nodes in the subgraph to generate risk scores of all the nodes.

Fig. 7 is a schematic diagram of scoring nodes by the SALSA algorithm in the embodiment of the present specification, and fig. 8 is another schematic diagram of scoring nodes by the SALSA algorithm in the embodiment of the present specification, and as shown in fig. 7 and fig. 8, in the embodiment of the present specification, the risk score includes a first risk score and/or a second risk score, which indicates that a node in the subgraph may have the first risk score, or a node in the subgraph may have the second risk score, or a node in the subgraph may have both the first risk score and the second risk score. In the present embodiment, the risk score including the first risk score and the second risk score is described as an example.

The method specifically comprises the following steps:

step 2142, the subgraph is divided into a risk set and an intermediary set, the risk set comprises nodes with incoming edges, the intermediary set comprises nodes with outgoing edges, and the nodes in the risk set and the nodes in the intermediary set have a connection relation.

In this embodiment of the present description, as shown in fig. 8, a subgraph includes a node 1, a node 2, a node 3, a node 5, a node 6, and a node 10, where each node in the subgraph forms a directed graph, connection edges between the nodes are directed edges, and connection edges of the nodes may include an incoming edge or an outgoing edge. The subgraph is split into a risk set (Authority set) and an intermediary set (Hub set) according to the outgoing edge and the incoming edge of the node, as shown in fig. 8, the risk set includes a node 1, a node 3, a node 5 and a node 6 with incoming edges, and the intermediary set includes a node 1, a node 2, a node 3, a node 6 and a node 10 with outgoing edges. If the same node has both an outgoing edge and an incoming edge, the node is allocated to two sets, namely a risk set and an intermediate set, for example: node 1 belongs to both the risk set and the intermediary set.

In this embodiment of the present specification, if there is an edge between a node in the risk set and a node in the broker set, the node in the risk set and the node in the broker set are connected together, so that a connection relationship is provided between the node in the risk set and the node in the broker set.

In summary, step 2142 transforms the original directed graph into an undirected bipartite graph.

Step 2144, according to the connection relationship between the nodes in the risk set and the nodes in the intermediary set, calculating the state transition matrix of the nodes in the risk set and the state transition matrix of the nodes in the intermediary set.

In this embodiment, step 2144 may specifically include: calculating the state transition probability of the nodes in the risk set and the state transition probability of the nodes in the intermediate set according to the connection relation between the risk set and the intermediate set; and generating the state transition matrix of the nodes in the intermediate set according to the state transition probability of the nodes in the intermediate set.

As shown in fig. 8, the calculation process of the state transition matrix is described by taking the nodes in the risk set as an example.

The state transition probabilities of the nodes in the risk set include: and jumping the nodes in the risk set to the nodes in the intermediary set, and then jumping the nodes in the intermediary set to the nodes in the risk set according to the two-step state transition probability.

For example: assuming that the edge weights of the edges between the nodes are all 1, taking node 3 as an example, the first-step state transition probability of node 3 includes:

the above equation shows that the state transition probabilities of node 3 in the risk set jumping to node 1 in the broker set and jumping to node 6 are 1/2.

The second step of state transition probabilities for node 3 includes:

the above equation shows that the state transition probabilities of node 1 in the broker set jumping to node 3 in the risk set and jumping to node 6 are both 1/2, and also shows that the state transition probabilities of node 6 in the broker set jumping to node 3 in the risk set and jumping to node 5 are both 1/2.

Generating the two-step state transition probability of the node 3 according to the first-step state transition probability and the second-step state transition probability includes:

。

the two-step state transition probability of the node 3 calculated above is the state transition probability of the node 3. After the state transition probability of the node 3 is calculated, the state transition matrix can be obtained according to the calculated state transition probability of the node 3.

Step 2146, a first risk score of a node in the risk set is calculated according to the state transition matrix of the node in the risk set by a Personalized PageRank (PPR) algorithm.

In the PPR algorithm in the embodiment of the present specification, the starting node is a black seed node, and therefore the first risk score of the node can be calculated by the following formula.

By formula (1):

and performing repeated iterative computation according to the state transition matrix of the nodes in the risk set until a convergence condition is reached so as to compute a first risk score of the nodes in the risk set. Wherein the convergence condition may include

And

is less than a set threshold, or the convergence condition includes the number of iterations. Wherein the content of the first and second substances,

and

are risk score vectors; d is a random probability, the value range of d comprises 0 to 1, and when the random walk is carried out on the network from one node, the random walk is continued according to the probability of d, and the random walk jumps back to the black seed node according to the probability of 1-d;

is a state transition matrix;

is a black seed node vector.

The above formula (1) can also be written as formula (2):

wherein, in the step (A),

a first risk score for node i in the risk set, in (i) other nodes pointing to node i,

for the out-degree sum of the node j,

the first risk score is the node j in the risk set, N is the total number of nodes in the risk set, and N is the total number of black seed nodes in the risk set.

Wherein, in the formula (1)

Corresponding to that in formula (2)

In formula (1)

Corresponding to that in formula (2)

In formula (1)

Corresponding to that in formula (2)

。

2148, calculating a second risk score of the node in the intermediary set according to the state transition matrix of the node in the intermediary set by using the PPR algorithm.

The specific calculation process and the formula used can be referred to the description of step 2146, and the description is not repeated here.

In the embodiment of the present specification, as an alternative, some nodes in the subgraph have the first risk score and the second risk score at the same time. As shown in fig. 8, for example, the first risk score a (3) of node 3 is 0.265, and the second risk score h (3) of node 3 is 0.154.

In the embodiment of the present specification, if a certain node is closer to a black seed node, the first risk score and the second risk score of the node are larger; and if the edge weight of an edge between a certain node and the black seed node is larger, the first risk score and the second risk score of the node are larger.

And step 216, determining a node selection threshold according to the risk scores of all the nodes.

In the embodiments of the present specification, the steps may specifically include:

step 2160, the nodes in the subgraph are sorted according to the sequence of the first risk scores of the nodes in the risk set from large to small.

2162, generating an alternative set according to the sorted nodes.

In the alternative set, the nodes are sorted from large to small according to the first risk score.

Step 2164, according to the number of the nodes in the candidate set, different candidate node thresholds are selected.

As an alternative, the different candidate node thresholds are gradually increased at set value intervals, and the maximum candidate node threshold includes the number of nodes in the candidate set.

Fig. 9 is a schematic diagram of determining a node selection threshold in an embodiment of the present specification, as shown in fig. 9, for example: the number of nodes in the alternative set includes 100, and the different alternative node thresholds include 20, 30, 40, 50, 60, 70, and 80.

Step 2166, according to the different alternative node threshold values, selecting the nodes corresponding to the different alternative node threshold values from the alternative set, and according to the nodes corresponding to the different alternative node threshold values, generating alternative subsets corresponding to the different alternative node threshold values.

And the nodes corresponding to the alternative thresholds comprise the nodes with the previous alternative thresholds in the alternative set. For example: if the candidate threshold includes 20, selecting the first 20 nodes from the candidate set, and generating a candidate subset corresponding to 20 according to the selected first 20 nodes, and so on.

Step 2168, calculate connectivity values corresponding to each alternative subset.

By the formula

，

，

And calculating connectivity values corresponding to the alternative subsets, wherein,

for the current alternative subset to be used,

as a result of the connectivity value of the alternative subset,

and

are all nodes in the alternative subset, two nodes

To form a side of the frame,

for the degree of the node in the alternative subset,

is the total number of edges in the sub-graph,

all edges in the subgraph.

In this embodiment of the present specification, since each alternative subset corresponds to one alternative threshold, the connectivity value corresponding to each alternative subset also corresponds to one alternative threshold. As shown in fig. 9, after calculating the connectivity value corresponding to each candidate subset, a curve of the candidate threshold K corresponding to the connectivity value may be generated.

And 2170, taking the number of nodes in the candidate subset corresponding to the minimum connectivity value as a node selection threshold.

For example, as shown in FIG. 9, the minimum connectivity value corresponds to an alternative threshold K₁Since the candidate threshold is the number of nodes in the candidate subset and the number of nodes in the candidate subset corresponding to the minimum connectivity value is 60, 60 is used as the node selection threshold.

As an alternative, step 2170 may further include: and taking the number of nodes in the candidate subset corresponding to the penultimate small connectivity value as a node selection threshold. As shown in FIG. 9, the alternative threshold K for the penultimate small connectivity value₂If the number of nodes in the candidate subset corresponding to the minimum connectivity value is 69, 69 is used as the node selection threshold.

A front node selection threshold number of nodes in the sub-graph are identified as second risk groups according to a node selection threshold, step 218.

In this embodiment, a corresponding node selection threshold may be calculated for each sub-graph. For example, as shown in fig. 9, the first 60 nodes in the subgraph are identified as second risk groups. Another example is: as shown in fig. 9, the first 69 nodes in the subgraph are identified as second risk groups.

In this embodiment of the present specification, the smaller the node selection threshold is, the smaller the number of identified nodes in the second risk group is, the higher the accuracy of the identification result is; the larger the node selection threshold, the greater the number of identified nodes in the second risk group, and the higher the coverage of the identification result.

And step 220, carrying out group evaluation on the second risk group to generate group characteristics.

As an alternative, the group characteristics include the number of black seed nodes, the maximum degree, the minimum degree, and the node in the intermediary set with the highest second risk score.

The number of the black seed sub-nodes comprises the total number of the black seed sub-nodes in the second wind direction group, the maximum degree comprises the maximum degree of the degrees of each node in the second risk group, and the minimum degree comprises the minimum degree of the degrees of each node in the second risk group.

And step 222, performing individual evaluation on the nodes in the second risk group to generate individual characteristics.

As an alternative, the individual characteristics include a first risk score, a second risk score, degree centricity (degree centricity), nearness centricity (betweenness centricity), and census centricity (closeness centricity) of the second risk group node.

In the embodiment of the specification, the formula is used

And calculating the centrality of the degree, wherein,

for the centrality of the degree of the node i,

the total number of direct connections between node i and the other g-1 nodes j. Node i and node j are both nodes in the second risk group.

In the embodiment of the specification, the formula is used

And calculating the approximate centrality, wherein,

for the near-centrality of the node x,

from node x to other node yThe distance of (a) to (b),

is the sum of the distances of node x to all other nodes y. Node x and node y are both nodes in the second risk group.

In the embodiment of the specification, the formula is used

And calculating the centrality of the medium, wherein,

for the purpose of the mediation centrality of node v,

the number of nodes v traversed in the shortest path from node s to node t,

the number of shortest paths from node s to node t. Node s, node V, and node t are all nodes in second risk group V.

In the embodiment of the specification, a relational graph is constructed according to the obtained black seed nodes and the diffusion nodes of the black seed nodes, composition is not required according to a full amount of nodes, the identified first risk group is related to the black seed nodes, a large number of unrelated communities cannot be generated, and therefore the calculated amount is reduced; and removing useless nodes in the plurality of cut node subsets to generate query subsets, and identifying the nodes in each query subset as each first risk group, so that the accuracy of the identified first risk groups is improved.

The algorithm adopted by the identification method of the risk group provided by the embodiment of the specification is a stable algorithm, and the result of each execution of the algorithm is the same, so that the situation of randomly generating the result is avoided.

In the embodiment of the specification, a first risk score and a second risk score of a node are calculated through an SALSA algorithm, wherein the first risk score is used for describing the risk degree of the node, the second risk score is used for describing the intermediary attribute of the node, and a second risk group is identified according to the first risk score and the second risk score, so that more nodes with risks are inquired, and the coverage rate of group identification is improved.

In the embodiment of the specification, the second risk group is subjected to group evaluation to generate group characteristics, and nodes in the second risk group are subjected to individual evaluation to generate individual characteristics, so that the identification method of the risk group has strong interpretability and can be directly used in business.

The identification scheme of the risk group provided by the embodiment of the specification is high in execution efficiency, stable in yield and strong in interpretability.

Fig. 10 is a schematic structural diagram of an apparatus for identifying a risk group according to an embodiment of the present disclosure, as shown in fig. 10, the apparatus includes: a building module 11, a subgraph cutting module 12, a node removing module 13 and a first identifying module 14.

The construction module 11 is configured to construct a relationship graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes;

the subgraph cutting module 12 is configured to perform subgraph cutting on the relationship graph to generate a plurality of node subsets;

the node removing module 13 is configured to perform a removing operation on the useless nodes in the node subset to generate a query subset;

the first identification module 14 is for identifying nodes in each of the subsets of queries as each first risk group.

In an embodiment of this specification, the apparatus further includes: an acquisition module 15 and a diffusion module 16.

The obtaining module 15 is configured to obtain a plurality of black seed nodes and edge relation tables. The diffusion module 16 is configured to perform N-degree diffusion by using the plurality of black seed sub-nodes as initial nodes according to the edge relation table to obtain diffusion nodes of the black seed sub-nodes.

In this embodiment of the present specification, the subgraph cutting module 12 is specifically configured to perform subgraph cutting on the relationship graph through a Louvain algorithm, so as to generate a plurality of node subsets.

In this embodiment of the present specification, the node removing module 13 is specifically configured to calculate a node subset through a minimum spanning tree algorithm to obtain a minimum spanning tree, perform iterative pruning on the minimum spanning tree through a k-kernel algorithm to remove the useless nodes, and generate the query subset according to the remaining nodes.

In an embodiment of this specification, the apparatus further includes: node recall module 17, subgraph generation module 18, scoring module 19, threshold selection module 20, and second identification module 21.

The node recall module 17 is configured to perform node recall by using the nodes in the query subset as starting nodes to obtain recalled nodes. Subgraph generation module 18 is used to generate a subgraph from the nodes in the query subset and the recall nodes. The scoring module 19 is configured to score nodes in the subgraph to generate a risk score of each node. The threshold selection module 20 is configured to determine a node selection threshold according to the risk score of each node. A second identifying module 21 is configured to identify a threshold number of nodes of the selection of nodes in the sub-graph as second risk groups according to the node selection threshold.

In embodiments of the present description, the risk score includes a first risk score and a second risk score. The scoring module 19 is specifically configured to split the subgraph into a risk set and an intermediary set, where the risk set includes nodes with incoming edges, the intermediary set includes nodes with outgoing edges, and a connection relationship is provided between the nodes in the risk set and the nodes in the intermediary set; calculating a state transition matrix of a node in the risk set and a state transition matrix of a node in the intermediary set according to the connection relation between the risk set and the intermediary set; calculating a first risk score of the nodes in the risk set according to the state transition matrix of the nodes in the risk set by using a personalized webpage sorting algorithm; and calculating a second risk score of the nodes in the intermediary set according to the state transition matrix of the nodes in the intermediary set by a personalized webpage sorting algorithm.

In this embodiment of the present specification, the threshold selection module 20 is specifically configured to sort the nodes in the subgraph according to a descending order of the first risk scores of the nodes in the risk set; generating an alternative set according to the sorted nodes; according to the number of the nodes in the alternative set, different alternative node thresholds are selected; selecting nodes corresponding to different alternative node thresholds from the alternative set according to different alternative node thresholds, and generating alternative subsets corresponding to different alternative node thresholds according to the nodes corresponding to different alternative node thresholds; calculating a connectivity value corresponding to each alternative subset; and taking the number of the nodes in the front candidate subset corresponding to the minimum connectivity value as the node selection threshold value.

The present specification provides a storage medium, where the storage medium includes a stored program, where, when the program runs, a device on which the storage medium is located is controlled to execute the steps of the above embodiment of the risk group identification method, and the specific description may refer to the above embodiment of the risk group identification method.

Embodiments of the present description provide a computer device, including a memory for storing information including program instructions and a processor for controlling execution of the program instructions, where the program instructions are loaded and executed by the processor to implement the steps of the embodiments of the method for identifying a risk group, and the specific description may refer to the embodiments of the method for identifying a risk group.

Fig. 11 is a schematic diagram of a computer device provided in an embodiment of the present specification, and as shown in fig. 11, the computer device 3 of the embodiment includes: a processor 31, a memory 32, and a computer program 33 stored in the memory 32 and capable of running on the processor 31, wherein the computer program 33, when executed by the processor 31, implements the identification method applied to risk groups in the embodiments, and for avoiding repetition, it is not described herein repeatedly. Alternatively, the computer program is executed by the processor 31 to implement the functions of each model/unit in the identification apparatus applied to risk groups in the embodiment, which are not described herein for avoiding repetition.

The computer device 3 may be a desktop computer, a notebook, a palm computer, a cloud computer device, or other computing devices. The computer device 3 may include, but is not limited to, a processor 31, a memory 32. Those skilled in the art will appreciate that fig. 11 is merely an example of the computer device 3 and does not constitute a limitation of the computer device 3 and may include more or less components than those shown, or combine certain components, or different components, e.g., the computer device 3 may also include input output devices, network access devices, buses, etc.

The Processor 31 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 32 may be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. The memory 32 may also be an external storage device of the computer device 3, such as a plug-in hard disk provided on the computer device 3, a Smart Media Card (SMC), a Secure Digital (SD) Card, a flash memory Card (FlashCard), and the like. Further, the memory 32 may also include both an internal storage unit of the computer device 3 and an external storage device. The memory 32 is used for storing computer programs and other programs and data required by the computer device. The memory 32 may also be used to temporarily store data that has been output or is to be output.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present specification, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present description may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a Processor (Processor) to execute some steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of identifying a risk group, comprising:

removing useless nodes in the node subset to generate a query subset;

identifying nodes in each of the query subsets as respective first risk groups;

scoring the nodes in the subgraph to generate risk scores of all the nodes;

identifying a threshold number of nodes of the selection of nodes in the subgraph as second risk groups according to the threshold of node selection;

the risk score comprises a first risk score; determining a node selection threshold according to the risk score of each node includes:

according to the sequence of the first risk scores of the nodes in the risk set split from the subgraph from large to small, the nodes in the subgraph are sorted;

generating an alternative set according to the sorted nodes;

calculating a connectivity value corresponding to each alternative subset;

2. The method of claim 1, wherein before the constructing the relationship graph according to the obtained black seed nodes and the diffusion nodes of the black seed nodes, the method comprises:

acquiring a plurality of black seed nodes and an edge relation table;

3. The method of claim 1, wherein the performing a subgraph cut on the relationship graph to generate a plurality of node subsets comprises:

4. The method of claim 1, wherein said removing of unwanted nodes from said subset of nodes to generate a subset of queries comprises:

generating the query subset from the remaining nodes.

5. The method of claim 1, wherein the risk score comprises a second risk score;

6. The method of claim 1, wherein the different candidate node thresholds are incrementally increased at set numerical intervals, the largest candidate node threshold comprising the number of nodes in the candidate set.

7. An apparatus for risk group identification, comprising:

a first identification module for identifying nodes in each of the subsets of queries as each first risk group;

a second identification module to identify a threshold number of nodes of the selection of nodes in the sub-graph as second risk groups based on the node selection threshold;

the risk score comprises a first risk score; the threshold selection module is specifically configured to sort the nodes in the subgraph according to a descending order of the first risk scores of the nodes in the risk set split from the subgraph; generating an alternative set according to the sorted nodes; according to the number of the nodes in the alternative set, different alternative node thresholds are selected; selecting nodes corresponding to different alternative node thresholds from the alternative set according to different alternative node thresholds, and generating alternative subsets corresponding to different alternative node thresholds according to the nodes corresponding to different alternative node thresholds; calculating a connectivity value corresponding to each alternative subset; and taking the number of the nodes in the front candidate subset corresponding to the minimum connectivity value as the node selection threshold value.

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. The apparatus of claim 7, wherein the sub-graph cut module is specifically configured to perform sub-graph cutting on the relationship graph by a Louvain algorithm to generate a plurality of node subsets.

10. The apparatus according to claim 7, wherein the node removing module is specifically configured to compute a node subset through a minimum spanning tree algorithm to obtain a minimum spanning tree, iteratively prune the minimum spanning tree through a k-kernel algorithm to remove the useless nodes, and generate the query subset according to remaining nodes.

11. The apparatus of claim 7, wherein the risk score comprises a second risk score; the scoring module is specifically configured to split the subgraph into a risk set and an intermediary set, where the risk set includes nodes with incoming edges, the intermediary set includes nodes with outgoing edges, and a connection relationship is provided between the nodes in the risk set and the nodes in the intermediary set; calculating a state transition matrix of a node in the risk set and a state transition matrix of a node in the intermediary set according to the connection relation between the risk set and the intermediary set; calculating a first risk score of the nodes in the risk set according to the state transition matrix of the nodes in the risk set by using a personalized webpage sorting algorithm; and calculating a second risk score of the nodes in the intermediary set according to the state transition matrix of the nodes in the intermediary set by a personalized webpage sorting algorithm.

12. A storage medium, comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the steps of the method for risk group identification according to any one of claims 1 to 6.

13. A computer device comprising a memory for storing information including program instructions and a processor for controlling the execution of the program instructions, characterized in that the program instructions are loaded and executed by the processor to implement the steps of the method of identification of risk groups according to any one of claims 1 to 6.