CN111831923A

CN111831923A - Method, device and storage medium for identifying associated specific account

Info

Publication number: CN111831923A
Application number: CN202010674195.3A
Authority: CN
Inventors: 王文刚; 郭豪; 康晓中; 蔡准; 孙悦; 郭晓鹏
Original assignee: Beijing Trusfort Technology Co ltd
Current assignee: Beijing Trusfort Technology Co ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-10-27

Abstract

The invention discloses a method, a device and a storage medium for identifying a related specific account. The method comprises the following steps: firstly, constructing an isomorphic digraph according to the associated data occurring among accounts within a period of time; then, finding out the seed nodes which are possibly the specific accounts from the isomorphic directed graph according to some specific modes of the specific accounts; then, finding out a connected subgraph containing the seed nodes; then, dividing the connected subgraph into a plurality of communities according to a community discovery algorithm, and evaluating each community to obtain a score corresponding to each community; then, the community with higher risk can be determined as the target community, and the account represented by the node in the target community is the identified associated specific account. Therefore, by using the method, the associated specific account can be automatically identified from large-scale associated data by using a data mining method such as graph theory, graph operation, community discovery and social network analysis and the like and unsupervised learning.

Description

Method, device and storage medium for identifying associated specific account

Technical Field

The present invention relates to the field of data mining technologies, and in particular, to a method, an apparatus, and a storage medium for identifying an associated specific account.

Background

In recent years, with the rapid development of the internet and the popularization of intelligent terminals, more and more daily activities or transactions can be completed by online self-help, meanwhile, a large amount of related data can be generated, and if the data can be fully utilized and some relevance analysis and data mining are carried out, more instructive information can be obtained, and data support is provided for high-level management and decision making.

Especially in specific industries, service data can be used for some special service discovery, for example, according to a specific pattern of some special services, a series of associated specific accounts related to the service are identified, so as to provide assistance for developing the specific service.

Currently, the specific accounts associated with these are identified primarily through expert rules and machine learning models. However, the method of using expert rules for identification is often limited by the expression form of the rules and limited expert experience, and is difficult to cover complex and variable data modes, thereby resulting in a relatively large false negative rate. Machine learning techniques generally use supervised learning algorithms when identifying associated specific accounts, but supervised learning methods are limited by the number of sample labels and the proportion constraint of positive and negative samples, and can only effectively identify known specific patterns, and cannot monitor novel or slightly changed patterns, so that the real effect of the novel or slightly changed patterns is hardly played.

Therefore, how to overcome the problems of the above solutions, a more elaborate and intelligent data mining method is used to automatically identify the relevant specific account from the large-scale data and ensure higher accuracy, which has become an urgent technical problem in the data mining neighborhood.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a method, an apparatus, and a storage medium for identifying an associated specific account based on unsupervised complex network self-feedback.

According to a first aspect of embodiments of the present invention, a method of identifying a specific account associated includes: acquiring associated data occurring among accounts within a period of time; constructing a homogeneous digraph according to the associated data, wherein nodes in the homogeneous digraph represent accounts, edges in the homogeneous digraph represent associated data occurring among the accounts, and the associated data comprise associated time, associated times and associated content; determining a seed node representing a suspected specific account from the isomorphic directed graph according to the unique digital characteristics and the graphic structure characteristics of the specific account; determining a connected subgraph where the seed nodes are located according to the seed nodes and a connected subgraph algorithm; dividing each connected subgraph in the connected subgraph into at least two communities according to a community discovery algorithm, wherein a community is a node subset in the corresponding connected subgraph; and evaluating each community of the at least two communities to obtain a score corresponding to each community, and determining a target community from the at least two communities according to the risk score, wherein an account represented by a node in the target community is an associated specific account.

According to an embodiment of the present invention, after determining the target community from the at least two communities, the method further includes: determining a hub node representing a core account from a target community according to a social network analysis algorithm or a shortest path algorithm; and determining a key path representing the core incidence relation in the target community according to a minimum spanning tree algorithm.

According to an embodiment of the present invention, constructing an isomorphic directed graph according to associated data includes: constructing a first isomorphic directed graph containing all accounts and all associated data according to all associated data; deleting the isolated edge in the first isomorphic directed graph, wherein the isolated edge and other edges in the first isomorphic directed graph have no intersection point to obtain a second isomorphic directed graph; and deleting edges with the association degree smaller than the first threshold value from the second isomorphic directed graph to obtain the isomorphic directed graph to be constructed.

According to an embodiment of the present invention, determining a seed node representing a suspected specific account from a homogeneous digraph according to a digital feature and a graph structure feature specific to a specific account includes detecting whether each node in the homogeneous digraph satisfies the following condition determination, and if yes, determining a corresponding node as a suspected seed node: the association times represented by the edges connected with the nodes are greater than a second threshold, or the degree of departure is greater than a first degree of departure threshold, the degree of arrival is less than a first degree of arrival threshold, the first degree of departure threshold is greater than a first degree of arrival threshold, or the degree of arrival is greater than a second degree of arrival threshold, the degree of departure is less than a second degree of departure threshold, the second degree of arrival threshold is greater than a second degree of departure threshold, or the nodes in the ring-shaped strong connection subgraph; and detecting whether the account represented by each suspected seed node meets a first condition, and if so, determining the corresponding suspected seed node as the seed node.

According to an embodiment of the present invention, evaluating each of at least two communities to obtain a score corresponding to each community includes performing the following operations for each community: evaluating judgment factors for judging whether the account is a related specific account to obtain the score of each judgment factor, wherein the judgment factors comprise the number of correlation times, the number of specific data structures and the distribution of correlation time points; and obtaining the corresponding score of the corresponding community according to the score of each judgment factor.

According to an embodiment of the present invention, determining a hub node representing a core account from a target community according to a social network analysis algorithm or a shortest path algorithm includes: determining a node with higher association frequency as a pivot node according to a frequency centrality algorithm; and/or determining the closely related nodes as pivot nodes according to a close centrality algorithm; and/or determining the nodes of the association center as pivot nodes according to the centrality of the feature vectors; and/or determining the nodes on the critical path as the hub nodes according to a shortest path algorithm.

According to an embodiment of the present invention, determining a critical path representing a core association relationship in a target community according to a minimum spanning tree algorithm includes: determining at least two connected subgraphs containing a target community; determining bridge nodes of at least two connected sub-graphs according to an intermediary centrality algorithm; and determining a key path for connecting at least two connected subgraphs through bridge nodes according to a minimum spanning tree algorithm.

According to a second aspect of the embodiments of the present invention, an apparatus for identifying a specific account associated includes: the system comprises a related data acquisition module, a data processing module and a data processing module, wherein the related data acquisition module is used for acquiring related data generated among accounts within a period of time; the isomorphic directed graph construction module is used for constructing an isomorphic directed graph according to the associated data, wherein nodes in the isomorphic directed graph represent accounts, edges in the isomorphic directed graph represent associated data occurring among the accounts, and the associated data comprise associated time, associated times and associated content; the seed node determining module is used for determining a seed node representing a suspected specific account from the isomorphic directed graph according to the specific digital characteristic and the graphic structure characteristic of the specific account; the connected subgraph determining module is used for determining a connected subgraph where the seed node is located according to the seed node and a connected subgraph algorithm; the community discovery module is used for dividing each connected subgraph in the connected subgraph into at least two communities according to a community discovery algorithm, wherein the community is a node subset in the corresponding connected subgraph; and the target community determining module is used for evaluating each community of the at least two communities to obtain a score corresponding to each community, and determining the target community from the at least two communities according to the risk score, wherein the account represented by the node in the target community is the associated specific account.

According to an embodiment of the present invention, the apparatus further includes: the hub node determining module is used for determining a hub node representing a core account from the target community according to a social network analysis algorithm or a shortest path algorithm; and the critical path determining module is used for determining a critical path representing the core incidence relation in the target community according to a minimum spanning tree algorithm.

According to a third aspect of embodiments of the present invention, a computer storage medium is characterized in that the storage medium comprises a set of computer executable instructions, which when executed, are used for any one of the above-mentioned methods of identifying an associated specific account.

The embodiment of the invention provides a method, a device and a storage medium for identifying a related specific account, wherein the method comprises the following steps: firstly, constructing an isomorphic digraph according to the associated data occurring among accounts within a period of time; then, finding out the seed nodes which are possibly the specific accounts from the isomorphic directed graph according to some specific modes of the specific accounts; then, finding out a connected subgraph containing the seed nodes; then, dividing the connected subgraph into a plurality of communities according to a community discovery algorithm, and evaluating each community to obtain a score corresponding to each community; then, the community with higher risk can be determined as the target community, and the account represented by the node in the target community is the identified associated specific account. Therefore, by using the method, the associated specific account can be automatically identified from large-scale associated data through unsupervised learning by using data mining methods such as graph theory, graph operation, community discovery and social network analysis.

It is to be understood that the teachings of the present invention need not achieve all of the above-described benefits, but rather that specific embodiments may achieve specific technical results, and that other embodiments of the present invention may achieve benefits not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a flow chart illustrating an implementation of a method for identifying a specific account associated according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an abnormal transaction graph structure identifying seed nodes according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a community obtained by applying a community discovery algorithm to perform community discovery according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of different risk levels corresponding to different transaction time intervals according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a target community obtained after risk assessment according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating analysis of social network analysis metrics, according to an embodiment of the invention;

fig. 7 is a schematic structural diagram of an apparatus for identifying a specific account associated according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Since identifying the back money laundering party from some unique patterns of transfer transaction records and money laundering activities is a typical application of embodiments of the present invention, this application scenario will be exemplified in the following description.

It should be noted that the embodiment of the present invention is a general method for identifying the specific account associated, wherein the identification of the back washing group based on the transfer transaction record and some unique patterns of the washing activities is only one of the application scenarios of the embodiment of the present invention, and is not the only application scenario of the embodiment of the present invention, and the embodiment of the present invention can be applied to any other application scenarios. For example, the embodiment of the invention can also identify possible marketing groups according to the purchase records and purchase sources of the users. In this application scenario, the account corresponds to a user who purchases a certain type of goods, the associated data is a record of purchases of the certain type of goods, including purchase times, purchase amount, and from which user the user purchases the certain type of goods, and the specific account specifically corresponds to a concealed reimbursement group behind, and the specific digital feature and the graphic structural feature correspond to a specific purchase mode of the reimbursement group.

According to a first aspect of the embodiments of the present invention, a method for identifying a specific account associated is provided, as shown in fig. 1, the method includes: operation 110, acquiring associated data occurring between accounts within a period of time; an operation 120 of constructing a homogeneous digraph according to the association data, wherein nodes in the homogeneous digraph represent accounts, edges in the homogeneous digraph represent association data occurring between the accounts, and the association data include association time, association times and association contents; operation 130, determining a seed node representing the suspected specific account from the isomorphic directed graph according to the digital feature and the graph structure feature specific to the specific account; operation 140, determining a connected subgraph where the seed node is located according to the seed node and the connected subgraph algorithm; at operation 150, dividing each connected subgraph in the connected subgraph into at least two communities according to a community discovery algorithm, wherein a community is a subset of nodes within the corresponding connected subgraph; operation 160, evaluating each of the at least two communities to obtain a score corresponding to each community, and determining a target community from the at least two communities according to the risk score, wherein an account represented by a node in the target community is an associated specific account.

Here, the association data is generally data generated in an action in which an association occurs between accounts, such as a transaction occurring between accounts. And a specific account generally refers to some accounts that are different from normal behavior in terms of content, purpose, frequency, pattern, and the like in the behavior of generating associated data. These special accounts are often used to conduct special transaction activities such as transfers, transactions of goods, and possibly even illegal transactions, for example money laundering activities. Behind these anomalous transactions, some criminal parties are often hidden. Identifying the particular account associated will help to find these criminal parties and provide strong evidence for the case's trial and error.

In operation 110, the period of time is generally determined according to a predetermined value, such as a time clue provided by a clerk or other party involved in the event. For the illegal activity of money laundering, because the transfer time is concentrated, the associated data within N days from the beginning of the case can be selected, wherein the recommended value range of N is 3 to 14. The associated data occurring between the accounts may be transfer data which may often be extracted from the daily transaction system of a financial institution or a transaction institution, for example, a bank involved in an event. The associated data at least includes associated time, associated times and associated content, and for example, the application of identifying a money laundering gang bank account according to the transfer record mainly includes transfer-out account, transfer-in account, transfer-out amount, transaction time and the like.

In operation 120, directed graph D generally refers to an ordered triple V (D), A (D), and ψ (D), where ψ (D) is a correlation function that makes each element in A (D), i.e., directed edge or arc, correspond to an ordered pair of elements in V (D), i.e., a pair of nodes at both ends of the directed edge or arc. Isomorphic directed graphs refer to directed graphs in which nodes all represent the same kind of data. In the isomorphic directed graph constructed in the embodiment of the invention, each node represents an account; each association relationship corresponds to one edge in the isomorphic digraph, when multiple association behaviors occur between two accounts, the edge between two corresponding points representing the accounts can fuse association data of the multiple association behaviors, and the fused association data, namely the attributes of the corresponding edge, mainly comprises association time, association times and association content. Taking the example of the application of identifying money laundering group bank accounts from transfer records, the edge here represents the transfer activity between two accounts, and the edge attributes include the total number of transfers, the total amount of transfers and the average transfer time.

The isomorphic directed graph is an important basis for specific account identification, and subsequent various data mining or operation are carried out aiming at the data structure.

In operation 130, the amount of associated data in the business system is typically large, and it is difficult to manually label the associated data. Therefore, only suspicious accounts can be screened according to some service features that a specific account has. How to map the service features of a specific account to the numerical features or the graphic structure features in the isomorphic directed graph is particularly important for accurately identifying the specific account.

Taking a bank account for money laundering as an example, the following modeled anomalous transaction structures are often presented in the above isomorphic digraphs: 1) frequent remittance/remittance transaction structures; 2) a chain transaction structure; 3) a centralized transfer-in/decentralized transfer-out transaction structure; 4) a transaction structure is transferred in/out in a decentralized manner; 5) a circular transaction structure; 6) other complex exception transaction structures.

FIG. 2 illustrates several graphical structures of an exception exchange, such as that shown in FIG. 2, where (a) is a chain transaction graphical structure; (b) is a nested annular transaction graph structure; (c) is a transaction graph structure which is transferred into and out in a centralized way and in a scattered way; (d) the transaction graph structure is a transaction graph structure with centralized roll-out and decentralized roll-in.

Wherein all points on the transaction graph structures (a) and (b) can be determined to represent seed nodes that mean a particular account; while the center points on the transaction graph structures (c) and (d) may be determined to represent seed nodes that mean a particular account.

It should be noted that the accounts represented by the seed nodes are often accounts with obvious abnormal transaction structure characteristics, and more specific accounts associated with the seed nodes can be found through the seed nodes. It can be seen that the seed node determination provides an important clue for subsequent identification of the associated specific account, and is also an important basis for subsequent data mining. Thus, operation 130 is a key step in the identification of a particular account.

In operation 140, if a given sub-graph G 'of the directed graph G ═ VE (VE) (V' E '), where V' is a subset of V and E 'is a subset of E, is a connected sub-graph of the directed graph G if there are at least one reachable paths between nodes V1 and V2, for any two nodes V1 and V2 in the graph G'.

The connected subgraph algorithm refers to an algorithm for finding a connected subgraph in a graph, which is connected with a certain node as a starting point, for example, a Kosaraju algorithm, a Tarjan algorithm and the like. In operation 140, the connected subgraph in which the seed node is located is found, typically with the seed node as a starting point.

In the practical application process, if there are many seed nodes and there are many connected subgraphs found according to the seed nodes, only the connected subgraphs with many nodes may be selected to improve the processing efficiency. For example, only the connected subgraph with the number of nodes greater than the threshold number of nodes is retained, or the connected subgraph is sorted according to the number of nodes and the connected subgraph with the top sorting is selected, wherein the threshold number of nodes is a value preset according to an empirical value, and the threshold value can be adjusted according to implementation effects.

The connected subgraph is determined, a range is defined for the subsequent identification of the associated specific account, and the number of accounts to be checked is reduced, so that the subsequent data volume to be processed can be greatly reduced, and the specific account can be identified more efficiently and more accurately.

In operation 150, communities (communities) refer to those subsets of nodes that are relatively tightly connected internally, called non-overlapping (dis-overlapping) communities where the community node sets do not intersect with each other, and called overlapping (overlapping) communities. The internal connection between nodes can be embodied in the distance between the points, the node subset with the internal connection being more compact presents an area with higher node density in the graph. The process of finding community structures in the graph is community discovery. In the embodiment of the invention, community discovery algorithms based on modularity optimization are recommended to be used, such as a Louvain algorithm, a Newman quick algorithm, a CNM algorithm, an MSG-MV algorithm and the like. However, the embodiment of the present invention does not limit the specifically adopted community discovery algorithm, and any algorithm with a better implementation effect may be used.

Fig. 3 shows four communities obtained by applying the luvain community discovery algorithm to perform community discovery according to the embodiment of the present invention: community a, community B, community C, and community D. Wherein, each node subset in the area defined by the dotted line in the graph is the community.

In practical application, there still exist some difficulties in locating an abnormal account simply according to a connected subgraph. Taking the application of identifying a money laundering group bank account from a transfer record as an example, when a criminal group attempts to build a complex money laundering transaction network, its core money laundering structure is typically hidden in some seemingly normal transactions. That is, if a connected subgraph is analyzed directly, it is likely that the connected subgraph will appear to have a lower risk of money laundering. Thus, in operation 150, the larger connected subgraph is further divided into several smaller communities with better anti-money laundering discrimination. If a large connected subgraph contains a community with a great money laundering risk or a plurality of communities with higher money laundering risks, the connected subgraph can be judged to have a larger money laundering risk on the whole. Thus, the missed-check rate of a specific account can be greatly reduced.

In operation 160, here, the risk assessment for each community is based primarily on the specific goals identified by the specific account, i.e., what specific account to look for. Taking the application of identifying a money laundering group bank account according to a transfer record as an example, the main basis for risk assessment is the degree of abnormality of the transaction amount of an account represented by points in the community, or the complexity of associated data between accounts, or the concentration degree of time point distribution, etc.

The communities are subjected to risk assessment and are ranked according to the risk scores, the identification range of the specific account can be further narrowed, and the communities with higher risk scores can be preferentially analyzed according to the risk scores, so that the speed and the processing efficiency of identifying the associated specific accounts are greatly improved.

After the target community is determined, all specific accounts which are desired to be identified can be obtained, but in some application scenarios, different roles of the specific accounts in abnormal transactions need to be further analyzed to locate the most critical core account and core association relationship.

Taking a money washing group as an example, the division of each person is very clear within the group, and the accounts of different roles have different characteristics. For example, in money laundering cases such as underground money, illegal collection, telecommunication fraud, etc., the number of core accounts responsible for money collection is small but the transaction amount is large, while the number of accounts performing money transfer transition is large but the transaction amount is small; in the isomorphic directed graph, the distance between the node representing the account which is intensively transferred into and out of the transaction and the node representing the core account is more close; while a large number of peripheral accounts with small transaction amount and low frequency have a limited effect on finding the core accounts in the money laundering gang. For the detection of money laundering cases, the key is to find core accounts and core financial transactions which can provide strong evidence for case management.

For this purpose, after the target communities are determined, the present embodiment further analyzes each community through a Social Network Analysis (SNA) algorithm, a shortest path algorithm, and a minimum spanning tree algorithm to find the hub nodes representing the core accounts and the key paths representing the core association relations.

Since the pivot node is a core account deduced reversely according to the recognition result, and the seed node determined before is a core account assumed from a certain point of view, the process of reversely deducing the pivot node according to the recognition result can also be regarded as a self-feedback mechanism. That is, by comparing the degrees of the recurrence of the pivot node and the seed node, it can be verified from another angle whether the recognition result is correct, and the higher the degree of the recurrence is, the more credible the recognition result is.

The self-feedback mechanism is added, so that not only can case groups similar to the occurred committing means be mined, but also groups of some novel committing means can be exhausted in a network, and the mined groups can be further proved by the self-feedback module with the verification function while the groups are identified. Therefore, a closed-loop system covering the whole process from mining, verification to tracing is achieved, the identification strategy can be adjusted according to the degree of the return of the pivot node and the seed node, the identification process is repeated again, and recursion is carried out until an identification result with high accuracy is obtained. Therefore, the accuracy of the identification result can be greatly improved.

According to an embodiment of the present invention, constructing an isomorphic directed graph according to associated data includes: constructing a first isomorphic directed graph containing all accounts and all associated data according to all associated data; deleting the isolated edge in the first isomorphic directed graph, wherein the isolated edge and other edges in the first isomorphic directed graph have no intersection point to obtain a second isomorphic directed graph; and deleting edges with the association degree smaller than the first threshold value from the second isomorphic directed graph to obtain the isomorphic directed graph to be constructed. The amount threshold is a value preset according to an empirical value, and the threshold can be adjusted according to implementation effects.

Since the size of the associated data is usually very considerable, even for a short period of time, for example, all the associated data in a day may reach the tens of thousands or even billions. The isomorphic digraphs established based on the data are undoubtedly extremely large and complicated, if the isomorphic digraphs are directly used for subsequent analysis and identification, the calculation amount is also hard to imagine, the identification efficiency is also extremely low, and even a result cannot be obtained.

Therefore, in the present embodiment, after the first isomorphic directed graph constructed from all relevant data, some edges that are obviously not anomalous transactions are deleted, and points that are obviously not specific accounts are deleted.

Taking a bank account as an example, the following features are typical of transaction activities between individuals: 1) transfer islands, namely only mutual transfer occurs between two account nodes, and generally, the transactions are always normal personal transfer behaviors, so that an isolated side representing the transfer island can be deleted from the first isomorphic digraph; 2) although the transfer of the account is carried out for a plurality of times between the two accounts, the total amount of the transfer of the account for the plurality of times is smaller than a certain total amount threshold value, the transaction total amount in the edge attribute is detected, and if the transaction total amount is smaller than the total amount threshold value, the edge is deleted. The total amount threshold is a value preset according to an empirical value, and the threshold can be adjusted according to implementation effects.

According to the effect of practical application, through the two steps, the node scale of the isomorphic directed graph can be reduced by more than half at least, so that the calculation amount can be greatly reduced, and the calculation speed and the recognition efficiency are improved.

Taking the application of identifying money laundering group bank accounts according to transfer records as an example, the transfers between money laundering groups all have the following characteristics:

1) the transactions are frequent, i.e. the number of associations over a period of time is greater than a second threshold. For the above transaction, an edge with a transaction frequency greater than a certain frequency threshold value in the edge attribute can be found from the isomorphic directed graph, and nodes at two ends of the edge are determined as seed nodes. The threshold value of the number of times is a value preset empirically, and the threshold value can be adjusted according to the implementation effect.

2) And (4) centralized transferring in/distributed transferring out. For the above transactions, an algorithm based on threshold filtering may be used. For example, by calculating the degree of departure and the degree of entrance of each node, and setting the first degree of departure threshold θ 1 and the first degree of entrance threshold θ 2, the node representing the account can be filtered using the condition "degree of departure > θ 1 and degree of entrance < θ 2". The first out-degree threshold and the first in-degree threshold are values preset empirically, and these thresholds can be adjusted according to implementation effects.

3) Centralized roll-out/decentralized roll-in. Algorithms based on threshold filtering may also be used for the transactions described above. For example, by calculating the degree of departure and the degree of entrance of each node, and setting the second degree threshold θ 3 and the second degree threshold θ 1, the node representing the account may be filtered using the condition "degree of departure < θ 3 and degree of entrance > θ 4". The second out-degree threshold and the second in-degree threshold are values preset empirically, and these thresholds can be adjusted according to implementation effects.

4) Has a ring-shaped transaction structure. For the transaction, a strong connectivity circular subgraph is searched by using algorithms based on Tarjan and Kosaraju, and then a corresponding abnormal structure can be obtained.

In practical applications, a few accounts meeting the above conditions may be found. At this time, some additional constraints are required to be added, so that the investigation range is reduced to a controllable range. One of the more common constraint conditions is the central point breakage rate, i.e., the ratio of the transfer amount to the transfer amount. Taking the intermediary account responsible for money laundering as an example, the transaction is usually transitional, i.e. most of the money transferred in is transferred out through various means, and the central breakage rate is close to 1. In other words, if the center breakage rate of an account is 1 or close to 1, the probability that the account is a specific account is high.

It should be noted that, in practical applications, when screening suspicious accounts, the breakage rate of this central point is not required to be strictly equal to 1, and may also be a range close to 1, for example, [ 0.8-1.2 ].

In addition, abnormal transactions typically complete many transactions in a relatively short period of time, and thus the transaction time interval is also a common constraint. If the time difference between the two transactions at one edge is small, the probability that the transaction is an abnormal transaction is also high.

In this case, the first condition may be defined as a breakage rate of [ 0.8-1.2 ] and a transaction time interval less than a threshold value preset based on empirical values. And the seed node with higher possibility of being a specific account can be obtained by screening the suspected seed nodes again according to the two constraints.

The judgment factors for judging whether the account is abnormal may be:

1) an abnormal traffic indicator. Such as an average number of transactions and an average transaction amount.

Taking the average number of transactions as an example, the following method may be used for risk assessment.

First, the average number of trades per community i is calculated using the following formula:

then, the average number of trades for all nodes over a period of time is calculated using the following formula:

thereafter, a risk score is calculated using the following formula:

Pi＝Counti_mean/Countmean

where the larger Pi, the higher the risk.

The risk score of the average transaction amount may also be calculated by replacing the transaction times in the above formula with the transaction amount, which is not described herein again.

2) An exception transaction structure. For example, a transaction structure with frequent remittance/remittance, a transaction structure with collective/distributed roll-out, a transaction structure with distributed/collective roll-out, a ring transaction structure, and the like. And (4) scoring the risk of the transaction structure of one session mainly by counting the number of abnormal transaction structures. How to identify the abnormal transaction structure is described above, and will not be described herein.

3) Anomalous trade time distributions. If the transaction times for the transactions represented by the edges of a community are all very concentrated, the greater the likelihood that the batch of transactions are anomalous transactions. For this purpose, the risk assessment can be performed using temporal entropy.

The temporal entropy of a community can be calculated using the following method:

firstly, the time point of each transaction is calculated by taking the initial transaction in the community as a starting point, the average transaction time of the community is calculated according to the time points, and then the absolute value of the difference between the average transaction time and the time of each transaction in the community is calculated.

Then, each transaction is divided into time intervals corresponding to different risk levels according to the difference of the absolute value of the difference.

As shown in fig. 4, where T is the average trading time of the community; p1, P2, P3 and P4 are each risk assessment ratings for each time interval T1, T2, T3 and T4.

Then, the transaction time entropy of each community can be calculated according to the following formula:

wherein Hc represents the transaction time entropy of the community; pi represents the corresponding risk assessment level for each transaction.

After the transaction time entropy of each community is calculated quantitatively according to the whole money laundering risk of the community, the risk can be evaluated according to the time entropy, and the smaller the time entropy is, the higher the risk level is.

After calculating the risk scores of the above judgment factors, the money laundering risk of each community can be quantified as a score using the following Sigmoid function s (x):

w_ijx_ij＝w_i1*(P_{number of transactions}+P_{Transaction total})+w_i2*(G_{Frequent roll-in/roll-out}+G_{Centralized transfer-in/decentralized transfer-out}+G_{Decentralized transfer-in/centralized transfer-out}+G_{Circular transaction structure})+w_i3*H_c

Wherein i represents the ith community and j represents each judgment factor; w is a_i1、w_i2And w_i3The weight values respectively represent the weight corresponding to each judgment factor, the values are preset according to experience values, and the weight values can be adjusted according to implementation effects. G_{Centralized transfer-in/decentralized transfer-out}Representing the number of centralized roll-in/scatter roll-out transaction structures; g_{Decentralized transfer-in/centralized transfer-out}Representing the number of scatter-and/or scatter-and-roll-out transaction structures, G_{Circular transaction structure}Representing the number of ring transaction structures; hc represents the transaction time entropy.

After the risk that each target community contains a specific account is quantified as a score, a corresponding percentile graph can be drawn, and then the communities are graded according to different percentile ranges. Generally, communities with a ratio between 95-100%, 90-95%, and 80-90% can be labeled as risk level 1, 2, or 3, respectively, followed by a risk level of 4 for all communities, and so on. Therefore, in the subsequent service development process, communities with large risk values can be focused, and the working efficiency is greatly improved.

Fig. 5 shows the target community representing a highly suspicious money laundering party after risk stratification.

Taking the social network analysis indicator analysis diagram shown in fig. 6 as an example, the node enlarged in part (a) of fig. 6 is the account with the highest centrality in the network, which indicates that the account has the most connections with other nodes in the network, and may be the most active node in the transfer transaction. The two nodes enlarged in part (b) of fig. 6 can best communicate with most other account members in the network, have the greatest tightness and are most likely to be the nodes in the network responsible for transaction transit. The node enlarged in part (c) of fig. 6 is the one with the largest number of agents in the two partial trading networks, and therefore is likely to be the bridge between the two partial trading networks. The node enlarged in part (d) of fig. 6 has the highest feature vector centrality, and has the most direct contact with the most active account nodes, so that the active nodes can be better influenced.

According to a second aspect of the embodiment of the present invention, an apparatus for identifying a specific account associated is provided, as shown in fig. 7, where the apparatus 70 includes: an associated data acquiring module 701, configured to acquire associated data occurring between accounts within a period of time; the isomorphic directed graph constructing module 702 is configured to construct an isomorphic directed graph according to the associated data, where nodes in the isomorphic directed graph represent accounts, edges in the isomorphic directed graph represent associated data occurring between the accounts, and the associated data includes associated time, associated times, and associated content; a seed node determining module 703, configured to determine, according to a digital feature and a graph structure feature unique to the specific account, a seed node representing a suspected specific account from the isomorphic directed graph; a connected subgraph determining module 704, configured to determine a connected subgraph in which the seed node is located according to the seed node and a connected subgraph algorithm; a community discovery module 705 for dividing each connected subgraph in the connected subgraph into at least two communities according to a community discovery algorithm, wherein a community is a node subset in the corresponding connected subgraph; the target community determining module 706 is configured to evaluate each of the at least two communities to obtain a score corresponding to each community, and determine a target community from the at least two communities according to the risk score, where an account represented by a node in the target community is an associated specific account.

According to an embodiment of the present invention, the apparatus 70 further includes: the hub node determining module is used for determining a hub node representing a core account from the target community according to a social network analysis algorithm or a shortest path algorithm; and the critical path determining module is used for determining a critical path representing the core incidence relation in the target community according to a minimum spanning tree algorithm.

According to an embodiment of the present invention, the isomorphic directed graph constructing module 702 includes: the first isomorphic directed graph constructing submodule is used for constructing a first isomorphic directed graph containing all accounts and all associated data according to all associated data; the isolated edge deleting submodule is used for deleting the isolated edge in the first isomorphic directed graph, and the isolated edge and other edges in the first isomorphic directed graph have no intersection point to obtain a second isomorphic directed graph; and the low transaction amount edge deleting module is used for deleting the edge with the association degree smaller than the first threshold value from the second isomorphic digraph to obtain the isomorphic digraph to be constructed.

According to an embodiment of the present invention, the seed node determining module 703 includes: the suspected seed node determining submodule is used for detecting whether each node in the isomorphic directed graph meets the following condition determination, if yes, the corresponding node is determined to be a suspected seed node: the association times represented by the edges connected with the nodes are greater than a second threshold, or the degree of departure is greater than a first degree of departure threshold, the degree of arrival is less than a first degree of arrival threshold, the first degree of departure threshold is greater than a first degree of arrival threshold, or the degree of arrival is greater than a second degree of arrival threshold, the degree of departure is less than a second degree of departure threshold, the second degree of arrival threshold is greater than a second degree of departure threshold, or the nodes in the ring-shaped strong connection subgraph; and the first condition detection submodule is used for detecting whether the account represented by each suspected seed node meets the first condition or not, and if so, determining the corresponding suspected seed node as the seed node.

According to an embodiment of the present invention, the target community determining module 706 includes: the judgment factor evaluation submodule is used for evaluating judgment factors used for judging whether the account is the associated specific account to obtain the grade of each judgment factor, wherein the judgment factors comprise the association times, the number of specific data structures and the distribution of association time points; and the risk scoring submodule is used for obtaining a score corresponding to the corresponding community according to the score of each judgment factor.

According to an embodiment of the present invention, the pivot node determining sub-module is specifically configured to determine a node with a higher association frequency as a pivot node according to a frequency centrality algorithm; and/or determining the closely related nodes as pivot nodes according to a close centrality algorithm; and/or determining the nodes of the association center as pivot nodes according to the centrality of the feature vectors; and/or determining the nodes on the critical path as the hub nodes according to a shortest path algorithm.

According to an embodiment of the present invention, the critical path determining module includes: the high-risk connected subgraph determining sub-module is used for determining at least two connected subgraphs containing the target community; the bridge node determining submodule is used for determining bridge nodes of at least two connected sub-graphs according to an intermediary centrality algorithm; and the critical path determining submodule is used for determining a critical path for connecting at least two connected subgraphs through the bridge node according to a minimum spanning tree algorithm.

Here, it should be noted that: the above description of the embodiment of the apparatus for identifying the associated specific account and the above description of the embodiment of the computer storage medium are similar to the description of the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore, no further description is given. For the technical details of the embodiment of the apparatus for identifying a specific account associated with the present invention and the embodiment of the computer storage medium, which are not disclosed yet, please refer to the description of the foregoing method embodiment of the present invention for understanding, and therefore will not be described again for brevity.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of identifying a particular account associated, the method comprising:

acquiring associated data occurring among accounts within a period of time;

constructing a homogeneous directed graph according to the association data, wherein nodes in the homogeneous directed graph represent accounts, edges in the homogeneous directed graph represent association data occurring between the accounts, and the association data comprise association time, association times and association degree;

determining seed nodes representing suspected specific accounts from the isomorphic directed graph according to the specific digital characteristics and the specific graph structure characteristics of the specific accounts;

determining a connected subgraph where the seed nodes are located according to the seed nodes and a connected subgraph algorithm;

dividing each of the connected subgraphs into at least two communities according to a community discovery algorithm, wherein the community is a node subset within the corresponding connected subgraph;

and evaluating each community of the at least two communities to obtain a score corresponding to each community, and determining a target community from the at least two communities according to the risk score, wherein an account represented by a node in the target community is an associated specific account.

2. The method of claim 1, wherein after the determining the target community from the at least two communities, the method further comprises:

determining a hub node representing a core account from the target community according to a social network analysis algorithm or a shortest path algorithm;

and determining a key path representing the core incidence relation in the target community according to a minimum spanning tree algorithm.

3. The method according to claim 1, wherein the constructing the isomorphic directed graph according to the associated data comprises:

constructing a first isomorphic directed graph containing all accounts and all associated data according to all associated data;

deleting an isolated edge in the first isomorphic directed graph, wherein the isolated edge and other edges in the first isomorphic directed graph have no intersection point to obtain a second isomorphic directed graph;

and deleting the edges with the association degree smaller than a first threshold value from the second isomorphic directed graph to obtain the isomorphic directed graph to be constructed.

4. The method according to claim 1, wherein determining the seed node representing the suspected specific account from the isomorphic directed graph according to the numerical features and the graph structure features specific to the specific account comprises

Detecting whether each node in the isomorphic directed graph meets the following condition, and if so, determining the corresponding node as a suspected seed node:

the number of associations represented by the edges connected thereto is greater than a second threshold, or

The out-degree is greater than a first out-degree threshold and the in-degree is less than a first in-degree threshold and the first out-degree threshold is greater than a first in-degree threshold, or

The in degree is greater than a second in degree threshold and the out degree is less than a second out degree threshold and the second in degree threshold is greater than a second out degree threshold, or

Is a node in a ring-like strongly connected subgraph;

and detecting whether the account represented by each suspected seed node meets a first condition, and if so, determining the corresponding suspected seed node as the seed node.

5. The method of claim 1, wherein evaluating each of the at least two communities for a score corresponding to each community comprises, for each community:

respectively evaluating judgment factors for judging whether the account is a related specific account to obtain the score of each judgment factor, wherein the judgment factors comprise the number of correlation times, the number of specific data structures and the distribution of correlation time points;

and obtaining the score corresponding to the corresponding community according to the score of each judgment factor.

6. The method of claim 2, wherein determining a hub node representing a core account from the target community according to a social network analysis algorithm or a shortest path algorithm comprises:

determining a node with higher association frequency as a pivot node according to a frequency centrality algorithm; and/or

Determining the closely related nodes as pivot nodes according to a close centrality algorithm; and/or

Determining the nodes of the association center as pivot nodes according to the centrality of the feature vectors; and/or

And determining the nodes on the key path as the hub nodes according to the shortest path algorithm.

7. The method of claim 2, wherein determining the critical path representing the core association relationship in the target community according to a minimum spanning tree algorithm comprises:

determining at least two connected subgraphs containing the target community;

determining bridge nodes of the at least two connected sub-graphs according to an intermediary centrality algorithm;

and determining a key path for connecting the at least two connected subgraphs through the bridge nodes according to a minimum spanning tree algorithm.

8. An apparatus for identifying a particular account associated, the apparatus comprising:

the system comprises a related data acquisition module, a data processing module and a data processing module, wherein the related data acquisition module is used for acquiring related data generated among accounts within a period of time;

the isomorphic directed graph construction module is used for constructing an isomorphic directed graph according to the association data, wherein nodes in the isomorphic directed graph represent accounts, edges in the isomorphic directed graph represent association data occurring among the accounts, and the association data comprise association time, association times and association content;

the seed node determining module is used for determining a seed node representing a suspected specific account from the isomorphic directed graph according to the specific digital feature and the graphic structure feature of the specific account;

the connected subgraph determining module is used for determining a connected subgraph where the seed node is located according to the seed node and a connected subgraph algorithm;

the community discovery module is used for dividing each connected subgraph in the connected subgraphs into at least two communities according to a community discovery algorithm, wherein each community is a node subset in the corresponding connected subgraph;

and the target community determining module is used for evaluating each community of the at least two communities to obtain a score corresponding to each community, and determining the target community from the at least two communities according to the risk score, wherein the account represented by the node in the target community is the associated specific account.

9. The apparatus of claim 8, further comprising:

the hub node determining module is used for determining a hub node representing a core account from the target community according to a social network analysis algorithm or a shortest path algorithm;

and the critical path determining module is used for determining a critical path representing the core incidence relation in the target community according to a minimum spanning tree algorithm.

10. A computer storage medium comprising a set of computer executable instructions for performing the method of any one of claims 1 to 7 when executed.