Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure;
the invention also provides a working method of the abnormal financial organization hierarchy dividing system;
the method can extract the transaction flow information of the abnormal financial organization, construct the abnormal transaction network on the basis, extract the neighborhood topological structure characteristics of the account number by using a user-defined network representation learning method based on the neighborhood topological structure, generate the corresponding low-dimensional dense vector for each account number, and then perform clustering processing on the node vectors by using a k-means algorithm to complete the hierarchical division task of the abnormal financial organization. The invention can be used for: 1) And constructing an abnormal financial organization transaction network based on the financial transaction flow. 2) And (4) abnormal financial organization hierarchical division based on the characteristics of the neighborhood topological structure.
Interpretation of terms:
1. NTF algorithm, belongs to a network representation learning method. The key idea of the method comprises the following steps:
1) Measuring the structural similarity of node pairs is independent of the relative positions of the node pairs. The degree of similarity of two nodes is only related to their neighborhood topology, and is not related to whether two nodes are connected or not, and the position, the label and the attribute in the network.
2) And (4) introducing an energy level concept to measure the importance degree of the neighborhood nodes in the neighborhood topological structure of the central node, and taking the importance degree as the characteristics of the neighborhood nodes to participate in describing the neighborhood topological structure of the central node. By adopting the idea of energy dissipation, neighborhood nodes which have large influence on the middle nodes are screened out, and an influence sub-network of the center nodes is constructed, so that the interference of noise data can be effectively reduced.
3) The relative characteristics and the absolute characteristics of the neighborhood nodes are used for representing the structural characteristics of the central node, the inherent attributes of the neighbor nodes and the importance degree of the neighborhood structure of the neighbor nodes relative to the central node are comprehensively considered, and the neighborhood structural characteristics of the central node are better reflected.
4) Hierarchy of node neighborhood structural representation. From the central node, sampling is carried out layer by layer according to the distance from the neighbor node to the central node by adopting the thought of breadth-first search, and the topological characteristic of the neighborhood structure of the central node is more reasonably described.
2. The NTF model mainly comprises the following four steps:
1) An influence subnetwork of the nodes is generated. The method comprises the steps of measuring importance degrees of neighbor nodes relative to a central node by introducing an energy level concept of an NTF model, firstly screening the neighbor nodes with higher importance degrees to a topological structure of a neighborhood of the central node from a network, and constructing an influence sub-network of the central node together with edges of the neighbor nodes.
2) A neighborhood topology representation of the node is obtained. According to a given maximum sampling depth, on the influence sub-network of the central node obtained in the step 1), starting from the central node, adopting a breadth-first search strategy, extending and sampling outwards layer by layer according to the distance, and using the energy level and the degree of the node to represent the adjacent node extended in the sampling process, wherein the energy level is a relative feature, and the degree is an absolute feature. And the sequence of the neighbor nodes obtained by extended sampling is represented by a neighborhood topological structure of the central node.
3) And constructing a secondary graph. The secondary graph is a new undirected weighted graph constructed on the basis of the original graph. Calculating the neighborhood topological structure similarity of the node pairs according to the neighborhood topological structure representation of the nodes obtained in the step 2), and accordingly calculating the weight of the edges of the node pairs on the quadratic graph, wherein the specific calculation mode is shown as a formula (III) below.
4) A vector representation of the nodes is generated. Firstly, random walk operation is carried out on the secondary graph constructed in the step 3, a corresponding node context sequence is generated for each node, and then a context representation is generated for each node according to the node context sequence by using a Skip-Gram algorithm.
The technical scheme of the invention is as follows:
an abnormal financial organization hierarchical division system based on a neighborhood topological structure comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchical division module which are connected in sequence;
the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, an abstract, a transaction balance and a transaction place;
the neighborhood topology feature extraction module is used for: processing data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracting neighborhood topological structure information of account nodes of the transaction network, and generating a corresponding low-dimensional dense vector for each account;
the abnormal financial organization hierarchy partitioning module is used for: firstly, performing dimension reduction on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then performing clustering operation on the low-dimensional dense vectors subjected to the dimension reduction by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.
The working method of the abnormal financial organization hierarchy dividing system runs in a computer and comprises the following steps:
(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;
(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracts neighborhood topological structure information of account number nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account number;
(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the abnormal financial organization hierarchical division result.
The original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Preferably, in step (1), the data washing includes:
A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number; the partial data is regarded as incomplete data, and cleaning work is performed on the partial data.
B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaned money are beneficial to improving the accuracy of hierarchical division.
C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account 1 And Account 2 A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account 1 The Account number of the transaction opponent is Account 2 "and" transaction Account number is Account 2 The Account number of the transaction opponent is Account 1 "one of the two records needs to be deleted.
According to the invention, in the step (1), the transaction account information and the counter-transaction account information in the transaction flow data of the abnormal financial organization are extracted to construct the transaction network of the abnormal financial organization, which includes:
D. account number set C = { C) related to transaction flow data of abnormal financial organization 1 ,c 2 ,…,c i ,…,c n Indicates a total of n account numbers, C indicates a set of all account numbers, C i Representing the ith account number in the set;
E. according to f (c) i )=v i Point set V = { V } for building transaction network 1 ,v 2 ,…,v i ,…v n H, every account number c i Mapped as a positive integer in the range of 1,n],v i =i,i=1,2,…,n,
F. According to transactionsA point set V, an account set C and transaction running data of the network, and an edge set of the transaction network is constructed
e
i,j =(v
i ,v
j ) To a
Has w
i,j Corresponding thereto, w
i,j Represents account number c
i ,c
j Number of transactions in between, and has w
i,j =w
j,i ,
G. And finally, constructing a transaction network: g = { V, E }.
Preferably, step (2) comprises the following steps:
in the network, the nodes with similar neighborhood structures often have similar functions and play similar roles, which reflects that account number nodes with similar neighborhood topology structures in the transaction network of the abnormal financial organization are often in the same level in the organization. The NTF method designed by the invention can effectively extract the neighborhood topological structure information of the node, thereby effectively completing the task of network level division.
Given a graph G = (V, E), V = { V = { V)
1 ,v
2 ,…,v
i ,…v
n Represents a set of n nodes;
represents the edge set of graph G; for any edge e
i,j Existence of weight w
i,j In contrast thereto, w
i,j Not less than 0; if two nodes v
i ,v
j There is no edge connection between them, then w
i,j =0, otherwise w
i,j >0; if the graph G is an undirected graph, then e
i,j ≡e
j,i ,w
i,j =w
j,i (ii) a If the graph G is an unweighted graph, then w
i,j =1, otherwise w
i,j ≥0;
Node v has an energy level el relative to node u
u (v) Setting initial energy Q
u (u) =1, total k
* A level of energy, the energy range of the kth level of energy being
The residual energy from point u to point v is Q
u (v) If at all
Then el
u (v) K (= k); the set of nodes with energy level k of all relative points u is N
k (u),N
k (u)={v
i ∣el(v
i )=k};
Maximum sampling depth d * The maximum value of the distance from a node to a central node can be obtained by sampling by taking a certain node as a center and expanding sampling layer by layer according to the distance; distance to central node exceeding d * Will not fit within the sampling range;
the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,
is a set of edges that are to be considered,
for any edge e
i,j All have a weight s
i,j Corresponding thereto, s
i,j Greater than or equal to 0, reflecting node v
i And node v
j Neighborhood structure similarity of (2);
H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. Taking a node from the trading network G as a central node, screening out neighbor nodes with larger importance degree to the neighborhood structure of the central node from the trading network G by the NTF model, and forming an influence subgraph, namely an influence sub-network, of the central node together with edges; the method comprises the following steps: let the central node be u, the initial energy Q u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the like,the closer the distance u is, the larger the node degree is, and the larger the edge weight is, the greater the importance degree of the node relative to the u neighborhood structure is, and the less energy is lost when the node is reached, that is, the greater the importance degree of the node is. Setting a Path P = { u, v = 1 ,…,v i ,v j },v j The distance to the central node u is k, and from the point u, the energy propagates along the path via v i Reaches v j Residual energy of time Q u (v j ) As shown in formula (i):
in the formula (I), Q
u (v
i ) Indicating the arrival at point v along path P from node u
i Residual energy of time, k represents v
j Distance to u, w
i,j Represents the edge e
i,j A is the energy decay rate, l (P) represents the length of path P,
representing a node v
j Degree of (d);
note that since from u to v
j There may be multiple paths of (1), and thus v
j At the same time belong to
Wherein k is
1 <k
2 . Then it finally belongs to according to the proximity principle
One node is not allowed to be at multiple energy levels at the same time.
Starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;
I. on the basis of an influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depthDegree d * In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;
definition of R i (u) is a set of nodes with a distance i from the central node u, DE i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):
DE i (u)={(d(v),el u (v))∣∣v∈R i (u)}# (Ⅱ)
in formula (II), d (v) represents the degree of node v, el u (v) Representing the energy level of node v relative to node u.
Let DE
i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,
the feature is the neighborhood structure of the central node;
J. constructing a secondary graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.
Defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):
dist (-) is a function that calculates the distance between two sequences;
the distance of the k-th hop neighborhood exists if and only if the node u, v has neighbors of distance k at the same time, where w i Representing the proportion of the contribution of the i-th hop neighborhood distance to the total distance by a calculation method such asFormula (IV):
in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, and is shown as a formula (V):
the difference value comprehensively considers the degree and energy level characteristics of the nodes, a 1 ,b 1 Degree, a, representing the first element of an ordered pair a, b, i.e. a node 2 ,b 2 Represents the energy level of the second element of the ordered pair, i.e., the node;
on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph u,v Embodying their neighborhood structure similarity s u,v The calculation mode is shown as formula (VI):
s u,v =e -distance(u,v) # (Ⅵ)
note that s u,v =s v,u . Finally, the neighborhood structure distance of the two nodes is mapped to (0, 1)]A fraction within the interval;
K. generating a vector representation of the node, i.e. a low-dimensional dense vector, refers to:
and (3) carrying out random walk on the secondary graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the secondary graph edge, as shown in a formula (VII):
z (u) is a normalization factor; as can be seen from the walk probability, starting from the current node,the node u is more prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;
regardless of their relative location in the network.
Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close together are closer in vector space.
Preferably, according to the invention, a vector representation is learned for each node using the Skip-Gram algorithm. The method comprises the following steps:
(1) generating a one-hot vector for each node, for
Generating corresponding one-hot vectors
The other elements are 0;
(2) let d be the dimension of the generated dense vector, randomly initialize the matrix U,
U
i representing a node v
i As a node vector when the node is the center node,
W
i representing a node v
i A node vector when used as a background node;
(3) v. the i Represents a central node, v i ∈V,context(v i ) Denotes v i | context (v) i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):
if there is v j ∈context(v i ) Then there is v i ∈context(v j ) I.e. v i ,v j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;
(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for
The low-dimensional dense vector generated by the Skip-Gram algorithm is W
i Is recorded as vec
i ,
Preferably, in step (3), the PCA method is used to perform dimension reduction on the low-dimensional dense vector generated by each account, v
i Is vec as the low-dimensional dense vector
i . The method comprises the following steps:
l, constructing a matrix X,
m, calculating the covariance matrix Cov of X,
n, calculating the characteristic value and the characteristic vector of X: (λ a-Cov) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;
setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the size of the eigenvalue, taking the front d' row to obtain a matrix P,
p, setting the data after dimensionality reduction as Y, Y = PX,
is v is
i And (5) reducing the low-dimensional dense vector after dimension reduction.
According to the optimization of the invention, in the step (3), a k-means algorithm is used for carrying out clustering operation on the low-dimensional dense vectors subjected to the dimensionality reduction processing, and the hierarchical division result of the abnormal financial organization is obtained. v. of
i The corresponding reduced node vector is y
i ,
The input sample set for the k-means algorithm is: s = { y
i ∣i∈[1,n]The method comprises the following steps of setting Classnum as the classification category number and Iternum as the iteration time, wherein the Classnum comprises the following steps:
q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu. } 1 ,μ 2 ,…,μ Classnum Is classified into the corresponding category 1 ,class 2 ,…,class c };
R、
Calculating dist (y)
i ,μ
j ) And will y
i Is divided into a distance y
i Nearest u
j Class to which the class belongs
j Middle, dist (y)
i ,μ
j ) The calculation formula of (c) is shown as formula (IX):
s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):
and T, sequentially executing the step R and the step S, terminating the execution after executing Iternum times, and outputting a classification result.
The invention has the beneficial effects that:
according to the invention, the transaction flow records of the abnormal financial organization can be utilized to construct the transaction network of the organization, the neighborhood topological structure characteristics of the account number nodes in the network are extracted through a self-defined NTF algorithm, and a low-dimensional dense vector is generated for each account number node based on the neighborhood topological structure characteristics. And then, carrying out dimensionality reduction on the vector by using a PCA algorithm, and processing the dimensionality-reduced node vector by using a k-means algorithm to finally obtain the hierarchical information of the account node. The abnormal financial organization hierarchical division algorithm based on the neighborhood topology structure provided by the invention needs less information, only the information of both transfer parties of the transaction record of the abnormal financial organization is needed, and the participation of other additional information items is not needed, so that the manual participation is reduced to a certain extent, the labor cost is reduced, a good hierarchical division result can be obtained, and the automatic processing of the hierarchical division of the abnormal financial organization is realized. The method can assist relevant workers to analyze and judge abnormal financial organizations, and improve the working efficiency of the relevant workers.
Detailed Description
The invention is further defined in the following description, without being limited thereto, by reference to the drawings and examples.
Example 1
An abnormal financial organization hierarchy dividing system based on a neighborhood topological structure is shown in figure 1 and comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchy dividing module which are connected in sequence;
the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, a summary, a transaction balance and a transaction place;
the neighborhood topology feature extraction module is used for: processing data of transaction network of abnormal financial organization by using customized network representation learning method based on neighborhood topological structure, and extracting transaction
Generating a corresponding low-dimensional dense vector for each account by using neighborhood topological structure information of the network account nodes;
the abnormal financial organization hierarchy partitioning module is used for: firstly, carrying out dimension reduction processing on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then carrying out clustering operation on the low-dimensional dense vectors subjected to the dimension reduction processing by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.
Example 2
The working method of the abnormal financial organization hierarchy dividing system in the embodiment 1 is operated in a computer, as shown in fig. 3, and comprises the following steps:
(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;
the original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Data cleansing, comprising:
A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number; the partial data is regarded as incomplete data, and cleaning work is performed on the partial data.
B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaning are beneficial to improving the accuracy of hierarchical division.
C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account 1 And Account 2 A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account 1 The Account number of the transaction opponent is Account 2 The 'and' transaction Account number is Account 2 The Account number of the transaction opponent is Account 1 "one of the two records needs to be deleted.
The method comprises the following steps of extracting transaction account information and transaction counter account information in transaction flow data of the abnormal financial organization, and constructing a transaction network of the abnormal financial organization, wherein the transaction network comprises the following steps:
D. account number set C = { C) related to transaction flow data of abnormal financial organization 1 ,c 2 ,…,c i ,…,c n Denotes a total of n account numbers, C denotes a set of all account numbers, C i Representing the ith account number in the set;
E. according to f (c) i )=v i Point set V = { V } for building trading network 1 ,v 2 ,…,v i ,…v n H, every account number c i Mapped as a positive integer in the range of 1,n],v i =i,i=1,2,…,n,
F. According to the point set V, the account set C and the transaction running data of the transaction network, an edge set of the transaction network is constructed
e
i,j =(v
i ,v
j ) To a
Has w
i,j Corresponding thereto, w
i,j Represents account number c
i ,c
j Number of transactions in between, and has w
i,j =w
j,i ,
G. And finally, constructing a transaction network: g = { V, E }.
(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined neighborhood topological structure-based network representation learning method, extracts neighborhood topological structure information of account nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account; the method comprises the following steps:
in the network, the nodes with similar neighborhood structures often have similar functions and play similar roles, which reflects that account number nodes with similar neighborhood topology structures in the transaction network of the abnormal financial organization are often in the same level in the organization. The NTF method designed by the invention can effectively extract the neighborhood topological structure information of the node, thereby effectively completing the task of network level division.
Given a graph G = (V, E), V = { V = { V)
1 ,v
2 ,…,v
i ,…v
n Represents a set of n nodes;
represents the edge set of graph G; for any edge e
i,j Existence of a weight w
i,j In contrast thereto, w
i,j Not less than 0; if two nodes v
i ,v
j There is no edge connection between them, then w
i,j =0, otherwise w
i,j >0; if the graph G is an undirected graph, then e
i,j ≡e
j,i ,w
i,j =w
j,i (ii) a If the graph G is an unweighted graph, then w
i,j =1, otherwise w
i,j ≥0;
Node v has an energy level el relative to node u
u (v) Setting initial energy Q
u (u) =1, total k
* A level of energy, the energy range of the kth level of energy being
The residual energy from point u to point v is Q
u (v) If, if
Then el
u (v) K (= k); the set of nodes with energy level k of all relative points u is N
k (u),N
k (u)={v
i ∣el(v
i )=k};
Maximum sampling depth d * The maximum value of the distance from a node to a central node can be obtained by taking a certain node as a center and outwards expanding sampling layer by layer according to the distance; distance to central node exceeding d * Will not fit within the sampling range;
the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,
is a set of edges that are to be considered,
for any edge e
i,j All have a weight s
i,j Corresponding thereto, s
i,j Greater than or equal to 0, reflecting node v
i And node v
j Neighborhood structure similarity of (1);
H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. As shown in fig. 2, a node is taken out from the transaction network G as a central node, the NTF model screens out neighbor nodes from the transaction network G that have a greater importance degree to the neighborhood structure of the central node, and the neighbor nodes together with edges form an influence subgraph, i.e., an influence sub-network, of the central node; the method comprises the following steps: let the central node be u, the initial energy Q u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the likeThe closer u, the larger the node degree and the larger the edge weight, the greater the importance degree of the node relative to the u neighborhood structure, and the less energy lost when reaching the node, that is, the greater the importance degree of the node. Setting a Path P = { u, v = 1 ,…,v i ,v j },v j The distance to the central node u is k, and from the point u, the energy propagates along the path via v i Reaches v j Residual energy of time Q u (v j ) As shown in formula (I):
in the formula (I), Q
u (v
i ) Indicating the arrival at point v along path P from node u
i Residual energy of time, k represents v
j Distance to u, w
i,j Represents the edge e
i,j A is the energy decay rate, l (P) represents the length of path P,
representing a node v
j Degree of (c);
note that since from u to v
j There may be multiple paths of (1), and thus v
j At the same time belong to
Wherein k is
1 <k
2 . Then it finally belongs to according to the proximity principle
One node is not allowed to be at multiple energy levels at the same time.
Starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;
I. on the basis of the influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depth d * In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;
definition of R i (u) is a set of nodes with a distance i from the central node u, DE i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):
DE i (u)={(d(v),el u (v))∣∣v∈R i (u)}# (Ⅱ)
in formula (II), d (v) represents the degree of node v, el u (v) Representing the energy level of node v relative to node u.
Let DE
i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,
the feature is the neighborhood structure of the central node;
J. constructing a quadratic graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.
Defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):
dist (-) is a function of the distance between two sequences;
if and only if the node u, v has neighbors with the distance of k at the same time, the distance of the k-th hop neighborhood exists, wherein w i Representing the contribution ratio of the ith hop neighborhood distance to the total distance by a calculation method such as formula(IV) shown in the specification:
in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, as shown in formula (V):
the difference value comprehensively considers the degree and energy level characteristics of the nodes, a 1 ,b 1 Degree, a, representing the first element of the ordered pair a, b, i.e. node 2 ,b 2 Represents the energy level of the second element of the ordered pair, i.e., the node;
on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph u,v Embodying their neighborhood structure similarity s u,v The calculation mode is shown as formula (VI):
s u,v =e -distance(u,v) # (Ⅵ)
note that s u,v =s v,u . Finally, the neighborhood structure distance of the two nodes is mapped to (0, 1)]A fraction within the interval;
K. generating a vector representation of the node, i.e. a low-dimensional dense vector, means:
and (3) carrying out random walk on the quadratic graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the edge of the quadratic graph, as shown in a formula (VII):
z (u) is a normalization factor; according to the wandering probability, starting from the current node, the node is furtherThe node is prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;
regardless of their relative location in the network.
Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close are closer in vector space.
Preferably, according to the invention, a vector representation is learned for each node using the Skip-Gram algorithm. The method comprises the following steps:
(1) generating a one-hot vector for each node, for
Generating corresponding one-hot vectors
The other elements are 0;
(2) assuming the dimension of the generated dense vector as d, randomly initializing the matrix U,
U
i representing a node v
i As a node vector when the center node is present,
W
i representing a node v
i A node vector when serving as a background node;
(3) let v i Represents a central node, v i ∈V,context(v i ) Denotes v i | context (v) i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):
if there is v j ∈context(v i ) Then there is v i ∈context(v j ) I.e. v i ,v j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;
(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for
The low-dimensional dense vector generated by the Skip-Gram algorithm is W
i Is recorded as
(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization.
Using a PCA method to perform dimension reduction processing on the low-dimensional dense vector generated by each account number, v i Is vec as the low-dimensional dense vector i . The method comprises the following steps:
l, constructing a matrix X,
m, calculating the covariance matrix Cov of X,
n, calculating the characteristic value and the characteristic vector of X: (λ a-Co) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;
setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the size of the eigenvalue, taking the front d' row to obtain a matrix P,
p, setting the data after dimensionality reduction as Y, Y = PX,
i.e. v
i And (5) reducing the low-dimensional dense vector after dimension reduction.
And clustering the low-dimensional dense vectors subjected to the dimensionality reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization. v. of
i The corresponding reduced node vector is y
i ,
The input sample set for the k-means algorithm is: s = { y
i ∣i∈[1,n]The method comprises the following steps of setting Classnum as the classification category number and Iternum as the iteration time, wherein the Classnum comprises the following steps:
q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu. } 1 ,μ 2 ,…,μ Classnum Is classified into the corresponding category 1 ,class 2 ,…,class c };
R、
Calculating dist (y)
i ,μ
j ) And is combined with y
i Is divided into a distance y
i Nearest u
j Class to which it belongs
j Middle, dist (y)
i ,μ
j ) The calculation formula (c) is shown in formula (ix):
s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):
and T, sequentially executing the step R and the step S, terminating the execution after executing Iternum times, and outputting a classification result.
The method of the embodiment is adopted to process the transaction flow data set T of an abnormal financial organization input by a certain user.
The transaction flow data set adopted in the present example has 105 account nodes in common, and is divided into 3 types in total. The account transfer system comprises a high-level account, a special function account, a bottom-level account and 784 transfer records, wherein the high-level account is 22, the special function account is 11, the bottom-level account is 72, and the number of the transfer records is 784.
This example uses six indices of CHI (Calinski Harabaz Index), SC (Silhouette Coefficient), DBI (Davies Bouldin Index), ARI (Adjusted random Index), AMI (Adjusted Mutual Index) and V-measure to evaluate the effect of the present embodiment. The first three indexes CHI, SC, DBI are Internal metrics (Internal criterion) used to evaluate the effect between clusters obtained after clustering; the last three indexes ARI, AMI and V-measure are External standards (External criterion) and are used for comparing the matching degree of the clustering result and the real distribution.
Fig. 4 (a) is a schematic diagram of a cluster visualization result obtained by the NTF method provided in this embodiment; fig. 4 (b) is a schematic diagram of a clustering visualization result obtained by the existing BoostNE method; the existing BoostNE method is disclosed in documents Li J, wu L, guo R, et al, multi-level network embedding with a bossed low-rank matrix adaptation [ C ]// Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and mining.2019:49-56. FIG. 4 (C) is a schematic diagram of the cluster visualization result obtained by the existing Role2Vec method; the existing Role2Vec method is described in Ahmed N K, rossi R, lee J B, et al. FIG. 4 (d) is a schematic diagram of the cluster visualization result obtained by the conventional Struc2Vec method; the existing Structure 2Vec method is described in Ribeiro L F R, saverese P H P, figueiredo D R, structure 2Vec, learning node representations from structural identity [ C ]// Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining.2017.
In fig. 4 (a) to 4 (d), when the dense vectors generated for each node by the abscissa and ordinate representation algorithms (NTF, boost ne, role2Vec, and Struc2 Vec) are reduced to two-dimensional vectors, the coordinate values corresponding to the two-dimensional plane are obtained. Since the vector needs to be normalized when the clustering result is evaluated using the correlation indices (CHI, SC, DBI, ARI, AMI, and V-measure), the magnitude of the absolute value of the abscissa and ordinate of the node does not affect the clustering result.
As can be seen from fig. 4 (a) to 4 (d), the two methods, boostNE and Role2Vec, do not distinguish the three types of nodes well. Struc2Vec can better distinguish category 3 from category 1 and category 2, but cannot better distinguish category 1 from category 2. The NTF method provided by the invention can well distinguish three types of nodes, and a better clustering effect is obtained.
Three indexes of CHI, SC and DBI are used for measuring the clustering effect of NTF and three comparison methods of BoostNE, role2Vec and Struc2Vec, and the result is shown in Table 1.
TABLE 1
Clustering operation is carried out on the node vectors generated by the four methods after dimensionality reduction by using a k-means algorithm, then ARI, AMI and V-means are used for measuring the clustering effect of the k-means algorithm, and the result is shown in Table 2.
TABLE 2
As can be seen from tables 1 and 2, compared with the existing three methods of BoostNE, role2Vec, and Struc2Vec, the NTF method provided by the present invention obtains the best results on 6 indexes, which reflects that the NTF method can effectively extract the neighborhood topology information of nodes, and separates different types of nodes as far as possible while gathering the same type of nodes together, so as to obtain a better clustering result. Meanwhile, the NTF method can better support the subsequent node classification task.