CN112150285B - Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure - Google Patents

Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure Download PDF

Info

Publication number
CN112150285B
CN112150285B CN202011009471.0A CN202011009471A CN112150285B CN 112150285 B CN112150285 B CN 112150285B CN 202011009471 A CN202011009471 A CN 202011009471A CN 112150285 B CN112150285 B CN 112150285B
Authority
CN
China
Prior art keywords
node
transaction
financial organization
nodes
abnormal financial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011009471.0A
Other languages
Chinese (zh)
Other versions
CN112150285A (en
Inventor
王巍
王佰玲
辛国栋
刘扬
黄俊恒
马东阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN202011009471.0A priority Critical patent/CN112150285B/en
Publication of CN112150285A publication Critical patent/CN112150285A/en
Application granted granted Critical
Publication of CN112150285B publication Critical patent/CN112150285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an abnormal financial organization level division system based on a neighborhood topological structure and a working method thereof, wherein the abnormal financial organization level division system comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization level division module which are connected in sequence; the data preprocessing module carries out data cleaning on the input abnormal financial organization transaction flow data and constructs a transaction network of the abnormal financial organization; the neighborhood topological structure feature extraction module generates a corresponding low-dimensional dense vector for each account; and the abnormal financial organization hierarchical division module performs dimension reduction processing on the low-dimensional dense vector generated by each account, performs clustering operation and acquires the hierarchical division result of the abnormal financial organization. According to the invention, only the information of the transfer parties of the transaction records of the abnormal financial organization is needed, so that the manual participation is reduced to a certain extent, the labor cost is reduced, a good hierarchical division result can be obtained, and the automatic processing of the hierarchical division of the abnormal financial organization is realized.

Description

Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure
Technical Field
The invention relates to an abnormal financial organization hierarchy dividing system and a working method thereof, in particular to an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure and a working method thereof.
Background
The abnormal financial organization is a financial organization with abnormal financial transaction behaviors and has various types. The common types include marketing organization, illegal collection organization, money laundering organization, etc. Wherein the distribution organization is a typical abnormal financial organization. The distribution is a criminal action which disturbs the economic order and is essentially a 'huge cheat', namely, money of newly joined persons is utilized to pay interest and short-term return to the previously joined persons so as to make the illusion of earning money and further cheat more investment. The distribution organization has a typical pyramid structure, and a transaction network formed by mutual transfer among members of the distribution organization also has a typical hierarchy. The method has sufficient practical significance for accurately judging the level hierarchy of the members of the distribution organization in the distribution organization according to the transaction network of the distribution organization. At present, the method for analyzing the member level of the distribution organization according to the transfer transaction of the distribution organization depends on manual judgment to a great extent and needs a great amount of human power to participate. In the face of a large number of complicated and widespread marketing organization data, the traditional method using manual judgment cannot effectively meet the requirement of large-scale data analysis and processing.
Currently, methods based on network representation learning make great progress in extracting network information. Such methods may map nodes in the network into low-dimensional dense vectors, which are then applied to subsequent tasks. Many researches fully draw algorithms in the field of natural language processing, apply the Word2Vec model to the field of network representation learning, obtain better learning effect, and the generated node vector can better meet follow-up tasks. The domestic research on network representation learning mainly comprises a LINE algorithm, an SDNE algorithm, a TADW algorithm and the like. However, the LINE algorithm and the SDNE algorithm mainly capture first-order neighbor and second-order neighbor information of the nodes, and the sampling range is small. The distribution network belongs to a multi-level network, the two methods cannot comprehensively and effectively capture the hierarchical information of account number nodes in the distribution transaction network, account numbers close to transfer transactions tend to be classified into one type, and the network cannot be hierarchically classified. The TADW algorithm is designed to handle social networks and applies heterogeneous information of the node's utterance text. When the TADW algorithm is applied to a marketing transaction network, other types of data, such as transaction summary information, call information, chat records and the like, need to be additionally introduced besides the marketing transaction network, so that more workers are required to participate in information collection, and a previous data preprocessing task causes labor-intensive work, and the labor cost cannot be effectively reduced. Most of the existing methods can not carry out targeted analysis on the characteristics of the transaction network of the marketing organization, more effectively extract the hierarchical information of account number nodes and finish the hierarchical division work of abnormal financial organizations, thereby reducing labor intensive work and improving the degree of automation of the hierarchical division of the members of the marketing organization.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure;
the invention also provides a working method of the abnormal financial organization hierarchy dividing system;
the method can extract the transaction flow information of the abnormal financial organization, construct the abnormal transaction network on the basis, extract the neighborhood topological structure characteristics of the account number by using a user-defined network representation learning method based on the neighborhood topological structure, generate the corresponding low-dimensional dense vector for each account number, and then perform clustering processing on the node vectors by using a k-means algorithm to complete the hierarchical division task of the abnormal financial organization. The invention can be used for: 1) And constructing an abnormal financial organization transaction network based on the financial transaction flow. 2) And (4) abnormal financial organization hierarchical division based on the characteristics of the neighborhood topological structure.
Interpretation of terms:
1. NTF algorithm, belongs to a network representation learning method. The key idea of the method comprises the following steps:
1) Measuring the structural similarity of node pairs is independent of the relative positions of the node pairs. The degree of similarity of two nodes is only related to their neighborhood topology, and is not related to whether two nodes are connected or not, and the position, the label and the attribute in the network.
2) And (4) introducing an energy level concept to measure the importance degree of the neighborhood nodes in the neighborhood topological structure of the central node, and taking the importance degree as the characteristics of the neighborhood nodes to participate in describing the neighborhood topological structure of the central node. By adopting the idea of energy dissipation, neighborhood nodes which have large influence on the middle nodes are screened out, and an influence sub-network of the center nodes is constructed, so that the interference of noise data can be effectively reduced.
3) The relative characteristics and the absolute characteristics of the neighborhood nodes are used for representing the structural characteristics of the central node, the inherent attributes of the neighbor nodes and the importance degree of the neighborhood structure of the neighbor nodes relative to the central node are comprehensively considered, and the neighborhood structural characteristics of the central node are better reflected.
4) Hierarchy of node neighborhood structural representation. From the central node, sampling is carried out layer by layer according to the distance from the neighbor node to the central node by adopting the thought of breadth-first search, and the topological characteristic of the neighborhood structure of the central node is more reasonably described.
2. The NTF model mainly comprises the following four steps:
1) An influence subnetwork of the nodes is generated. The method comprises the steps of measuring importance degrees of neighbor nodes relative to a central node by introducing an energy level concept of an NTF model, firstly screening the neighbor nodes with higher importance degrees to a topological structure of a neighborhood of the central node from a network, and constructing an influence sub-network of the central node together with edges of the neighbor nodes.
2) A neighborhood topology representation of the node is obtained. According to a given maximum sampling depth, on the influence sub-network of the central node obtained in the step 1), starting from the central node, adopting a breadth-first search strategy, extending and sampling outwards layer by layer according to the distance, and using the energy level and the degree of the node to represent the adjacent node extended in the sampling process, wherein the energy level is a relative feature, and the degree is an absolute feature. And the sequence of the neighbor nodes obtained by extended sampling is represented by a neighborhood topological structure of the central node.
3) And constructing a secondary graph. The secondary graph is a new undirected weighted graph constructed on the basis of the original graph. Calculating the neighborhood topological structure similarity of the node pairs according to the neighborhood topological structure representation of the nodes obtained in the step 2), and accordingly calculating the weight of the edges of the node pairs on the quadratic graph, wherein the specific calculation mode is shown as a formula (III) below.
4) A vector representation of the nodes is generated. Firstly, random walk operation is carried out on the secondary graph constructed in the step 3, a corresponding node context sequence is generated for each node, and then a context representation is generated for each node according to the node context sequence by using a Skip-Gram algorithm.
The technical scheme of the invention is as follows:
an abnormal financial organization hierarchical division system based on a neighborhood topological structure comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchical division module which are connected in sequence;
the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, an abstract, a transaction balance and a transaction place;
the neighborhood topology feature extraction module is used for: processing data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracting neighborhood topological structure information of account nodes of the transaction network, and generating a corresponding low-dimensional dense vector for each account;
the abnormal financial organization hierarchy partitioning module is used for: firstly, performing dimension reduction on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then performing clustering operation on the low-dimensional dense vectors subjected to the dimension reduction by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.
The working method of the abnormal financial organization hierarchy dividing system runs in a computer and comprises the following steps:
(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;
(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracts neighborhood topological structure information of account number nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account number;
(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the abnormal financial organization hierarchical division result.
The original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Preferably, in step (1), the data washing includes:
A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number; the partial data is regarded as incomplete data, and cleaning work is performed on the partial data.
B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaned money are beneficial to improving the accuracy of hierarchical division.
C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account 1 And Account 2 A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account 1 The Account number of the transaction opponent is Account 2 "and" transaction Account number is Account 2 The Account number of the transaction opponent is Account 1 "one of the two records needs to be deleted.
According to the invention, in the step (1), the transaction account information and the counter-transaction account information in the transaction flow data of the abnormal financial organization are extracted to construct the transaction network of the abnormal financial organization, which includes:
D. account number set C = { C) related to transaction flow data of abnormal financial organization 1 ,c 2 ,…,c i ,…,c n Indicates a total of n account numbers, C indicates a set of all account numbers, C i Representing the ith account number in the set;
E. according to f (c) i )=v i Point set V = { V } for building transaction network 1 ,v 2 ,…,v i ,…v n H, every account number c i Mapped as a positive integer in the range of 1,n],v i =i,i=1,2,…,n,
F. According to transactionsA point set V, an account set C and transaction running data of the network, and an edge set of the transaction network is constructed
Figure GDA0003715238920000041
e i,j =(v i ,v j ) To a
Figure GDA0003715238920000042
Has w i,j Corresponding thereto, w i,j Represents account number c i ,c j Number of transactions in between, and has w i,j =w j,i
G. And finally, constructing a transaction network: g = { V, E }.
Preferably, step (2) comprises the following steps:
in the network, the nodes with similar neighborhood structures often have similar functions and play similar roles, which reflects that account number nodes with similar neighborhood topology structures in the transaction network of the abnormal financial organization are often in the same level in the organization. The NTF method designed by the invention can effectively extract the neighborhood topological structure information of the node, thereby effectively completing the task of network level division.
Given a graph G = (V, E), V = { V = { V) 1 ,v 2 ,…,v i ,…v n Represents a set of n nodes;
Figure GDA0003715238920000043
represents the edge set of graph G; for any edge e i,j Existence of weight w i,j In contrast thereto, w i,j Not less than 0; if two nodes v i ,v j There is no edge connection between them, then w i,j =0, otherwise w i,j >0; if the graph G is an undirected graph, then e i,j ≡e j,i ,w i,j =w j,i (ii) a If the graph G is an unweighted graph, then w i,j =1, otherwise w i,j ≥0;
Node v has an energy level el relative to node u u (v) Setting initial energy Q u (u) =1, total k * A level of energy, the energy range of the kth level of energy being
Figure GDA0003715238920000044
The residual energy from point u to point v is Q u (v) If at all
Figure GDA0003715238920000045
Then el u (v) K (= k); the set of nodes with energy level k of all relative points u is N k (u),N k (u)={v i ∣el(v i )=k};
Maximum sampling depth d * The maximum value of the distance from a node to a central node can be obtained by sampling by taking a certain node as a center and expanding sampling layer by layer according to the distance; distance to central node exceeding d * Will not fit within the sampling range;
the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,
Figure GDA0003715238920000046
is a set of edges that are to be considered,
Figure GDA0003715238920000047
for any edge e i,j All have a weight s i,j Corresponding thereto, s i,j Greater than or equal to 0, reflecting node v i And node v j Neighborhood structure similarity of (2);
H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. Taking a node from the trading network G as a central node, screening out neighbor nodes with larger importance degree to the neighborhood structure of the central node from the trading network G by the NTF model, and forming an influence subgraph, namely an influence sub-network, of the central node together with edges; the method comprises the following steps: let the central node be u, the initial energy Q u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the like,the closer the distance u is, the larger the node degree is, and the larger the edge weight is, the greater the importance degree of the node relative to the u neighborhood structure is, and the less energy is lost when the node is reached, that is, the greater the importance degree of the node is. Setting a Path P = { u, v = 1 ,…,v i ,v j },v j The distance to the central node u is k, and from the point u, the energy propagates along the path via v i Reaches v j Residual energy of time Q u (v j ) As shown in formula (i):
Figure GDA0003715238920000051
in the formula (I), Q u (v i ) Indicating the arrival at point v along path P from node u i Residual energy of time, k represents v j Distance to u, w i,j Represents the edge e i,j A is the energy decay rate, l (P) represents the length of path P,
Figure GDA0003715238920000055
representing a node v j Degree of (d);
note that since from u to v j There may be multiple paths of (1), and thus v j At the same time belong to
Figure GDA0003715238920000052
Wherein k is 1 <k 2 . Then it finally belongs to according to the proximity principle
Figure GDA0003715238920000053
One node is not allowed to be at multiple energy levels at the same time.
Starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;
I. on the basis of an influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depthDegree d * In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;
definition of R i (u) is a set of nodes with a distance i from the central node u, DE i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):
DE i (u)={(d(v),el u (v))∣∣v∈R i (u)}# (Ⅱ)
in formula (II), d (v) represents the degree of node v, el u (v) Representing the energy level of node v relative to node u.
Let DE i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,
Figure GDA0003715238920000054
the feature is the neighborhood structure of the central node;
J. constructing a secondary graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.
Defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):
Figure GDA0003715238920000061
dist (-) is a function that calculates the distance between two sequences;
the distance of the k-th hop neighborhood exists if and only if the node u, v has neighbors of distance k at the same time, where w i Representing the proportion of the contribution of the i-th hop neighborhood distance to the total distance by a calculation method such asFormula (IV):
Figure GDA0003715238920000062
in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, and is shown as a formula (V):
Figure GDA0003715238920000063
the difference value comprehensively considers the degree and energy level characteristics of the nodes, a 1 ,b 1 Degree, a, representing the first element of an ordered pair a, b, i.e. a node 2 ,b 2 Represents the energy level of the second element of the ordered pair, i.e., the node;
on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph u,v Embodying their neighborhood structure similarity s u,v The calculation mode is shown as formula (VI):
s u,v =e -distance(u,v) # (Ⅵ)
note that s u,v =s v,u . Finally, the neighborhood structure distance of the two nodes is mapped to (0, 1)]A fraction within the interval;
K. generating a vector representation of the node, i.e. a low-dimensional dense vector, refers to:
and (3) carrying out random walk on the secondary graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the secondary graph edge, as shown in a formula (VII):
Figure GDA0003715238920000064
z (u) is a normalization factor; as can be seen from the walk probability, starting from the current node,the node u is more prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;
Figure GDA0003715238920000065
regardless of their relative location in the network.
Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close together are closer in vector space.
Preferably, according to the invention, a vector representation is learned for each node using the Skip-Gram algorithm. The method comprises the following steps:
(1) generating a one-hot vector for each node, for
Figure GDA0003715238920000071
Generating corresponding one-hot vectors
Figure GDA0003715238920000072
Figure GDA0003715238920000073
The other elements are 0;
(2) let d be the dimension of the generated dense vector, randomly initialize the matrix U,
Figure GDA0003715238920000074
U i representing a node v i As a node vector when the node is the center node,
Figure GDA0003715238920000075
W i representing a node v i A node vector when used as a background node;
(3) v. the i Represents a central node, v i ∈V,context(v i ) Denotes v i | context (v) i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):
Figure GDA0003715238920000076
if there is v j ∈context(v i ) Then there is v i ∈context(v j ) I.e. v i ,v j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;
(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for
Figure GDA0003715238920000077
The low-dimensional dense vector generated by the Skip-Gram algorithm is W i Is recorded as vec i ,
Figure GDA0003715238920000078
Preferably, in step (3), the PCA method is used to perform dimension reduction on the low-dimensional dense vector generated by each account, v i Is vec as the low-dimensional dense vector i . The method comprises the following steps:
l, constructing a matrix X,
Figure GDA0003715238920000079
m, calculating the covariance matrix Cov of X,
Figure GDA00037152389200000710
n, calculating the characteristic value and the characteristic vector of X: (λ a-Cov) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;
setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the size of the eigenvalue, taking the front d' row to obtain a matrix P,
Figure GDA00037152389200000711
p, setting the data after dimensionality reduction as Y, Y = PX,
Figure GDA00037152389200000712
is v is i And (5) reducing the low-dimensional dense vector after dimension reduction.
According to the optimization of the invention, in the step (3), a k-means algorithm is used for carrying out clustering operation on the low-dimensional dense vectors subjected to the dimensionality reduction processing, and the hierarchical division result of the abnormal financial organization is obtained. v. of i The corresponding reduced node vector is y i
Figure GDA00037152389200000713
The input sample set for the k-means algorithm is: s = { y i ∣i∈[1,n]The method comprises the following steps of setting Classnum as the classification category number and Iternum as the iteration time, wherein the Classnum comprises the following steps:
q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu. } 12 ,…,μ Classnum Is classified into the corresponding category 1 ,class 2 ,…,class c };
R、
Figure GDA0003715238920000081
Calculating dist (y) ij ) And will y i Is divided into a distance y i Nearest u j Class to which the class belongs j Middle, dist (y) ij ) The calculation formula of (c) is shown as formula (IX):
Figure GDA0003715238920000082
s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):
Figure GDA0003715238920000083
and T, sequentially executing the step R and the step S, terminating the execution after executing Iternum times, and outputting a classification result.
The invention has the beneficial effects that:
according to the invention, the transaction flow records of the abnormal financial organization can be utilized to construct the transaction network of the organization, the neighborhood topological structure characteristics of the account number nodes in the network are extracted through a self-defined NTF algorithm, and a low-dimensional dense vector is generated for each account number node based on the neighborhood topological structure characteristics. And then, carrying out dimensionality reduction on the vector by using a PCA algorithm, and processing the dimensionality-reduced node vector by using a k-means algorithm to finally obtain the hierarchical information of the account node. The abnormal financial organization hierarchical division algorithm based on the neighborhood topology structure provided by the invention needs less information, only the information of both transfer parties of the transaction record of the abnormal financial organization is needed, and the participation of other additional information items is not needed, so that the manual participation is reduced to a certain extent, the labor cost is reduced, a good hierarchical division result can be obtained, and the automatic processing of the hierarchical division of the abnormal financial organization is realized. The method can assist relevant workers to analyze and judge abnormal financial organizations, and improve the working efficiency of the relevant workers.
Drawings
FIG. 1 is a block diagram of an abnormal financial organization hierarchy partitioning system based on neighborhood topology of the present invention;
FIG. 2 is a schematic sampling of an influence subnetwork of the present invention;
FIG. 3 is a schematic flow chart of a method of operation of the abnormal financial organization hierarchy partitioning system of the present invention;
fig. 4 (a) is a schematic diagram of a cluster visualization result obtained by the NTF method provided in example 2;
fig. 4 (b) is a schematic diagram of a clustering visualization result obtained by the existing BoostNE method;
FIG. 4 (c) is a schematic diagram of a cluster visualization result obtained by the conventional Role2Vec method;
fig. 4 (d) is a schematic diagram of a cluster visualization result obtained by the conventional Struc2Vec method.
Detailed Description
The invention is further defined in the following description, without being limited thereto, by reference to the drawings and examples.
Example 1
An abnormal financial organization hierarchy dividing system based on a neighborhood topological structure is shown in figure 1 and comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchy dividing module which are connected in sequence;
the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, a summary, a transaction balance and a transaction place;
the neighborhood topology feature extraction module is used for: processing data of transaction network of abnormal financial organization by using customized network representation learning method based on neighborhood topological structure, and extracting transaction
Generating a corresponding low-dimensional dense vector for each account by using neighborhood topological structure information of the network account nodes;
the abnormal financial organization hierarchy partitioning module is used for: firstly, carrying out dimension reduction processing on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then carrying out clustering operation on the low-dimensional dense vectors subjected to the dimension reduction processing by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.
Example 2
The working method of the abnormal financial organization hierarchy dividing system in the embodiment 1 is operated in a computer, as shown in fig. 3, and comprises the following steps:
(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;
the original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Data cleansing, comprising:
A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number; the partial data is regarded as incomplete data, and cleaning work is performed on the partial data.
B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaning are beneficial to improving the accuracy of hierarchical division.
C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account 1 And Account 2 A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account 1 The Account number of the transaction opponent is Account 2 The 'and' transaction Account number is Account 2 The Account number of the transaction opponent is Account 1 "one of the two records needs to be deleted.
The method comprises the following steps of extracting transaction account information and transaction counter account information in transaction flow data of the abnormal financial organization, and constructing a transaction network of the abnormal financial organization, wherein the transaction network comprises the following steps:
D. account number set C = { C) related to transaction flow data of abnormal financial organization 1 ,c 2 ,…,c i ,…,c n Denotes a total of n account numbers, C denotes a set of all account numbers, C i Representing the ith account number in the set;
E. according to f (c) i )=v i Point set V = { V } for building trading network 1 ,v 2 ,…,v i ,…v n H, every account number c i Mapped as a positive integer in the range of 1,n],v i =i,i=1,2,…,n,
F. According to the point set V, the account set C and the transaction running data of the transaction network, an edge set of the transaction network is constructed
Figure GDA0003715238920000101
e i,j =(v i ,v j ) To a
Figure GDA0003715238920000102
Has w i,j Corresponding thereto, w i,j Represents account number c i ,c j Number of transactions in between, and has w i,j =w j,i
G. And finally, constructing a transaction network: g = { V, E }.
(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined neighborhood topological structure-based network representation learning method, extracts neighborhood topological structure information of account nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account; the method comprises the following steps:
in the network, the nodes with similar neighborhood structures often have similar functions and play similar roles, which reflects that account number nodes with similar neighborhood topology structures in the transaction network of the abnormal financial organization are often in the same level in the organization. The NTF method designed by the invention can effectively extract the neighborhood topological structure information of the node, thereby effectively completing the task of network level division.
Given a graph G = (V, E), V = { V = { V) 1 ,v 2 ,…,v i ,…v n Represents a set of n nodes;
Figure GDA0003715238920000103
represents the edge set of graph G; for any edge e i,j Existence of a weight w i,j In contrast thereto, w i,j Not less than 0; if two nodes v i ,v j There is no edge connection between them, then w i,j =0, otherwise w i,j >0; if the graph G is an undirected graph, then e i,j ≡e j,i ,w i,j =w j,i (ii) a If the graph G is an unweighted graph, then w i,j =1, otherwise w i,j ≥0;
Node v has an energy level el relative to node u u (v) Setting initial energy Q u (u) =1, total k * A level of energy, the energy range of the kth level of energy being
Figure GDA0003715238920000104
The residual energy from point u to point v is Q u (v) If, if
Figure GDA0003715238920000105
Then el u (v) K (= k); the set of nodes with energy level k of all relative points u is N k (u),N k (u)={v i ∣el(v i )=k};
Maximum sampling depth d * The maximum value of the distance from a node to a central node can be obtained by taking a certain node as a center and outwards expanding sampling layer by layer according to the distance; distance to central node exceeding d * Will not fit within the sampling range;
the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,
Figure GDA0003715238920000106
is a set of edges that are to be considered,
Figure GDA0003715238920000107
for any edge e i,j All have a weight s i,j Corresponding thereto, s i,j Greater than or equal to 0, reflecting node v i And node v j Neighborhood structure similarity of (1);
H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. As shown in fig. 2, a node is taken out from the transaction network G as a central node, the NTF model screens out neighbor nodes from the transaction network G that have a greater importance degree to the neighborhood structure of the central node, and the neighbor nodes together with edges form an influence subgraph, i.e., an influence sub-network, of the central node; the method comprises the following steps: let the central node be u, the initial energy Q u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the likeThe closer u, the larger the node degree and the larger the edge weight, the greater the importance degree of the node relative to the u neighborhood structure, and the less energy lost when reaching the node, that is, the greater the importance degree of the node. Setting a Path P = { u, v = 1 ,…,v i ,v j },v j The distance to the central node u is k, and from the point u, the energy propagates along the path via v i Reaches v j Residual energy of time Q u (v j ) As shown in formula (I):
Figure GDA0003715238920000111
in the formula (I), Q u (v i ) Indicating the arrival at point v along path P from node u i Residual energy of time, k represents v j Distance to u, w i,j Represents the edge e i,j A is the energy decay rate, l (P) represents the length of path P,
Figure GDA0003715238920000112
representing a node v j Degree of (c);
note that since from u to v j There may be multiple paths of (1), and thus v j At the same time belong to
Figure GDA0003715238920000113
Wherein k is 1 <k 2 . Then it finally belongs to according to the proximity principle
Figure GDA0003715238920000114
One node is not allowed to be at multiple energy levels at the same time.
Starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;
I. on the basis of the influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depth d * In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;
definition of R i (u) is a set of nodes with a distance i from the central node u, DE i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):
DE i (u)={(d(v),el u (v))∣∣v∈R i (u)}# (Ⅱ)
in formula (II), d (v) represents the degree of node v, el u (v) Representing the energy level of node v relative to node u.
Let DE i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,
Figure GDA0003715238920000115
the feature is the neighborhood structure of the central node;
J. constructing a quadratic graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.
Defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):
Figure GDA0003715238920000121
dist (-) is a function of the distance between two sequences;
if and only if the node u, v has neighbors with the distance of k at the same time, the distance of the k-th hop neighborhood exists, wherein w i Representing the contribution ratio of the ith hop neighborhood distance to the total distance by a calculation method such as formula(IV) shown in the specification:
Figure GDA0003715238920000122
in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, as shown in formula (V):
Figure GDA0003715238920000123
the difference value comprehensively considers the degree and energy level characteristics of the nodes, a 1 ,b 1 Degree, a, representing the first element of the ordered pair a, b, i.e. node 2 ,b 2 Represents the energy level of the second element of the ordered pair, i.e., the node;
on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph u,v Embodying their neighborhood structure similarity s u,v The calculation mode is shown as formula (VI):
s u,v =e -distance(u,v) # (Ⅵ)
note that s u,v =s v,u . Finally, the neighborhood structure distance of the two nodes is mapped to (0, 1)]A fraction within the interval;
K. generating a vector representation of the node, i.e. a low-dimensional dense vector, means:
and (3) carrying out random walk on the quadratic graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the edge of the quadratic graph, as shown in a formula (VII):
Figure GDA0003715238920000124
z (u) is a normalization factor; according to the wandering probability, starting from the current node, the node is furtherThe node is prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;
Figure GDA0003715238920000125
regardless of their relative location in the network.
Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close are closer in vector space.
Preferably, according to the invention, a vector representation is learned for each node using the Skip-Gram algorithm. The method comprises the following steps:
(1) generating a one-hot vector for each node, for
Figure GDA0003715238920000131
Generating corresponding one-hot vectors
Figure GDA0003715238920000132
Figure GDA0003715238920000133
The other elements are 0;
(2) assuming the dimension of the generated dense vector as d, randomly initializing the matrix U,
Figure GDA0003715238920000134
U i representing a node v i As a node vector when the center node is present,
Figure GDA0003715238920000135
W i representing a node v i A node vector when serving as a background node;
(3) let v i Represents a central node, v i ∈V,context(v i ) Denotes v i | context (v) i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):
Figure GDA0003715238920000136
if there is v j ∈context(v i ) Then there is v i ∈context(v j ) I.e. v i ,v j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;
(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for
Figure GDA0003715238920000137
The low-dimensional dense vector generated by the Skip-Gram algorithm is W i Is recorded as
Figure GDA0003715238920000138
(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization.
Using a PCA method to perform dimension reduction processing on the low-dimensional dense vector generated by each account number, v i Is vec as the low-dimensional dense vector i . The method comprises the following steps:
l, constructing a matrix X,
Figure GDA0003715238920000139
m, calculating the covariance matrix Cov of X,
Figure GDA00037152389200001310
n, calculating the characteristic value and the characteristic vector of X: (λ a-Co) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;
setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the size of the eigenvalue, taking the front d' row to obtain a matrix P,
Figure GDA00037152389200001311
p, setting the data after dimensionality reduction as Y, Y = PX,
Figure GDA00037152389200001312
i.e. v i And (5) reducing the low-dimensional dense vector after dimension reduction.
And clustering the low-dimensional dense vectors subjected to the dimensionality reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization. v. of i The corresponding reduced node vector is y i
Figure GDA0003715238920000141
The input sample set for the k-means algorithm is: s = { y i ∣i∈[1,n]The method comprises the following steps of setting Classnum as the classification category number and Iternum as the iteration time, wherein the Classnum comprises the following steps:
q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu. } 12 ,…,μ Classnum Is classified into the corresponding category 1 ,class 2 ,…,class c };
R、
Figure GDA0003715238920000142
Calculating dist (y) ij ) And is combined with y i Is divided into a distance y i Nearest u j Class to which it belongs j Middle, dist (y) ij ) The calculation formula (c) is shown in formula (ix):
Figure GDA0003715238920000143
s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):
Figure GDA0003715238920000144
and T, sequentially executing the step R and the step S, terminating the execution after executing Iternum times, and outputting a classification result.
The method of the embodiment is adopted to process the transaction flow data set T of an abnormal financial organization input by a certain user.
The transaction flow data set adopted in the present example has 105 account nodes in common, and is divided into 3 types in total. The account transfer system comprises a high-level account, a special function account, a bottom-level account and 784 transfer records, wherein the high-level account is 22, the special function account is 11, the bottom-level account is 72, and the number of the transfer records is 784.
This example uses six indices of CHI (Calinski Harabaz Index), SC (Silhouette Coefficient), DBI (Davies Bouldin Index), ARI (Adjusted random Index), AMI (Adjusted Mutual Index) and V-measure to evaluate the effect of the present embodiment. The first three indexes CHI, SC, DBI are Internal metrics (Internal criterion) used to evaluate the effect between clusters obtained after clustering; the last three indexes ARI, AMI and V-measure are External standards (External criterion) and are used for comparing the matching degree of the clustering result and the real distribution.
Fig. 4 (a) is a schematic diagram of a cluster visualization result obtained by the NTF method provided in this embodiment; fig. 4 (b) is a schematic diagram of a clustering visualization result obtained by the existing BoostNE method; the existing BoostNE method is disclosed in documents Li J, wu L, guo R, et al, multi-level network embedding with a bossed low-rank matrix adaptation [ C ]// Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and mining.2019:49-56. FIG. 4 (C) is a schematic diagram of the cluster visualization result obtained by the existing Role2Vec method; the existing Role2Vec method is described in Ahmed N K, rossi R, lee J B, et al. FIG. 4 (d) is a schematic diagram of the cluster visualization result obtained by the conventional Struc2Vec method; the existing Structure 2Vec method is described in Ribeiro L F R, saverese P H P, figueiredo D R, structure 2Vec, learning node representations from structural identity [ C ]// Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining.2017.
In fig. 4 (a) to 4 (d), when the dense vectors generated for each node by the abscissa and ordinate representation algorithms (NTF, boost ne, role2Vec, and Struc2 Vec) are reduced to two-dimensional vectors, the coordinate values corresponding to the two-dimensional plane are obtained. Since the vector needs to be normalized when the clustering result is evaluated using the correlation indices (CHI, SC, DBI, ARI, AMI, and V-measure), the magnitude of the absolute value of the abscissa and ordinate of the node does not affect the clustering result.
As can be seen from fig. 4 (a) to 4 (d), the two methods, boostNE and Role2Vec, do not distinguish the three types of nodes well. Struc2Vec can better distinguish category 3 from category 1 and category 2, but cannot better distinguish category 1 from category 2. The NTF method provided by the invention can well distinguish three types of nodes, and a better clustering effect is obtained.
Three indexes of CHI, SC and DBI are used for measuring the clustering effect of NTF and three comparison methods of BoostNE, role2Vec and Struc2Vec, and the result is shown in Table 1.
TABLE 1
Figure GDA0003715238920000151
Clustering operation is carried out on the node vectors generated by the four methods after dimensionality reduction by using a k-means algorithm, then ARI, AMI and V-means are used for measuring the clustering effect of the k-means algorithm, and the result is shown in Table 2.
TABLE 2
Figure GDA0003715238920000152
As can be seen from tables 1 and 2, compared with the existing three methods of BoostNE, role2Vec, and Struc2Vec, the NTF method provided by the present invention obtains the best results on 6 indexes, which reflects that the NTF method can effectively extract the neighborhood topology information of nodes, and separates different types of nodes as far as possible while gathering the same type of nodes together, so as to obtain a better clustering result. Meanwhile, the NTF method can better support the subsequent node classification task.

Claims (5)

1. A work method of an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchy dividing module which are connected in sequence;
the data preprocessing module is used for: sequentially carrying out duplicate removal, denoising and incomplete data removal operations on the input transaction flow data of the abnormal financial organization, extracting transaction account information and transaction counter-party account information in the transaction flow data of the abnormal financial organization, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, an abstract, a transaction balance and a transaction place;
the neighborhood topology feature extraction module is used for: processing data of transaction network of abnormal financial organization by using customized network representation learning method based on neighborhood topology structure, and extracting transaction
Generating a corresponding low-dimensional dense vector for each account by using neighborhood topological structure information of the network account nodes;
the abnormal financial organization hierarchy partitioning module is used for: firstly, carrying out dimensionality reduction on a low-dimensional dense vector generated by each account by using a PCA (principal component analysis) method, and then carrying out clustering operation on the low-dimensional dense vector subjected to dimensionality reduction by using a k-means algorithm to obtain a hierarchical division result of abnormal financial organization; the method is characterized by operating in a computer and comprising the following steps:
(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;
(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracts neighborhood topological structure information of account number nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account number;
(3) The abnormal financial organization hierarchical division module firstly performs dimension reduction processing on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction processing by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization;
in the step (1), the transaction account information and the transaction counter-party account information in the transaction flow data of the abnormal financial organization are extracted, and a transaction network of the abnormal financial organization is constructed, wherein the method comprises the following steps:
D. account number set C = { C) related to transaction flow data of abnormal financial organization 1 ,c 2 ,…,c i ,…,c n Denotes a total of n account numbers, C denotes a set of all account numbers, C i Representing the ith account number in the set;
E. according to f (c) i )=v i Point set V = { V } for building transaction network 1 ,v 2 ,…,v i ,…v n }, each account number c is counted i Mapped as a positive integer in the range of 1,n],v i =i,i=1,2,…,n,
F. According to the point set V, the account set C and the transaction flow data of the transaction network, an edge set of the transaction network is constructed
Figure FDA0003715238910000011
e i,j =(v i ,v j ) To a
Figure FDA0003715238910000021
Has w i,j Corresponding thereto, w i,j Represents account number c i ,c j Number of transactions in between, and has w i,j =w j,i
G. And finally, constructing a transaction network: g = { V, E };
in the step (2), the method comprises the following steps:
given a graph G = (V, E), V = { V = { V) 1 ,v 2 ,…,v i ,…v n Represents a set of n nodes;
Figure FDA0003715238910000022
represents the edge set of graph G; for any edge e i,j Existence of weight w i,j In contrast thereto, w i,j Not less than 0; if two nodes v i ,v j There is no edge connection between them, then w i,j =0, otherwise w i,j >0; if the graph G is an undirected graph, then e i,j ≡e j,i ,w i,j =w j,i (ii) a If the graph G is an unweighted graph, then w i,j =1, otherwise w i,j ≥0;
Node v has an energy level el relative to node u u (v) Setting initial energy Q u (u) =1, total k * A level of energy, the energy range of the kth level of energy being
Figure FDA0003715238910000023
The residual energy from point u to point v is Q u (v) If, if
Figure FDA0003715238910000024
Then el u (v) K (= k); the set of nodes with energy level k of all relative points u is N k (u),N k (u)={v i ∣el(v i )=k};
Maximum sampling depth d * The maximum value of the distance from a node to a central node can be obtained by sampling by taking a certain node as a center and expanding sampling layer by layer according to the distance;
the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,
Figure FDA0003715238910000025
is a set of edges that are to be considered,
Figure FDA0003715238910000026
for any edge e i,j All have a weight s i,j Corresponding thereto, s i,j Greater than or equal to 0, reflecting node v i And node v j Neighborhood structure similarity of (1);
H. generating respective influence sub-networks for each node through an NTF algorithm; taking a node from the trading network G as a central node, screening out neighbor nodes with larger importance degree to the neighborhood structure of the central node from the trading network G by the NTF model, and forming an influence subgraph, namely an influence sub-network, of the central node together with edges; the method comprises the following steps: let the central node be u, the initial energy Q u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is insufficient to reach the next node; setting a Path P = { u, v = 1 ,…,v i ,v j },v j A distance k from the central node u, from which point u energy propagates along the path via v i Reaches v j Residual energy of time Q u (v j ) As shown in formula (I):
Figure FDA0003715238910000027
in the formula (I), Q u (v i ) Indicating the arrival at point v along path P from node u i Residual energy of time, k represents v j Distance to u, w i,j Represents the edge e i,j A is the energy decay rate, l (P) represents the length of path P,
Figure FDA0003715238910000028
representing a node v j Degree of (d);
starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;
I. on the basis of the influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depth d * In the influence subnetwork, from the central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and using the energy level of the nodeAnd degree to represent the node extended in the sampling process, wherein the energy level of the node is relative characteristic, and the degree is absolute characteristic;
definition of R i (u) is a set of nodes with a distance i from the central node u, DE i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):
DE i (u)={(d(v),el u (v))∣∣v∈R i (u)}#(Ⅱ)
in formula (II), d (v) represents the degree of node v, el u (v) Represents the energy level of node v relative to node u;
let DE i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,
Figure FDA0003715238910000031
the feature is the neighborhood structure of the central node;
J. constructing a secondary graph according to the neighborhood structure characteristics of the nodes; defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):
Figure FDA0003715238910000032
dist (-) is a function that calculates the distance between two sequences;
the distance of the k-th hop neighborhood exists if and only if the node u, v has neighbors of distance k at the same time, where w i The contribution proportion of the ith hop neighborhood distance to the total distance is represented, and the calculation method is shown as the formula (IV):
Figure FDA0003715238910000033
according to the DTW algorithm, a difference calculation definition formula of any two elements is given, as shown in formula (V):
Figure FDA0003715238910000034
the difference value comprehensively considers the degree and energy level characteristics of the nodes, a 1 ,b 1 Degree, a, representing the first element of the ordered pair a, b, i.e. node 2 ,b 2 Represents the energy level of the second element of the ordered pair, i.e., the node;
on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph u,v Embodying their neighborhood structure similarity s u,v The calculation mode is shown as the formula (VI):
s u,v =e -distance(u,v) #(Ⅵ)
finally, mapping the neighborhood structure distance of the two nodes into a decimal in the (0, 1) interval;
K. generating a vector representation of the node, i.e. a low-dimensional dense vector, refers to:
and (3) carrying out random walk on the secondary graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the secondary graph edge, as shown in a formula (VII):
Figure FDA0003715238910000041
z (u) is a normalization factor;
Figure FDA0003715238910000042
using the Skip-Gram algorithm, a vector representation is learned for each node.
2. The method for operating an abnormal financial organization hierarchy-dividing system according to claim 1, wherein in the step (1), the data cleansing includes:
A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number;
B. denoising: namely, the data of the cleaning transaction amount is less than 50 yuan;
C. removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes.
3. The method of claim 2, wherein learning a vector representation for each node using Skip-Gram algorithm comprises the steps of:
(1) generating a one-hot vector for each node, for
Figure FDA0003715238910000043
Generating corresponding one-hot vectors
Figure FDA0003715238910000044
Figure FDA0003715238910000045
The other elements are 0;
(2) setting the dimension of the generated dense vector as d, and randomly initializing the matrix
Figure FDA0003715238910000046
U i Representing a node v i As a node vector when the node is the center node,
Figure FDA0003715238910000047
W i representing a node v i A node vector when used as a background node;
(3) v. the i Represents a central node, v i ∈V,context(v i ) Denotes v i Of background nodesAggregate, | context (v) i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in equation (VIII):
Figure FDA0003715238910000048
if there is v j ∈context(v i ) Then there is v i ∈context(v j ) I.e. v i ,v j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;
(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for
Figure FDA0003715238910000049
The low-dimensional dense vector generated by the Skip-Gram algorithm is W i Is recorded as vec i ,
Figure FDA0003715238910000051
4. The method as claimed in claim 3, wherein in step (3), the PCA method is used to perform dimension reduction on the low-dimensional dense vector generated by each account, v is i Is vec as the low-dimensional dense vector i The method comprises the following steps:
l, constructing a matrix X,
Figure FDA0003715238910000052
m, calculating the covariance matrix Cov of X,
Figure FDA0003715238910000053
n, calculating the characteristic value and the characteristic vector of X: (λ a-Cov) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;
setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the eigenvalues, taking the front d' row to obtain a matrix P,
Figure FDA0003715238910000054
p, the data after dimension reduction is Y, Y = PX,
Figure FDA0003715238910000055
Y i T i.e. v i And the low-dimensional dense vector after dimension reduction.
5. The method as claimed in claim 4, wherein in step (3), the low-dimensional dense vectors after dimension reduction are clustered by using k-means algorithm to obtain the hierarchical segmentation result of abnormal financial organization, v i The corresponding reduced node vector is y i ,y i =Y i T The input sample set for the k-means algorithm is: s = { y i ∣i∈[1,n]And setting Classnum as the classification category number and Iternum as the iteration time, wherein the method comprises the following steps of:
q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu ] m 12 ,…,μ Classnu, Is associated with class 1 ,class 2 ,…,class c };
R、
Figure FDA0003715238910000056
Calculating dist (y) ij ) And is combined with y i Is divided into a distance y i Nearest u j Class to which the class belongs j Middle, dist (y) ij ) The calculation formula of (c) is shown as formula (IX):
Figure FDA0003715238910000057
s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):
Figure FDA0003715238910000058
and T, sequentially executing the step R and the step S, stopping executing after executing Iternum times, and outputting a classification result.
CN202011009471.0A 2020-09-23 2020-09-23 Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure Active CN112150285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011009471.0A CN112150285B (en) 2020-09-23 2020-09-23 Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011009471.0A CN112150285B (en) 2020-09-23 2020-09-23 Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure

Publications (2)

Publication Number Publication Date
CN112150285A CN112150285A (en) 2020-12-29
CN112150285B true CN112150285B (en) 2022-10-04

Family

ID=73897986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011009471.0A Active CN112150285B (en) 2020-09-23 2020-09-23 Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure

Country Status (1)

Country Link
CN (1) CN112150285B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114741433B (en) * 2022-06-09 2022-09-23 北京芯盾时代科技有限公司 Community mining method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622072A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 A kind of recognition methods and server, terminal for web page operation behavior
CN109740722A (en) * 2018-12-26 2019-05-10 西安电子科技大学 A kind of network representation learning method based on Memetic algorithm
CN110704694A (en) * 2019-09-29 2020-01-17 哈尔滨工业大学(威海) Organization hierarchy dividing method based on network representation learning and application thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200097921A1 (en) * 2018-09-24 2020-03-26 Hitachi, Ltd. Equipment repair management and execution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622072A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 A kind of recognition methods and server, terminal for web page operation behavior
CN109740722A (en) * 2018-12-26 2019-05-10 西安电子科技大学 A kind of network representation learning method based on Memetic algorithm
CN110704694A (en) * 2019-09-29 2020-01-17 哈尔滨工业大学(威海) Organization hierarchy dividing method based on network representation learning and application thereof

Also Published As

Publication number Publication date
CN112150285A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN112925989B (en) Group discovery method and system of attribute network
Pizzuti Overlapped community detection in complex networks
Pillai et al. Overview of itemset utility mining and its applications
CN108009710A (en) Node test importance appraisal procedure based on similarity and TrustRank algorithms
CN106302522A (en) A kind of network safety situations based on neutral net and big data analyze method and system
CN106960390A (en) Overlapping community division method based on convergence degree
CN110704694B (en) Organization hierarchy dividing method based on network representation learning and application thereof
Faizan et al. Applications of clustering techniques in data mining: a comparative study
CN109886334A (en) A kind of shared nearest neighbor density peak clustering method of secret protection
CN107885971A (en) Using the method for improving flower pollination algorithm identification key protein matter
CN112150285B (en) Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure
Bakirli et al. DTreeSim: A new approach to compute decision tree similarity using re-mining
CN110287237B (en) Social network structure analysis based community data mining method
CN110992194A (en) User reference index algorithm based on attribute-containing multi-process sampling graph representation learning model
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN111277433B (en) Network service abnormity detection method and device based on attribute network characterization learning
CN112686654A (en) Block chain digital currency transaction identification method and device, electronic equipment and storage medium
CN111584010A (en) Key protein identification method based on capsule neural network and ensemble learning
Huang et al. A visual method of cluster validation with Fastmap
Chang et al. Automated feature engineering for fraud prediction in online credit loan services
Guo et al. EC‐Structure: Establishing Consumption Structure through Mining E‐Commerce Data to Discover Consumption Upgrade
CN112422505A (en) Network malicious traffic identification method based on high-dimensional extended key feature vector
Foxcroft et al. Product Matching Lessons and Recommendations from a Real World Application.
Yahia et al. K-nearest neighbor and C4. 5 algorithms as data mining methods: advantages and difficulties
CN115086179B (en) Detection method for community structure in social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Wang Wei

Inventor after: Wang Bailing

Inventor after: Xin Guodong

Inventor after: Liu Yang

Inventor after: Huang Junheng

Inventor after: Ma Dongyang

Inventor before: Ma Dongyang

Inventor before: Wang Wei

Inventor before: Wang Bailing

Inventor before: Xin Guodong

Inventor before: Liu Yang

Inventor before: Huang Junheng

GR01 Patent grant
GR01 Patent grant