CN112150285B

CN112150285B - Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure

Info

Publication number: CN112150285B
Application number: CN202011009471.0A
Authority: CN
Inventors: 王巍; 王佰玲; 辛国栋; 刘扬; 黄俊恒; 马东阳
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-10-04
Anticipated expiration: 2040-09-23
Also published as: CN112150285A

Abstract

The invention relates to an abnormal financial organization level division system based on a neighborhood topological structure and a working method thereof, wherein the abnormal financial organization level division system comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization level division module which are connected in sequence; the data preprocessing module carries out data cleaning on the input abnormal financial organization transaction flow data and constructs a transaction network of the abnormal financial organization; the neighborhood topological structure feature extraction module generates a corresponding low-dimensional dense vector for each account; and the abnormal financial organization hierarchical division module performs dimension reduction processing on the low-dimensional dense vector generated by each account, performs clustering operation and acquires the hierarchical division result of the abnormal financial organization. According to the invention, only the information of the transfer parties of the transaction records of the abnormal financial organization is needed, so that the manual participation is reduced to a certain extent, the labor cost is reduced, a good hierarchical division result can be obtained, and the automatic processing of the hierarchical division of the abnormal financial organization is realized.

Description

Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure

Technical Field

The invention relates to an abnormal financial organization hierarchy dividing system and a working method thereof, in particular to an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure and a working method thereof.

Background

The abnormal financial organization is a financial organization with abnormal financial transaction behaviors and has various types. The common types include marketing organization, illegal collection organization, money laundering organization, etc. Wherein the distribution organization is a typical abnormal financial organization. The distribution is a criminal action which disturbs the economic order and is essentially a 'huge cheat', namely, money of newly joined persons is utilized to pay interest and short-term return to the previously joined persons so as to make the illusion of earning money and further cheat more investment. The distribution organization has a typical pyramid structure, and a transaction network formed by mutual transfer among members of the distribution organization also has a typical hierarchy. The method has sufficient practical significance for accurately judging the level hierarchy of the members of the distribution organization in the distribution organization according to the transaction network of the distribution organization. At present, the method for analyzing the member level of the distribution organization according to the transfer transaction of the distribution organization depends on manual judgment to a great extent and needs a great amount of human power to participate. In the face of a large number of complicated and widespread marketing organization data, the traditional method using manual judgment cannot effectively meet the requirement of large-scale data analysis and processing.

Currently, methods based on network representation learning make great progress in extracting network information. Such methods may map nodes in the network into low-dimensional dense vectors, which are then applied to subsequent tasks. Many researches fully draw algorithms in the field of natural language processing, apply the Word2Vec model to the field of network representation learning, obtain better learning effect, and the generated node vector can better meet follow-up tasks. The domestic research on network representation learning mainly comprises a LINE algorithm, an SDNE algorithm, a TADW algorithm and the like. However, the LINE algorithm and the SDNE algorithm mainly capture first-order neighbor and second-order neighbor information of the nodes, and the sampling range is small. The distribution network belongs to a multi-level network, the two methods cannot comprehensively and effectively capture the hierarchical information of account number nodes in the distribution transaction network, account numbers close to transfer transactions tend to be classified into one type, and the network cannot be hierarchically classified. The TADW algorithm is designed to handle social networks and applies heterogeneous information of the node's utterance text. When the TADW algorithm is applied to a marketing transaction network, other types of data, such as transaction summary information, call information, chat records and the like, need to be additionally introduced besides the marketing transaction network, so that more workers are required to participate in information collection, and a previous data preprocessing task causes labor-intensive work, and the labor cost cannot be effectively reduced. Most of the existing methods can not carry out targeted analysis on the characteristics of the transaction network of the marketing organization, more effectively extract the hierarchical information of account number nodes and finish the hierarchical division work of abnormal financial organizations, thereby reducing labor intensive work and improving the degree of automation of the hierarchical division of the members of the marketing organization.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure;

the invention also provides a working method of the abnormal financial organization hierarchy dividing system;

the method can extract the transaction flow information of the abnormal financial organization, construct the abnormal transaction network on the basis, extract the neighborhood topological structure characteristics of the account number by using a user-defined network representation learning method based on the neighborhood topological structure, generate the corresponding low-dimensional dense vector for each account number, and then perform clustering processing on the node vectors by using a k-means algorithm to complete the hierarchical division task of the abnormal financial organization. The invention can be used for: 1) And constructing an abnormal financial organization transaction network based on the financial transaction flow. 2) And (4) abnormal financial organization hierarchical division based on the characteristics of the neighborhood topological structure.

Interpretation of terms:

1. NTF algorithm, belongs to a network representation learning method. The key idea of the method comprises the following steps:

1) Measuring the structural similarity of node pairs is independent of the relative positions of the node pairs. The degree of similarity of two nodes is only related to their neighborhood topology, and is not related to whether two nodes are connected or not, and the position, the label and the attribute in the network.

2) And (4) introducing an energy level concept to measure the importance degree of the neighborhood nodes in the neighborhood topological structure of the central node, and taking the importance degree as the characteristics of the neighborhood nodes to participate in describing the neighborhood topological structure of the central node. By adopting the idea of energy dissipation, neighborhood nodes which have large influence on the middle nodes are screened out, and an influence sub-network of the center nodes is constructed, so that the interference of noise data can be effectively reduced.

3) The relative characteristics and the absolute characteristics of the neighborhood nodes are used for representing the structural characteristics of the central node, the inherent attributes of the neighbor nodes and the importance degree of the neighborhood structure of the neighbor nodes relative to the central node are comprehensively considered, and the neighborhood structural characteristics of the central node are better reflected.

4) Hierarchy of node neighborhood structural representation. From the central node, sampling is carried out layer by layer according to the distance from the neighbor node to the central node by adopting the thought of breadth-first search, and the topological characteristic of the neighborhood structure of the central node is more reasonably described.

2. The NTF model mainly comprises the following four steps:

1) An influence subnetwork of the nodes is generated. The method comprises the steps of measuring importance degrees of neighbor nodes relative to a central node by introducing an energy level concept of an NTF model, firstly screening the neighbor nodes with higher importance degrees to a topological structure of a neighborhood of the central node from a network, and constructing an influence sub-network of the central node together with edges of the neighbor nodes.

2) A neighborhood topology representation of the node is obtained. According to a given maximum sampling depth, on the influence sub-network of the central node obtained in the step 1), starting from the central node, adopting a breadth-first search strategy, extending and sampling outwards layer by layer according to the distance, and using the energy level and the degree of the node to represent the adjacent node extended in the sampling process, wherein the energy level is a relative feature, and the degree is an absolute feature. And the sequence of the neighbor nodes obtained by extended sampling is represented by a neighborhood topological structure of the central node.

3) And constructing a secondary graph. The secondary graph is a new undirected weighted graph constructed on the basis of the original graph. Calculating the neighborhood topological structure similarity of the node pairs according to the neighborhood topological structure representation of the nodes obtained in the step 2), and accordingly calculating the weight of the edges of the node pairs on the quadratic graph, wherein the specific calculation mode is shown as a formula (III) below.

4) A vector representation of the nodes is generated. Firstly, random walk operation is carried out on the secondary graph constructed in the step 3, a corresponding node context sequence is generated for each node, and then a context representation is generated for each node according to the node context sequence by using a Skip-Gram algorithm.

The technical scheme of the invention is as follows:

an abnormal financial organization hierarchical division system based on a neighborhood topological structure comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchical division module which are connected in sequence;

the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, an abstract, a transaction balance and a transaction place;

the neighborhood topology feature extraction module is used for: processing data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracting neighborhood topological structure information of account nodes of the transaction network, and generating a corresponding low-dimensional dense vector for each account;

the abnormal financial organization hierarchy partitioning module is used for: firstly, performing dimension reduction on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then performing clustering operation on the low-dimensional dense vectors subjected to the dimension reduction by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.

The working method of the abnormal financial organization hierarchy dividing system runs in a computer and comprises the following steps:

(1) The data preprocessing module sequentially performs duplicate removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracts a transaction account number and a transaction counter-account number of each transaction record in the abnormal financial organization transaction flow data, and constructs a transaction network of the abnormal financial organization;

(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined network representation learning method based on a neighborhood topological structure, extracts neighborhood topological structure information of account number nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account number;

(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the abnormal financial organization hierarchical division result.

The original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Preferably, in step (1), the data washing includes:

A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number; the partial data is regarded as incomplete data, and cleaning work is performed on the partial data.

B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaned money are beneficial to improving the accuracy of hierarchical division.

C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account ₁ And Account ₂ A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account ₁ The Account number of the transaction opponent is Account ₂ "and" transaction Account number is Account ₂ The Account number of the transaction opponent is Account ₁ "one of the two records needs to be deleted.

According to the invention, in the step (1), the transaction account information and the counter-transaction account information in the transaction flow data of the abnormal financial organization are extracted to construct the transaction network of the abnormal financial organization, which includes:

D. account number set C = { C) related to transaction flow data of abnormal financial organization ₁ ,c ₂ ,…,c _i ,…,c _n Indicates a total of n account numbers, C indicates a set of all account numbers, C _i Representing the ith account number in the set;

E. according to f (c) _i )＝v _i Point set V = { V } for building transaction network ₁ ,v ₂ ,…,v _i ,…v _n H, every account number c _i Mapped as a positive integer in the range of 1,n]，v _i ＝i,i＝1,2,…,n，

F. According to transactionsA point set V, an account set C and transaction running data of the network, and an edge set of the transaction network is constructed

e _i,j ＝(v _i ,v _j ) To a

Has w _i,j Corresponding thereto, w _i,j Represents account number c _i ,c _j Number of transactions in between, and has w _i,j ＝w _j,i ，

G. And finally, constructing a transaction network: g = { V, E }.

Preferably, step (2) comprises the following steps:

in the network, the nodes with similar neighborhood structures often have similar functions and play similar roles, which reflects that account number nodes with similar neighborhood topology structures in the transaction network of the abnormal financial organization are often in the same level in the organization. The NTF method designed by the invention can effectively extract the neighborhood topological structure information of the node, thereby effectively completing the task of network level division.

Given a graph G = (V, E), V = { V = { V) ₁ ,v ₂ ,…,v _i ,…v _n Represents a set of n nodes;

represents the edge set of graph G; for any edge e _i,j Existence of weight w _i,j In contrast thereto, w _i,j Not less than 0; if two nodes v _i ,v _j There is no edge connection between them, then w _i,j =0, otherwise w _i,j >0; if the graph G is an undirected graph, then e _i,j ≡e _j,i ,w _i,j ＝w _j,i (ii) a If the graph G is an unweighted graph, then w _i,j =1, otherwise w _i,j ≥0；

Node v has an energy level el relative to node u _u (v) Setting initial energy Q _u (u) =1, total k ^* A level of energy, the energy range of the kth level of energy being

The residual energy from point u to point v is Q _u (v) If at all

Then el _u (v) K (= k); the set of nodes with energy level k of all relative points u is N _k (u)，N _k (u)＝{v _i ∣el(v _i )＝k}；

Maximum sampling depth d ^* The maximum value of the distance from a node to a central node can be obtained by sampling by taking a certain node as a center and expanding sampling layer by layer according to the distance; distance to central node exceeding d ^* Will not fit within the sampling range;

the quadratic graph G ' = { V, E ', S }, G ' is an undirected, weighted complete graph generated on the basis of the original graph,

is a set of edges that are to be considered,

for any edge e _i,j All have a weight s _i,j Corresponding thereto, s _i,j Greater than or equal to 0, reflecting node v _i And node v _j Neighborhood structure similarity of (2);

H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. Taking a node from the trading network G as a central node, screening out neighbor nodes with larger importance degree to the neighborhood structure of the central node from the trading network G by the NTF model, and forming an influence subgraph, namely an influence sub-network, of the central node together with edges; the method comprises the following steps: let the central node be u, the initial energy Q _u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the like,the closer the distance u is, the larger the node degree is, and the larger the edge weight is, the greater the importance degree of the node relative to the u neighborhood structure is, and the less energy is lost when the node is reached, that is, the greater the importance degree of the node is. Setting a Path P = { u, v = ₁ ,…,v _i ,v _j }，v _j The distance to the central node u is k, and from the point u, the energy propagates along the path via v _i Reaches v _j Residual energy of time Q _u (v _j ) As shown in formula (i):

in the formula (I), Q _u (v _i ) Indicating the arrival at point v along path P from node u _i Residual energy of time, k represents v _j Distance to u, w _i,j Represents the edge e _i,j A is the energy decay rate, l (P) represents the length of path P,

representing a node v _j Degree of (d);

note that since from u to v _j There may be multiple paths of (1), and thus v _j At the same time belong to

Wherein k is ₁ <k ₂ . Then it finally belongs to according to the proximity principle

One node is not allowed to be at multiple energy levels at the same time.

Starting from point u, the set of reachable nodes is denoted as V (u), the set of edges is denoted as E (u), and the influence subgraph G (u) = { V (u), E (u), N }, N = { el = _u (v) | V ∈ V (u) }, which is the set of energy levels of the reached node;

I. on the basis of an influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depthDegree d ^* In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;

definition of R _i (u) is a set of nodes with a distance i from the central node u, DE _i (u) represents a set of ordered pairs of absolute features and relative features of the node extended from the node u to the sampling depth i, and is defined as shown in formula (II):

DE _i (u)＝{(d(v),el _u (v))∣∣v∈R _i (u)}# (Ⅱ)

in formula (II), d (v) represents the degree of node v, el _u (v) Representing the energy level of node v relative to node u.

Let DE _i (u) the elements are arranged in order, the degree of the node is taken as a primary key word, the energy level of the node is taken as a secondary key word to be arranged in ascending order, the absolute characteristic and the relative characteristic of the node are synthesized,

the feature is the neighborhood structure of the central node;

J. constructing a secondary graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.

Defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):

dist (-) is a function that calculates the distance between two sequences;

the distance of the k-th hop neighborhood exists if and only if the node u, v has neighbors of distance k at the same time, where w _i Representing the proportion of the contribution of the i-th hop neighborhood distance to the total distance by a calculation method such asFormula (IV):

in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, and is shown as a formula (V):

the difference value comprehensively considers the degree and energy level characteristics of the nodes, a ₁ ,b ₁ Degree, a, representing the first element of an ordered pair a, b, i.e. a node ₂ ,b ₂ Represents the energy level of the second element of the ordered pair, i.e., the node;

on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph _u,v Embodying their neighborhood structure similarity s _u,v The calculation mode is shown as formula (VI):

s _u,v ＝e ^{-distance(u,v)} # (Ⅵ)

note that s _u,v ＝s _v,u . Finally, the neighborhood structure distance of the two nodes is mapped to (0, 1)]A fraction within the interval;

K. generating a vector representation of the node, i.e. a low-dimensional dense vector, refers to:

and (3) carrying out random walk on the secondary graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the secondary graph edge, as shown in a formula (VII):

z (u) is a normalization factor; as can be seen from the walk probability, starting from the current node,the node u is more prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;

regardless of their relative location in the network.

Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close together are closer in vector space.

Preferably, according to the invention, a vector representation is learned for each node using the Skip-Gram algorithm. The method comprises the following steps:

(1) generating a one-hot vector for each node, for

Generating corresponding one-hot vectors

The other elements are 0;

(2) let d be the dimension of the generated dense vector, randomly initialize the matrix U,

U _i representing a node v _i As a node vector when the node is the center node,

W _i representing a node v _i A node vector when used as a background node;

(3) v. the _i Represents a central node, v _i ∈V，context(v _i ) Denotes v _i | context (v) _i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):

if there is v _j ∈context(v _i ) Then there is v _i ∈context(v _j ) I.e. v _i ,v _j The background nodes are nodes which are background nodes and central nodes, and the background nodes are nodes which are not more than wsize away from the central nodes in the context sequence of the nodes;

(4) updating the matrix U using a back propagation algorithm, W maximizes f (skip-gram), for

The low-dimensional dense vector generated by the Skip-Gram algorithm is W _i Is recorded as vec _i ,

Preferably, in step (3), the PCA method is used to perform dimension reduction on the low-dimensional dense vector generated by each account, v _i Is vec as the low-dimensional dense vector _i . The method comprises the following steps:

l, constructing a matrix X,

m, calculating the covariance matrix Cov of X,

n, calculating the characteristic value and the characteristic vector of X: (λ a-Cov) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;

setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the size of the eigenvalue, taking the front d' row to obtain a matrix P,

p, setting the data after dimensionality reduction as Y, Y = PX,

is v is _i And (5) reducing the low-dimensional dense vector after dimension reduction.

According to the optimization of the invention, in the step (3), a k-means algorithm is used for carrying out clustering operation on the low-dimensional dense vectors subjected to the dimensionality reduction processing, and the hierarchical division result of the abnormal financial organization is obtained. v. of _i The corresponding reduced node vector is y _i ，

The input sample set for the k-means algorithm is: s = { y _i ∣i∈[1,n]The method comprises the following steps of setting Classnum as the classification category number and Iternum as the iteration time, wherein the Classnum comprises the following steps:

q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu. } ₁ ,μ ₂ ,…,μ _Classnum Is classified into the corresponding category ₁ ,class ₂ ,…,class _c }；

R、

Calculating dist (y) _i ,μ _j ) And will y _i Is divided into a distance y _i Nearest u _j Class to which the class belongs _j Middle, dist (y) _i ,μ _j ) The calculation formula of (c) is shown as formula (IX):

s, updating the centroid vector of each category, wherein the updating formula is shown as a formula (X):

and T, sequentially executing the step R and the step S, terminating the execution after executing Iternum times, and outputting a classification result.

The invention has the beneficial effects that:

according to the invention, the transaction flow records of the abnormal financial organization can be utilized to construct the transaction network of the organization, the neighborhood topological structure characteristics of the account number nodes in the network are extracted through a self-defined NTF algorithm, and a low-dimensional dense vector is generated for each account number node based on the neighborhood topological structure characteristics. And then, carrying out dimensionality reduction on the vector by using a PCA algorithm, and processing the dimensionality-reduced node vector by using a k-means algorithm to finally obtain the hierarchical information of the account node. The abnormal financial organization hierarchical division algorithm based on the neighborhood topology structure provided by the invention needs less information, only the information of both transfer parties of the transaction record of the abnormal financial organization is needed, and the participation of other additional information items is not needed, so that the manual participation is reduced to a certain extent, the labor cost is reduced, a good hierarchical division result can be obtained, and the automatic processing of the hierarchical division of the abnormal financial organization is realized. The method can assist relevant workers to analyze and judge abnormal financial organizations, and improve the working efficiency of the relevant workers.

Drawings

FIG. 1 is a block diagram of an abnormal financial organization hierarchy partitioning system based on neighborhood topology of the present invention;

FIG. 2 is a schematic sampling of an influence subnetwork of the present invention;

FIG. 3 is a schematic flow chart of a method of operation of the abnormal financial organization hierarchy partitioning system of the present invention;

fig. 4 (a) is a schematic diagram of a cluster visualization result obtained by the NTF method provided in example 2;

fig. 4 (b) is a schematic diagram of a clustering visualization result obtained by the existing BoostNE method;

FIG. 4 (c) is a schematic diagram of a cluster visualization result obtained by the conventional Role2Vec method;

fig. 4 (d) is a schematic diagram of a cluster visualization result obtained by the conventional Struc2Vec method.

Detailed Description

The invention is further defined in the following description, without being limited thereto, by reference to the drawings and examples.

Example 1

An abnormal financial organization hierarchy dividing system based on a neighborhood topological structure is shown in figure 1 and comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchy dividing module which are connected in sequence;

the data preprocessing module is used for: sequentially carrying out duplication removal, denoising and incomplete data removal operations on the input abnormal financial organization transaction flow data, extracting transaction account information and transaction counter account information in the abnormal financial organization transaction flow data, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, a summary, a transaction balance and a transaction place;

the neighborhood topology feature extraction module is used for: processing data of transaction network of abnormal financial organization by using customized network representation learning method based on neighborhood topological structure, and extracting transaction

Generating a corresponding low-dimensional dense vector for each account by using neighborhood topological structure information of the network account nodes;

the abnormal financial organization hierarchy partitioning module is used for: firstly, carrying out dimension reduction processing on low-dimensional dense vectors generated by each account by using a PCA (principal component analysis) method, and then carrying out clustering operation on the low-dimensional dense vectors subjected to the dimension reduction processing by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization.

Example 2

The working method of the abnormal financial organization hierarchy dividing system in the embodiment 1 is operated in a computer, as shown in fig. 3, and comprises the following steps:

the original financial transaction flow data has the defects of data redundancy, data loss, non-standard format and the like, and firstly, data cleaning work is required to extract effective transaction record information. Data cleansing, comprising:

B. Denoising: namely, the data of the cleaning transaction amount less than 50 yuan; the amount of abnormal transactions of general abnormal financial organizations is large, and the transaction records with small amount of cleaning are beneficial to improving the accuracy of hierarchical division.

C. Removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes. E.g. Account ₁ And Account ₂ A transaction takes place between the two parties, and the transaction takes a transaction Account number as Account ₁ The Account number of the transaction opponent is Account ₂ The 'and' transaction Account number is Account ₂ The Account number of the transaction opponent is Account ₁ "one of the two records needs to be deleted.

The method comprises the following steps of extracting transaction account information and transaction counter account information in transaction flow data of the abnormal financial organization, and constructing a transaction network of the abnormal financial organization, wherein the transaction network comprises the following steps:

D. account number set C = { C) related to transaction flow data of abnormal financial organization ₁ ,c ₂ ,…,c _i ,…,c _n Denotes a total of n account numbers, C denotes a set of all account numbers, C _i Representing the ith account number in the set;

E. according to f (c) _i )＝v _i Point set V = { V } for building trading network ₁ ,v ₂ ,…,v _i ,…v _n H, every account number c _i Mapped as a positive integer in the range of 1,n]，v _i ＝i,i＝1,2,…,n，

F. According to the point set V, the account set C and the transaction running data of the transaction network, an edge set of the transaction network is constructed

e _i,j ＝(v _i ,v _j ) To a

G. And finally, constructing a transaction network: g = { V, E }.

(2) The neighborhood topological structure feature extraction module processes data of a transaction network of an abnormal financial organization by using a user-defined neighborhood topological structure-based network representation learning method, extracts neighborhood topological structure information of account nodes of the transaction network, and generates a corresponding low-dimensional dense vector for each account; the method comprises the following steps:

represents the edge set of graph G; for any edge e _i,j Existence of a weight w _i,j In contrast thereto, w _i,j Not less than 0; if two nodes v _i ,v _j There is no edge connection between them, then w _i,j =0, otherwise w _i,j >0; if the graph G is an undirected graph, then e _i,j ≡e _j,i ,w _i,j ＝w _j,i (ii) a If the graph G is an unweighted graph, then w _i,j =1, otherwise w _i,j ≥0；

The residual energy from point u to point v is Q _u (v) If, if

Maximum sampling depth d ^* The maximum value of the distance from a node to a central node can be obtained by taking a certain node as a center and outwards expanding sampling layer by layer according to the distance; distance to central node exceeding d ^* Will not fit within the sampling range;

is a set of edges that are to be considered,

for any edge e _i,j All have a weight s _i,j Corresponding thereto, s _i,j Greater than or equal to 0, reflecting node v _i And node v _j Neighborhood structure similarity of (1);

H. generating respective influence sub-networks for each node through an NTF algorithm; subsequent sampling processes will also be performed on the influence sub-network. As shown in fig. 2, a node is taken out from the transaction network G as a central node, the NTF model screens out neighbor nodes from the transaction network G that have a greater importance degree to the neighborhood structure of the central node, and the neighbor nodes together with edges form an influence subgraph, i.e., an influence sub-network, of the central node; the method comprises the following steps: let the central node be u, the initial energy Q _u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is not enough to reach the next node; in the energy propagation process, the energy dissipation rate is related to factors such as the distance from a node to u, the degree of the node, the edge weight and the likeThe closer u, the larger the node degree and the larger the edge weight, the greater the importance degree of the node relative to the u neighborhood structure, and the less energy lost when reaching the node, that is, the greater the importance degree of the node. Setting a Path P = { u, v = ₁ ,…,v _i ,v _j }，v _j The distance to the central node u is k, and from the point u, the energy propagates along the path via v _i Reaches v _j Residual energy of time Q _u (v _j ) As shown in formula (I):

representing a node v _j Degree of (c);

One node is not allowed to be at multiple energy levels at the same time.

I. on the basis of the influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depth d ^* In the influence sub-network, starting from a central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and expressing the expanded node in the sampling process by using the energy level and the degree of the node, wherein the energy level of the node is a relative characteristic, and the degree is an absolute characteristic;

DE _i (u)＝{(d(v),el _u (v))∣∣v∈R _i (u)}# (Ⅱ)

the feature is the neighborhood structure of the central node;

J. constructing a quadratic graph according to the neighborhood structure characteristics of the nodes; a random walk operation will be performed on the quadratic graph.

dist (-) is a function of the distance between two sequences;

if and only if the node u, v has neighbors with the distance of k at the same time, the distance of the k-th hop neighborhood exists, wherein w _i Representing the contribution ratio of the ith hop neighborhood distance to the total distance by a calculation method such as formula(IV) shown in the specification:

in the present model, a DTW (Dynamic Time Wrap) algorithm is used; in practical calculation, in order to take account of both calculation speed and accuracy, a FastDTW algorithm is adopted. According to the DTW algorithm, a difference calculation definition formula of any two elements is given, as shown in formula (V):

the difference value comprehensively considers the degree and energy level characteristics of the nodes, a ₁ ,b ₁ Degree, a, representing the first element of the ordered pair a, b, i.e. node ₂ ,b ₂ Represents the energy level of the second element of the ordered pair, i.e., the node;

s _u,v ＝e ^{-distance(u,v)} # (Ⅵ)

K. generating a vector representation of the node, i.e. a low-dimensional dense vector, means:

and (3) carrying out random walk on the quadratic graph to generate a context sequence of the nodes, and calculating the probability of transferring from the current node u to other nodes according to the weight of the edge of the quadratic graph, as shown in a formula (VII):

z (u) is a normalization factor; according to the wandering probability, starting from the current node, the node is furtherThe node is prone to be transferred to a node with a structure similar to that of the current node, so that the context of the node u comprises the node with the structure similar to that of the node u;

regardless of their relative location in the network.

Using the Skip-Gram algorithm, a vector representation is learned for each node. The vector representations of nodes whose neighborhood topologies are close are closer in vector space.

(1) generating a one-hot vector for each node, for

Generating corresponding one-hot vectors

The other elements are 0;

(2) assuming the dimension of the generated dense vector as d, randomly initializing the matrix U,

U _i representing a node v _i As a node vector when the center node is present,

W _i representing a node v _i A node vector when serving as a background node;

(3) let v _i Represents a central node, v _i ∈V，context(v _i ) Denotes v _i | context (v) _i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in formula (viii):

The low-dimensional dense vector generated by the Skip-Gram algorithm is W _i Is recorded as

(3) And the abnormal financial organization hierarchical division module firstly performs dimension reduction on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization.

Using a PCA method to perform dimension reduction processing on the low-dimensional dense vector generated by each account number, v _i Is vec as the low-dimensional dense vector _i . The method comprises the following steps:

l, constructing a matrix X,

m, calculating the covariance matrix Cov of X,

n, calculating the characteristic value and the characteristic vector of X: (λ a-Co) x =0, λ being an eigenvalue of Cov, a being an identity matrix, x being an eigenvector of Cov;

p, setting the data after dimensionality reduction as Y, Y = PX,

i.e. v _i And (5) reducing the low-dimensional dense vector after dimension reduction.

And clustering the low-dimensional dense vectors subjected to the dimensionality reduction by using a k-means algorithm to obtain the hierarchical division result of the abnormal financial organization. v. of _i The corresponding reduced node vector is y _i ，

R、

Calculating dist (y) _i ,μ _j ) And is combined with y _i Is divided into a distance y _i Nearest u _j Class to which it belongs _j Middle, dist (y) _i ,μ _j ) The calculation formula (c) is shown in formula (ix):

The method of the embodiment is adopted to process the transaction flow data set T of an abnormal financial organization input by a certain user.

The transaction flow data set adopted in the present example has 105 account nodes in common, and is divided into 3 types in total. The account transfer system comprises a high-level account, a special function account, a bottom-level account and 784 transfer records, wherein the high-level account is 22, the special function account is 11, the bottom-level account is 72, and the number of the transfer records is 784.

This example uses six indices of CHI (Calinski Harabaz Index), SC (Silhouette Coefficient), DBI (Davies Bouldin Index), ARI (Adjusted random Index), AMI (Adjusted Mutual Index) and V-measure to evaluate the effect of the present embodiment. The first three indexes CHI, SC, DBI are Internal metrics (Internal criterion) used to evaluate the effect between clusters obtained after clustering; the last three indexes ARI, AMI and V-measure are External standards (External criterion) and are used for comparing the matching degree of the clustering result and the real distribution.

Fig. 4 (a) is a schematic diagram of a cluster visualization result obtained by the NTF method provided in this embodiment; fig. 4 (b) is a schematic diagram of a clustering visualization result obtained by the existing BoostNE method; the existing BoostNE method is disclosed in documents Li J, wu L, guo R, et al, multi-level network embedding with a bossed low-rank matrix adaptation [ C ]// Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and mining.2019:49-56. FIG. 4 (C) is a schematic diagram of the cluster visualization result obtained by the existing Role2Vec method; the existing Role2Vec method is described in Ahmed N K, rossi R, lee J B, et al. FIG. 4 (d) is a schematic diagram of the cluster visualization result obtained by the conventional Struc2Vec method; the existing Structure 2Vec method is described in Ribeiro L F R, saverese P H P, figueiredo D R, structure 2Vec, learning node representations from structural identity [ C ]// Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining.2017.

In fig. 4 (a) to 4 (d), when the dense vectors generated for each node by the abscissa and ordinate representation algorithms (NTF, boost ne, role2Vec, and Struc2 Vec) are reduced to two-dimensional vectors, the coordinate values corresponding to the two-dimensional plane are obtained. Since the vector needs to be normalized when the clustering result is evaluated using the correlation indices (CHI, SC, DBI, ARI, AMI, and V-measure), the magnitude of the absolute value of the abscissa and ordinate of the node does not affect the clustering result.

As can be seen from fig. 4 (a) to 4 (d), the two methods, boostNE and Role2Vec, do not distinguish the three types of nodes well. Struc2Vec can better distinguish category 3 from category 1 and category 2, but cannot better distinguish category 1 from category 2. The NTF method provided by the invention can well distinguish three types of nodes, and a better clustering effect is obtained.

Three indexes of CHI, SC and DBI are used for measuring the clustering effect of NTF and three comparison methods of BoostNE, role2Vec and Struc2Vec, and the result is shown in Table 1.

TABLE 1

Clustering operation is carried out on the node vectors generated by the four methods after dimensionality reduction by using a k-means algorithm, then ARI, AMI and V-means are used for measuring the clustering effect of the k-means algorithm, and the result is shown in Table 2.

TABLE 2

As can be seen from tables 1 and 2, compared with the existing three methods of BoostNE, role2Vec, and Struc2Vec, the NTF method provided by the present invention obtains the best results on 6 indexes, which reflects that the NTF method can effectively extract the neighborhood topology information of nodes, and separates different types of nodes as far as possible while gathering the same type of nodes together, so as to obtain a better clustering result. Meanwhile, the NTF method can better support the subsequent node classification task.

Claims

1. A work method of an abnormal financial organization hierarchy dividing system based on a neighborhood topological structure comprises a data preprocessing module, a neighborhood topological structure feature extraction module and an abnormal financial organization hierarchy dividing module which are connected in sequence;

the data preprocessing module is used for: sequentially carrying out duplicate removal, denoising and incomplete data removal operations on the input transaction flow data of the abnormal financial organization, extracting transaction account information and transaction counter-party account information in the transaction flow data of the abnormal financial organization, and constructing a transaction network of the abnormal financial organization; the abnormal financial organization transaction flow data comprises a transaction account number, an opponent account number, a transaction amount, transaction time, an abstract, a transaction balance and a transaction place;

the neighborhood topology feature extraction module is used for: processing data of transaction network of abnormal financial organization by using customized network representation learning method based on neighborhood topology structure, and extracting transaction

the abnormal financial organization hierarchy partitioning module is used for: firstly, carrying out dimensionality reduction on a low-dimensional dense vector generated by each account by using a PCA (principal component analysis) method, and then carrying out clustering operation on the low-dimensional dense vector subjected to dimensionality reduction by using a k-means algorithm to obtain a hierarchical division result of abnormal financial organization; the method is characterized by operating in a computer and comprising the following steps:

(3) The abnormal financial organization hierarchical division module firstly performs dimension reduction processing on the low-dimensional dense vector generated by each account by using a PCA method, and then performs clustering operation on the low-dimensional dense vector subjected to the dimension reduction processing by using a k-means algorithm to obtain a hierarchical division result of the abnormal financial organization;

in the step (1), the transaction account information and the transaction counter-party account information in the transaction flow data of the abnormal financial organization are extracted, and a transaction network of the abnormal financial organization is constructed, wherein the method comprises the following steps:

E. according to f (c) _i )＝v _i Point set V = { V } for building transaction network ₁ ,v ₂ ,…,v _i ,…v _n }, each account number c is counted _i Mapped as a positive integer in the range of 1,n]，v _i ＝i,i＝1,2,…,n，

F. According to the point set V, the account set C and the transaction flow data of the transaction network, an edge set of the transaction network is constructed

e _i,j ＝(v _i ,v _j ) To a

G. And finally, constructing a transaction network: g = { V, E };

in the step (2), the method comprises the following steps:

The residual energy from point u to point v is Q _u (v) If, if

Maximum sampling depth d ^* The maximum value of the distance from a node to a central node can be obtained by sampling by taking a certain node as a center and expanding sampling layer by layer according to the distance;

is a set of edges that are to be considered,

H. generating respective influence sub-networks for each node through an NTF algorithm; taking a node from the trading network G as a central node, screening out neighbor nodes with larger importance degree to the neighborhood structure of the central node from the trading network G by the NTF model, and forming an influence subgraph, namely an influence sub-network, of the central node together with edges; the method comprises the following steps: let the central node be u, the initial energy Q _u (u) =1, from point u, energy propagates along the edge until the energy is at the lowest energy level or the remaining energy is insufficient to reach the next node; setting a Path P = { u, v = ₁ ,…,v _i ,v _j }，v _j A distance k from the central node u, from which point u energy propagates along the path via v _i Reaches v _j Residual energy of time Q _u (v _j ) As shown in formula (I):

representing a node v _j Degree of (d);

I. on the basis of the influence subnetwork, the neighborhood topological structure characteristic sampling work is carried out, and the method comprises the following steps: according to a given sampling depth d ^* In the influence subnetwork, from the central node, adopting a breadth-first search strategy, expanding sampling layer by layer according to distance, and using the energy level of the nodeAnd degree to represent the node extended in the sampling process, wherein the energy level of the node is relative characteristic, and the degree is absolute characteristic;

DE _i (u)＝{(d(v),el _u (v))∣∣v∈R _i (u)}#(Ⅱ)

in formula (II), d (v) represents the degree of node v, el _u (v) Represents the energy level of node v relative to node u;

the feature is the neighborhood structure of the central node;

J. constructing a secondary graph according to the neighborhood structure characteristics of the nodes; defining distance (u, v) to represent the neighborhood topological structure distance of the node u, v, wherein the distance is calculated according to the neighborhood topological structure feature set DE (u) and DE (v) of the node u, v, and the calculation method is shown as a formula (III):

dist (-) is a function that calculates the distance between two sequences;

the distance of the k-th hop neighborhood exists if and only if the node u, v has neighbors of distance k at the same time, where w _i The contribution proportion of the ith hop neighborhood distance to the total distance is represented, and the calculation method is shown as the formula (IV):

according to the DTW algorithm, a difference calculation definition formula of any two elements is given, as shown in formula (V):

on the basis, a quadratic graph is constructed, and the weight s of the edge (u, v) is regarded as the weight of two points u, v in the quadratic graph _u,v Embodying their neighborhood structure similarity s _u,v The calculation mode is shown as the formula (VI):

s _u,v ＝e ^{-distance(u,v)} #(Ⅵ)

finally, mapping the neighborhood structure distance of the two nodes into a decimal in the (0, 1) interval;

z (u) is a normalization factor;

using the Skip-Gram algorithm, a vector representation is learned for each node.

2. The method for operating an abnormal financial organization hierarchy-dividing system according to claim 1, wherein in the step (1), the data cleansing includes:

A. incomplete data removal: cleaning format non-standard data, wherein the format non-standard data refers to data that the number of digits of a transaction account number and a transaction counter-account number is inconsistent with the number of digits of a standard bank card account number;

B. denoising: namely, the data of the cleaning transaction amount is less than 50 yuan;

C. removing weight: namely, redundant data is cleaned, wherein the redundant data refers to the condition that the same transaction record existing in the abnormal financial organization transaction flow data is recorded twice in different modes.

3. The method of claim 2, wherein learning a vector representation for each node using Skip-Gram algorithm comprises the steps of:

(1) generating a one-hot vector for each node, for

Generating corresponding one-hot vectors

The other elements are 0;

(2) setting the dimension of the generated dense vector as d, and randomly initializing the matrix

W _i representing a node v _i A node vector when used as a background node;

(3) v. the _i Represents a central node, v _i ∈V，context(v _i ) Denotes v _i Of background nodesAggregate, | context (v) _i ) | = wsize, which represents the window length, and the objective function f (Skip-Gram) of Skip-Gram is shown in equation (VIII):

4. The method as claimed in claim 3, wherein in step (3), the PCA method is used to perform dimension reduction on the low-dimensional dense vector generated by each account, v is _i Is vec as the low-dimensional dense vector _i The method comprises the following steps:

l, constructing a matrix X,

m, calculating the covariance matrix Cov of X,

setting dimensionality d 'after dimensionality reduction, arranging the eigenvectors from top to bottom according to the eigenvalues, taking the front d' row to obtain a matrix P,

p, the data after dimension reduction is Y, Y = PX,

Y _i ^T i.e. v _i And the low-dimensional dense vector after dimension reduction.

5. The method as claimed in claim 4, wherein in step (3), the low-dimensional dense vectors after dimension reduction are clustered by using k-means algorithm to obtain the hierarchical segmentation result of abnormal financial organization, v _i The corresponding reduced node vector is y _i ，y _i ＝Y _i ^T The input sample set for the k-means algorithm is: s = { y _i ∣i∈[1,n]And setting Classnum as the classification category number and Iternum as the iteration time, wherein the method comprises the following steps of:

q, randomly selecting Classnum node vectors from the S as initial centroid vectors of each category, wherein the centroid vectors are as follows: { mu ] m ₁ ,μ ₂ ,…,μ _Classnu, Is associated with class ₁ ,class ₂ ,…,class _c }；

R、

Calculating dist (y) _i ,μ _j ) And is combined with y _i Is divided into a distance y _i Nearest u _j Class to which the class belongs _j Middle, dist (y) _i ,μ _j ) The calculation formula of (c) is shown as formula (IX):

and T, sequentially executing the step R and the step S, stopping executing after executing Iternum times, and outputting a classification result.