Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a group partner detection method based on a financial transaction network, which utilizes basic characteristics of a transaction account number, a transaction counter-account number, transaction time and the like in original financial transaction stream information data, extracts time sequence characteristics and space structure characteristics through a sequence model and a GAE model in a self-adaptive manner, and finally calculates the distance between every two nodes as network weight by using connection characteristics, so that each user can be allocated to a potential group partner by using the detection method. The method can reduce the work of manually extracting the features, can automatically determine the number of the groups, and can effectively improve the accuracy and the interpretability of the existing method.
The invention also provides an implementation device of the group partner detection method based on the financial transaction network.
Interpretation of terms:
skip-gram model a neural network model for training word vectors.
GAE model: the graph self-encoder model is a neural network model which efficiently represents input graph data through unsupervised learning.
3. High frequency word sampling technique: that is, in the process of training the word vector, in order to overcome the influence of the high frequency word, the high frequency word is deleted with a certain probability.
4. Negative sampling technology: a method for increasing the training speed of a neural network. Not all parameters are updated, only a small number of neuron parameters are updated.
5, GCN: graph convolution neural network, a neural network that convolves graph structure data.
The technical scheme of the invention is as follows:
a method of group detection based on a financial transaction network, comprising:
(1) data preprocessing: performing data cleaning on the transaction data, extracting a transaction sequence of each user and constructing graph data;
(2) generating a user feature vector: acquiring a user time sequence characteristic vector by using a sequence model, and acquiring a user space characteristic vector by using a GAE model; respectively normalizing the user time sequence characteristic vector and the space characteristic vector, performing connection operation, and generating a node expression vector
d′
1,......d′
mRepresenting the normalized user timing feature vector,
representing the normalized spatial feature vector;
(3) group detection: and calculating the group to which each node belongs and outputting a group mark of the node.
According to the invention, in the step (1), the transaction data includes a user, a transaction counter account, a transaction time and a transaction amount, and the specific step of performing data cleansing on the transaction data includes:
1-1, missing value filling: if any field of a user, a transaction counter account and transaction time of certain transaction data is missing, discarding the transaction data;
if only the transaction amount in certain transaction data has field loss, filling by adopting an average filling method, namely calculating the average value of all transaction amounts of the current user, and filling the transaction amount by adopting the average value;
1-2, data inconsistency processing: when different date forms are used for representing dates, a data time library of Python is used for formatting, and all time formats are unified into date forms of year, month and day; for example: dates within the trade time are "2019/01/07" and "07/01/2019", formatted using the Python datatime library, with all time formats unified to "20190107";
1-3, feature coding: and mapping by using a map, converting the account numbers of the users and the transaction opponents with more than 15 digits into label-encoding (label-encoding), and finally obtaining a transaction sample set. For example, after 100 account numbers are mapped, the number is 0-99, and the transaction account number comprises a bank card number and an account number; the transaction sample set comprises a plurality of pieces of transaction data which are preprocessed through the steps 1-1 to 1-3.
Many fields in the original transaction data have more missing values and abnormal values, and the main purpose of data cleaning is to process dirty data into primarily usable input data.
Preferably, in step (1), the specific steps of extracting the transaction sequence of each user and constructing graph data include:
a. generating a transaction sequence for each user based on the chronological order: obtaining a user set by using unique function of a Pandas library
n is the total number of users, u
i1.... n, representing the ith transaction user; m is the total number of the counter account number of the
transactionj 1.... m, representing the jth transaction-partner account number;
for user uiIn other words, all the counter-trade account numbers are obtained from the trade sample set, and are sorted in ascending order according to the trade time, and the sequences are recombined into a trade sequence Li;
With user uiIs a key, transaction sequence LiTo build a set of key-value pairs S ═ S i1.. n }, where siIs (u)i,Li) (ii) a The key of the key value pair set S is a user, and the value is a transaction sequence and is used for finding the transaction sequence through the user;
b. and (3) constructing graph data: the graph data comprises an adjacency matrix A and a feature matrix X of a graph node;
firstly, in a transaction sample set, a user u is extracted from the same transaction data
iAnd the account number of the transaction opponent
Form a sequence pair
i=1,...,n,j=1,...,m;
Then all the sequence pairs are subjected to the duplicate removal operation, and all the sequence pair sets after the duplicate removal are taken as an edge set E of the graph G, wherein E is { E {i1., m }; all user sets U are used as a node set V, V ═ V i1, ·, n }; generating an adjacency matrix A E R of a node through a network x library using the edge set E and the node set Vn×nThe adjacency matrix represents the topological structure of a graph by judging whether nodes of the coded graph are connected or not; the user and the transaction counter account are used as nodes, and an edge is added when a transaction exists between any two nodes;
the characteristic matrix X of the graph nodes is a degree matrix D of the nodes, the degree matrix D is a diagonal matrix, the elements on the diagonal are degrees of each node i, and the degree D of each node iiRepresentation and node viNumber of associated edges, Di=[di],DiRepresenting a node viThe obtained adjacency matrix A and the feature matrix X of the graph nodes are used as training data of the GAE model.
Preferably, in step (2), the sequence model is used to obtain the user time sequence feature vector, and the transaction sequence L is obtained
iAs a result of being viewed as a sentence,
optimizing each layer of parameters by maximizing the probability of appearance of context nodes in the case of appearance of a central node, comprising the following specific steps:
2-1, preparing training data: firstly, vectorizing a node list by using an OneHotEncoder in a sklern library to obtain a node One-hot vector with a higher dimensionality, wherein the dimensionality of the node One-hot vector is equal to the number of words;
then setting window and skip step size to generate training data, and passing through transaction sequence LiConstruction of training data, Li={Li (1),...,Li (k)};LiFor user uiK is the transaction sequence L, and the superscript 1iK transaction-to-hand account numbers; setting a window and skip step size, taking a certain node as a central node, constructing a (input, output) form training set, and obtaining training data, wherein output is a context node and output is the central node;
in particular, assume that both the window and step size take 2, from Li (2)Starting as a central node, respectively selecting two nodes on the left side and the right side as window nodes, constructing a training set in the form of (input, output), and obtaining (L) in the form ofi (2),Li (1)),(Li (2),Li (3)),(Li (2),Li (4)) Three sets of training data;
2-2, constructing a Skip-gram model to obtain a node vector: the Skip-gram model comprises an input layer, a hidden layer and an output layer which are connected in sequence,
inputting a node One-Hot vector by an input layer; the dimensionality of the hidden layer is set according to the user requirement, and the dimensionality of the hidden layer is the number of the hidden layer neurons; the output layer is a softmax classifier, outputs the probability of each node,
calculating a cross entropy loss function, updating model weight parameters by using a gradient descent method, and finally using a weight matrix from an input layer to a hidden layer as a time sequence characteristic R of a nodeSequence of={d′1,......d′m};
Preferably, in the process of generating the training set in step 2-1, a high-frequency word sampling technique is used to sample vector sequence pairs (input, output) in the training samples, so as to reduce the number of the training samples and solve the problem of overlarge scale of the weight matrix and the training samples;
and by adopting a negative sampling technology, only the weight of each part of the model is updated when each sample is trained, so that the calculation load is reduced.
Preferably, in step (2), the user space feature vector is obtained by using a GAE model, where the GAE model includes an encoder and a decoder; the encoder comprises two layers of GCNs, and the decoder is used for calculating the probability of edges existing between any two nodes and then generating edges to form a reconstructed picture; the method comprises the following specific steps:
a. inputting an adjacency matrix A and a feature matrix X of a graph node at an input layer of the GAE model;
b. two layers of GCN of an encoder perform feature extraction on an adjacency matrix A and a feature matrix X of a graph node to obtain a node embedding vector Z, wherein it is assumed that each input sample adjacency matrix A obeys Gaussian distribution, feature extraction is performed on the adjacency matrix A and the feature matrix X of the graph node through the two layers of GCN, a mean value and a variance are determined, namely a distribution function of the Gaussian distribution is determined, and a reconstructed adjacency matrix is obtained through the distribution function of the Gaussian distribution
The node embedding vector Z satisfies:
Z=GCN(X,A) (I),
in the formula (I), GCN represents a graph convolution neural network model, X is a characteristic matrix of a graph node, and A is an adjacent matrix;
c. inputting the node embedding vector Z into a decoder, generating the connection probability of edges by using the decoder, and reconstructing a picture; finally, the reconstructed adjacency matrix is output by the output layer
The calculation formula is as follows:
in formula (II), the superscript T represents transposition, sigma (-) represents sigmoid function, namely output activation function of neuron, which is a common expression symbol in neural network,
representing the reconstructed adjacency matrix;
adopting a loss function L to measure the difference between the reconstructed image and the original image, and enabling the reconstructed image to be closest to the original image by minimizing the loss function L;
inputting an adjacency matrix A of a graph and a feature matrix X of nodes, extracting features of the adjacency matrix A and the feature matrix X of the graph nodes by an encoder with a two-layer GCN structure, calculating the probability of edges existing between any two nodes by using a decoder to generate the graph, measuring the difference between the input graph and the graph generated by GAE by a loss function L, and optimizing W
0,W
1The loss function L is minimized so that the reconstructed graph is closest to the original graph, resulting in a node-embedded vector matrix Z having the spatial characteristics of the graph,
z is a matrix of n rows, and the row vector corresponds to a node;
further preferably, in step b, the two layers of GCN are defined as follows:
in formula (III), ReLU (. cndot.) represents a linear rectification function,
d represents degree matrix, superscript-1/2 represents exponentiation, W
0Representing a first weight matrix, W
1Representing a second weight matrix;
further preferably, in step c, the decoder reconstructs the graph by calculating the probability between nodes, i.e. reconstructs the adjacency matrix:
in formula (IV), Sigmoid (. cndot.) is an activation function, which maps variables between 0 and 1, and if the probability exceeds a threshold, A
ijIs 1, represents that two nodes are connected to finally obtain an adjacency matrix
A
ijRepresenting nodes embedded in elements of the vector matrix Z located in the ith row and jth column, Z
iAnd z
jRespectively embedding nodes into i rows and j rows of a vector matrix;
representing the probability of reconstructing the connection between any two nodes i and j by embedding the vector matrix Z into the known nodes; sigmoid (-) is an activation function, maps variables between 0 and 1, and if the probability exceeds a threshold value, represents that two nodes are connected and corresponds to an adjacency matrix
The middle element is set to be 1,
is a decomposed representation of matrix a;
the loss function is a measure of the distance between the reconstructed picture by the encoder-decoder structure and the original picture:
L=Eq(Z|X,A)[logp(A|Z) (V)
in formula (V), L represents a loss function, and Eq (. cndot.) represents a desired distribution;
training GAE by using random gradient descent, finishing the loss function convergence training, and finally obtaining a low-dimensional node embedding vector matrix Z of the nodes;
by optimizing W0,W1And minimizing the loss function L, so that the reconstructed graph is closest to the original graph to obtain a low-dimensional node embedding vector matrix Z, and the low-dimensional node embedding vector matrix Z has the spatial characteristics of the graph.
Minimizing L by W requires a gradient of L to W, and then optimizing L using a gradient descent method to minimize L.
In the data preprocessing stage, an adjacency matrix A of a transaction graph and a feature matrix X of a node are generated, the feature matrix X contains degree information of the node, and the module encodes a space representation vector of the node through a GAE model, namely the space representation vector contains the feature of the node and the feature of a neighbor node.
Preferably, in step (3), the distance between each two nodes is calculated and used as the weight of the edge to obtain the group of each node, and the group mark of the node is output, and the method specifically comprises the following steps:
3-1, first, the vector R is represented by the nodes generated in step 2iCalculating the distance between any two nodes in the graph data structure, and taking the calculated distance as the weight of the edge, wherein the larger the distance is, the farther the distance between the two nodes is; then each node in the graph data structure is distributed to a single group, the nodes in the network are continuously traversed, the change situation of the module degree caused by the node joining the neighbor group is compared, the node is selected to be joined to the group which can increase the compactness to the maximum,
the modularity Q defines a function as:
in the formula (VI), Q represents the modularity, m is the sum of the weights of all sides, WijRepresents the weight between node i and node j, kiRepresents the sum of the weights, k, of the edges connected to node ijRepresents the sum of the weights of the edges connected to node j, ciAs a group to which node i belongs, cjIs the group to which node j belongs, (c)i,cj) For an illustrative function, if ci and cj are the same group, 1, otherwise 0;
3-2, merging all nodes belonging to the same group into a new node to construct a hypergraph;
3-3, repeating the step 3-1 and the step 3-2 to obtain the final grouping and generating (u)i,ci) Party mark of ciIs the group to which the node i belongs.
The realization device of the group partner detection method based on the financial transaction network comprises the following steps:
the data preprocessing module is used for carrying out data cleaning on transaction data, extracting a transaction sequence of each user and constructing graph data, and is used for executing the step (1);
the user characteristic vector generation module is used for acquiring a user time sequence characteristic vector by using a sequence model, acquiring a user space characteristic vector by using a GAE model, normalizing the user time sequence characteristic vector and the space characteristic vector respectively and connecting the user time sequence characteristic vector and the space characteristic vector for executing the step (2);
and the group detection module is used for calculating the group of each node and outputting the group mark of the node for executing the step (3).
The invention has the beneficial effects that:
1. the invention mainly provides a group detection method based on the combination of time series characteristics and space structure characteristics of nodes. The method utilizes basic characteristics of users, counter-trading account numbers, trading time and the like in original financial trading flow information data, extracts time sequence characteristics and space structure characteristics in a self-adaptive mode through a sequence model and a GAE model, finally calculates the distance between every two nodes as weight through connection characteristics, and can allocate each user to potential groups through a group detection algorithm based on modularity optimization.
2. The invention mainly aims to provide an auxiliary decision making system for case handling personnel, features are automatically extracted based on a Skip-gram model and a GAE model, manpower is greatly released, and the generated ganging marks can also be used for tracking potential suspects.
3. The group partner detection method based on the financial transaction network provided by the invention has the advantages that the flow is full-automatic, any person can obtain a final desired result by inputting original data with few fields, the working efficiency is improved, and a large amount of time is saved. Along with the improvement of the input transaction data quantity, the quantity of automatically constructed training data is increased, and the accuracy of the model is further improved.
Example 1
A group detection method based on financial transaction network, as shown in fig. 1 and 4, comprising:
(1) data preprocessing: performing data cleaning on the transaction data, extracting a transaction sequence of each user and constructing graph data;
in the step (1), the transaction data includes a user, a transaction counter account, transaction time and transaction amount, and the specific steps of performing data cleaning on the transaction data include:
1-1, missing value filling: if any field of a user, a transaction counter account and transaction time of certain transaction data is missing, discarding the transaction data;
if only the transaction amount in certain transaction data has field loss, filling by adopting an average filling method, namely calculating the average value of all transaction amounts of the current user, and filling the transaction amount by adopting the average value;
1-2, data inconsistency processing: when different date forms are used for representing dates, a data time library of Python is used for formatting, and all time formats are unified into date forms of year, month and day; for example: dates within the trade time are "2019/01/07" and "07/01/2019", formatted using the Python datatime library, with all time formats unified to "20190107";
1-3, feature coding: and mapping by using a map, converting the account numbers of the users and the transaction opponents with more than 15 digits into label-encoding (label-encoding), and finally obtaining a transaction sample set. For example, after 100 account numbers are mapped, the number is 0-99, and the transaction account number comprises a bank card number and an account number; the transaction sample set comprises a plurality of pieces of transaction data which are preprocessed through the steps 1-1 to 1-3.
Many fields in the original transaction data have more missing values and abnormal values, and the main purpose of data cleaning is to process dirty data into primarily usable input data.
In the step (1), the specific steps of extracting the transaction sequence of each user and constructing graph data include:
a. generating a transaction sequence for each user based on the chronological order: obtaining a user set by using unique function of a Pandas library
n is the total number of users,
u i1.... n, representing the ith transaction user; m is the total number of the counter account number of the
transactionj 1.... m, representing the jth transaction-partner account number;
for user uiIn other words, all the counter-trade account numbers are obtained from the trade sample set, and are sorted in ascending order according to the trade time, and the sequences are recombined into a trade sequence Li;
With user uiIs a key, transaction sequence LiTo build a set of key-value pairs S ═ S i1.. n }, where siIs (u)i,Li) (ii) a The key of the key value pair set S is a user, and the value is a transaction sequence and is used for finding the transaction sequence through the user;
b. and (3) constructing graph data: the graph data comprises an adjacency matrix A and a feature matrix X of a graph node;
firstly, in a transaction sample set, a user u is extracted from the same transaction data
iAnd the account number of the transaction opponent
Form a sequence pair
i=1,...,n,j=1,...,m;
Then all the sequence pairs are subjected to duplicate removal operationAll the order pair sets after the duplication removal are used as an edge set E of the graph G, and E is equal to { E {i1., m }; all user sets U are used as a node set V, V ═ V i1, ·, n }; generating an adjacency matrix A E R of a node through a network x library using the edge set E and the node set Vn×nThe adjacency matrix represents the topological structure of a graph by judging whether nodes of the coded graph are connected or not; the user and the transaction counter account are used as nodes, and an edge is added when a transaction exists between any two nodes;
the characteristic matrix X of the graph nodes is a degree matrix D of the nodes, the degree matrix D is a diagonal matrix, the elements on the diagonal are degrees of each node i, and the degree D of each node iiRepresentation and node viNumber of associated edges, Di=[di],DiRepresenting a node viDegree of (c).
The obtained adjacency matrix A and the feature matrix X of the graph nodes are used as training data of the GAE model.
(2) Generating a user feature vector: acquiring a user time sequence characteristic vector by using a sequence model, and acquiring a user space characteristic vector by using a GAE model; respectively normalizing the user time sequence characteristic vector and the space characteristic vector, performing connection operation, and generating a node expression vector
d′
1,......d′
mRepresenting the normalized user timing feature vector,
representing the normalized spatial feature vector;
in the step (2), the sequence model is used for obtaining the user time sequence characteristic vector, and the transaction sequence L is processed
iAs a result of being viewed as a sentence,
optimizing each layer of parameters by maximizing the probability of occurrence of context nodes in the case of occurrence of a central node, the specific steps comprising:
2-1, preparing training data: firstly, vectorizing a node list by using an OneHotEncoder in a sklern library to obtain a node One-hot vector with a higher dimensionality, wherein the dimensionality of the node One-hot vector is equal to the number of words;
then setting window and skip step size to generate training data, and passing through transaction sequence LiConstruction of training data, Li={Li (1),...,Li (k)};LiFor user uiK is the transaction sequence L, and the superscript 1iK transaction-to-hand account numbers; setting a window and skip step size, taking a certain node as a central node, constructing a training set in an (input, output) form, and obtaining training data;
in particular, assume that both the window and step size take 2, from Li (2)Starting as a central node, respectively selecting two nodes on the left side and the right side as window nodes, constructing a training set in the form of (input, output), and obtaining (L) in the form ofi (2),Li (1)),(Li (2),Li (3)),(Li (2),Li (4)) Three sets of training data;
in the process of generating the training set in the step 2-1, a high-frequency word sampling technology is used for sampling vector sequence pairs (input, output) in the training samples so as to reduce the number of the training samples and solve the problem that the weight matrix and the training samples are overlarge in scale;
and by adopting a negative sampling technology, only the weight of each part of the model is updated when each sample is trained, so that the calculation load is reduced.
2-2, constructing a Skip-gram model to obtain a node vector: the Skip-gram model comprises an input layer, a hidden layer and an output layer which are connected in sequence,
inputting a node One-Hot vector by an input layer; the dimensionality of the hidden layer is set according to the user requirement, and the dimensionality of the hidden layer is the number of the hidden layer neurons; the output layer is a softmax classifier, outputs the probability of each node,
calculating cross entropy loss function, updating model weight parameters by using gradient descent method, and finally using inputLayer-to-hidden layer weight matrix as timing characteristic R of nodeSequence of={d′1,......d′m};
In the step (2), a GAE model is used to obtain the user space feature vector, wherein the GAE model comprises an encoder and a decoder; the encoder comprises two layers of GCNs, and the decoder is used for calculating the probability of edges existing between any two nodes and then generating edges to form a reconstructed picture; the method comprises the following specific steps:
a. inputting an adjacency matrix A and a feature matrix X of a graph node at an input layer of the GAE model;
b. two layers of GCN of an encoder perform feature extraction on an adjacency matrix A and a feature matrix X of a graph node to obtain a node embedding vector Z, wherein it is assumed that each input sample adjacency matrix A obeys Gaussian distribution, feature extraction is performed on the adjacency matrix A and the feature matrix X of the graph node through the two layers of GCN, a mean value and a variance are determined, namely a distribution function of the Gaussian distribution is determined, and a reconstructed adjacency matrix is obtained through the distribution function of the Gaussian distribution
The node embedding vector Z satisfies:
Z=GCN(X,A) (I),
in the formula (I), GCN represents a graph convolution neural network model, X is a characteristic matrix of a graph node, and A is an adjacent matrix;
c. inputting the node embedding vector Z into a decoder, generating the connection probability of edges by using the decoder, and reconstructing a picture; finally, the reconstructed adjacency matrix is output by the output layer
The calculation formula is as follows:
in formula (II), the superscript T represents transposition, sigma (-) represents sigmoid function, namely output activation function of neuron, which is a common expression symbol in neural network,
representing the reconstructed adjacency matrix;
adopting a loss function L to measure the difference between the reconstructed image and the original image, and enabling the reconstructed image to be closest to the original image by minimizing the loss function L;
inputting an adjacency matrix A of a graph and a feature matrix X of nodes, extracting features of the adjacency matrix A and the feature matrix X of the graph nodes by an encoder with a two-layer GCN structure, calculating the probability of edges existing between any two nodes by using a decoder to generate the graph, measuring the difference between the input graph and the graph generated by GAE by a loss function L, and optimizing W
0,W
1The loss function L is minimized so that the reconstructed graph is closest to the original graph, resulting in a node-embedded vector matrix Z having the spatial characteristics of the graph,
z is a matrix of n rows, and the row vector corresponds to a node;
further, in step b, the definition of the two layers of GCN is as follows:
in formula (III), ReLU (. cndot.) represents a linear rectification function,
d represents degree matrix, superscript-1/2 represents exponentiation, W
0Representing a first weight matrix, W
1Representing a second weight matrix;
further, in step c, the decoder reconstructs the graph by calculating the probability between the nodes, i.e. reconstructs the adjacency matrix:
in formula (IV), Sigmoid (. cndot.) is an activation function and will changeThe quantity maps between 0 and 1, if the probability exceeds a threshold, then A
ijIs 1, represents that two nodes are connected to finally obtain an adjacency matrix
A
ijRepresenting nodes embedded in elements of the vector matrix Z located in the ith row and jth column, Z
iAnd z
jRespectively embedding nodes into i rows and j rows of a vector matrix;
representing the probability of reconstructing the connection between any two nodes i and j by embedding the vector matrix Z into the known nodes; sigmoid (-) is an activation function, maps variables between 0 and 1, and if the probability exceeds a threshold value, represents that two nodes are connected and corresponds to an adjacency matrix
The middle element is set to be 1,
is a decomposed representation of matrix a;
the loss function is a measure of the distance between the reconstructed picture by the encoder-decoder structure and the original picture:
L=Eq(Z|x,A)[logp(A|Z) (V)
in formula (V), L represents a loss function, and Eq (. cndot.) represents a desired distribution;
training GAE by using random gradient descent, finishing the loss function convergence training, and finally obtaining a low-dimensional node embedding vector matrix Z of the nodes;
by optimizing W0,W1And minimizing the loss function L, so that the reconstructed graph is closest to the original graph to obtain a low-dimensional node embedding vector matrix Z, and the low-dimensional node embedding vector matrix Z has the spatial characteristics of the graph.
Minimizing L by W requires a gradient of L to W, and then optimizing L using a gradient descent method to minimize L.
In the data preprocessing stage, an adjacency matrix A of a transaction graph and a feature matrix X of a node are generated, the feature matrix X contains degree information of the node, and the module encodes a space representation vector of the node through a GAE model, namely the space representation vector contains the feature of the node and the feature of a neighbor node.
(3) Group detection: and calculating the group to which each node belongs and outputting a group mark of the node.
The step can be applied to the existing algorithms such as K-means, KNN and the like based on clustering and community detection algorithms of characteristic space distance;
in this example, in the step (3), the distance between each two nodes is calculated based on the euclidean distance, and the calculated distance is used as the weight of the edge to obtain the group to which each node belongs, and the specific steps include:
3-1, first, the vector R is represented by the nodes generated in step 2iCalculating the distance between any two nodes in the graph data structure, and taking the calculated distance as the weight of the edge, wherein the larger the distance is, the farther the distance between the two nodes is; then each node in the graph data structure is distributed to a single group, the nodes in the network are continuously traversed, the change situation of the module degree caused by the node joining the neighbor group is compared, the node is selected to be joined to the group which can increase the compactness to the maximum,
the modularity Q defines a function as:
in the formula (VI), Q represents the modularity, m is the sum of the weights of all sides, WijRepresents the weight between node i and node j, kiRepresents the sum of the weights, k, of the edges connected to node ijRepresents the sum of the weights of the edges connected to node j, ciAs a group to which node i belongs, cjIs the group to which node j belongs, (c)i,cj) For an illustrative function, if ci and cj are the same group, 1, otherwise 0;
3-2, merging all nodes belonging to the same group into a new node to construct a hypergraph;
3-3, repeating the step 3-1 and the step 3-2 to obtain the final grouping and generating (u)i,ci) Party mark of ciIs the group to which the node i belongs.
The invention mainly provides a group detection method based on the combination of time sequence characteristics and space structure characteristics of nodes. The method utilizes basic characteristics of a transaction account number, a transaction counter account number, transaction time and the like in original financial transaction flow information data, extracts time sequence characteristics and space structure characteristics in a self-adaptive mode through a sequence skip-gram model and a GAE model, calculates the distance between every two nodes as weight by using connection characteristics, and can distribute each user to potential groups by using a group detection algorithm based on modularity optimization. The method reduces the workload of artificial characteristic engineering and fully utilizes the time sequence and spatial characteristics of the transaction diagram.