CN115797041A - Financial credit assessment method based on depth map semi-supervised learning - Google Patents

Financial credit assessment method based on depth map semi-supervised learning Download PDF

Info

Publication number
CN115797041A
CN115797041A CN202111048605.4A CN202111048605A CN115797041A CN 115797041 A CN115797041 A CN 115797041A CN 202111048605 A CN202111048605 A CN 202111048605A CN 115797041 A CN115797041 A CN 115797041A
Authority
CN
China
Prior art keywords
graph
node
nodes
layer
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111048605.4A
Other languages
Chinese (zh)
Inventor
邱韵
徐小龙
邬晶
李少远
徐世界
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Electronic Commerce Co Ltd
Original Assignee
Tianyi Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Electronic Commerce Co Ltd filed Critical Tianyi Electronic Commerce Co Ltd
Priority to CN202111048605.4A priority Critical patent/CN115797041A/en
Publication of CN115797041A publication Critical patent/CN115797041A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a financial credit assessment method based on depth map semi-supervised learning. The depth does not refer to the deep level of the network structure in deep learning, but refers to the deep level on the graph information mining level, and for the graph constructed by the financial characteristic data, two layers of information mining are carried out: firstly, the graph structure information is mined by using a depth map embedding method, and then the neighborhood information of the graph nodes is aggregated by using a graph convolution neural network. Compared with the traditional graph semi-supervised learning method, the method provided by the invention effectively relieves the problem of sparse information amount under the condition of scarce labels, and has deep layer in the mining of graph information, so the method is called deep graph semi-supervised learning. The method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the financial data with scarce labels.

Description

Financial credit assessment method based on depth map semi-supervised learning
Technical Field
The invention relates to the field of financial user credit evaluation, in particular to a financial credit evaluation method based on depth map semi-supervised learning.
Background
In the financial market, financial fraud occurs occasionally, which not only affects the normal order of financial transactions, but also brings huge losses to users, enterprises and institutions. Common financial fraud includes: bank fraud, insurance fraud, security fraud, merchandise transaction fraud, and the like. In order to prevent financial fraud, credit assessment for financial users, enterprises, and the like is becoming an urgent need. If the behavior records of the financial users are used as features (features) and the credit evaluation results of the users are used as labels (labels), the credit evaluation problem of the financial users is abstracted as how to fit reasonable labels according to the feature data of the users. Therefore, how to establish a reasonable and efficient mathematical model and accurately acquire a corresponding credit evaluation label from the behavior record characteristics of the financial user becomes a research hotspot in the field of credit evaluation of the financial user.
In the prior art, a supervised machine learning method is often used to perform learning fitting from the feature data of financial users to the user credit evaluation result labels. However, supervised machine learning approaches often require a large number of labeled data sets as training samples. In many practical problem scenarios, the acquisition of the sample label is often laborious and costly. Credit assessment of financial users is a typical example. In a traditional financial credit assessment method, the rating of a target user is often determined according to expert experience and a complex algorithm through long-time tracking analysis of information such as transaction, investment records and the like of the user. This results in the scarcity of labeled samples among the actual financial credit assessment problems, which greatly limits the effectiveness of supervised learning models for credit assessment. The existing semi-supervised learning method has many limitations, for example, the probability density distribution function of the model needs to be known in advance by a generative algorithm, expert knowledge is needed, and the application field is small; the semi-supervised support vector machine method has higher model complexity and is generally limited to the problem of two classifications; the collaborative training method is sensitive to data distribution and requires good independence between data attributes; the self-learning method is poor in robustness, free of self-correcting capability and capable of accumulating training errors. Therefore, the patent provides a semi-supervised learning method based on a depth map for credit assessment and feature mining of financial users, and compared with the semi-supervised learning method, the algorithm of the map semi-supervised learning is suitable for data sets distributed randomly, does not need prior knowledge, and has the advantages of good robustness, low model structure complexity and the like. On the basis, the depth map embedding and the semi-supervised learning method based on the graph convolution neural network are combined, and compared with the general graph semi-supervised learning method, the problem of sparse information amount under the condition of scarce labels is effectively solved, the structural information of the graph can be deeply mined, and the credit evaluation effect on the open-source financial data set is remarkably improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a financial credit evaluation method based on depth map semi-supervised learning. The "depth" herein does not refer to the deep level of the network structure in deep learning, but refers to the deep level at the graph information mining level. For the graph constructed by the financial characteristic data, two layers of information mining are carried out: firstly, the graph structure information is mined by using a depth map embedding method, and then the neighborhood information of the graph nodes is aggregated by using a graph convolution neural network. Therefore, compared with the traditional graph semi-supervised learning method, the method provided by the invention effectively relieves the problem of sparse information amount under the condition of scarce labels, and has deep layer in the mining of graph information, so the method is called deep graph semi-supervised learning.
The invention has the following main beneficial effects: 1. and constructing a graph based on the financial characteristic data. Mining potential relationship information between financial individuals by processing the financial characteristic data as graph data; 2. applying the depth map embedding method to map representation learning of a financial user map and mining map structure information; 3. the graph convolution neural network is used for training a semi-supervised learning model, and neighborhood information of the nodes in the graph is aggregated; 4. compared with the traditional financial credit assessment method, the method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the scarce financial data of the labels.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention provides a financial credit assessment method based on depth map semi-supervised learning, which comprises the following steps:
s1, constructing a graph according to the original characteristic data set, namely X belongs to R N×D Constructing a graph G (V, E), wherein X is an original feature data set and is represented by a matrix of NxD dimensions, N is the number of user samples, D is a feature dimension corresponding to each user sample, the graph G (V, E) corresponding to X is located in a D-dimensional feature space, V (Vertices) represents a set of all nodes in the graph, N nodes are totally arranged in the graph, each node represents one user sample, the coordinate of the node in the D-dimensional feature space is equal to the value of the D-dimensional feature of the user sample corresponding to the node, E (Edges) is an edge set between all nodes, the Edges represent the connection relation between the nodes, the weight of the Edges is determined by an rbf mapping function based on Euclidean distance, for example, the weight of the edge between a node i and a node j can be represented as:
Figure BDA0003251876640000031
where σ represents the width parameter in the rbf function,
Figure BDA0003251876640000032
representing the squares of the Euclidean distances between nodes i and j, after rbf mapping, the weights of all edges are mapped between (0, 1), and the edges between points with closer distances have larger weights, and the nodes and edges can be uniformly represented by adjacency matrices which jointly represent the nodes of the graphConstructing information;
s2, random walk is carried out on the constructed graph G (V, E), n times of random walk are respectively carried out by taking each Node in the graph as a starting point, the truncation length of each time of random walk is m, a set formed by a series of Node sequences is further generated, the strategy of random walk is divided into width-first search (BFS) and depth-first search (DFS) based on a Node2Vec algorithm, in the process of random walk of the Node2Vec algorithm, for point pairs (V, x) connected with edges, the current Node is given as V, and the probability that the next Node visited by the random walk is x is given as:
Figure BDA0003251876640000033
wherein, pi vx Is the unnormalized transition probability between nodes v and x, Z is the normalization constant, for π vx The method further introduces two hyper-parameters p and q to control the walk strategy, and finally reflects in pi vx First, the following values are set:
π vx =α pq (t,x)·w vx
wherein w vx For the edge weight between node v and node x, then:
Figure BDA0003251876640000041
wherein d is tx For the shortest path distance between the node t and the node x, the super parameter p is called a return parameter, which controls the probability of repeatedly accessing the node which is just visited in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is just visited is increased, the super parameter q is called an in-out parameter, which controls the walk trend in the random walk process, and if q is smaller, the super parameter q is called an in-out parameter>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model can adapt to more distributed data,and is no longer limited to cluster-type distribution data based on euclidean distance.
By performing a number of random walk operations for each node in the graph, the structural information of the whole graph is included in the generated node sequence;
and S3, sampling node pairs by adopting a sliding window model for a series of node sequence sets generated by random walk. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w c ,V i ) In which V is c Representing a central node (centre), V i Representing peripheral nodes (contexts), and then taking a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed in a negative logarithm loss function form, and the mathematical expression is as follows:
Figure BDA0003251876640000051
wherein, phi (V) c ) Is to connect the node V c Mapping function to corresponding embedded representation vector, using mapping matrix phi ∈ R N×d Representing, wherein N represents the number of nodes (or samples of original data) in the graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in the mapping matrix Φ is a training result of the skip-gram model, and finally, for each node in the graph, a vector representation of d dimension can be obtained through training of the skip-gram model, so that an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;
s4, on the basis of the embedding of the obtained graph, a graph convolution neural network (GCN) is used for semi-supervised learning, the graph convolution neural network is used for simulating convolution operation in signal processing, the feature vector of each point in the graph is used as an input signal and is subjected to convolution operation, and then first-order approximation graph convolution operation of a Chebyshev polynomial is used for obtaining a layer-by-layer forward propagation rule among each layer of neural network:
Figure BDA0003251876640000052
wherein H (l) And H (l+1) Outputs of the l hidden layer and the l +1 hidden layer, respectively, in particular, H (0) =X,
Figure BDA0003251876640000053
The adjacency matrix corresponding to the graph G with self-join added thereto, namely:
Figure BDA0003251876640000054
wherein I N Is an identity matrix of the order of N,
Figure BDA0003251876640000055
is composed of
Figure BDA0003251876640000056
A corresponding metric matrix which is a diagonal matrix, the elements on each row diagonal being
Figure BDA0003251876640000057
The row and column of the corresponding row, i.e.:
Figure BDA0003251876640000058
W (l) representing a weight matrix to be trained corresponding to the l-th layer, sigma (·) representing an activation function of the hidden layer, and adopting a RELU function as the activation function of the hidden layer in the GCN. The graph convolution operation of each layer can be understood as performing weighted aggregation on first-order neighborhood information (or local structure information of the graph) of each node in the graph.
Since the structure information of the graph is contained in the layer-by-layer forward propagation rule, the GCN omits the conventional unsupervised loss term containing the graph structure information in the training process, and only retains the cross entropy function of the labeled samples as the supervised loss term, and the expression of the loss function is as follows:
Figure BDA0003251876640000061
wherein Y represents a true label, Z represents a probability distribution of the label predicted by GCN, l represents a number of a labeled sample, F represents a number of a neuron node of an output layer of GCN, F represents the number of classes of the label,
in the training process of the GCN, calculating the output of each layer (the weight matrix can be initialized randomly) according to a layer-by-layer forward propagation rule in each iteration, updating the weight matrix layer by layer according to a gradient descent method of a loss function, iterating in this way until the maximum iteration times are met, and finishing the training;
s5, verifying and comparing results on the open-source financial data set, wherein the selected open-source financial data set is a credit card default related data set derived from UCI Machine Learning reproducibility, the data set comprises 30000 samples, each sample has 23-dimensional characteristics, the labels are represented by 0 and 1, the label corresponding to each sample represents whether the user has credit card default behaviors or not, 1 represents default, and 0 represents normal, in the data set, the default samples account for about 22% and the normal samples account for about 78%, if the default samples are taken as positive samples and the normal samples are taken as negative samples, the classification problem of the data set is a positive sample detection problem under the condition of imbalance proportion of the positive samples and the negative samples, under the condition, the traditional binary accuracy is not taken as an evaluation index, so that the invention adopts recall rate (recalling), precision (precision) and F1 score as evaluation indexes, wherein F1 is considered for the comprehensive result of recall rate and precision score.
As a preferred embodiment of the present invention, in step S1, the width parameter σ =0.15 of the rbf mapping function.
As a preferred embodiment of the present invention, in step S2, the number of random walks from each node n =20, the truncation length m of each random walk =20, the return parameter p =0.5, and the in-out parameter q =0.25.
As a preferred technical solution of the present invention, in step S3, after the graph is embedded, the vector dimension d =100 corresponding to each node, and the window length w =10 in the sliding window model.
Compared with the prior art, the invention has the following beneficial effects:
1. and constructing the graph based on the financial characteristic data. Mining potential relationship information between financial individuals by processing the financial characteristic data as graph data;
2. applying the depth map embedding method to map representation learning of a financial user map and mining map structure information;
3. the graph convolutional neural network is used for training a semi-supervised learning model, and neighborhood information of the nodes in the graph is aggregated;
4. compared with the traditional financial credit assessment method, the method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the financial data with scarce labels.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a random walk strategy of depth map embedded Node2Vec algorithm. BFS stands for breadth-first search, DFS stands for depth-first search;
FIG. 3 is a process of point-to-point sampling by the skip-gram using a sliding window model;
FIG. 4 is a diagram of a skip-gram network model architecture;
FIG. 5 is a training process of a GCN model pair graph;
FIG. 6 is a graph showing the results of three evaluation indexes of different models under the condition that the label rate is 0.1;
FIG. 7 is a graph showing the F1 score results of different models under different label rates;
fig. 8 is a graph comparing the number of training iterations of GCN based on raw data and GCN based on depth map embedding results under different label rates.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.
Example 1
The invention provides a financial credit assessment method based on depth map semi-supervised learning, which comprises the following steps of:
s1, constructing a graph according to the original characteristic data set, namely X belongs to R N×D Constructing a graph G (V, E), wherein X is an original feature data set and is represented by a matrix of NxD dimensions, N is the number of user samples, D is a feature dimension corresponding to each user sample, the graph G (V, E) corresponding to X is located in a D-dimensional feature space, V (Vertices) represents a set of all nodes in the graph, N nodes are totally arranged in the graph, each node represents one user sample, the coordinate of the node in the D-dimensional feature space is equal to the value of the D-dimensional feature of the user sample corresponding to the node, E (Edges) is an edge set between all nodes, the Edges represent the connection relation between the nodes, the weight of the Edges is determined by an rbf mapping function based on Euclidean distance, for example, the weight of the edge between a node i and a node j can be represented as:
Figure BDA0003251876640000081
where σ represents the width parameter in the rbf function,
Figure BDA0003251876640000082
representing the square of the Euclidean distance between nodes i and j, after rbf mapping, the weights of all edges are mapped between (0, 1), and the edges between points with closer distances have larger weights, the nodes and the edges can be uniformly represented by an adjacency matrix, and the nodes and the edges jointly represent the structural information of the graph;
s2, carrying out random walk on the constructed graph G (V, E), respectively carrying out n times of random walk by taking each Node in the graph as a starting point, wherein the truncation length of each time of the random walk is m, further generating a set formed by a series of Node sequences, wherein the strategy of the random walk is divided into width-first search (BFS) and depth-first search (DFS) based on a Node2Vec algorithm, and in the process of the random walk of the Node2Vec algorithm, for a point pair (V, x) connected with edges, the current Node is given as V, and then the probability that the next Node visited by the random walk is x is as follows:
Figure BDA0003251876640000091
wherein, pi vx Is the unnormalized transition probability between nodes v and x, Z is the normalization constant, for π vx The method further introduces two hyper-parameters p and q to control the walk strategy, and finally reflects in pi vx First, the following values are set:
π vx =α pq (t,x)·w vx
wherein w vx For the edge weight between node v and node x, then:
Figure BDA0003251876640000092
wherein d is tx For the shortest path distance between the node t and the node x, the hyperparameter p is called a return parameter and controls the probability of repeatedly accessing the node which is accessed just before in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is accessed just before is increased, the hyperparameter q is called an in-out parameter and controls the walk trend in the random walk process, and if q is smaller, the hyperparameter p is called an in-out parameter and controls the walk trend in the random walk process>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model adapts to more distributed data and is not limited to only the dataAnd cluster type distribution data based on Euclidean distance.
By performing a number of random walk operations for each node in the graph, the structural information of the whole graph is included in the generated node sequence;
and S3, sampling node pairs by adopting a sliding window model for a series of node sequence sets generated by random walk. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w c ,V i ) In which V is c Representing a central node (centre), V i Representing peripheral nodes (context), and then using a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed by a negative logarithmic loss function, and the mathematical expression is as follows:
Figure BDA0003251876640000101
wherein, phi (V) c ) Is to connect the node V c Mapping function to corresponding embedded expression vector, using mapping matrix phi epsilon R N×d The method comprises the following steps of representing, wherein N represents the number of nodes (or the number of samples of original data) in a graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in a mapping matrix phi is a training result of a skip-gram model, finally, for each node in the graph, a vector representation of the dimension d can be obtained through training of the skip-gram model, and then, an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;
s4, on the basis of the embedding of the obtained graph, a graph convolution neural network (GCN) is used for semi-supervised learning, the graph convolution neural network is used for simulating convolution operation in signal processing, the feature vector of each point in the graph is used as an input signal and is subjected to convolution operation, and then the first-order approximation graph convolution operation of the Chebyshev polynomial is used for obtaining a layer-by-layer forward propagation rule between each layer of neural network:
Figure BDA0003251876640000102
wherein H (l) And H (l+1) Outputs of the l hidden layer and the l +1 hidden layer, respectively, in particular, H (0) =X,
Figure BDA0003251876640000103
The adjacency matrix corresponding to the graph G with self-join added thereto is:
Figure BDA0003251876640000104
in which I N Is an identity matrix of the order of N,
Figure BDA0003251876640000105
is composed of
Figure BDA0003251876640000106
A corresponding metric matrix which is a diagonal matrix, the elements on each row diagonal being
Figure BDA0003251876640000107
The row and column of the corresponding row, i.e.:
Figure BDA0003251876640000108
W (l) represents the weight matrix to be trained corresponding to the l-th layer, sigma (·) represents the activation function of the hidden layer, and adopts the RELU function as the activation function of the hidden layer in the GCN. The graph convolution operation of each layer can be understood as performing weighted aggregation on first-order neighborhood information (or local structure information of the graph) of each node in the graph.
Since the structure information of the graph is contained in the layer-by-layer forward propagation rule, the GCN omits the traditional unsupervised loss term containing the graph structure information in the training process, and only retains the cross entropy function of the labeled samples as the supervised loss term, and the expression of the loss function is as follows:
Figure BDA0003251876640000111
wherein Y represents a true label, Z represents a probability distribution of the label predicted by GCN, l represents a number of a labeled sample, F represents a number of a neuron node of an output layer of GCN, F represents the number of classes of the label,
in the training process of the GCN, calculating the output of each layer (the weight matrix can be initialized randomly) according to a layer-by-layer forward propagation rule in each iteration, updating the weight matrix layer by layer according to a gradient descent method of a loss function, iterating in this way until the maximum iteration times are met, and finishing the training;
s5, verifying and comparing results on the open-source financial data set, wherein the selected open-source financial data set is a credit card default related data set derived from UCI Machine Learning reproducibility, the data set comprises 30000 samples, each sample has 23-dimensional characteristics, the labels are represented by 0 and 1, the label corresponding to each sample represents whether the user has credit card default behaviors or not, 1 represents default, and 0 represents normal, in the data set, the default samples account for about 22% and the normal samples account for about 78%, if the default samples are taken as positive samples and the normal samples are taken as negative samples, the classification problem of the data set is a positive sample detection problem under the condition of imbalance proportion of the positive samples and the negative samples, under the condition, the traditional binary accuracy is not taken as an evaluation index, so that the invention adopts recall rate (recalling), precision (precision) and F1 score as evaluation indexes, wherein F1 is considered for the comprehensive result of recall rate and precision score.
In step S1, the width parameter σ =0.15 of the rbf mapping function.
In step S2, the number of random walks from each node n =20, the truncation length m =20 for each random walk, the return parameter p =0.5, and the in-out parameter q =0.25.
In step S3, after the graph is embedded, the vector dimension d =100 corresponding to each node, and the window length w =10 in the sliding window model.
Specifically, in addition to the method adopted by the present invention, for comparison, on the above data set, the result of graph embedding and the original data without graph embedding are verified respectively by using a Label Propagation Algorithm (LPA), and the GCN is also verified in the original data without graph embedding. The verification results are shown in fig. 6, 7, and 8. Fig. 6 is a result chart of three evaluation indexes of different models under the condition that the label rate is 0.1. FIG. 7 is a graph showing the results of F1 scores of different models under different label rates. FIG. 8 is a graph comparing the number of training iterations of GCN based on raw data and GCN based on depth map embedding results under different label rates. Of the three plots,' indicates an experiment based on the plot embedding results.
Through the analysis of the three result graphs, the beneficial effects of the invention are as follows:
1. the depth map is embedded into the financial credit assessment under the condition of scarce labels, so that the effect of a semi-supervised learning model can be improved on the whole;
2. the graph convolutional neural network is used for financial credit assessment under the condition of scarce labels, and compared with the traditional graph semi-supervised learning method, the semi-supervised learning effect is obviously improved;
3. the depth map embedding and the map convolution neural network are combined, so that the effect of the model under the condition of extremely scarce labels (low label rate) is remarkably improved;
4. the depth map embedding is combined with the graph convolution neural network, so that the training speed of the graph convolution neural network is remarkably increased, and the calculation expense is reduced.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (4)

1. The financial credit assessment method based on the depth map semi-supervised learning is characterized by comprising the following steps of:
s1, constructing a graph according to the original characteristic data set, namely X belongs to R N×D Constructing a graph G (V, E), wherein X is an original feature data set and is represented by a matrix of NxD dimensions, N is the number of user samples, D is a feature dimension corresponding to each user sample, the graph G (V, E) corresponding to X is located in a D-dimensional feature space, V (Vertices) represents a set of all nodes in the graph, N nodes are totally arranged in the graph, each node represents one user sample, the coordinate of the node in the D-dimensional feature space is equal to the value of the D-dimensional feature of the user sample corresponding to the node, E (Edges) is an edge set between all nodes, the Edges represent the connection relation between the nodes, the weight of the Edges is determined by an rbf mapping function based on Euclidean distance, for example, the weight of the edge between a node i and a node j can be represented as:
Figure FDA0003251876630000011
where sigma represents the width parameter in the rbf function,
Figure FDA0003251876630000012
representing the square of the Euclidean distance between nodes i and j, after rbf mapping, the weights of all edges are mapped between (0, 1), and the edges between points with closer distances have larger weights, the nodes and the edges can be uniformly represented by an adjacency matrix, and the nodes and the edges jointly represent the structural information of the graph;
s2, carrying out random walk on the constructed graph G (V, E), respectively carrying out n times of random walk by taking each Node in the graph as a starting point, wherein the truncation length of each time of the random walk is m, further generating a set formed by a series of Node sequences, wherein the strategy of the random walk is divided into width-first search (BFS) and depth-first search (DFS) based on a Node2Vec algorithm, and in the process of the random walk of the Node2Vec algorithm, for a point pair (V, x) connected with edges, the current Node is given as V, and then the probability that the next Node visited by the random walk is x is as follows:
Figure FDA0003251876630000013
wherein, pi vx Is the unnormalized transition probability between nodes v and x, Z is the normalization constant, for π vx The calculation of (2) further introduces two hyper-parameters p and q to control the migration strategy, and finally reflects in pi vx First, the following values are set:
π vx =α pq (t,x)·w vx
wherein w vx For the edge weight between node v and node x, then:
Figure FDA0003251876630000021
wherein d is tx For the shortest path distance between the node t and the node x, the hyperparameter p is called a return parameter and controls the probability of repeatedly accessing the node which is accessed just before in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is accessed just before is increased, the hyperparameter q is called an in-out parameter and controls the walk trend in the random walk process, and if q is smaller, the hyperparameter p is called an in-out parameter and controls the walk trend in the random walk process>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model adapts to more kinds of distributed data and is not limited to cluster-type distributed data based on Euclidean distance.
By performing several random walk operations for each node in the graph, the structural information of the whole graph is contained in the generated node sequence;
s3, firstly adopting sliding to generate a series of node sequence sets generated by random walkThe window model samples node pairs. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w c ,V i ) In which V is c Representing a central node (centre), V i Representing peripheral nodes (context), and then using a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed by a negative logarithmic loss function, and the mathematical expression is as follows:
Figure FDA0003251876630000031
wherein, phi (V) c ) Is to connect the node V c Mapping function to corresponding embedded representation vector, using mapping matrix phi ∈ R N×d The method comprises the following steps of representing, wherein N represents the number of nodes (or the number of samples of original data) in a graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in a mapping matrix phi is a training result of a skip-gram model, finally, for each node in the graph, a vector representation of the dimension d can be obtained through training of the skip-gram model, and then, an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;
s4, on the basis of the embedding of the obtained graph, a graph convolution neural network (GCN) is used for semi-supervised learning, the graph convolution neural network is used for simulating convolution operation in signal processing, the feature vector of each point in the graph is used as an input signal and is subjected to convolution operation, and then first-order approximation graph convolution operation of a Chebyshev polynomial is used for obtaining a layer-by-layer forward propagation rule among each layer of neural network:
Figure FDA0003251876630000032
wherein H (l) And H (l+1) Outputs of the l hidden layer and the l +1 hidden layer, respectively, in particular, H (0) =X,
Figure FDA0003251876630000033
The adjacency matrix corresponding to the graph G with self-join added thereto is:
Figure FDA0003251876630000034
in which I N Is an identity matrix of the order of N,
Figure FDA0003251876630000035
is composed of
Figure FDA0003251876630000036
A corresponding metric matrix which is a diagonal matrix, the elements on each row diagonal being
Figure FDA0003251876630000037
The row and column of the corresponding row, i.e.:
Figure FDA0003251876630000038
W (l) representing a weight matrix to be trained corresponding to the l-th layer, sigma (·) representing an activation function of the hidden layer, and adopting a RELU function as the activation function of the hidden layer in the GCN. The graph convolution operation of each layer can be understood as performing weighted aggregation on first-order neighborhood information (or local structure information of the graph) of each node in the graph.
Since the structure information of the graph is contained in the layer-by-layer forward propagation rule, the GCN omits the conventional unsupervised loss term containing the graph structure information in the training process, and only retains the cross entropy function of the labeled samples as the supervised loss term, and the expression of the loss function is as follows:
Figure FDA0003251876630000041
wherein Y represents a true label, Z represents a probability distribution of the label predicted by GCN, l represents the number of a labeled sample, F represents the number of a neuron node of the GCN output layer, F represents the number of classes of the label,
in the training process of the GCN, calculating the output of each layer (the weight matrix can be initialized randomly) according to a layer-by-layer forward propagation rule in each iteration, updating the weight matrix layer by layer according to a gradient descent method of a loss function, iterating in this way until the maximum iteration times are met, and finishing the training;
s5, verifying and comparing results on the open source financial data set, wherein the selected open source financial data set is a credit card default related data set derived from UCI Machine Learning review, the data set comprises 30000 samples, each sample has 23-dimensional characteristics, two types of labels are represented by 0 and 1, the label corresponding to each sample represents whether the user has credit card default behaviors, 1 represents default, and 0 represents normal, in the data set, the default samples account for about 22% and the normal samples account for about 78%, if the default samples are taken as positive samples and the normal samples are taken as negative samples, the classification problem of the data set is a positive sample detection problem under the condition of imbalance of proportion of the positive samples and the negative samples, under the condition, the traditional binary accuracy is not taken as an evaluation index, therefore, the invention adopts recall rate (call), precision rate (precision) and F1 score as evaluation indexes, wherein F1 is the comprehensive result of the recall rate and the precision score.
2. The method for assessing financial credit based on depth map semi-supervised learning of claim 1, wherein in step S1, the width parameter σ =0.15 of the rbf mapping function.
3. The financial credit assessment method based on depth map semi-supervised learning as recited in claim 1, wherein in step S2, the number of random walks from each node n =20, the truncation length m =20 for each random walk, the return parameter p =0.5, and the in-and-out parameter q =0.25.
4. The financial credit assessment method based on depth map semi-supervised learning as claimed in claim 1, wherein in step S3, the vector dimension d =100 corresponding to each node after the map embedding, and the window length w =10 in the sliding window model.
CN202111048605.4A 2021-09-08 2021-09-08 Financial credit assessment method based on depth map semi-supervised learning Pending CN115797041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111048605.4A CN115797041A (en) 2021-09-08 2021-09-08 Financial credit assessment method based on depth map semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111048605.4A CN115797041A (en) 2021-09-08 2021-09-08 Financial credit assessment method based on depth map semi-supervised learning

Publications (1)

Publication Number Publication Date
CN115797041A true CN115797041A (en) 2023-03-14

Family

ID=85473411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111048605.4A Pending CN115797041A (en) 2021-09-08 2021-09-08 Financial credit assessment method based on depth map semi-supervised learning

Country Status (1)

Country Link
CN (1) CN115797041A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829683A (en) * 2024-03-04 2024-04-05 国网山东省电力公司信息通信公司 Electric power Internet of things data quality analysis method and system based on graph comparison learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829683A (en) * 2024-03-04 2024-04-05 国网山东省电力公司信息通信公司 Electric power Internet of things data quality analysis method and system based on graph comparison learning

Similar Documents

Publication Publication Date Title
US10713597B2 (en) Systems and methods for preparing data for use by machine learning algorithms
US11631032B2 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
WO2021164382A1 (en) Method and apparatus for performing feature processing for user classification model
CN108647736B (en) Image classification method based on perception loss and matching attention mechanism
CN110555455A (en) Online transaction fraud detection method based on entity relationship
CN112906770A (en) Cross-modal fusion-based deep clustering method and system
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN110636445A (en) WIFI-based indoor positioning method, device, equipment and medium
WO2023116111A1 (en) Disk fault prediction method and apparatus
WO2023109085A1 (en) Method for training account risk model, and method for determining risk user group
KR20190094068A (en) Learning method of classifier for classifying behavior type of gamer in online game and apparatus comprising the classifier
CN113269647A (en) Graph-based transaction abnormity associated user detection method
CN113535964B (en) Enterprise classification model intelligent construction method, device, equipment and medium
CN115797041A (en) Financial credit assessment method based on depth map semi-supervised learning
CN116993513A (en) Financial wind control model interpretation method and device and computer equipment
US20230004869A1 (en) Machine-learning techniques for evaluating suitability of candidate datasets for target applications
CN115907775A (en) Personal credit assessment rating method based on deep learning and application thereof
CN116502132A (en) Account set identification method, device, equipment, medium and computer program product
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN110837847A (en) User classification method and device, storage medium and server
CN111400413A (en) Method and system for determining category of knowledge points in knowledge base
US11609936B2 (en) Graph data processing method, device, and computer program product
Liu et al. A Big Data-Based Anti-Fraud Model for Internet Finance.
CN115344693B (en) Clustering method based on fusion of traditional algorithm and neural network algorithm
CN114281994B (en) Text clustering integration method and system based on three-layer weighting model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication