CN115797041A

CN115797041A - Financial credit assessment method based on depth map semi-supervised learning

Info

Publication number: CN115797041A
Application number: CN202111048605.4A
Authority: CN
Inventors: 邱韵; 徐小龙; 邬晶; 李少远; 徐世界
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2023-03-14

Abstract

The invention discloses a financial credit assessment method based on depth map semi-supervised learning. The depth does not refer to the deep level of the network structure in deep learning, but refers to the deep level on the graph information mining level, and for the graph constructed by the financial characteristic data, two layers of information mining are carried out: firstly, the graph structure information is mined by using a depth map embedding method, and then the neighborhood information of the graph nodes is aggregated by using a graph convolution neural network. Compared with the traditional graph semi-supervised learning method, the method provided by the invention effectively relieves the problem of sparse information amount under the condition of scarce labels, and has deep layer in the mining of graph information, so the method is called deep graph semi-supervised learning. The method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the financial data with scarce labels.

Description

Financial credit assessment method based on depth map semi-supervised learning

Technical Field

The invention relates to the field of financial user credit evaluation, in particular to a financial credit evaluation method based on depth map semi-supervised learning.

Background

In the financial market, financial fraud occurs occasionally, which not only affects the normal order of financial transactions, but also brings huge losses to users, enterprises and institutions. Common financial fraud includes: bank fraud, insurance fraud, security fraud, merchandise transaction fraud, and the like. In order to prevent financial fraud, credit assessment for financial users, enterprises, and the like is becoming an urgent need. If the behavior records of the financial users are used as features (features) and the credit evaluation results of the users are used as labels (labels), the credit evaluation problem of the financial users is abstracted as how to fit reasonable labels according to the feature data of the users. Therefore, how to establish a reasonable and efficient mathematical model and accurately acquire a corresponding credit evaluation label from the behavior record characteristics of the financial user becomes a research hotspot in the field of credit evaluation of the financial user.

In the prior art, a supervised machine learning method is often used to perform learning fitting from the feature data of financial users to the user credit evaluation result labels. However, supervised machine learning approaches often require a large number of labeled data sets as training samples. In many practical problem scenarios, the acquisition of the sample label is often laborious and costly. Credit assessment of financial users is a typical example. In a traditional financial credit assessment method, the rating of a target user is often determined according to expert experience and a complex algorithm through long-time tracking analysis of information such as transaction, investment records and the like of the user. This results in the scarcity of labeled samples among the actual financial credit assessment problems, which greatly limits the effectiveness of supervised learning models for credit assessment. The existing semi-supervised learning method has many limitations, for example, the probability density distribution function of the model needs to be known in advance by a generative algorithm, expert knowledge is needed, and the application field is small; the semi-supervised support vector machine method has higher model complexity and is generally limited to the problem of two classifications; the collaborative training method is sensitive to data distribution and requires good independence between data attributes; the self-learning method is poor in robustness, free of self-correcting capability and capable of accumulating training errors. Therefore, the patent provides a semi-supervised learning method based on a depth map for credit assessment and feature mining of financial users, and compared with the semi-supervised learning method, the algorithm of the map semi-supervised learning is suitable for data sets distributed randomly, does not need prior knowledge, and has the advantages of good robustness, low model structure complexity and the like. On the basis, the depth map embedding and the semi-supervised learning method based on the graph convolution neural network are combined, and compared with the general graph semi-supervised learning method, the problem of sparse information amount under the condition of scarce labels is effectively solved, the structural information of the graph can be deeply mined, and the credit evaluation effect on the open-source financial data set is remarkably improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a financial credit evaluation method based on depth map semi-supervised learning. The "depth" herein does not refer to the deep level of the network structure in deep learning, but refers to the deep level at the graph information mining level. For the graph constructed by the financial characteristic data, two layers of information mining are carried out: firstly, the graph structure information is mined by using a depth map embedding method, and then the neighborhood information of the graph nodes is aggregated by using a graph convolution neural network. Therefore, compared with the traditional graph semi-supervised learning method, the method provided by the invention effectively relieves the problem of sparse information amount under the condition of scarce labels, and has deep layer in the mining of graph information, so the method is called deep graph semi-supervised learning.

The invention has the following main beneficial effects: 1. and constructing a graph based on the financial characteristic data. Mining potential relationship information between financial individuals by processing the financial characteristic data as graph data; 2. applying the depth map embedding method to map representation learning of a financial user map and mining map structure information; 3. the graph convolution neural network is used for training a semi-supervised learning model, and neighborhood information of the nodes in the graph is aggregated; 4. compared with the traditional financial credit assessment method, the method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the scarce financial data of the labels.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention provides a financial credit assessment method based on depth map semi-supervised learning, which comprises the following steps:

s1, constructing a graph according to the original characteristic data set, namely X belongs to R ^N×D Constructing a graph G (V, E), wherein X is an original feature data set and is represented by a matrix of NxD dimensions, N is the number of user samples, D is a feature dimension corresponding to each user sample, the graph G (V, E) corresponding to X is located in a D-dimensional feature space, V (Vertices) represents a set of all nodes in the graph, N nodes are totally arranged in the graph, each node represents one user sample, the coordinate of the node in the D-dimensional feature space is equal to the value of the D-dimensional feature of the user sample corresponding to the node, E (Edges) is an edge set between all nodes, the Edges represent the connection relation between the nodes, the weight of the Edges is determined by an rbf mapping function based on Euclidean distance, for example, the weight of the edge between a node i and a node j can be represented as:

where σ represents the width parameter in the rbf function,

representing the squares of the Euclidean distances between nodes i and j, after rbf mapping, the weights of all edges are mapped between (0, 1), and the edges between points with closer distances have larger weights, and the nodes and edges can be uniformly represented by adjacency matrices which jointly represent the nodes of the graphConstructing information;

s2, random walk is carried out on the constructed graph G (V, E), n times of random walk are respectively carried out by taking each Node in the graph as a starting point, the truncation length of each time of random walk is m, a set formed by a series of Node sequences is further generated, the strategy of random walk is divided into width-first search (BFS) and depth-first search (DFS) based on a Node2Vec algorithm, in the process of random walk of the Node2Vec algorithm, for point pairs (V, x) connected with edges, the current Node is given as V, and the probability that the next Node visited by the random walk is x is given as:

wherein, pi _vx Is the unnormalized transition probability between nodes v and x, Z is the normalization constant, for π _vx The method further introduces two hyper-parameters p and q to control the walk strategy, and finally reflects in pi _vx First, the following values are set:

π _vx ＝α _pq (t,x)·w _vx

wherein w _vx For the edge weight between node v and node x, then:

wherein d is _tx For the shortest path distance between the node t and the node x, the super parameter p is called a return parameter, which controls the probability of repeatedly accessing the node which is just visited in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is just visited is increased, the super parameter q is called an in-out parameter, which controls the walk trend in the random walk process, and if q is smaller, the super parameter q is called an in-out parameter>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model can adapt to more distributed data,and is no longer limited to cluster-type distribution data based on euclidean distance.

By performing a number of random walk operations for each node in the graph, the structural information of the whole graph is included in the generated node sequence;

and S3, sampling node pairs by adopting a sliding window model for a series of node sequence sets generated by random walk. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w _c ,V _i ) In which V is _c Representing a central node (centre), V _i Representing peripheral nodes (contexts), and then taking a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed in a negative logarithm loss function form, and the mathematical expression is as follows:

wherein, phi (V) _c ) Is to connect the node V _c Mapping function to corresponding embedded representation vector, using mapping matrix phi ∈ R ^N×d Representing, wherein N represents the number of nodes (or samples of original data) in the graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in the mapping matrix Φ is a training result of the skip-gram model, and finally, for each node in the graph, a vector representation of d dimension can be obtained through training of the skip-gram model, so that an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;

s4, on the basis of the embedding of the obtained graph, a graph convolution neural network (GCN) is used for semi-supervised learning, the graph convolution neural network is used for simulating convolution operation in signal processing, the feature vector of each point in the graph is used as an input signal and is subjected to convolution operation, and then first-order approximation graph convolution operation of a Chebyshev polynomial is used for obtaining a layer-by-layer forward propagation rule among each layer of neural network:

wherein H ^(l) And H ^(l+1) Outputs of the l hidden layer and the l +1 hidden layer, respectively, in particular, H ⁽⁰⁾ ＝X，

The adjacency matrix corresponding to the graph G with self-join added thereto, namely:

wherein I _N Is an identity matrix of the order of N,

is composed of

A corresponding metric matrix which is a diagonal matrix, the elements on each row diagonal being

The row and column of the corresponding row, i.e.:

W ^(l) representing a weight matrix to be trained corresponding to the l-th layer, sigma (·) representing an activation function of the hidden layer, and adopting a RELU function as the activation function of the hidden layer in the GCN. The graph convolution operation of each layer can be understood as performing weighted aggregation on first-order neighborhood information (or local structure information of the graph) of each node in the graph.

Since the structure information of the graph is contained in the layer-by-layer forward propagation rule, the GCN omits the conventional unsupervised loss term containing the graph structure information in the training process, and only retains the cross entropy function of the labeled samples as the supervised loss term, and the expression of the loss function is as follows:

wherein Y represents a true label, Z represents a probability distribution of the label predicted by GCN, l represents a number of a labeled sample, F represents a number of a neuron node of an output layer of GCN, F represents the number of classes of the label,

in the training process of the GCN, calculating the output of each layer (the weight matrix can be initialized randomly) according to a layer-by-layer forward propagation rule in each iteration, updating the weight matrix layer by layer according to a gradient descent method of a loss function, iterating in this way until the maximum iteration times are met, and finishing the training;

s5, verifying and comparing results on the open-source financial data set, wherein the selected open-source financial data set is a credit card default related data set derived from UCI Machine Learning reproducibility, the data set comprises 30000 samples, each sample has 23-dimensional characteristics, the labels are represented by 0 and 1, the label corresponding to each sample represents whether the user has credit card default behaviors or not, 1 represents default, and 0 represents normal, in the data set, the default samples account for about 22% and the normal samples account for about 78%, if the default samples are taken as positive samples and the normal samples are taken as negative samples, the classification problem of the data set is a positive sample detection problem under the condition of imbalance proportion of the positive samples and the negative samples, under the condition, the traditional binary accuracy is not taken as an evaluation index, so that the invention adopts recall rate (recalling), precision (precision) and F1 score as evaluation indexes, wherein F1 is considered for the comprehensive result of recall rate and precision score.

As a preferred embodiment of the present invention, in step S1, the width parameter σ =0.15 of the rbf mapping function.

As a preferred embodiment of the present invention, in step S2, the number of random walks from each node n =20, the truncation length m of each random walk =20, the return parameter p =0.5, and the in-out parameter q =0.25.

As a preferred technical solution of the present invention, in step S3, after the graph is embedded, the vector dimension d =100 corresponding to each node, and the window length w =10 in the sliding window model.

Compared with the prior art, the invention has the following beneficial effects:

1. and constructing the graph based on the financial characteristic data. Mining potential relationship information between financial individuals by processing the financial characteristic data as graph data;

2. applying the depth map embedding method to map representation learning of a financial user map and mining map structure information;

3. the graph convolutional neural network is used for training a semi-supervised learning model, and neighborhood information of the nodes in the graph is aggregated;

4. compared with the traditional financial credit assessment method, the method based on the depth map semi-supervised learning improves the accuracy of the financial credit assessment under the condition of scarce labels, and accelerates the model training speed, thereby realizing the efficient and accurate credit assessment of the financial data with scarce labels.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a random walk strategy of depth map embedded Node2Vec algorithm. BFS stands for breadth-first search, DFS stands for depth-first search;

FIG. 3 is a process of point-to-point sampling by the skip-gram using a sliding window model;

FIG. 4 is a diagram of a skip-gram network model architecture;

FIG. 5 is a training process of a GCN model pair graph;

FIG. 6 is a graph showing the results of three evaluation indexes of different models under the condition that the label rate is 0.1;

FIG. 7 is a graph showing the F1 score results of different models under different label rates;

fig. 8 is a graph comparing the number of training iterations of GCN based on raw data and GCN based on depth map embedding results under different label rates.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.

Example 1

The invention provides a financial credit assessment method based on depth map semi-supervised learning, which comprises the following steps of:

where σ represents the width parameter in the rbf function,

representing the square of the Euclidean distance between nodes i and j, after rbf mapping, the weights of all edges are mapped between (0, 1), and the edges between points with closer distances have larger weights, the nodes and the edges can be uniformly represented by an adjacency matrix, and the nodes and the edges jointly represent the structural information of the graph;

s2, carrying out random walk on the constructed graph G (V, E), respectively carrying out n times of random walk by taking each Node in the graph as a starting point, wherein the truncation length of each time of the random walk is m, further generating a set formed by a series of Node sequences, wherein the strategy of the random walk is divided into width-first search (BFS) and depth-first search (DFS) based on a Node2Vec algorithm, and in the process of the random walk of the Node2Vec algorithm, for a point pair (V, x) connected with edges, the current Node is given as V, and then the probability that the next Node visited by the random walk is x is as follows:

π _vx ＝α _pq (t,x)·w _vx

wherein w _vx For the edge weight between node v and node x, then:

wherein d is _tx For the shortest path distance between the node t and the node x, the hyperparameter p is called a return parameter and controls the probability of repeatedly accessing the node which is accessed just before in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is accessed just before is increased, the hyperparameter q is called an in-out parameter and controls the walk trend in the random walk process, and if q is smaller, the hyperparameter p is called an in-out parameter and controls the walk trend in the random walk process>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model adapts to more distributed data and is not limited to only the dataAnd cluster type distribution data based on Euclidean distance.

and S3, sampling node pairs by adopting a sliding window model for a series of node sequence sets generated by random walk. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w _c ,V _i ) In which V is _c Representing a central node (centre), V _i Representing peripheral nodes (context), and then using a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed by a negative logarithmic loss function, and the mathematical expression is as follows:

wherein, phi (V) _c ) Is to connect the node V _c Mapping function to corresponding embedded expression vector, using mapping matrix phi epsilon R ^N×d The method comprises the following steps of representing, wherein N represents the number of nodes (or the number of samples of original data) in a graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in a mapping matrix phi is a training result of a skip-gram model, finally, for each node in the graph, a vector representation of the dimension d can be obtained through training of the skip-gram model, and then, an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;

s4, on the basis of the embedding of the obtained graph, a graph convolution neural network (GCN) is used for semi-supervised learning, the graph convolution neural network is used for simulating convolution operation in signal processing, the feature vector of each point in the graph is used as an input signal and is subjected to convolution operation, and then the first-order approximation graph convolution operation of the Chebyshev polynomial is used for obtaining a layer-by-layer forward propagation rule between each layer of neural network:

The adjacency matrix corresponding to the graph G with self-join added thereto is:

in which I _N Is an identity matrix of the order of N,

is composed of

The row and column of the corresponding row, i.e.:

W ^(l) represents the weight matrix to be trained corresponding to the l-th layer, sigma (·) represents the activation function of the hidden layer, and adopts the RELU function as the activation function of the hidden layer in the GCN. The graph convolution operation of each layer can be understood as performing weighted aggregation on first-order neighborhood information (or local structure information of the graph) of each node in the graph.

Since the structure information of the graph is contained in the layer-by-layer forward propagation rule, the GCN omits the traditional unsupervised loss term containing the graph structure information in the training process, and only retains the cross entropy function of the labeled samples as the supervised loss term, and the expression of the loss function is as follows:

In step S1, the width parameter σ =0.15 of the rbf mapping function.

In step S2, the number of random walks from each node n =20, the truncation length m =20 for each random walk, the return parameter p =0.5, and the in-out parameter q =0.25.

In step S3, after the graph is embedded, the vector dimension d =100 corresponding to each node, and the window length w =10 in the sliding window model.

Specifically, in addition to the method adopted by the present invention, for comparison, on the above data set, the result of graph embedding and the original data without graph embedding are verified respectively by using a Label Propagation Algorithm (LPA), and the GCN is also verified in the original data without graph embedding. The verification results are shown in fig. 6, 7, and 8. Fig. 6 is a result chart of three evaluation indexes of different models under the condition that the label rate is 0.1. FIG. 7 is a graph showing the results of F1 scores of different models under different label rates. FIG. 8 is a graph comparing the number of training iterations of GCN based on raw data and GCN based on depth map embedding results under different label rates. Of the three plots,' indicates an experiment based on the plot embedding results.

Through the analysis of the three result graphs, the beneficial effects of the invention are as follows:

1. the depth map is embedded into the financial credit assessment under the condition of scarce labels, so that the effect of a semi-supervised learning model can be improved on the whole;

2. the graph convolutional neural network is used for financial credit assessment under the condition of scarce labels, and compared with the traditional graph semi-supervised learning method, the semi-supervised learning effect is obviously improved;

3. the depth map embedding and the map convolution neural network are combined, so that the effect of the model under the condition of extremely scarce labels (low label rate) is remarkably improved;

4. the depth map embedding is combined with the graph convolution neural network, so that the training speed of the graph convolution neural network is remarkably increased, and the calculation expense is reduced.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The financial credit assessment method based on the depth map semi-supervised learning is characterized by comprising the following steps of:

where sigma represents the width parameter in the rbf function,

wherein, pi _vx Is the unnormalized transition probability between nodes v and x, Z is the normalization constant, for π _vx The calculation of (2) further introduces two hyper-parameters p and q to control the migration strategy, and finally reflects in pi _vx First, the following values are set:

π _vx ＝α _pq (t,x)·w _vx

wherein w _vx For the edge weight between node v and node x, then:

wherein d is _tx For the shortest path distance between the node t and the node x, the hyperparameter p is called a return parameter and controls the probability of repeatedly accessing the node which is accessed just before in the random walk process, if p is smaller, the probability of repeatedly accessing the node which is accessed just before is increased, the hyperparameter q is called an in-out parameter and controls the walk trend in the random walk process, and if q is smaller, the hyperparameter p is called an in-out parameter and controls the walk trend in the random walk process>1, randomly walking to more easily access nodes (corresponding to BFS) around the node t; if q is<1, the random walk can more easily access nodes (corresponding to DFS) far away from the node t, and with the hyper-parameters p and q, the strategy of the random walk can be flexibly adjusted according to the structure of the graph, so that the model adapts to more kinds of distributed data and is not limited to cluster-type distributed data based on Euclidean distance.

By performing several random walk operations for each node in the graph, the structural information of the whole graph is contained in the generated node sequence;

s3, firstly adopting sliding to generate a series of node sequence sets generated by random walkThe window model samples node pairs. For each node sequence, a plurality of point pairs (V) are obtained using sliding window sampling of length w _c ,V _i ) In which V is _c Representing a central node (centre), V _i Representing peripheral nodes (context), and then using a plurality of point pairs obtained by sampling as a training set of the skip-gram network model, wherein for each input central node, the corresponding training goal is to maximize the co-occurrence probability between the central node and the peripheral nodes thereof, and the training goal is expressed by a negative logarithmic loss function, and the mathematical expression is as follows:

wherein, phi (V) _c ) Is to connect the node V _c Mapping function to corresponding embedded representation vector, using mapping matrix phi ∈ R ^N×d The method comprises the following steps of representing, wherein N represents the number of nodes (or the number of samples of original data) in a graph, d represents the dimension of an embedded vector corresponding to each node after mapping, a parameter in a mapping matrix phi is a training result of a skip-gram model, finally, for each node in the graph, a vector representation of the dimension d can be obtained through training of the skip-gram model, and then, an embedded representation result of the whole graph is obtained. The size of d can be preset before training, and a proper value of d can be selected according to actual requirements and the embedding effect of a corresponding graph to form a network model of a skip-gram;

in which I _N Is an identity matrix of the order of N,

is composed of

The row and column of the corresponding row, i.e.:

wherein Y represents a true label, Z represents a probability distribution of the label predicted by GCN, l represents the number of a labeled sample, F represents the number of a neuron node of the GCN output layer, F represents the number of classes of the label,

s5, verifying and comparing results on the open source financial data set, wherein the selected open source financial data set is a credit card default related data set derived from UCI Machine Learning review, the data set comprises 30000 samples, each sample has 23-dimensional characteristics, two types of labels are represented by 0 and 1, the label corresponding to each sample represents whether the user has credit card default behaviors, 1 represents default, and 0 represents normal, in the data set, the default samples account for about 22% and the normal samples account for about 78%, if the default samples are taken as positive samples and the normal samples are taken as negative samples, the classification problem of the data set is a positive sample detection problem under the condition of imbalance of proportion of the positive samples and the negative samples, under the condition, the traditional binary accuracy is not taken as an evaluation index, therefore, the invention adopts recall rate (call), precision rate (precision) and F1 score as evaluation indexes, wherein F1 is the comprehensive result of the recall rate and the precision score.

2. The method for assessing financial credit based on depth map semi-supervised learning of claim 1, wherein in step S1, the width parameter σ =0.15 of the rbf mapping function.

3. The financial credit assessment method based on depth map semi-supervised learning as recited in claim 1, wherein in step S2, the number of random walks from each node n =20, the truncation length m =20 for each random walk, the return parameter p =0.5, and the in-and-out parameter q =0.25.

4. The financial credit assessment method based on depth map semi-supervised learning as claimed in claim 1, wherein in step S3, the vector dimension d =100 corresponding to each node after the map embedding, and the window length w =10 in the sliding window model.