CN114417958A

CN114417958A - Credit assessment method for imbalanced financial data based on improved graph convolutional neural network

Info

Publication number: CN114417958A
Application number: CN202111471951.3A
Authority: CN
Inventors: 邱韵; 徐小龙; 邬晶; 李少远; 周松
Original assignee: China Telecom Bestpay Co Ltd
Current assignee: China Telecom Bestpay Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-29

Abstract

The invention discloses an unbalanced financial data credit evaluation method based on an improved graph convolution neural network, which comprises the following steps of: firstly, graph construction is carried out according to current financial feature data to obtain a graph G (V, E), wherein V is a point set, and E is an edge set. On the basis of the constructed graph, an improved graph convolution neural network is adopted to train a supervised learning model, and finally the trained model is used for predicting the credit of the financial user. The method relieves the over-fitting problem in the classified unbalanced financial data from the aspect of data enhancement; on the other hand, in the improved GCN model, each layer of graph convolution operation comprehensively utilizes information of node vectors and edge vectors in a first-order neighborhood, all graph node vectors are updated by weighting and aggregating the node vectors and the edge vectors, the effect of representing learning of the graph node vectors is improved from the model level, and the effect of carrying out classification evaluation on financial users with unbalanced categories is further improved.

Description

A credit evaluation method for unbalanced financial data based on improved graph convolutional neural network

技术领域technical field

本发明涉及金融用户信用评价领域，特别涉及一种基于改进图卷积神经网络的不均衡金融数据信用评估方法。The invention relates to the field of financial user credit evaluation, in particular to an unbalanced financial data credit evaluation method based on an improved graph convolutional neural network.

背景技术Background technique

在金融信用评估领域，数据类别不平衡是一个普遍的问题。例如在欺诈检测的问题当中，存在欺诈、违约等行为的样本数会远远少于正常的样本数，这一方面是因为坏用户在总用户当中的比例本身就比较小，另一方面存在欺诈行为的用户可能会隐瞒、伪造自己的欺诈记录。这就造成了少数类(欺诈用户)样本数量与多数类(正常用户)样本数量在分布上严重失衡。传统的机器学习模型在对类别不均衡数据进行学习时往往会在多数类的样本中获得比较好的泛化效果，而对于少数类的样本，由于训练集规模较小，可能会发生严重的过拟合，导致泛化性能很差。在样本数目较少的训练集中，少数类的样本数目被进一步限制，甚至可能出现某些少数类样本在训练集当中缺失的问题，即“类别缺失”。因此，如何解决金融数据中的类别分布不均衡所带来的模型泛化性能较差，难以有效学习少数类样本的问题，是金融信用评估领域的又一大挑战。In the field of financial credit assessment, data category imbalance is a common problem. For example, in the problem of fraud detection, the number of samples with behaviors such as fraud and breach of contract will be far less than the normal number of samples. This is because the proportion of bad users in the total users is relatively small, and on the other hand, there is fraud. Behavioral users may conceal and falsify their own fraudulent records. This results in a serious imbalance in the distribution of the number of samples of the minority class (fraudulent users) and the number of samples of the majority class (normal users). Traditional machine learning models tend to obtain better generalization results in samples of the majority class when learning data with unbalanced classes. fitting, resulting in poor generalization performance. In the training set with a small number of samples, the number of samples of the minority class is further limited, and there may even be a problem that some samples of the minority class are missing in the training set, that is, "class missing". Therefore, how to solve the problem of poor model generalization performance caused by unbalanced class distribution in financial data and difficulty in effectively learning minority class samples is another major challenge in the field of financial credit evaluation.

现有的解决数据类别不均衡问题的方法主要有重采样、数据合成、重加权、迁移学习、元学习和度量学习等。其中，重采样可能会加剧模型对少数类数据的过拟合；而数据合成可能会引入噪声或对分类用处不大的特征，降低分类器性能；度量学习基于样本点间距离，力求学习到少数类样本周边的更优决策边界，但以距离衡量相似度的方法在标签稀缺条件下往往存在较大局限；迁移学习和元学习都需要对多数类样本和少数类样本进行分别建模，在样本数和类别数较多时，模型的复杂度较高。以上方法在解决金融数据中的类别不均衡问题时或多或少存在一些局限，因此亟待提出一种简单、高效的算法框架，用于金融数据中的类别不均衡对金融信用评估模型性能所带来的局限。The existing methods to solve the problem of imbalanced data categories mainly include resampling, data synthesis, reweighting, transfer learning, meta-learning, and metric learning. Among them, resampling may aggravate the overfitting of the model to minority data; data synthesis may introduce noise or features that are not useful for classification, reducing the performance of the classifier; metric learning is based on the distance between sample points, and strives to learn a minority A better decision boundary around the class samples, but the method of measuring similarity by distance often has great limitations under the condition of scarce labels; both transfer learning and meta-learning need to model the majority class samples and minority class samples separately. When the number of numbers and categories is large, the complexity of the model is higher. The above methods have more or less limitations in solving the problem of category imbalance in financial data. Therefore, it is urgent to propose a simple and efficient algorithm framework for the impact of category imbalance in financial data on the performance of financial credit evaluation models. coming limitations.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是克服现有技术的缺陷，提供一种基于改进图卷积神经网络的不均衡金融数据信用评估方法。The technical problem to be solved by the present invention is to overcome the defects of the prior art and provide a credit evaluation method for unbalanced financial data based on an improved graph convolutional neural network.

本发明提供了如下的技术方案：The invention provides the following technical solutions:

本发明提供一种基于改进图卷积神经网络的不均衡金融数据信用评估方法，包括以下步骤：The present invention provides an unbalanced financial data credit evaluation method based on an improved graph convolutional neural network, comprising the following steps:

S1，首先基于金融特征数据集进行图的构造；即根据输入的金融特征矩阵X∈R^N×D(N代表训练样本集总数,D为特征数据的维度)，构造图G(V,E)(V代表代表节点集合，E代表边集)；每一个训练样本对应图中的一个节点，节点坐落于D维欧氏空间当中，每一维坐标对应样本每维特征的值；图构造主要有两步：首先用基于欧氏距离的K近邻算法确定每个点的一阶邻域，并用边连接中心节点和所有邻居节点，接着用RBF映射来计算每条边的边权，进而构成整张图的带权邻接矩阵A，边权的计算公式如下：S1, first construct the graph based on the financial feature data set; that is, according to the input financial feature matrix X∈R ^N×D (N represents the total number of training sample sets, D is the dimension of the feature data), construct the graph G(V,E) (V represents node set, E represents edge set); each training sample corresponds to a node in the graph, the node is located in the D-dimensional Euclidean space, and each dimension coordinate corresponds to the value of each dimension feature of the sample; the graph structure mainly includes Two steps: First, use the K-nearest neighbor algorithm based on Euclidean distance to determine the first-order neighborhood of each point, and use edges to connect the center node and all neighbor nodes, and then use RBF mapping to calculate the edge weight of each edge, and then form the entire frame. The weighted adjacency matrix A of the graph, the calculation formula of the edge weight is as follows:

其中σ代表RBF函数中的宽度参数，

代表节点i与j之间的欧氏距离的平方；经过RBF映射之后，所有边的权值被映射到(0，1)之间，并且距离越近的点之间的边具有越大的权值；where σ represents the width parameter in the RBF function,

Represents the square of the Euclidean distance between nodes i and j; after RBF mapping, the weights of all edges are mapped to (0, 1), and the edges between points with a closer distance have greater weights value;

S2，采用随机图增强的方法对于训练数据进行增强(如图2所示)；在训练数据集当中，对于每个节点的一阶邻域，以一定概率p随机剔除邻域中的节点和相应的边；对于任一节点v，其原始的一阶邻域可表示为：S2, the training data is enhanced by the random graph enhancement method (as shown in Figure 2); in the training data set, for the first-order neighborhood of each node, the nodes in the neighborhood and the corresponding nodes in the neighborhood are randomly eliminated with a certain probability p The edge of ; for any node v, its original first-order neighborhood can be expressed as:

(u,e)∈N(v)(u,e)∈N(v)

其中，u代表节点v一阶邻域中的节点，e代表节点v一阶邻域中的边；则经过随机图增强后，节点v的一阶邻域为：Among them, u represents the node in the first-order neighborhood of node v, and e represents the edge in the first-order neighborhood of node v; then after random graph enhancement, the first-order neighborhood of node v is:

N(v)'＝N(v)-N(v)_drop N(v)'=N(v)-N(v) _drop

其中N(v)'为经过图增强后的邻域，N(v)_drop为邻域当中随机删除的点集与边集，并且图增强前后邻域的规模之比满足|N(v)'|＝(1-p)|N(v)|；Among them, N(v)' is the neighborhood after graph enhancement, N(v) _drop is the point set and edge set that are randomly deleted in the neighborhood, and the ratio of the size of the neighborhood before and after graph enhancement satisfies |N(v)'|=(1-p)|N(v)|;

S3，用图增强后的训练集来对改进GCN模型进行训练，改进GCN的逐层节点表示向量更新规则(即空域图卷积运算)定义如下：S3, use the training set after graph enhancement to train the improved GCN model. The layer-by-layer node representation vector update rule of the improved GCN (ie, the spatial graph convolution operation) is defined as follows:

其中，

与

分别代表节点v在第l层与第(l+1)层的表示向量，

与

分别为节点v的一阶邻域N(v)'当中的节点表示向量与边的表示向量，初始表示向量可随机设置；W^(l)代表了第l层的卷积核，即待训练的权值矩阵；f(·,·)代表对邻域内节点向量和边向量的聚合函数，在这里采用向量卷积运算函数，即f(a,b)＝a*b；σ(·)代表隐层的激活函数，在这里采用RELU函数作为隐层激活函数；in,

and

respectively represent the representation vector of node v in the lth layer and the (l+1)th layer,

and

are the node representation vector and edge representation vector in the first-order neighborhood N(v)' of node v, respectively, and the initial representation vector can be set randomly; W ^(l) represents the convolution kernel of the lth layer, that is, the one to be trained. Weight matrix; f(·,·) represents the aggregation function of the node vector and edge vector in the neighborhood, and the vector convolution operation function is used here, that is, f(a,b)=a*b; σ(·) represents the hidden The activation function of the layer, where the RELU function is used as the hidden layer activation function;

而对于每一层边的表示向量的更新，则简单地引入可学习矩阵进行训练即可：For the update of the representation vector of each layer edge, simply introduce a learnable matrix for training:

其中，

与

分别代表边e在第l层与第(l+1)层的表示向量，

代表了第l层的边更新矩阵；in,

and

represent the representation vectors of edge e in the lth layer and the (l+1)th layer, respectively,

Represents the edge update matrix of the lth layer;

S4，基于我们的金融信用评估任务可以抽象为图节点分类任务，因此，只需要为改进GCN的输出层加上softmax映射，即可由节点表示向量得到节点的分类预测结果，完成最终分类器的训练；S4, based on our financial credit evaluation task, it can be abstracted as a graph node classification task. Therefore, only need to add softmax mapping to the output layer of the improved GCN, and the node representation vector can be used to obtain the node classification prediction result, and complete the final classifier training. ;

S5，训练得到最终模型后，在测试集当中进行测试，得到最终的金融信用分类预测结果；注意，图增强仅在模型训练过程当中使用，在测试过程中仍用原始的图作为模型输入。S5, after training the final model, test it in the test set to obtain the final financial credit classification prediction result; note that graph enhancement is only used in the model training process, and the original graph is still used as the model input during the testing process.

与现有技术相比，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

1.采用随机图增强的方法对于训练数据进行增强。从数据增强的层面有效缓解类别不均衡金融数据中的过拟合问题；1. The training data is augmented by random graph augmentation. Effectively alleviate the problem of over-fitting in unbalanced financial data from the level of data enhancement;

2.采用改进GCN模型来进行分类器训练。在每一层图卷积运算中，通过加权聚合一阶邻域中的节点向量与边向量来更新所有的图节点向量，综合利用了邻域当中的节点信息与边的信息，从模型层面提升图节点向量表示学习的效果，进而提升对不均衡金融数据信用评估的效果。2. Use the improved GCN model for classifier training. In each layer of graph convolution operation, all graph node vectors are updated by weighted aggregation of node vectors and edge vectors in the first-order neighborhood, and the node information and edge information in the neighborhood are comprehensively utilized to improve the model level. The graph node vector represents the effect of learning, thereby improving the effect of credit evaluation of unbalanced financial data.

附图说明Description of drawings

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。在附图中：The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the specification, and are used to explain the present invention together with the embodiments of the present invention, and do not constitute a limitation to the present invention. In the attached image:

图1是本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2为随机图增强的方法对于训练数据进行增强的示意图；Fig. 2 is the schematic diagram that the method of random graph enhancement enhances training data;

图3为改进GCN模型的结构示意图。Figure 3 is a schematic diagram of the structure of the improved GCN model.

具体实施方式Detailed ways

以下结合附图对本发明的优选实施例进行说明，应当理解，此处所描述的优选实施例仅用于说明和解释本发明，并不用于限定本发明。其中附图中相同的标号全部指的是相同的部件。The preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention. Wherein the same reference numbers in the drawings refer to the same parts throughout.

实施例1Example 1

如图1-3，本发明提供一种基于改进图卷积神经网络的不均衡金融数据信用评估方法，包括以下步骤(方法流程图如图1所示)：As shown in Figures 1-3, the present invention provides a credit evaluation method for unbalanced financial data based on an improved graph convolutional neural network, including the following steps (the method flow chart is shown in Figure 1):

S1，首先基于金融特征数据集进行图的构造。即根据输入的金融特征矩阵X∈R^N×D(N代表训练样本集总数,D为特征数据的维度)，构造图G(V,E)(V代表代表节点集合，E代表边集)。每一个训练样本对应图中的一个节点，节点坐落于D维欧氏空间当中，每一维坐标对应样本每维特征的值。图构造主要有两步：首先用基于欧氏距离的K近邻算法确定每个点的一阶邻域，并用边连接中心节点和所有邻居节点，接着用RBF映射来计算每条边的边权，进而构成整张图的带权邻接矩阵A，边权的计算公式如下：S1, first construct the graph based on the financial feature dataset. That is, according to the input financial feature matrix X∈R ^N×D (N represents the total number of training sample sets, D is the dimension of feature data), construct a graph G(V,E) (V represents node set, E represents edge set). Each training sample corresponds to a node in the graph, the node is located in the D-dimensional Euclidean space, and each dimension coordinate corresponds to the value of each dimension feature of the sample. There are two main steps in graph construction: first, the K-nearest neighbor algorithm based on Euclidean distance is used to determine the first-order neighborhood of each point, and edges are used to connect the central node and all neighbor nodes, and then RBF mapping is used to calculate the edge weight of each edge, Then the weighted adjacency matrix A of the whole graph is formed, and the calculation formula of the edge weight is as follows:

其中σ代表RBF函数中的宽度参数，

代表节点i与j之间的欧氏距离的平方。经过RBF映射之后，所有边的权值被映射到(0，1)之间，并且距离越近的点之间的边具有越大的权值；where σ represents the width parameter in the RBF function,

represents the square of the Euclidean distance between nodes i and j. After RBF mapping, the weights of all edges are mapped to (0, 1), and the edges between points with closer distances have greater weights;

S2，采用随机图增强的方法对于训练数据进行增强(如图2所示)。在训练数据集当中，对于每个节点的一阶邻域，以一定概率p随机剔除邻域中的节点和相应的边。对于任一节点v，其原始的一阶邻域可表示为：S2, using the random graph enhancement method to enhance the training data (as shown in Figure 2). In the training data set, for the first-order neighborhood of each node, the nodes and corresponding edges in the neighborhood are randomly removed with a certain probability p. For any node v, its original first-order neighborhood can be expressed as:

(u,e)∈N(v)(u,e)∈N(v)

其中，u代表节点v一阶邻域中的节点，e代表节点v一阶邻域中的边。则经过随机图增强后，节点v的一阶邻域为：where u represents a node in the first-order neighborhood of node v, and e represents an edge in the first-order neighborhood of node v. Then after random graph enhancement, the first-order neighborhood of node v is:

N(v)'＝N(v)-N(v)_drop N(v)'=N(v)-N(v) _drop

其中，

与

分别代表节点v在第l层与第(l+1)层的表示向量，

与

分别为节点v的一阶邻域N(v)'当中的节点表示向量与边的表示向量，初始表示向量可随机设置。W^(l)代表了第l层的卷积核，即待训练的权值矩阵。f(·,·)代表对邻域内节点向量和边向量的聚合函数，在这里采用向量卷积运算函数，即f(a,b)＝a*b。σ(·)代表隐层的激活函数，在这里采用RELU函数作为隐层激活函数。in,

and

Represent the representation vector of node v in the lth layer and the (l+1)th layer, respectively,

and

are the node representation vector and the edge representation vector in the first-order neighborhood N(v)' of node v, respectively, and the initial representation vector can be set randomly. W ^(l) represents the convolution kernel of the lth layer, that is, the weight matrix to be trained. f(·,·) represents the aggregation function of the node vector and edge vector in the neighborhood, and the vector convolution operation function is used here, that is, f(a,b)=a*b. σ( ) represents the activation function of the hidden layer, and here the RELU function is used as the activation function of the hidden layer.

其中，

与

分别代表边e在第l层与第(l+1)层的表示向量，

代表了第l层的边更新矩阵；in,

and

Represents the edge update matrix of the lth layer;

S4，基于我们的金融信用评估任务可以抽象为图节点分类任务，因此，只需要为改进GCN的输出层加上softmax映射，即可由节点表示向量得到节点的分类预测结果，完成最终分类器的训练(模型结构如图3所示)；S4, based on our financial credit evaluation task, it can be abstracted as a graph node classification task. Therefore, only need to add softmax mapping to the output layer of the improved GCN, and the node representation vector can be used to obtain the node classification prediction result, and complete the final classifier training. (The model structure is shown in Figure 3);

S5，训练得到最终模型后，在测试集当中进行测试，得到最终的金融信用分类预测结果。注意，图增强仅在模型训练过程当中使用，在测试过程中仍用原始的图作为模型输入。S5, after training to obtain the final model, test in the test set to obtain the final financial credit classification prediction result. Note that graph augmentation is only used during model training, and the original graph is still used as model input during testing.

本发明中优点如下：Advantages in the present invention are as follows:

最后应说明的是：以上所述仅为本发明的优选实施例而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。Finally, it should be noted that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, for those skilled in the art, the The technical solutions described in the foregoing embodiments may be modified, or some technical features thereof may be equivalently replaced. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. The unbalanced financial data credit evaluation method based on the improved graph convolution neural network is characterized by comprising the following steps of:

s1, firstly, constructing a graph based on the financial characteristic data set; i.e. according to the input financial characteristic matrix X ∈ R^N×D(N represents the total number of training sample sets, D is the dimension of the feature data), constructing a graph G (V, E) (V represents a node set, and E represents an edge set); each training sample corresponds to a node in the graph, the node is located in a D-dimensional Euclidean space, and each dimensional coordinate corresponds to the value of each dimensional feature of the sample; the graph construction mainly has two steps: firstly, determining a first-order neighborhood of each point by using a K nearest neighbor algorithm based on Euclidean distance, connecting a central node and all neighbor nodes by using edges, and then calculating each point by using RBF mappingThe edge weights of the edges further form a weighted adjacency matrix A of the whole graph, and the calculation formula of the edge weights is as follows:

where sigma represents the width parameter in the RBF function,

represents the square of the euclidean distance between nodes i and j; after RBF mapping, the weights of all edges are mapped between (0, 1), and edges between points closer in distance have larger weights;

s2, enhancing the training data by adopting a random graph enhancing method (as shown in FIG. 2); in the training data set, for a first-order neighborhood of each node, randomly eliminating nodes and corresponding edges in the neighborhood with a certain probability p; for any node v, its original first-order neighborhood can be represented as:

(u,e)∈N(v)

wherein u represents a node in a first-order neighborhood of the node v, and e represents an edge in the first-order neighborhood of the node v; after the random graph is enhanced, the first-order neighborhood of the node v is:

N(v)'＝N(v)-N(v)_drop

wherein N (v)' is neighborhood after graph enhancement, N (v)_dropIs a randomly deleted point set and an edge set in the neighborhood, and the scale ratio of the neighborhood before and after the graph enhancement satisfies | N (v) | (1-p) | N (v) |;

s3, training the improved GCN model by using the graph enhanced training set, wherein the updating rule (namely the spatial map convolution operation) of the layer-by-layer node representation vector of the improved GCN is defined as follows:

wherein,

and

representing the representative vectors of the node v at the l-th level and the (l +1) -th level respectively,

and

the node in the first-order neighborhood N (v) of the node v respectively represents a vector and an edge represents a vector, and the initial representation vector can be randomly set; w^(l)Represents the convolution kernel of the l layer, namely the weight matrix to be trained; f (·,) represents the aggregation function for the intra-neighborhood node vectors and edge vectors, where a vector convolution operation is used, i.e., f (a, b) ═ a × b; σ (-) represents the hidden layer activation function, and here adopts the RELU function as the hidden layer activation function;

for the update of the representation vector of each layer edge, a learnable matrix is simply introduced for training:

wherein,

and

representing the representative vectors of the edge e at the l-th layer and the (l +1) -th layer respectively,

an edge update matrix representing the l-th layer;

s4, the financial credit evaluation task can be abstracted into a graph node classification task, so that the classification prediction result of the node can be obtained by the node expression vector only by adding softmax mapping to the output layer of the improved GCN, and the training of the final classifier is completed;

s5, training to obtain a final model, and testing in the test set to obtain a final financial credit classification prediction result; note that graph enhancement is only used during model training, and the original graph is still used as model input during testing.