CN110992173A

CN110992173A - Credit risk assessment model generation method based on multi-instance learning

Info

Publication number: CN110992173A
Application number: CN202010141306.4A
Authority: CN
Inventors: 吴基成; 程宏峰; 陈杰
Original assignee: Hangzhou Sunyard Digital Science Co ltd
Current assignee: Hangzhou Sunyard Digital Science Co ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-04-10

Abstract

A credit risk assessment model generation method based on multi-instance learning comprises the following steps: s1: collecting relevant data source information of the user S2: extracting a user historical behavior feature vector from the acquired data source information by using a minimum Hausdorff distance clustering feature; s3: combining the user historical behavior characteristic vector with personal information data to construct a new vector data set; s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function; s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function. The invention solves the problems of high dimensionality and no label of user data, realizes the aims of minimizing risk, minimizing complexity and maximizing accuracy, and not only improves the operation efficiency of a credit risk assessment model, but also improves the model accuracy and interpretability.

Description

Credit risk assessment model generation method based on multi-instance learning

Technical Field

The invention relates to the field of credit assessment, in particular to a credit risk assessment model generation method based on multi-instance learning.

Background

With the rapid development of the information age, credit risk assessment is one of the important issues for the research of financial institutions. It is a process that translates uncertainty into risk control. High quality risk management enables banks to build robust decision making systems, reducing losses. Banking risks are classified into three categories according to the basel bank regulatory commission basel agreement II: (1) credit risk, (2) market risk, (3) business risk. Therefore, from the banking security perspective, credit risk has become an important issue for banking research, and credit risk assessment is considered to be a complex multidimensional problem, which is based on a large amount of historical data, such as guardians, job status, previous credit history, personal account status, etc., with the goal of understanding applicant's behavior and predicting risk. Accordingly, credit risk metrics and management systems are developed primarily with a view to the classification or credit score of the applicant.

And under the patent names: a credit risk assessment method and device based on text analysis is disclosed in 'a credit risk assessment method and device based on text analysis' (application No. 2015106953161, application publication No. 2017.05.03), wherein the method further comprises the following steps: acquiring a text of a borrower; analyzing the text to obtain basic language features, wherein the basic language features are used for predicting whether the borrower will default or not; inputting the basic language features into a preset credit risk assessment model to obtain a credit risk value of the borrower output from the credit risk assessment model; and outputting the credit risk value of the borrower.

Methods such as Linear Discriminant Analysis (LDA) and Logistic Regression (LR) have been used in the above patent publications to construct credit assessment models. These statistical methods are widely used because they are simple and easy to operate and implement. However, their relatively poor predictive performance limits their use, especially on large datasets with a large number of feature dimensions. In order to use the credit assessment model efficiently, the credit assessment model must seek a good balance between classification performance and interpretability.

Disclosure of Invention

The invention innovatively provides a credit risk assessment model generation method based on multi-instance learning, which can solve the problems of high dimensionality and no label of user data so as to improve the accuracy and recall rate of a risk assessment model.

The technical scheme of the invention is as follows:

a credit risk assessment model generation method based on multi-instance learning comprises the following steps:

s1: collecting related data source information of a user, wherein the related data source information specifically comprises personal information data and historical dynamic behavior data;

s2: extracting historical behavior characteristic vectors of the user from the acquired data source information by using the minimum Hausdorff distance clustering characteristic;

s3: combining the historical behavior characteristic vector with personal information data to construct a new vector data set;

s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function;

s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function.

Preferably, the extracting process of the user history behavior feature vector in step S2 specifically includes:

s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;

s2.2: mapping the distance of each cluster center to a historical behavior feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the specific definition is as follows:

wherein

Is that

And

the Euclidean distance between;

s2.3: will historyThe behavior feature vector is noted as

The ith variable

Is expressed as a characteristic component

，

The calculation formula of (2) is as follows:

calculating a specific value of the historical behavior feature vector through the formula; wherein

Represents the nth user;

representing nth user and cluster

The distance between them;

standard deviation is expressed to describe the average distance between two cluster centers.

Preferably, the standard deviation described in step S2.3

The calculation formula of (2) is as follows:

wherein

Is a constant parameterAnd (4) counting. Wherein

Representing a cluster

And

k is the number of clusters.

Preferably, the specific construction step of the new vector data set in step S3 is as follows:

s3.1: the historical behavior feature vector is used

Is marked as

；

S3.2: recording personal information data of users as vectors

；

S3.3: constructing new vectors

And combining the personal information data of the user with the historical behavior characteristic vector.

Wherein

A characteristic component representing the ith variable,

representing the comprehensive characteristic vector constructed by the nth user;

representing nth user and cluster

The distance between them;

Preferably, the fitness function in step S5 is a quality function and a risk function.

Preferably, the quality function is a score of correctness, and the calculation formula is as follows:

。

more preferably, the risk function is to distinguish between high risk attribute and low risk attribute by calculating information value, where high negative value represents high risk and high positive value represents low risk, and the calculation formula is:

(ii) a Wherein

Represents the total number of good samples;

represents the total number of bad samples;

representing the number of good attributes in the features;

indicating the number of bad attributes in the feature.

The invention has the beneficial effects that: according to the method, a multi-instance learning method in machine learning is combined with a radial basis function to carry out comprehensive risk assessment on the user related data set, the problems of high dimensionality and no label of user data are solved, the goals of minimizing risk, minimizing complexity and maximizing accuracy are achieved, the operation efficiency of a credit risk assessment model is improved, and the model accuracy and interpretability are improved.

Drawings

FIG. 1 is a flow chart of the generation of a credit risk assessment model in the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

A credit risk assessment model generation method based on multi-instance learning is disclosed, as shown in FIG. 1, and includes the following steps:

s1: personal information data and historical dynamic behavior data of the user are collected.

S2: extracting the historical behavior characteristic vector of the user by using the minimum Hausdorff distance clustering characteristic for the acquired data source information, and specifically comprising the following steps: and aggregating the historical dynamic behavior data into K clusters. The data set needs to be preprocessed before aggregation, and the missing data is firstly subjected to mean value supplement, and the repeated data is deleted. Because the data is high-dimensional data, the dimension reduction of the data set needs to be carried out firstly, and the LDA algorithm is adopted in the invention, so that the characteristic theme of the data set is obtained while the dimension reduction is carried out. Then, aggregating the preprocessed low-dimensional data; mapping the distance of the center of each cluster to a feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the distance function D is specifically defined as:

wherein

Is that

And

the euclidean distance between them,

、

respectively representing cluster points; record the historical behavior feature vector as

The ith variable

Is expressed as a characteristic component

，

The calculation formula of (2) is as follows:

Represents the nth user;

representing nth user and cluster

The distance between them;

representing standard deviation describing the average distance between the center points of two clusters

The calculation formula of (2) is as follows:

wherein

Representing a cluster

And

the distance between them, K being the number of clusters,

are constant parameters.

S3: combining the historical behavior characteristic vector and the personal information data to construct a new vector data set, and specifically comprising the following steps: the historical behavior feature vector obtained by the steps is

Is marked as

(ii) a The personal information data of the user including the user identification number, the mobile phone number, the bank card number, the mailbox number and other data are connected and merged and recorded as a vector

If the data is missing, marking as NULL; constructing new vectors

And combining the personal information data of the user with the historical behavior characteristic vector to construct a joint vector matrix.

S4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function, wherein the evaluation function is

；

Wherein

Features representing the ith variableThe components of the first and second images are,

representing nth user and cluster

The distance between them;

and expressing standard deviation, describing the average distance between the center points of every two clusters, and K expressing the number of feature vectors.

S5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through two fitness functions, namely a quality function and a risk function.

Where the quality function is a score on correctness and if there is a higher accuracy score, the solution is considered to be of higher quality. The objective is to maximize historical sample data collected on the basis of high quality rule generation. In practice, maximizing the quality function means maximizing the correct classification rules for data set validation. The calculation formula is as follows:

。

the risk function is to distinguish high risk attributes and low risk attributes by calculating information values, wherein high negative values represent high risk and high positive values represent low risk, and the calculation formula is:

(ii) a Wherein

Represents the total number of good samples;

represents the total number of bad samples;

representing the number of good attributes in the features;

indicating the number of bad attributes in the feature.

Taking the financial field as an example, the historical dynamic behavior data of the user may specifically include: the credit investigation times, the overdue times, whether the credit card is arrearage or not, whether the credit card is abroad or not and the like are combined to form a cluster, the cluster is combined with personal information of the user, including age, occupation, income, presence or absence of children and the like to form a new vector, the new vector is used as an example in multi-instance learning, a data set is trained, and the final purpose is to predict the credit risk assessment category of the new user.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the appended claims, and it should be specifically noted that any modifications, equivalent substitutions, improvements and the like made by those skilled in the art within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A credit risk assessment model generation method based on multi-instance learning is characterized by comprising the following steps:

s1: collecting related data source information of a user, wherein the data source information specifically comprises personal information data and historical dynamic behavior data;

2. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein the extracting process of the historical behavior feature vector in step S2 specifically comprises:

s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;

wherein

Is that

And

the Euclidean distance between;

s2.3: record the historical behavior feature vector as

The ith variable

Is expressed as a characteristic component

，

The calculation formula of (2) is as follows:

Represents the nth user;

representing nth user and cluster

The distance between them;

3. The method for generating a multi-instance learning-based credit risk assessment model according to claim 2, wherein said standard deviation of step S2.3

The calculation formula of (2) is as follows:

wherein

Are constant parameters.

4. The method for generating the credit risk assessment model based on multi-instance learning according to claim 2, wherein the step S3 of constructing the new vector data set specifically comprises:

s3.1: the historical behavior feature vector is used

Is marked as

；

S3.2: recording personal information data of users as vectors

；

S3.3: constructing new vectors

5. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein said fitness function in step S5 is a quality function and a risk function.

6. The method for generating a credit risk assessment model based on multi-instance learning as claimed in claim 5, wherein said quality function is the score of correctness, and the calculation formula is:

。

7. the method as claimed in claim 5, wherein the risk function is to distinguish high risk attribute and low risk attribute by calculating information value, wherein high negative value represents high risk and high positive value represents low windThe risk is calculated by the following formula:

(ii) a Wherein

Representing a total number of low risk attribute samples;

representing a total number of high risk attribute samples;

representing the number of low risk attributes in the features;

indicating the number of high risk attributes in the feature.