CN110992173A - Credit risk assessment model generation method based on multi-instance learning - Google Patents

Credit risk assessment model generation method based on multi-instance learning Download PDF

Info

Publication number
CN110992173A
CN110992173A CN202010141306.4A CN202010141306A CN110992173A CN 110992173 A CN110992173 A CN 110992173A CN 202010141306 A CN202010141306 A CN 202010141306A CN 110992173 A CN110992173 A CN 110992173A
Authority
CN
China
Prior art keywords
risk assessment
assessment model
credit risk
function
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010141306.4A
Other languages
Chinese (zh)
Inventor
吴基成
程宏峰
陈杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Sunyard Digital Science Co ltd
Original Assignee
Hangzhou Sunyard Digital Science Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Sunyard Digital Science Co ltd filed Critical Hangzhou Sunyard Digital Science Co ltd
Priority to CN202010141306.4A priority Critical patent/CN110992173A/en
Publication of CN110992173A publication Critical patent/CN110992173A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Abstract

A credit risk assessment model generation method based on multi-instance learning comprises the following steps: s1: collecting relevant data source information of the user S2: extracting a user historical behavior feature vector from the acquired data source information by using a minimum Hausdorff distance clustering feature; s3: combining the user historical behavior characteristic vector with personal information data to construct a new vector data set; s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function; s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function. The invention solves the problems of high dimensionality and no label of user data, realizes the aims of minimizing risk, minimizing complexity and maximizing accuracy, and not only improves the operation efficiency of a credit risk assessment model, but also improves the model accuracy and interpretability.

Description

Credit risk assessment model generation method based on multi-instance learning
Technical Field
The invention relates to the field of credit assessment, in particular to a credit risk assessment model generation method based on multi-instance learning.
Background
With the rapid development of the information age, credit risk assessment is one of the important issues for the research of financial institutions. It is a process that translates uncertainty into risk control. High quality risk management enables banks to build robust decision making systems, reducing losses. Banking risks are classified into three categories according to the basel bank regulatory commission basel agreement II: (1) credit risk, (2) market risk, (3) business risk. Therefore, from the banking security perspective, credit risk has become an important issue for banking research, and credit risk assessment is considered to be a complex multidimensional problem, which is based on a large amount of historical data, such as guardians, job status, previous credit history, personal account status, etc., with the goal of understanding applicant's behavior and predicting risk. Accordingly, credit risk metrics and management systems are developed primarily with a view to the classification or credit score of the applicant.
And under the patent names: a credit risk assessment method and device based on text analysis is disclosed in 'a credit risk assessment method and device based on text analysis' (application No. 2015106953161, application publication No. 2017.05.03), wherein the method further comprises the following steps: acquiring a text of a borrower; analyzing the text to obtain basic language features, wherein the basic language features are used for predicting whether the borrower will default or not; inputting the basic language features into a preset credit risk assessment model to obtain a credit risk value of the borrower output from the credit risk assessment model; and outputting the credit risk value of the borrower.
Methods such as Linear Discriminant Analysis (LDA) and Logistic Regression (LR) have been used in the above patent publications to construct credit assessment models. These statistical methods are widely used because they are simple and easy to operate and implement. However, their relatively poor predictive performance limits their use, especially on large datasets with a large number of feature dimensions. In order to use the credit assessment model efficiently, the credit assessment model must seek a good balance between classification performance and interpretability.
Disclosure of Invention
The invention innovatively provides a credit risk assessment model generation method based on multi-instance learning, which can solve the problems of high dimensionality and no label of user data so as to improve the accuracy and recall rate of a risk assessment model.
The technical scheme of the invention is as follows:
a credit risk assessment model generation method based on multi-instance learning comprises the following steps:
s1: collecting related data source information of a user, wherein the related data source information specifically comprises personal information data and historical dynamic behavior data;
s2: extracting historical behavior characteristic vectors of the user from the acquired data source information by using the minimum Hausdorff distance clustering characteristic;
s3: combining the historical behavior characteristic vector with personal information data to construct a new vector data set;
s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function;
s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function.
Preferably, the extracting process of the user history behavior feature vector in step S2 specifically includes:
s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;
s2.2: mapping the distance of each cluster center to a historical behavior feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the specific definition is as follows:
Figure 96940DEST_PATH_IMAGE001
wherein
Figure 455240DEST_PATH_IMAGE002
Is that
Figure 740859DEST_PATH_IMAGE003
And
Figure 854308DEST_PATH_IMAGE004
the Euclidean distance between;
s2.3: will historyThe behavior feature vector is noted as
Figure 592588DEST_PATH_IMAGE005
The ith variable
Figure 751650DEST_PATH_IMAGE006
Is expressed as a characteristic component
Figure 813147DEST_PATH_IMAGE007
Figure 35181DEST_PATH_IMAGE007
The calculation formula of (2) is as follows:
Figure 260757DEST_PATH_IMAGE008
calculating a specific value of the historical behavior feature vector through the formula; wherein
Figure 226439DEST_PATH_IMAGE009
Represents the nth user;
Figure 408021DEST_PATH_IMAGE010
representing nth user and cluster
Figure 431251DEST_PATH_IMAGE011
The distance between them;
Figure 331074DEST_PATH_IMAGE012
standard deviation is expressed to describe the average distance between two cluster centers.
Preferably, the standard deviation described in step S2.3
Figure 897185DEST_PATH_IMAGE012
The calculation formula of (2) is as follows:
Figure 136536DEST_PATH_IMAGE013
wherein
Figure 310160DEST_PATH_IMAGE014
Is a constant parameterAnd (4) counting. Wherein
Figure 634962DEST_PATH_IMAGE015
Representing a cluster
Figure 942446DEST_PATH_IMAGE016
And
Figure 705479DEST_PATH_IMAGE017
k is the number of clusters.
Preferably, the specific construction step of the new vector data set in step S3 is as follows:
s3.1: the historical behavior feature vector is used
Figure 502533DEST_PATH_IMAGE005
Is marked as
Figure 986735DEST_PATH_IMAGE018
S3.2: recording personal information data of users as vectors
Figure 160228DEST_PATH_IMAGE019
S3.3: constructing new vectors
Figure 170909DEST_PATH_IMAGE020
And combining the personal information data of the user with the historical behavior characteristic vector.
Wherein
Figure 951914DEST_PATH_IMAGE007
A characteristic component representing the ith variable,
Figure 985730DEST_PATH_IMAGE009
representing the comprehensive characteristic vector constructed by the nth user;
Figure 635017DEST_PATH_IMAGE010
representing nth user and cluster
Figure 434958DEST_PATH_IMAGE011
The distance between them;
Figure 511498DEST_PATH_IMAGE012
standard deviation is expressed to describe the average distance between two cluster centers.
Preferably, the fitness function in step S5 is a quality function and a risk function.
Preferably, the quality function is a score of correctness, and the calculation formula is as follows:
Figure 970293DEST_PATH_IMAGE021
more preferably, the risk function is to distinguish between high risk attribute and low risk attribute by calculating information value, where high negative value represents high risk and high positive value represents low risk, and the calculation formula is:
Figure 157691DEST_PATH_IMAGE022
(ii) a Wherein
Figure 80648DEST_PATH_IMAGE023
Represents the total number of good samples;
Figure 390407DEST_PATH_IMAGE024
represents the total number of bad samples;
Figure 208934DEST_PATH_IMAGE025
representing the number of good attributes in the features;
Figure 262340DEST_PATH_IMAGE026
indicating the number of bad attributes in the feature.
The invention has the beneficial effects that: according to the method, a multi-instance learning method in machine learning is combined with a radial basis function to carry out comprehensive risk assessment on the user related data set, the problems of high dimensionality and no label of user data are solved, the goals of minimizing risk, minimizing complexity and maximizing accuracy are achieved, the operation efficiency of a credit risk assessment model is improved, and the model accuracy and interpretability are improved.
Drawings
FIG. 1 is a flow chart of the generation of a credit risk assessment model in the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
A credit risk assessment model generation method based on multi-instance learning is disclosed, as shown in FIG. 1, and includes the following steps:
s1: personal information data and historical dynamic behavior data of the user are collected.
S2: extracting the historical behavior characteristic vector of the user by using the minimum Hausdorff distance clustering characteristic for the acquired data source information, and specifically comprising the following steps: and aggregating the historical dynamic behavior data into K clusters. The data set needs to be preprocessed before aggregation, and the missing data is firstly subjected to mean value supplement, and the repeated data is deleted. Because the data is high-dimensional data, the dimension reduction of the data set needs to be carried out firstly, and the LDA algorithm is adopted in the invention, so that the characteristic theme of the data set is obtained while the dimension reduction is carried out. Then, aggregating the preprocessed low-dimensional data; mapping the distance of the center of each cluster to a feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the distance function D is specifically defined as:
Figure 915170DEST_PATH_IMAGE001
wherein
Figure 333513DEST_PATH_IMAGE002
Is that
Figure 501320DEST_PATH_IMAGE003
And
Figure 230854DEST_PATH_IMAGE004
the euclidean distance between them,
Figure 925141DEST_PATH_IMAGE003
Figure 514385DEST_PATH_IMAGE004
respectively representing cluster points; record the historical behavior feature vector as
Figure 497384DEST_PATH_IMAGE005
The ith variable
Figure 892594DEST_PATH_IMAGE006
Is expressed as a characteristic component
Figure 316753DEST_PATH_IMAGE007
Figure 342478DEST_PATH_IMAGE007
The calculation formula of (2) is as follows:
Figure 481947DEST_PATH_IMAGE008
calculating a specific value of the historical behavior feature vector through the formula; wherein
Figure 618531DEST_PATH_IMAGE009
Represents the nth user;
Figure 959513DEST_PATH_IMAGE010
representing nth user and cluster
Figure 841625DEST_PATH_IMAGE011
The distance between them;
Figure 533638DEST_PATH_IMAGE012
representing standard deviation describing the average distance between the center points of two clusters
Figure 208333DEST_PATH_IMAGE012
The calculation formula of (2) is as follows:
Figure 607084DEST_PATH_IMAGE013
wherein
Figure 974612DEST_PATH_IMAGE015
Representing a cluster
Figure 950658DEST_PATH_IMAGE016
And
Figure 366727DEST_PATH_IMAGE017
the distance between them, K being the number of clusters,
Figure 944951DEST_PATH_IMAGE014
are constant parameters.
S3: combining the historical behavior characteristic vector and the personal information data to construct a new vector data set, and specifically comprising the following steps: the historical behavior feature vector obtained by the steps is
Figure 155484DEST_PATH_IMAGE005
Is marked as
Figure 822089DEST_PATH_IMAGE018
(ii) a The personal information data of the user including the user identification number, the mobile phone number, the bank card number, the mailbox number and other data are connected and merged and recorded as a vector
Figure 776269DEST_PATH_IMAGE019
If the data is missing, marking as NULL; constructing new vectors
Figure 274247DEST_PATH_IMAGE020
And combining the personal information data of the user with the historical behavior characteristic vector to construct a joint vector matrix.
S4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function, wherein the evaluation function is
Figure 551555DEST_PATH_IMAGE027
Wherein
Figure 580822DEST_PATH_IMAGE007
Features representing the ith variableThe components of the first and second images are,
Figure 401010DEST_PATH_IMAGE009
representing the comprehensive characteristic vector constructed by the nth user;
Figure 753494DEST_PATH_IMAGE010
representing nth user and cluster
Figure 506162DEST_PATH_IMAGE011
The distance between them;
Figure 881779DEST_PATH_IMAGE012
and expressing standard deviation, describing the average distance between the center points of every two clusters, and K expressing the number of feature vectors.
S5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through two fitness functions, namely a quality function and a risk function.
Where the quality function is a score on correctness and if there is a higher accuracy score, the solution is considered to be of higher quality. The objective is to maximize historical sample data collected on the basis of high quality rule generation. In practice, maximizing the quality function means maximizing the correct classification rules for data set validation. The calculation formula is as follows:
Figure 240080DEST_PATH_IMAGE028
the risk function is to distinguish high risk attributes and low risk attributes by calculating information values, wherein high negative values represent high risk and high positive values represent low risk, and the calculation formula is:
Figure 525699DEST_PATH_IMAGE029
(ii) a Wherein
Figure 311252DEST_PATH_IMAGE023
Represents the total number of good samples;
Figure 236483DEST_PATH_IMAGE024
represents the total number of bad samples;
Figure 208593DEST_PATH_IMAGE025
representing the number of good attributes in the features;
Figure 535670DEST_PATH_IMAGE026
indicating the number of bad attributes in the feature.
Taking the financial field as an example, the historical dynamic behavior data of the user may specifically include: the credit investigation times, the overdue times, whether the credit card is arrearage or not, whether the credit card is abroad or not and the like are combined to form a cluster, the cluster is combined with personal information of the user, including age, occupation, income, presence or absence of children and the like to form a new vector, the new vector is used as an example in multi-instance learning, a data set is trained, and the final purpose is to predict the credit risk assessment category of the new user.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the appended claims, and it should be specifically noted that any modifications, equivalent substitutions, improvements and the like made by those skilled in the art within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A credit risk assessment model generation method based on multi-instance learning is characterized by comprising the following steps:
s1: collecting related data source information of a user, wherein the data source information specifically comprises personal information data and historical dynamic behavior data;
s2: extracting historical behavior characteristic vectors of the user from the acquired data source information by using the minimum Hausdorff distance clustering characteristic;
s3: combining the historical behavior characteristic vector with personal information data to construct a new vector data set;
s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function;
s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function.
2. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein the extracting process of the historical behavior feature vector in step S2 specifically comprises:
s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;
s2.2: mapping the distance of each cluster center to a historical behavior feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the specific definition is as follows:
Figure 247524DEST_PATH_IMAGE001
wherein
Figure 40031DEST_PATH_IMAGE002
Is that
Figure 819768DEST_PATH_IMAGE003
And
Figure 277294DEST_PATH_IMAGE004
the Euclidean distance between;
s2.3: record the historical behavior feature vector as
Figure 901786DEST_PATH_IMAGE005
The ith variable
Figure 599615DEST_PATH_IMAGE006
Is expressed as a characteristic component
Figure 742014DEST_PATH_IMAGE007
Figure 940915DEST_PATH_IMAGE007
The calculation formula of (2) is as follows:
Figure 609793DEST_PATH_IMAGE008
calculating a specific value of the historical behavior feature vector through the formula; wherein
Figure 665474DEST_PATH_IMAGE009
Represents the nth user;
Figure 105289DEST_PATH_IMAGE010
representing nth user and cluster
Figure 842301DEST_PATH_IMAGE011
The distance between them;
Figure 428003DEST_PATH_IMAGE012
standard deviation is expressed to describe the average distance between two cluster centers.
3. The method for generating a multi-instance learning-based credit risk assessment model according to claim 2, wherein said standard deviation of step S2.3
Figure 670897DEST_PATH_IMAGE012
The calculation formula of (2) is as follows:
Figure 663255DEST_PATH_IMAGE013
wherein
Figure 695DEST_PATH_IMAGE014
Are constant parameters.
4. The method for generating the credit risk assessment model based on multi-instance learning according to claim 2, wherein the step S3 of constructing the new vector data set specifically comprises:
s3.1: the historical behavior feature vector is used
Figure 644166DEST_PATH_IMAGE005
Is marked as
Figure 664818DEST_PATH_IMAGE015
S3.2: recording personal information data of users as vectors
Figure 456057DEST_PATH_IMAGE016
S3.3: constructing new vectors
Figure 534871DEST_PATH_IMAGE017
And combining the personal information data of the user with the historical behavior characteristic vector.
5. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein said fitness function in step S5 is a quality function and a risk function.
6. The method for generating a credit risk assessment model based on multi-instance learning as claimed in claim 5, wherein said quality function is the score of correctness, and the calculation formula is:
Figure 32849DEST_PATH_IMAGE018
7. the method as claimed in claim 5, wherein the risk function is to distinguish high risk attribute and low risk attribute by calculating information value, wherein high negative value represents high risk and high positive value represents low windThe risk is calculated by the following formula:
Figure 617545DEST_PATH_IMAGE019
(ii) a Wherein
Figure 630500DEST_PATH_IMAGE020
Representing a total number of low risk attribute samples;
Figure 513006DEST_PATH_IMAGE021
representing a total number of high risk attribute samples;
Figure 613292DEST_PATH_IMAGE022
representing the number of low risk attributes in the features;
Figure 368890DEST_PATH_IMAGE023
indicating the number of high risk attributes in the feature.
CN202010141306.4A 2020-03-04 2020-03-04 Credit risk assessment model generation method based on multi-instance learning Pending CN110992173A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010141306.4A CN110992173A (en) 2020-03-04 2020-03-04 Credit risk assessment model generation method based on multi-instance learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010141306.4A CN110992173A (en) 2020-03-04 2020-03-04 Credit risk assessment model generation method based on multi-instance learning

Publications (1)

Publication Number Publication Date
CN110992173A true CN110992173A (en) 2020-04-10

Family

ID=70081447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010141306.4A Pending CN110992173A (en) 2020-03-04 2020-03-04 Credit risk assessment model generation method based on multi-instance learning

Country Status (1)

Country Link
CN (1) CN110992173A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396310A (en) * 2020-11-12 2021-02-23 上海京滴信用管理有限公司 Social credit risk assessment system based on machine learning
CN112686749A (en) * 2020-12-31 2021-04-20 上海竞动科技有限公司 Credit risk assessment method and device based on logistic regression technology
CN117132001A (en) * 2023-10-24 2023-11-28 杭银消费金融股份有限公司 Multi-target wind control strategy optimization method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396310A (en) * 2020-11-12 2021-02-23 上海京滴信用管理有限公司 Social credit risk assessment system based on machine learning
CN112686749A (en) * 2020-12-31 2021-04-20 上海竞动科技有限公司 Credit risk assessment method and device based on logistic regression technology
CN117132001A (en) * 2023-10-24 2023-11-28 杭银消费金融股份有限公司 Multi-target wind control strategy optimization method and system
CN117132001B (en) * 2023-10-24 2024-01-23 杭银消费金融股份有限公司 Multi-target wind control strategy optimization method and system

Similar Documents

Publication Publication Date Title
Yu et al. A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data
CN110992173A (en) Credit risk assessment model generation method based on multi-instance learning
CN109544190A (en) A kind of fraud identification model training method, fraud recognition methods and device
CN107230108A (en) The processing method and processing device of business datum
CN110956273A (en) Credit scoring method and system integrating multiple machine learning models
US20220215298A1 (en) Method for training sequence mining model, method for processing sequence data, and device
Chen et al. Predicting default risk on peer-to-peer lending imbalanced datasets
CN109726918A (en) The personal credit for fighting network and semi-supervised learning based on production determines method
Teng et al. Customer credit scoring based on HMM/GMDH hybrid model
Doumpos et al. Model combination for credit risk assessment: A stacked generalization approach
Fan et al. Improved ML-based technique for credit card scoring in internet financial risk control
CN110084609B (en) Transaction fraud behavior deep detection method based on characterization learning
Pan et al. Study on evaluation model of Chinese P2P online lending platform based on hybrid kernel support Vector Machine
Jin et al. A weighting method for feature dimension by semisupervised learning with entropy
Naik Predicting credit risk for unsecured lending: A machine learning approach
Dolphin et al. Industry Classification Using a Novel Financial Time-Series Case Representation
CN110033862B (en) Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium
Mamun et al. Predicting Bank Loan Eligibility Using Machine Learning Models and Comparison Analysis
CN117217807B (en) Bad asset estimation method based on multi-mode high-dimensional characteristics
Dixon et al. A Bayesian approach to ranking private companies based on predictive indicators
CN117094817B (en) Credit risk control intelligent prediction method and system
US11663668B1 (en) Apparatus and method for generating a pecuniary program
TWI824876B (en) Marketing system and method by using customer genes
US20240020771A1 (en) Apparatus and method for generating a pecuniary program
Shen et al. Investment time series prediction using a hybrid model based on RBMs and pattern clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200410

RJ01 Rejection of invention patent application after publication