CN110992173A - Credit risk assessment model generation method based on multi-instance learning - Google Patents
Credit risk assessment model generation method based on multi-instance learning Download PDFInfo
- Publication number
- CN110992173A CN110992173A CN202010141306.4A CN202010141306A CN110992173A CN 110992173 A CN110992173 A CN 110992173A CN 202010141306 A CN202010141306 A CN 202010141306A CN 110992173 A CN110992173 A CN 110992173A
- Authority
- CN
- China
- Prior art keywords
- risk assessment
- assessment model
- credit risk
- function
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Abstract
A credit risk assessment model generation method based on multi-instance learning comprises the following steps: s1: collecting relevant data source information of the user S2: extracting a user historical behavior feature vector from the acquired data source information by using a minimum Hausdorff distance clustering feature; s3: combining the user historical behavior characteristic vector with personal information data to construct a new vector data set; s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function; s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function. The invention solves the problems of high dimensionality and no label of user data, realizes the aims of minimizing risk, minimizing complexity and maximizing accuracy, and not only improves the operation efficiency of a credit risk assessment model, but also improves the model accuracy and interpretability.
Description
Technical Field
The invention relates to the field of credit assessment, in particular to a credit risk assessment model generation method based on multi-instance learning.
Background
With the rapid development of the information age, credit risk assessment is one of the important issues for the research of financial institutions. It is a process that translates uncertainty into risk control. High quality risk management enables banks to build robust decision making systems, reducing losses. Banking risks are classified into three categories according to the basel bank regulatory commission basel agreement II: (1) credit risk, (2) market risk, (3) business risk. Therefore, from the banking security perspective, credit risk has become an important issue for banking research, and credit risk assessment is considered to be a complex multidimensional problem, which is based on a large amount of historical data, such as guardians, job status, previous credit history, personal account status, etc., with the goal of understanding applicant's behavior and predicting risk. Accordingly, credit risk metrics and management systems are developed primarily with a view to the classification or credit score of the applicant.
And under the patent names: a credit risk assessment method and device based on text analysis is disclosed in 'a credit risk assessment method and device based on text analysis' (application No. 2015106953161, application publication No. 2017.05.03), wherein the method further comprises the following steps: acquiring a text of a borrower; analyzing the text to obtain basic language features, wherein the basic language features are used for predicting whether the borrower will default or not; inputting the basic language features into a preset credit risk assessment model to obtain a credit risk value of the borrower output from the credit risk assessment model; and outputting the credit risk value of the borrower.
Methods such as Linear Discriminant Analysis (LDA) and Logistic Regression (LR) have been used in the above patent publications to construct credit assessment models. These statistical methods are widely used because they are simple and easy to operate and implement. However, their relatively poor predictive performance limits their use, especially on large datasets with a large number of feature dimensions. In order to use the credit assessment model efficiently, the credit assessment model must seek a good balance between classification performance and interpretability.
Disclosure of Invention
The invention innovatively provides a credit risk assessment model generation method based on multi-instance learning, which can solve the problems of high dimensionality and no label of user data so as to improve the accuracy and recall rate of a risk assessment model.
The technical scheme of the invention is as follows:
a credit risk assessment model generation method based on multi-instance learning comprises the following steps:
s1: collecting related data source information of a user, wherein the related data source information specifically comprises personal information data and historical dynamic behavior data;
s2: extracting historical behavior characteristic vectors of the user from the acquired data source information by using the minimum Hausdorff distance clustering characteristic;
s3: combining the historical behavior characteristic vector with personal information data to construct a new vector data set;
s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function;
s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function.
Preferably, the extracting process of the user history behavior feature vector in step S2 specifically includes:
s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;
s2.2: mapping the distance of each cluster center to a historical behavior feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the specific definition is as follows:whereinIs thatAndthe Euclidean distance between;
s2.3: will historyThe behavior feature vector is noted asThe ith variableIs expressed as a characteristic component,The calculation formula of (2) is as follows:calculating a specific value of the historical behavior feature vector through the formula; whereinRepresents the nth user;representing nth user and clusterThe distance between them;standard deviation is expressed to describe the average distance between two cluster centers.
Preferably, the standard deviation described in step S2.3The calculation formula of (2) is as follows:
whereinIs a constant parameterAnd (4) counting. WhereinRepresenting a clusterAndk is the number of clusters.
Preferably, the specific construction step of the new vector data set in step S3 is as follows:
S3.3: constructing new vectorsAnd combining the personal information data of the user with the historical behavior characteristic vector.
WhereinA characteristic component representing the ith variable,representing the comprehensive characteristic vector constructed by the nth user;representing nth user and clusterThe distance between them;standard deviation is expressed to describe the average distance between two cluster centers.
Preferably, the fitness function in step S5 is a quality function and a risk function.
Preferably, the quality function is a score of correctness, and the calculation formula is as follows:。
more preferably, the risk function is to distinguish between high risk attribute and low risk attribute by calculating information value, where high negative value represents high risk and high positive value represents low risk, and the calculation formula is:(ii) a WhereinRepresents the total number of good samples;represents the total number of bad samples;representing the number of good attributes in the features;indicating the number of bad attributes in the feature.
The invention has the beneficial effects that: according to the method, a multi-instance learning method in machine learning is combined with a radial basis function to carry out comprehensive risk assessment on the user related data set, the problems of high dimensionality and no label of user data are solved, the goals of minimizing risk, minimizing complexity and maximizing accuracy are achieved, the operation efficiency of a credit risk assessment model is improved, and the model accuracy and interpretability are improved.
Drawings
FIG. 1 is a flow chart of the generation of a credit risk assessment model in the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
A credit risk assessment model generation method based on multi-instance learning is disclosed, as shown in FIG. 1, and includes the following steps:
s1: personal information data and historical dynamic behavior data of the user are collected.
S2: extracting the historical behavior characteristic vector of the user by using the minimum Hausdorff distance clustering characteristic for the acquired data source information, and specifically comprising the following steps: and aggregating the historical dynamic behavior data into K clusters. The data set needs to be preprocessed before aggregation, and the missing data is firstly subjected to mean value supplement, and the repeated data is deleted. Because the data is high-dimensional data, the dimension reduction of the data set needs to be carried out firstly, and the LDA algorithm is adopted in the invention, so that the characteristic theme of the data set is obtained while the dimension reduction is carried out. Then, aggregating the preprocessed low-dimensional data; mapping the distance of the center of each cluster to a feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the distance function D is specifically defined as:whereinIs thatAndthe euclidean distance between them,、respectively representing cluster points; record the historical behavior feature vector asThe ith variableIs expressed as a characteristic component,The calculation formula of (2) is as follows:calculating a specific value of the historical behavior feature vector through the formula; whereinRepresents the nth user;representing nth user and clusterThe distance between them;representing standard deviation describing the average distance between the center points of two clustersThe calculation formula of (2) is as follows:
whereinRepresenting a clusterAndthe distance between them, K being the number of clusters,are constant parameters.
S3: combining the historical behavior characteristic vector and the personal information data to construct a new vector data set, and specifically comprising the following steps: the historical behavior feature vector obtained by the steps isIs marked as(ii) a The personal information data of the user including the user identification number, the mobile phone number, the bank card number, the mailbox number and other data are connected and merged and recorded as a vectorIf the data is missing, marking as NULL; constructing new vectorsAnd combining the personal information data of the user with the historical behavior characteristic vector to construct a joint vector matrix.
S4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function, wherein the evaluation function is;
WhereinFeatures representing the ith variableThe components of the first and second images are,representing the comprehensive characteristic vector constructed by the nth user;representing nth user and clusterThe distance between them;and expressing standard deviation, describing the average distance between the center points of every two clusters, and K expressing the number of feature vectors.
S5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through two fitness functions, namely a quality function and a risk function.
Where the quality function is a score on correctness and if there is a higher accuracy score, the solution is considered to be of higher quality. The objective is to maximize historical sample data collected on the basis of high quality rule generation. In practice, maximizing the quality function means maximizing the correct classification rules for data set validation. The calculation formula is as follows:。
the risk function is to distinguish high risk attributes and low risk attributes by calculating information values, wherein high negative values represent high risk and high positive values represent low risk, and the calculation formula is:(ii) a WhereinRepresents the total number of good samples;represents the total number of bad samples;representing the number of good attributes in the features;indicating the number of bad attributes in the feature.
Taking the financial field as an example, the historical dynamic behavior data of the user may specifically include: the credit investigation times, the overdue times, whether the credit card is arrearage or not, whether the credit card is abroad or not and the like are combined to form a cluster, the cluster is combined with personal information of the user, including age, occupation, income, presence or absence of children and the like to form a new vector, the new vector is used as an example in multi-instance learning, a data set is trained, and the final purpose is to predict the credit risk assessment category of the new user.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the appended claims, and it should be specifically noted that any modifications, equivalent substitutions, improvements and the like made by those skilled in the art within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A credit risk assessment model generation method based on multi-instance learning is characterized by comprising the following steps:
s1: collecting related data source information of a user, wherein the data source information specifically comprises personal information data and historical dynamic behavior data;
s2: extracting historical behavior characteristic vectors of the user from the acquired data source information by using the minimum Hausdorff distance clustering characteristic;
s3: combining the historical behavior characteristic vector with personal information data to construct a new vector data set;
s4: training the combined new vector data set by adopting a multi-instance learning method based on a radial basis function, and constructing a credit risk assessment model based on an evaluation model index function;
s5: and predicting the effect of the credit risk assessment model, and checking the correctness of the model through a fitness function.
2. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein the extracting process of the historical behavior feature vector in step S2 specifically comprises:
s2.1: aggregating the historical dynamic behavior data of S1 into K clusters;
s2.2: mapping the distance of each cluster center to a historical behavior feature vector, and adopting a distance function D as the minimum Hausdorff distance measurement, wherein the specific definition is as follows:whereinIs thatAndthe Euclidean distance between;
s2.3: record the historical behavior feature vector asThe ith variableIs expressed as a characteristic component,The calculation formula of (2) is as follows:calculating a specific value of the historical behavior feature vector through the formula; whereinRepresents the nth user;representing nth user and clusterThe distance between them;standard deviation is expressed to describe the average distance between two cluster centers.
4. The method for generating the credit risk assessment model based on multi-instance learning according to claim 2, wherein the step S3 of constructing the new vector data set specifically comprises:
5. The method for generating a credit risk assessment model based on multi-instance learning according to claim 1, wherein said fitness function in step S5 is a quality function and a risk function.
7. the method as claimed in claim 5, wherein the risk function is to distinguish high risk attribute and low risk attribute by calculating information value, wherein high negative value represents high risk and high positive value represents low windThe risk is calculated by the following formula:(ii) a WhereinRepresenting a total number of low risk attribute samples;representing a total number of high risk attribute samples;representing the number of low risk attributes in the features;indicating the number of high risk attributes in the feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010141306.4A CN110992173A (en) | 2020-03-04 | 2020-03-04 | Credit risk assessment model generation method based on multi-instance learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010141306.4A CN110992173A (en) | 2020-03-04 | 2020-03-04 | Credit risk assessment model generation method based on multi-instance learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110992173A true CN110992173A (en) | 2020-04-10 |
Family
ID=70081447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010141306.4A Pending CN110992173A (en) | 2020-03-04 | 2020-03-04 | Credit risk assessment model generation method based on multi-instance learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110992173A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112396310A (en) * | 2020-11-12 | 2021-02-23 | 上海京滴信用管理有限公司 | Social credit risk assessment system based on machine learning |
CN112686749A (en) * | 2020-12-31 | 2021-04-20 | 上海竞动科技有限公司 | Credit risk assessment method and device based on logistic regression technology |
CN117132001A (en) * | 2023-10-24 | 2023-11-28 | 杭银消费金融股份有限公司 | Multi-target wind control strategy optimization method and system |
-
2020
- 2020-03-04 CN CN202010141306.4A patent/CN110992173A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112396310A (en) * | 2020-11-12 | 2021-02-23 | 上海京滴信用管理有限公司 | Social credit risk assessment system based on machine learning |
CN112686749A (en) * | 2020-12-31 | 2021-04-20 | 上海竞动科技有限公司 | Credit risk assessment method and device based on logistic regression technology |
CN117132001A (en) * | 2023-10-24 | 2023-11-28 | 杭银消费金融股份有限公司 | Multi-target wind control strategy optimization method and system |
CN117132001B (en) * | 2023-10-24 | 2024-01-23 | 杭银消费金融股份有限公司 | Multi-target wind control strategy optimization method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data | |
CN110992173A (en) | Credit risk assessment model generation method based on multi-instance learning | |
CN109544190A (en) | A kind of fraud identification model training method, fraud recognition methods and device | |
CN107230108A (en) | The processing method and processing device of business datum | |
CN110956273A (en) | Credit scoring method and system integrating multiple machine learning models | |
US20220215298A1 (en) | Method for training sequence mining model, method for processing sequence data, and device | |
Chen et al. | Predicting default risk on peer-to-peer lending imbalanced datasets | |
CN109726918A (en) | The personal credit for fighting network and semi-supervised learning based on production determines method | |
Teng et al. | Customer credit scoring based on HMM/GMDH hybrid model | |
Doumpos et al. | Model combination for credit risk assessment: A stacked generalization approach | |
Fan et al. | Improved ML-based technique for credit card scoring in internet financial risk control | |
CN110084609B (en) | Transaction fraud behavior deep detection method based on characterization learning | |
Pan et al. | Study on evaluation model of Chinese P2P online lending platform based on hybrid kernel support Vector Machine | |
Jin et al. | A weighting method for feature dimension by semisupervised learning with entropy | |
Naik | Predicting credit risk for unsecured lending: A machine learning approach | |
Dolphin et al. | Industry Classification Using a Novel Financial Time-Series Case Representation | |
CN110033862B (en) | Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium | |
Mamun et al. | Predicting Bank Loan Eligibility Using Machine Learning Models and Comparison Analysis | |
CN117217807B (en) | Bad asset estimation method based on multi-mode high-dimensional characteristics | |
Dixon et al. | A Bayesian approach to ranking private companies based on predictive indicators | |
CN117094817B (en) | Credit risk control intelligent prediction method and system | |
US11663668B1 (en) | Apparatus and method for generating a pecuniary program | |
TWI824876B (en) | Marketing system and method by using customer genes | |
US20240020771A1 (en) | Apparatus and method for generating a pecuniary program | |
Shen et al. | Investment time series prediction using a hybrid model based on RBMs and pattern clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200410 |
|
RJ01 | Rejection of invention patent application after publication |