CN110751400B

CN110751400B - Risk assessment method and device

Info

Publication number: CN110751400B
Application number: CN201911006993.2A
Authority: CN
Inventors: 马子俊
Original assignee: Puxin Hengye Technology Development Beijing Co ltd; Yiren Hengye Technology Development Beijing Co ltd
Current assignee: Puxin Hengye Technology Development Beijing Co ltd; Yiren Hengye Technology Development Beijing Co ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2022-08-02
Anticipated expiration: 2039-10-22
Also published as: CN110751400A

Abstract

The invention provides a risk assessment method and a device, and the method comprises the following steps: grouping data sources according to the risk information quantity of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables; constructing a first risk assessment model according to the weak correlation variable group; performing predictive probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain a sample group with the highest negative sample proportion; constructing a second risk assessment model according to the sample group with the highest negative sample proportion and the strong correlation variable group; and performing risk assessment by using the first risk assessment model and the second risk assessment model. The method solves the problem of low model prediction efficiency caused by unbalance of positive and negative samples, and improves the prediction efficiency of the risk assessment model.

Description

Risk assessment method and device

Technical Field

The invention relates to the technical field of risk control, in particular to a risk assessment method and a risk assessment device.

Background

Risk assessment is the quantification of risk and is a critical technique for risk management. At present, risk assessment is generally carried out in a modeling mode, and in the process of establishing a model, the steps of data extraction, feature generation, feature selection, algorithm model generation, rationality assessment and the like are mainly carried out.

As the source channel of data is richer and richer, more and more data fields can be used as risk characteristic variables. Since not all risk feature fields in all samples are valid values, the occurrence of missing values is inevitable, and the missing situation progresses toward an increasingly serious direction as the number of feature fields increases.

When data are generally sparse, namely the vacancy values of risk characteristic fields are more, if the characteristic selection and the subsequent modeling process are carried out according to the traditional model means, the efficiency of model prediction is low, and when the risk assessment is carried out by utilizing the model, the accuracy of the risk assessment is low.

Disclosure of Invention

In view of this, the present invention provides a risk assessment method and apparatus to improve the prediction efficiency of the model.

In order to achieve the above purpose, the invention provides the following specific technical scheme:

a method of risk assessment, comprising:

grouping data sources according to the risk information quantity of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables;

constructing a first risk assessment model according to the weak correlation variable group;

performing predictive probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain a sample group with the highest negative sample proportion;

constructing a second risk assessment model according to the sample group with the highest negative sample proportion and the strong correlation variable group;

and performing risk assessment by using the first risk assessment model and the second risk assessment model.

Optionally, before the constructing the first risk assessment model according to the weak relevant variable group, the method further includes:

and respectively carrying out noise reduction processing on the strong correlation variable group and the weak correlation variable group.

Optionally, the performing, by using the first risk assessment model, predictive probability classification on a full-scale sample only including the weak correlation variable to obtain a sample group with a highest negative sample proportion includes:

performing predictive probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain the probability that each sample in the full-scale samples only containing the weak correlation variables is a negative sample;

and dividing the full samples only containing the weak correlation variables into a sample group with the highest proportion of negative samples and a sample group with the lowest proportion of negative samples according to a preset dividing point and the probability that each sample in the full samples only containing the weak correlation variables is a negative sample.

Optionally, the method further includes:

and calculating the optimal value of the segmentation point by adopting a preset optimization algorithm by taking the highest prediction accuracy of the positive sample and the negative sample as an optimization target.

Optionally, the performing risk assessment by using the first risk assessment model and the second risk assessment model includes:

performing risk assessment by using the first risk assessment model to obtain a first risk assessment value;

performing risk assessment by using the second risk assessment model to obtain a second risk assessment value;

determining a maximum of the first risk assessment value and the second risk assessment value as a final risk assessment value.

A risk assessment device comprising:

the variable group dividing unit is used for grouping the data sources according to the risk information amount of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables;

the first model building unit is used for building a first risk assessment model according to the weak correlation variable group;

the probability classification unit is used for performing prediction probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain a sample group with the highest negative sample proportion;

the second model building unit is used for building a second risk assessment model according to the sample group with the highest negative sample proportion and the strong correlation variable group;

and the risk evaluation unit is used for carrying out risk evaluation by utilizing the first risk evaluation model and the second risk evaluation model.

Optionally, the apparatus further comprises:

and the noise reduction processing unit is used for respectively carrying out noise reduction processing on the strong correlation variable group and the weak correlation variable group.

Optionally, the probability classification unit is specifically configured to:

Optionally, the apparatus further comprises:

and the division point setting unit is used for calculating the optimal value of the division point by adopting a preset optimization algorithm by taking the highest prediction accuracy of the positive sample and the negative sample as an optimization target.

Optionally, the risk assessment unit is specifically configured to:

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a risk assessment method, firstly grouping data sources according to the risk information quantity of data to obtain a strong correlation variable group and a weak correlation variable group; then, a first risk evaluation model is constructed according to the weak correlation variable group, and the first risk evaluation model is utilized to carry out prediction probability classification on the full-scale samples only containing the weak correlation variables, so that a sample group with the highest negative sample proportion is obtained; a second risk assessment model is constructed according to the sample group with the highest negative sample proportion and the strong correlation variable group, and as the training data for constructing the second risk assessment model are the sample group with the highest negative sample proportion and the strong correlation variable group, the missing value of the training data is less, and the prediction efficiency of the second risk assessment model constructed on the basis is higher; and finally, performing risk assessment by using the first risk assessment model and the second risk assessment model, thereby improving the accuracy of the risk assessment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a risk assessment method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a risk assessment apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a risk assessment method for the problems of sparse variables and unbalanced positive and negative samples for constructing a risk assessment model, which is applied to a risk assessment scene such as loan risk assessment, and please refer to fig. 1, and the risk assessment method specifically includes the following steps:

s101: grouping data sources according to the risk information quantity of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables;

wherein, the higher the risk information amount of the data is, the higher the correlation of the data with the risk assessment object is, and conversely, the lower the risk information amount of the data is, the lower the correlation of the data with the risk assessment object is. If the distribution of the customer card opening amount is concentrated to a certain range, the information amount of the customer card opening amount data is also reduced to a certain range until the condition that the correlation between the customer card opening amount data and the loan risk is low can occur, it should be noted that the information amount of the data and the statistical distribution of the data are not in a direct linear relationship, and when the data distribution is complex but concentrated, the information amount of the data can be large.

The data source comprises various variable data, and the data source is grouped according to a preset grouping rule and the risk information amount of the variable data to obtain a strong relevant variable group comprising strong relevant variables and a weak relevant variable group comprising weak relevant variables. In the above example, if the concentration of the customer card opening amount is in a certain range, the customer card opening amount data is classified into the weak correlation variable group, and if the concentration of the customer card opening amount is not in the above range, the customer card opening amount data is classified into the strong correlation variable group. It should be noted that this process is generally performed in data exploratory analysis.

In order to facilitate subsequent processing, noise reduction processing can be performed on the strong correlation variable group and the weak correlation variable group, and continuity of the variables is increased.

Optionally, the deep learning auto-encoding tool may be used to perform denoising processing on the strong correlation variable group and the weak correlation variable group.

The noise reduction processing of the strong correlation variable group and the weak correlation variable group by using the deep learning auto-encoding tool is an encoding (encoder) and decoding (decoder) process using a neural network.

The neural network model comprises an input layer input, an intermediate layer code, a decoding layer decoder and an output layer output. Taking the variable as X for example, X is transformed into Z using neural network principle, where Z represents the output result of the middle layer, and the variable Z of the middle layer outputs X' through a decoder (decoder). Overall, the optimization objectives of this neural network are:

Distance(X，X′)＝||X-X′|| ₂

the optimization process mainly uses a gradient descent method, and details are not repeated here.

S102: constructing a first risk assessment model according to the weak correlation variable group;

in the actual operation process, the algorithm for constructing the first risk assessment model may be selected according to requirements, such as xgboost.

S103: performing predictive probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain a sample group with the highest negative sample proportion;

the positive and negative examples are risk assessment results, for example, in the risk assessment, a default customer is recorded as 1, and a non-default customer is recorded as 0, so that a negative example is marked as 1, and a positive example is marked as 0.

Specifically, a first risk assessment model is used for carrying out prediction probability classification on full-scale samples only containing weak correlation variables to obtain the probability that each sample in the full-scale samples only containing the weak correlation variables is a negative sample;

If K represents the segmentation point, the proportion of negative samples in the sample group with the probability of the negative samples being more than or equal to K is higher, and the sample with the probability of the negative samples being more than or equal to K is divided into the sample group with the highest proportion of the negative samples; and dividing the samples with the probability of being the negative samples less than K into sample groups with the lowest proportion of the negative samples.

In order to enable the preset segmentation point K to be more reasonable and enable the prediction accuracy of the positive sample and the negative sample to be highest, the highest prediction accuracy of the positive sample and the negative sample is taken as an optimization target, and a preset optimization algorithm is adopted to calculate the optimal value of the segmentation point.

First, a confusion matrix is introduced, as shown in table 1.

TABLE 1

When the segmentation point K is determined, it is apparent that the sample may be divided into two types of predicted values, one type being predicted as a positive sample, using a first risk assessment model constructed from a weakly correlated variable group; the other type of prediction is negative examples. In the case of prediction as negative samples, the proportion of true negative samples is significantly increased, while in prediction as positive samples, the proportion of positive samples is the majority, so the optimization goal is as follows:

where a and b are coefficients that need to be input in practice. In an actual process, determining K may use multiple optimization methods, which may use a discrete optimization algorithm, or may perform simple traversal under the condition that there are not many sample sets.

S104: constructing a second risk assessment model according to the sample group with the highest negative sample proportion and the strong correlation variable group;

in the actual operation process, the algorithm for constructing the second risk assessment model may be selected according to the requirement, for example, xgboost.

The algorithms for constructing the first risk assessment model and the second risk assessment model may be the same or different.

The above process does not directly use an undersampling mode to increase the proportion of negative samples in the data source, but firstly divides the data source into a strong correlation variable group and a weak correlation variable group, and then divides full samples only containing weak correlation variables into a sample group with the highest proportion of negative samples and a sample group with the lowest proportion of negative samples. On the basis, the model probability of the second risk assessment model constructed according to the sample group with the highest negative sample proportion and the strong correlation variable group is the natural probability, so that the introduction of human errors is prevented to a certain extent, and the overfitting phenomenon of the model caused by undersampling sampling is prevented.

S105: and performing risk assessment by using the first risk assessment model and the second risk assessment model.

Specifically, risk assessment is performed by using the first risk assessment model to obtain a first risk assessment value; performing risk assessment by using the second risk assessment model to obtain a second risk assessment value; determining a maximum of the first risk assessment value and the second risk assessment value as a final risk assessment value.

P _final (x)＝max{P _model1 (x)，P _model2 (x)}

Wherein P is _model1 (x) Representing a first risk assessment value, P _model2 (x) Representing a second risk assessment value. max represents the maximum of two elements.

According to the risk assessment method disclosed by the embodiment, firstly, data sources are grouped according to the risk information amount of data to obtain a strong correlation variable group and a weak correlation variable group; then, a first risk evaluation model is constructed according to the weak correlation variable group, and the first risk evaluation model is utilized to carry out prediction probability classification on the full-scale samples only containing the weak correlation variables, so that a sample group with the highest negative sample proportion is obtained; a second risk assessment model is constructed according to the sample group with the highest negative sample proportion and the strong correlation variable group, and as the training data for constructing the second risk assessment model are the sample group with the highest negative sample proportion and the strong correlation variable group, the missing value of the training data is less, and the prediction efficiency of the second risk assessment model constructed on the basis is higher; and finally, performing risk assessment by using the first risk assessment model and the second risk assessment model, thereby improving the accuracy of the risk assessment.

Based on the risk assessment method disclosed in the above embodiments, the present embodiment discloses a risk assessment apparatus, please refer to fig. 2, the apparatus includes:

a variable group dividing unit 201, configured to group data sources according to risk information amount of the data to obtain a strong relevant variable group including a strong relevant variable and a weak relevant variable group including a weak relevant variable;

a first model building unit 202, configured to build a first risk assessment model according to the weakly correlated variable group;

a probability classification unit 203, configured to perform predictive probability classification on a full-scale sample that only includes the weak correlation variable by using the first risk assessment model, so as to obtain a sample group with a highest negative sample proportion;

a second model building unit 204, configured to build a second risk assessment model according to the sample group with the highest negative sample proportion and the strong correlation variable group;

a risk assessment unit 205 configured to perform risk assessment using the first risk assessment model and the second risk assessment model.

Optionally, the apparatus further comprises:

Optionally, the probability classification unit is specifically configured to:

Optionally, the apparatus further comprises:

Optionally, the risk assessment unit is specifically configured to:

According to the risk assessment device disclosed by the embodiment, firstly, data sources are grouped according to the risk information amount of data to obtain a strong correlation variable group and a weak correlation variable group; then, a first risk evaluation model is constructed according to the weak correlation variable group, and the first risk evaluation model is utilized to carry out prediction probability classification on the full-scale samples only containing the weak correlation variables to obtain a sample group with the highest negative sample proportion; a second risk assessment model is constructed according to the sample group with the highest negative sample proportion and the strong correlation variable group, and as the training data for constructing the second risk assessment model are the sample group with the highest negative sample proportion and the strong correlation variable group, the missing value of the training data is less, and the prediction efficiency of the second risk assessment model constructed on the basis is higher; and finally, performing risk assessment by using the first risk assessment model and the second risk assessment model, thereby improving the accuracy of the risk assessment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A risk assessment method is applied to a loan risk assessment scene, and comprises the following steps:

grouping data sources according to the risk information quantity of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables; the data source comprises variable data, and the variable data comprises customer card opening quantity data;

performing risk assessment using the first risk assessment model and the second risk assessment model;

the performing predictive probability classification on the full-scale samples only containing the weak correlation variables by using the first risk assessment model to obtain a sample group with the highest negative sample proportion includes:

2. The method of claim 1, wherein prior to said constructing a first risk assessment model from said set of weakly-relevant variables, said method further comprises:

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein said performing a risk assessment using said first risk assessment model and said second risk assessment model comprises:

5. A risk assessment apparatus for use in a loan risk assessment scenario, the apparatus comprising:

the variable group dividing unit is used for grouping the data sources according to the risk information amount of the data to obtain a strong correlation variable group comprising strong correlation variables and a weak correlation variable group comprising weak correlation variables; the data source comprises variable data, and the variable data comprises customer card opening quantity data;

a risk assessment unit for performing risk assessment using the first risk assessment model and the second risk assessment model;

wherein, the probability classification unit is specifically configured to:

6. The apparatus of claim 5, further comprising:

7. The apparatus of claim 5, further comprising:

8. The device according to claim 5, wherein the risk assessment unit is specifically configured to: