CN117422545A

CN117422545A - Credit risk identification method, apparatus, device and storage medium

Info

Publication number: CN117422545A
Application number: CN202311639295.2A
Authority: CN
Inventors: 伏峰; 杨晓旗; 刘世尧; 蔡宇笙; 俞泱
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-01-19

Abstract

The disclosure provides a credit risk identification method, a credit risk identification device, credit risk identification equipment and a credit risk identification storage medium, and the credit risk identification method, the credit risk identification device and the credit risk identification storage medium can be applied to the technical field of data processing. The method comprises the following steps: acquiring client information; performing preprocessing operation on the client information, and extracting a plurality of first characteristic data from the client information; respectively calculating importance degrees of a plurality of first feature data, and determining the number of target features based on the importance degrees; the target features are used for predicting the risk level of the client; screening target features from the plurality of first feature data based on the target feature quantity; and transmitting the target characteristics into a risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

Description

Credit risk identification method, apparatus, device and storage medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a credit risk identification method, apparatus, device, medium, and program product.

Background

With the increasing frequency of credit financing services, requirements for customer risk control are increasing, and scientific, systematic and efficient methods are urgently required for customer risk identification. In the related art, a method for performing credit risk recognition on a client often performs risk recognition on the client based on the input features by transmitting related features in client information into a risk recognition model, wherein the selection of the input features often affects the accuracy of a final client risk recognition result, and how to accurately select the input features is a problem to be considered by technicians.

In the prior art, a fault method and an exhaustion method are generally adopted to determine the in-mold feature, wherein the fault method is to determine the in-mold feature manually according to the difference value of the feature importance, and the in-mold feature is not objective enough to select due to subjective experience of technicians, so that the problem of low identification accuracy of credit risk of customers due to inaccurate in-mold feature selection is easily caused. The exhaustive method refers to that all candidate features are tried one by one as the input features, so that the operation cost is high and the efficiency is low.

Disclosure of Invention

In view of the foregoing, the present disclosure provides credit risk identification methods, apparatus, devices, media, and program products.

According to a first aspect of the present disclosure, there is provided a credit risk method comprising: acquiring client information; the following operations are performed on the customer information: performing preprocessing operation on the client information, and extracting a plurality of first characteristic data from the client information; respectively calculating importance degrees of a plurality of first feature data, and determining the number of target features based on the importance degrees; the target features are used for predicting the risk level of the client; screening target features from the plurality of first feature data based on the target feature quantity; and transmitting the target characteristics into a risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

According to an embodiment of the present disclosure, performing a preprocessing operation on customer information, acquiring a plurality of first feature data from the customer information, includes: extracting a plurality of feature data to be screened from the client information; calculating characteristic indexes of the plurality of characteristic data to be screened, and screening the plurality of characteristic data to be screened according to the characteristic indexes to obtain first characteristic data.

According to an embodiment of the present disclosure, calculating a feature index of a plurality of feature data to be screened, screening the plurality of feature data to be screened according to the feature index, to obtain first feature data, including: calculating the information value of each piece of characteristic data to be screened and the correlation coefficient among the pieces of characteristic data to be screened; screening a plurality of first feature data to be screened from the feature data to be screened according to the information value and the correlation coefficient; calculating a group stability index value, a deletion rate, a feature number and an AUC value of each first feature data to be screened; and screening a plurality of first characteristic data from the plurality of first characteristic data to be screened according to the group stability index value, the deletion rate, the characteristic value number and the AUC value.

According to an embodiment of the present disclosure, calculating an information value of each feature data to be screened and a correlation coefficient between the feature data to be screened includes: screening the feature data to be screened, which correspond to the information value larger than the threshold value, from the feature data to be screened; and calculating correlation coefficients among the plurality of filtered characteristic data to be filtered.

According to an embodiment of the present disclosure, a method for screening a plurality of first feature data to be screened from a plurality of feature data to be screened according to information value and a correlation coefficient includes: under the condition that the correlation coefficient between the feature data to be screened and other feature data to be screened is smaller than a threshold value, determining the feature data to be screened as first feature data to be screened; and under the condition that the correlation coefficient between the feature data to be screened and other feature data to be screened is larger than a threshold value, determining the feature data to be screened with the maximum information value in the feature data to be screened as first feature data to be screened.

According to an embodiment of the present disclosure, calculating importance degrees of a plurality of first feature data, respectively, and determining the number of target features based on the importance degrees includes: the following operations are performed on each first feature data: transmitting the first characteristic data into a gradient lifting decision tree model, and obtaining a plurality of importance degrees of the first characteristic data by adjusting super parameters in the gradient lifting decision tree model; constructing an importance matrix of the first feature data according to the importance of the first feature data; the number of target features is determined based on the importance matrices of the plurality of first feature data.

According to an embodiment of the present disclosure, determining the number of target features based on the importance matrices of the plurality of first feature data includes: respectively calculating first variance corresponding to a plurality of first feature data and comprehensive variance corresponding to all the first feature data based on the feature importance matrix; and determining the target feature quantity according to the magnitude relation between the first variance and the comprehensive variance.

According to an embodiment of the present disclosure, first variance corresponding to a plurality of first feature data and integrated variance corresponding to all the first feature data are calculated based on a feature importance matrix, respectively, including: calculating the characteristic value of the first characteristic data in each characteristic importance matrix; calculating a first variance corresponding to the first feature data according to a plurality of feature values of the first feature data; and calculating the comprehensive variance corresponding to all the first feature matrixes according to the feature values of all the first feature data.

A second aspect of the present disclosure provides a credit risk identification device, comprising: the acquisition module is used for acquiring the client information; the preprocessing module is used for executing preprocessing operation on the client information and extracting a plurality of first characteristic data from the client information; the determining module is used for respectively calculating the importance degrees of the first characteristic data and determining the number of target characteristics based on the importance degrees; the target features are used for predicting the risk level of the client; the screening module is used for screening target features from the plurality of first feature data based on the target feature quantity; and the identification module is used for transmitting the target characteristics into the risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the credit risk identification method, the credit risk identification device, the credit risk identification equipment, the credit risk identification medium and the credit risk identification program product, the target feature quantity and the target feature of the final model are determined based on the client information, so that accuracy of the client risk identification result is improved. Wherein determining the number of final in-mold target features and the target features based on the customer information comprises: first feature data are extracted from a plurality of feature data to be screened of the client information, the number of target features is determined by calculating the importance degree of the first feature data, and target features for modeling are screened from the first feature data based on the number of target features. The processing method can effectively ensure the quality of the final model entering feature, avoid the problems of low accuracy and long time consumption of credit risk identification caused by inaccurate model entering feature selection or excessively complex model entering feature selection method due to subjective experience of technicians in the prior art, effectively improve the credit risk identification efficiency, simultaneously ensure the accuracy of credit risk identification results, and realize efficient and accurate credit risk identification.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of credit risk identification methods, apparatus, devices, media and program products according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a credit risk identification method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart for obtaining a plurality of first feature data from customer information according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of screening a plurality of feature data to be screened based on a feature index according to an embodiment of the disclosure;

fig. 5 schematically illustrates a flowchart of screening first feature data to be screened according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of determining a target feature quantity based on a first feature importance in accordance with an embodiment of the disclosure;

FIG. 7 schematically illustrates a block diagram of a credit risk identification device according to an embodiment of the disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a credit risk identification method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

The embodiment of the disclosure provides a credit risk identification method, which comprises the following steps: acquiring client information; the following operations are performed on the customer information: performing preprocessing operation on the client information, and extracting a plurality of first characteristic data from the client information; respectively calculating importance degrees of a plurality of first feature data, and determining the number of target features based on the importance degrees; the target features are used for predicting the risk level of the client; screening target features from the plurality of first feature data based on the target feature quantity; and transmitting the target characteristics into a risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

Fig. 1 schematically illustrates an application scenario diagram of a credit risk identification method according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 105, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the credit risk identification method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the credit risk recognition device provided by the embodiments of the present disclosure may be generally provided in the server 105. The credit risk identification method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105. Accordingly, the credit risk identifying apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster, which is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The credit risk recognition method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 6 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a credit risk identification method according to an embodiment of the disclosure.

As shown in fig. 2, the credit risk identification method of this embodiment includes operations S210 to S250.

In operation S210, customer information is acquired.

In some embodiments, the customer information may include, for example, customer base information (e.g., age, gender, academic, work, asset status, etc. of the customer), transaction information (e.g., customer's frequency of consumption, amount of consumption, time of last consumption, investment preference, etc.), and customer risk information (e.g., customer credit information, expiration number, etc.).

In operation S220, a preprocessing operation is performed on the client information, and a plurality of first feature data are extracted from the client information.

The customer information often contains a plurality of feature data, if all the feature data are used as the model entering features to be transmitted into the risk recognition model, the problems of high model calculation cost and feature redundancy and low accuracy of the final analysis result of the model possibly caused by irrelevant features are caused. Thus, in order to accurately screen out part of the feature data that is ultimately used for the in-mold from the plurality of feature data, the present disclosure proposes: and screening the feature data based on the preprocessing operation and the feature importance degree to determine the quantity of the feature data finally molded, and screening target features for predicting the risk level of the client from a plurality of feature data based on the molded feature quantity to realize accurate selection of the feature data.

In some embodiments, since the quality of the plurality of feature data to be screened included in the client information is irregular, the feature data may be first screened by a preprocessing operation to filter out a part of feature data unsuitable for being molded in advance.

Wherein, the pretreatment operation refers to: and screening the plurality of feature data to be screened based on the preset feature indexes. Wherein, the preset characteristic index is shown in the following table.

And screening the plurality of feature data to be screened through the feature indexes in the table to obtain a plurality of first feature data, wherein the number of the first feature data is smaller than that of the feature data to be screened.

In operation S230, the importance degrees of the plurality of first feature data are calculated, respectively, and the target feature number is determined based on the importance degrees; wherein the target feature is used to predict a risk level of the customer.

In some embodiments, the first feature data is transmitted to a pre-trained gradient-boost decision tree model, and the importance of each first feature data is calculated from the gradient-boost decision tree. Importance is an index that measures the contribution of features in a model, and can effectively reflect the importance of each feature in the model. After the importance degree of the first feature data is calculated, the variance of the first feature data is calculated according to the importance degree of the first feature data, and then the target feature quantity is determined according to the variance.

On one hand, the model can perform high-efficiency and accurate risk identification based on the characteristics to obtain an accurate risk identification result; on the other hand, the scientific target feature quantity determination mode can avoid the problems of inaccurate model evaluation caused by too few features or excessive calculation caused by feature redundancy.

In the prior art, the number of target features is typically determined by tomography or by exhaustion. Tomosynthesis refers to determining N number of in-mold features based on feature importance ranking, where the number of in-mold features N is typically determined by subjective experience of the technician. The exhaustive method refers to the step of inputting each feature into a model, and selecting the feature based on the model result. Compared with the mode of determining the target feature quantity in the prior art, the method and the device for determining the target feature quantity based on the importance degree and the variance of the first feature data can avoid the influence of subjective experience of technicians on one hand, and on the other hand, the efficiency of determining the target feature quantity can be effectively improved, and the target feature quantity can be rapidly and accurately determined.

In operation S240, a target feature is selected from the plurality of first feature data based on the target feature quantity.

In some embodiments, a plurality of first feature data is filtered according to the number of target features that have been determined, and target features for molding are selected from the first feature data. Screening the in-model features can reduce the run time of the model algorithm and increase the interpretability of the model. The method has the specific advantages that: the screening of the in-mold features can effectively reduce feature dimensions and realize feature dimension reduction; the difficulty of learning tasks is reduced, and the efficiency of the model is improved; the generalization capability of the model is enhanced, and the overfitting is reduced; and the understanding of the model to the characteristics and the characteristic values is enhanced, and the accuracy of the model is improved.

In the implementation process, a technician can screen out target features consistent with the number of target features from the plurality of first feature data in a manner of, for example, pearson correlation coefficient, mutual information, maximum information coefficient, distance correlation coefficient and the like. The invention does not limit the specific mode of selecting the target features, and a technician can select a corresponding method according to actual conditions to screen the target features so as to screen irrelevant features and redundant features in the first feature data and obtain simple and effective target features.

In operation S250, the target feature is transferred into the risk recognition model, so that the risk recognition model performs credit risk recognition on the current client based on the target feature, and the risk level of the client is obtained.

In some embodiments, the screened target features are transmitted into a risk recognition model, the risk classification of the current client and probability information of each risk classification are calculated by the risk recognition model based on the target features, and then the client risk level is determined according to the calculated risk classification and probability information.

Because the target features of the incoming risk identification model are obtained through pretreatment, the target features of the incoming risk identification model are closely related to the customer risk identification, and the running speed and the evaluation accuracy of the model can be effectively improved, so that the efficiency and the accuracy of the customer credit risk identification are improved.

According to the customer risk identification method provided by the invention, the customer information is preprocessed, so that the effective selection of the model entering feature is realized, and the accuracy of the customer risk identification result is further improved. Wherein, preprocessing the client information comprises: extracting first feature data from a plurality of feature data to be screened, determining the number of target features by calculating the importance degree of the first feature data, and screening target features for modeling from the first feature data based on the number of target features. The processing method can scientifically and efficiently screen the irrelevant features and the redundant features from the feature data, ensure the quality of the final modeling features, reduce the occurrence of the situations of high calculation cost and inaccurate evaluation result of the risk identification model caused by the redundant features and the irrelevant features, and improve the accuracy and the flexibility of customer risk identification.

Fig. 3 schematically illustrates a flowchart for obtaining a plurality of first feature data from customer information according to an embodiment of the present disclosure.

As shown in fig. 3, the acquisition of the plurality of first feature data from the client information of this embodiment includes operations S310 to S320.

In operation S310, a plurality of feature data to be filtered are extracted from the customer information.

In some embodiments, a text data feature extraction method may be used to extract a plurality of feature data to be screened from the customer information. The text data feature extraction method can comprise One-Hot, boW, TF-IDF and the like.

In operation S320, feature indexes of the feature data to be screened are calculated, and the feature data to be screened are screened according to the feature indexes, so as to obtain first feature data.

In some embodiments, the quality of the feature data to be screened extracted based on the text data features is uneven and is not suitable for being used as the modeling feature, so the invention proposes to perform preliminary screening on the feature data to be screened by calculating the feature index, so as to filter out a part of variables unsuitable for modeling in advance, save the calculation time and calculation cost for determining the number of target features and the target features subsequently, and improve the efficiency of credit risk identification of customers.

Fig. 4 schematically illustrates a flowchart of screening a plurality of feature data to be screened based on a feature index according to an embodiment of the present disclosure.

As shown in fig. 4, the filtering of the feature data to be filtered to obtain the first feature data in this embodiment includes operations S410 to S440.

In operation S410, information value of each feature data to be filtered and correlation coefficients among the feature data to be filtered are calculated.

In some embodiments, the information value of each feature data to be screened and the correlation coefficient between the feature data to be screened are calculated first. The information value (Information Value, IV) is used to measure the predictive power of the feature data, and the larger the IV value is, the more predictive power of the feature data is represented. The information value is the basis for selecting the characteristic data, and if the prediction capability of one characteristic data cannot meet the requirement, the characteristic data is directly filtered out. And when the correlation coefficient between the characteristic data is larger, the information expressed by the two characteristic data is similar, and the functions in model prediction are the same, so that the characteristic data with close correlation can be screened through the correlation, the number of the model entering characteristic data is reduced on the premise of not influencing the model prediction precision, and the model calculation cost is reduced.

In a specific implementation process, calculating the correlation coefficient between the feature data to be screened comprises: screening the feature data to be screened, which has information value larger than the threshold value, from the feature data to be screened, and calculating the correlation coefficient between the feature data to be screened after the screening.

In operation S420, a plurality of first feature data to be screened is screened from the plurality of feature data to be screened according to the information value and the correlation coefficient.

Fig. 5 schematically shows a flowchart of screening first feature data to be screened according to an embodiment of the present disclosure.

As shown in fig. 5, the screening of the first feature data to be screened in this embodiment includes operations S421 to S422.

In operation S421, in the case where the correlation coefficient between the feature data to be screened and other feature data to be screened is smaller than the threshold, the feature data to be screened is determined as the first feature data to be screened.

When the correlation coefficients of the characteristic data and other characteristic data are smaller than a threshold value, the fact that other characteristic data similar to the expression information of the characteristic data do not exist is indicated, and the characteristic data are directly determined to be the first variable to be screened.

In operation S422, in the case that the correlation coefficient between the feature data to be screened and other feature data to be screened is greater than the threshold, determining the feature data to be screened with the highest information value in the plurality of feature data to be screened as the first feature data to be screened.

When the characteristic data with the characteristic data correlation coefficient larger than the threshold value exists in the plurality of characteristic data, the characteristic data is similar to information expressed by other characteristic data, and the characteristic data plays the same role in model prediction, so that only one characteristic data with highest information value is selected from the similar characteristic data to serve as first characteristic data to be screened.

In operation S430, a population stability index value, a deletion rate, a feature number, and an AUC value of each first feature data to be screened are calculated.

In operation S440, a plurality of first feature data is selected from the plurality of first feature data to be screened according to the population stability index value, the deletion rate, the feature value number and the AUC value.

In some embodiments, the population stability indicator (Population Stability Index, PSI) may be effective to reflect the stability of the data, and when the smaller the PSI value, the better the stability of the feature data is indicated, and if the PSI value of the first feature data to be screened is greater than the threshold, the feature data is filtered. The deletion rate is used for representing the null rate of the first feature data to be screened, and when the deletion rate of the first feature data to be screened is larger than a threshold value, the feature data is filtered. And when the number of the characteristic values of the first characteristic data to be screened is smaller than the threshold value, filtering the characteristic data. And when the AUC value of the first feature data to be screened is smaller than the threshold value, filtering the feature data. That is, when the population stability index value, the deletion rate, the feature index and the AUC value of the first feature data to be screened all meet the requirements, the first feature data to be screened is determined as the first feature data, so as to realize screening of the feature data to be screened.

Fig. 6 schematically illustrates a flowchart of determining a target feature quantity based on a first feature importance level according to an embodiment of the present disclosure.

As shown in fig. 6, the filtering of the feature data to be filtered to obtain the first feature data in this embodiment includes operations S610 to S630.

In operation S610, the first feature data is transferred into a gradient-lifting decision tree model, and a plurality of importance levels of the first feature data are obtained by adjusting super parameters in the gradient-lifting decision tree model.

In some embodiments, the first feature data is passed into a gradient-lifting decision tree model to obtain the first feature data Filter Var _n Corresponding feature importance fp _n 。

The first feature data is passed into a gradient-lifting decision tree model, and feature importance is calculated by improving the amount of performance metric per feature split point in a single decision tree. The nodes are responsible for weighting and recording the number of times, that is, the greater a feature improves the performance metric for a node (i.e., the closer to the root node), the greater the corresponding weight, the more the tree is promoted to select, and the more important the attribute. The performance metric may be Gini coefficients of selected split nodes or other metric functions. Finally, the results of a feature in all the lifting trees are weighted and summed and then averaged to obtain the importance of the feature. Feature importance is used to measure the value of features in the construction of an enhanced decision tree in a model. The more features that are used to build a decision tree in a model, the more important it is. The gradient lifting decision tree model is a model which is built in advance based on training samples. The training samples may be historical customer information including quality sample identifiers, and after performing the operations S210 to S220 on the training samples, the gradient lifting decision tree model is constructed by taking the first feature data in the training samples as independent variables and the quality sample identifiers as dependent variables, where the gradient lifting decision tree model generates feature importance indexes for each first feature data.

In operation S620, an importance matrix of the first feature data is constructed according to the plurality of importance of the first feature data.

In the implementation process, the first characteristic data filter_var is calculated _n Importance fp of (2) _n In addition, the invention further provides that a plurality of feature importance degrees are obtained by adjusting the self-contained super parameter random_state in the gradient lifting decision tree algorithm, and a feature importance degree matrix is constructed.

Wherein, adjusting the super parameter range_state includes: and randomly assigning K times R to the super parameter random_state, wherein K is the number of the first characteristic data, R is an adjustable parameter, and 10000 is copied by default. By randomly assigning K times to the super parameter range_state, a K x K feature importance matrix composed of the importance of the first feature data can be obtained, and finally, R feature importance matrices are included.

The specific feature importance matrix is as follows:

in operation S630, the number of target features is determined based on the importance matrices of the plurality of first feature data.

In a specific implementation, determining the number of target features includes operations S631-S632.

In operation S631, first variance corresponding to the plurality of first feature data and the integrated variance corresponding to all the first feature data are calculated based on the feature importance matrix, respectively.

Wherein calculating the first variance and the integrated variance includes operations S6311-S6313.

In operation S6311, feature values of the first feature data in each feature importance matrix are calculated.

Solving eigenvalues for each feature importance matrix<λ _i，1 ，λ _i，2 ，λ _i，3 ，...，λ _i，j ，...λ _i，k >Wherein lambda is _i，j And (3) representing the jth eigenvalue of the ith matrix, and obtaining a matrix eigenvalue two-dimensional table by solving the matrix, wherein the matrix eigenvalue two-dimensional table is shown as follows.

Matrix array

Eigenvalue 1

Eigenvalue 2

....

Eigenvalue j

....

Eigenvalue k

Matrix 1

λ _1，1

λ _1，2

....

λ _1，j

....

λ _1，k

Matrix 2

λ _2，1

λ _2，2

....

λ _2，j

....

λ _2，k

Matrix 3

λ _3，1

λ _3，2

....

λ _3，j

....

λ _3，k

...

Matrix i

λ _i，1

λ _i，2

....

λ _i，j

....

λ _i，k

....

Matrix r

λ _r，1

λ _r，2

....

λ _r，j

....

λ _r，k

In operation S6312, a first variance corresponding to the first feature data is calculated according to a plurality of feature values of the first feature data.

Calculating variance sigma of each eigenvalue _j Wherein σ is _j Representing the variance corresponding to the eigenvalue j.

In operation S6313, the integrated variances corresponding to all the first feature matrices are calculated according to the feature values of all the first feature data.

In a specific implementation process, the calculation formula of the integrated variance a_σ is as follows:

wherein lambda is _i，j The jth eigenvalue representing the ith matrix,

in operation S632, a target feature quantity is determined according to the magnitude relation of the first variance and the integrated variance.

In some embodiments, a counter count=0 is initialized, wherein the counter is used to record the target feature quantity. First variance sigma corresponding to all feature values _j Judging sigma _j Whether greater than or equal to the integrated variance a_sigma, if sigma _j And (3) if the value is more than or equal to alpha sigma, adding 1 to the value of the counter count, otherwise, keeping the count value unchanged. The final count value is the target feature quantity.

The variance may effectively reflect the divergence of the feature, i.e., the difference in customer risk over this feature, and when the variance is too small, this feature is not useful for distinguishing customer risk. Therefore, after the first feature data are screened out, the invention further provides the steps of calculating the feature value of each first feature data, determining the target feature quantity through the magnitude relation between the first variance corresponding to each feature value and the comprehensive variances of all feature values, scientifically and rapidly determining the target feature quantity, and providing a good basis for the subsequent target feature selection, thereby completing the accurate selection of the target feature, further improving the accuracy of the customer risk identification, and realizing the efficient and accurate customer risk identification.

Based on the credit risk identification method, the disclosure also provides a credit risk identification device. The device will be described in detail below in connection with fig. 7.

Fig. 7 schematically illustrates a block diagram of a credit risk recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the credit risk recognition device 700 of this embodiment includes an acquisition module 710, a preprocessing module 720, a determination module 730, a screening module 740, and a recognition module 750.

The acquisition module 710 is configured to acquire client information. In an embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein.

The preprocessing module 720 is configured to perform a preprocessing operation on the client information, and extract a plurality of first feature data from the client information. In an embodiment, the preprocessing module 720 may be used to perform the operation S220 described above, which is not described herein.

The determining module 730 is configured to calculate importance degrees of the plurality of first feature data, and determine the number of target features based on the importance degrees; wherein the target feature is used to predict a risk level of the customer. In an embodiment, the determining module 730 may be configured to perform the operation S230 described above, which is not described herein.

The screening module 740 is configured to screen the target feature from the plurality of first feature data based on the target feature quantity. In an embodiment, the filtering module 740 may be configured to perform the operation S240 described above, which is not described herein.

The recognition module 750 is configured to transfer the target feature into a risk recognition model, so that the risk recognition model performs credit risk recognition on the current client based on the target feature, and obtains a risk level of the client. In an embodiment, the identification module 750 may be used to perform the operation S250 described above, which is not described herein.

Any of the acquisition module 710, the preprocessing module 720, the determination module 730, the screening module 740, and the identification module 750 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 710, the preprocessing module 720, the determination module 730, the screening module 740, and the identification module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the acquisition module 710, the preprocessing module 720, the determination module 730, the screening module 740, and the identification module 750 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 807 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 807 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 810, so that a computer program read out therefrom is installed into the storage section 807 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the item recommendation method provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A credit risk identification method, comprising:

acquiring client information;

the following operations are performed on the client information:

Performing preprocessing operation on the client information, and extracting a plurality of first characteristic data from the client information;

calculating importance degrees of the plurality of first feature data respectively, and determining the number of target features based on the importance degrees; wherein the target feature is used for predicting the risk level of the client;

screening target features from the plurality of first feature data based on the target feature quantity;

and transmitting the target characteristics into a risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

2. The credit risk identification method of claim 1, wherein the performing a preprocessing operation on the customer information obtains a plurality of first feature data from the customer information, including:

extracting a plurality of feature data to be screened from the client information;

calculating characteristic indexes of the plurality of characteristic data to be screened, and screening the plurality of characteristic data to be screened according to the characteristic indexes to obtain first characteristic data.

3. The credit risk identification method according to claim 2, wherein the calculating the feature index of the feature data to be screened, and the screening the feature data to be screened according to the feature index, to obtain the first feature data, includes:

Calculating the information value of each piece of characteristic data to be screened and the correlation coefficient among the pieces of characteristic data to be screened;

screening a plurality of first feature data to be screened from the plurality of feature data to be screened according to the information value and the correlation coefficient;

calculating a group stability index value, a deletion rate, a feature number and an AUC value of each first feature data to be screened;

and screening a plurality of first characteristic data from the plurality of first characteristic data to be screened according to the group stability index value, the deletion rate, the characteristic value number and the AUC value.

4. The credit risk recognition method according to claim 3, wherein the calculating the information value of each feature data to be screened and the correlation coefficient between the feature data to be screened includes:

screening the feature data to be screened, which correspond to the information value larger than a threshold value, from the feature data to be screened;

and calculating correlation coefficients among the plurality of filtered characteristic data to be filtered.

5. The credit risk identification method according to claim 4, wherein the screening a plurality of first feature data to be screened from the plurality of feature data to be screened according to the information value and the correlation coefficient includes:

Under the condition that the correlation coefficient between the feature data to be screened and other feature data to be screened is smaller than a threshold value, determining the feature data to be screened as first feature data to be screened;

and under the condition that the correlation coefficient between the feature data to be screened and other feature data to be screened is larger than a threshold value, determining the feature data to be screened with the maximum information value in the feature data to be screened as first feature data to be screened.

6. The credit risk recognition method of claim 1, the calculating importance levels of the plurality of first feature data, respectively, and determining the number of target features based on the importance levels, comprising:

the following operations are performed on each first feature data:

transmitting the first characteristic data into a gradient lifting decision tree model, and obtaining a plurality of importance degrees of the first characteristic data by adjusting super parameters in the gradient lifting decision tree model;

constructing an importance matrix of the first feature data according to the importance of the first feature data;

the number of target features is determined based on the importance matrix of the plurality of first feature data.

7. The credit risk identification method of claim 6, the determining the number of target features based on the importance matrix of the plurality of first feature data, comprising:

Respectively calculating first variance corresponding to a plurality of first feature data and comprehensive variance corresponding to all the first feature data based on the feature importance matrix;

and determining the target feature quantity according to the magnitude relation between the first variance and the comprehensive variance.

8. The credit risk recognition method according to claim 7, wherein the calculating the first variance corresponding to the plurality of first feature data and the integrated variance corresponding to all the first feature data based on the feature importance matrix includes:

calculating the characteristic value of the first characteristic data in each characteristic importance matrix;

calculating a first variance corresponding to the first characteristic data according to a plurality of characteristic values of the first characteristic data;

and calculating the comprehensive variance corresponding to all the first feature matrixes according to the feature values of all the first feature data.

9. A credit risk identification device, comprising:

the acquisition module is used for acquiring the client information;

the preprocessing module is used for executing preprocessing operation on the client information and extracting a plurality of first characteristic data from the client information;

a determining module, configured to calculate importance degrees of the plurality of first feature data, respectively, and determine a target feature number based on the importance degrees; wherein the target feature is used for predicting the risk level of the client;

The screening module is used for screening target features from the plurality of first feature data based on the target feature quantity; and

and the identification module is used for transmitting the target characteristics into a risk identification model, so that the risk identification model carries out credit risk identification on the current client based on the target characteristics, and the risk grade of the client is obtained.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.