CN116258579A

CN116258579A - Training method of user credit scoring model and user credit scoring method

Info

Publication number: CN116258579A
Application number: CN202310474439.7A
Authority: CN
Inventors: 刘洪江; 甘元笛; 任晓东; 陈昱任; 吕文勇; 周智杰
Original assignee: Chengdu New Hope Finance Information Co Ltd
Current assignee: Chengdu New Hope Finance Information Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-06-13
Anticipated expiration: 2043-04-28
Also published as: CN116258579B

Abstract

The application provides a training method of a user credit scoring model and the user credit scoring method, wherein training of the user credit scoring model is divided into two stages of training, a trained neural network encoder is obtained through training in the previous stage, and training of a meta-model (for example, training of GBDT or generalized linear regression) is carried out through feature vectors output by the trained neural network encoder and derived features derived according to original data in the next stage. Therefore, the user credit scoring model obtained through distribution training can fully utilize high-dimensional unstructured data such as images, characters and videos, and improves the accuracy of user credit scoring, so that risks are reduced.

Description

Training method of user credit scoring model and user credit scoring method

Technical Field

The present application relates to the field of credit assessment, and in particular, to a training method for a user credit scoring model and a user credit scoring method.

Background

Retail credit is small, short-term credit directed to a consumer, typically used to purchase a consumer product or service. Retail credit has the characteristics of convenience and rapidness, but at the same time has higher risks. Traditional retail credit risk prediction methods are typically based on a single data source, such as credit score or historical repayment record. These methods have some limitations that make it difficult to accurately predict retail credit risk.

At present, as for risk control of retail credit loans, as the business forms belong to huge numbers, but the average single credit amount is smaller, the risk forms are various, complex and changeable, and the situation that business experience is difficult to comprehensively cover is solved, the risk control in the field is mostly controlled through a risk model and a strategy, and the manual intervention degree is relatively low. Among these, risk models are necessary and core tools for coping with complex and varied risks, and common risk models such as credit assessment models and anti-fraud models.

In the existing credit evaluation model, the algorithm used is mainly based on logistic regression and integrated decision tree. The logistic regression is a traditional algorithm for establishing a scoring card model, has long history and mature scheme, and is characterized by small model parameters, high stability, simple algorithm and strong interpretation. The integrated decision tree algorithm comprises a random forest and gradient lifting decision tree (GBDT), is a mainstream algorithm for establishing a machine learning wind control model, and is characterized by high model performance, low requirement on model entering characteristics, nonlinearity and partial interpretation.

The existing credit evaluation model is built by firstly carrying out feature derivation based on the original data, generating features with scalar or category values and then using the features to build the model. The model development method is difficult to fully utilize high-dimensional unstructured data such as images, characters, videos and the like. For these high-dimensional data, the existing mainstream scheme is to empirically design a series of feature generation rules, thereby generating features. However, the features of the design are difficult to cover the information in the high-dimensional data in all aspects, and most of the features designed according to experience can only extract a small amount of information. The model which is most suitable for extracting information from high-dimensional data is a deep learning model based on a deep neural network. However, deep learning models are difficult to replace logistic regression and decision trees in credit models because neural networks are far less interpretable than logistic regression and decision trees.

Disclosure of Invention

The embodiment of the application aims to provide a training method of a user credit scoring model and the user credit scoring method, which are used for solving the problem that the existing credit scoring model is difficult to fully utilize high-dimensional unstructured data such as images, characters, videos and the like.

The embodiment of the application provides a training method for a user credit scoring model, wherein the user credit scoring model comprises a neural network encoder and a meta model, and the training method comprises the following steps:

inputting the high-dimensional data into a trained neural network encoder to obtain a feature vector; wherein the high-dimensional data includes at least one of image data, video data, and text data;

according to the original data, carrying out rule derivation based on service experience to obtain derived features; wherein the raw data includes at least one of personal information, device information, credit history, and financial data;

feature screening is carried out according to all feature vectors and derivative features, and screened features are obtained;

and training the meta-model according to the screened characteristics and the corresponding labels to obtain the trained meta-model.

In the above technical solution, training of the user credit scoring model is divided into two stages of training, the training in the previous stage obtains a trained neural network encoder, and training (for example, training of GBDT or generalized linear regression) of the meta model is performed in the next stage by using feature vectors output by the trained neural network encoder and derived features derived from the original data. Therefore, the user credit scoring model obtained through distribution training can fully utilize high-dimensional unstructured data such as images, characters and videos, and improves the accuracy of user credit scoring, so that risks are reduced.

In some alternative embodiments, before inputting the high-dimensional data into the trained neural network encoder, further comprising:

training the neural network encoder.

In some alternative embodiments, training a neural network encoder includes:

establishing a neural network structure corresponding to the high-dimensional data; the neural network structure comprises a neural network encoder and a neural network prediction head, wherein the neural network encoder is used for generating and outputting corresponding feature vectors according to the high-dimensional data, and the neural network prediction head is used for generating and outputting corresponding prediction values according to the feature vectors;

and training the neural network structure according to the high-dimensional data and the corresponding labels to obtain the trained neural network encoder.

In some optional embodiments, establishing a neural network structure corresponding to the high-dimensional data includes:

establishing a corresponding neural network structure for each type of high-dimensional data;

training the neural network structure according to the high-dimensional data and the corresponding label to obtain a trained neural network encoder, comprising:

and respectively training the corresponding neural network structure according to each type of high-dimensional data and the corresponding label to obtain a trained neural network encoder corresponding to each type of high-dimensional data.

In the above technical solution, training of the neural network structure includes the following two cases: first, each type of high-dimensional data is used to independently train its corresponding neural network, but the same labels are used; second, each type of high-dimensional data is used to independently train its corresponding neural network, and a different label is used.

In some optional embodiments, training the neural network structure according to the high-dimensional data and the corresponding tag, to obtain a trained neural network encoder, further comprising:

integrating the neural network structures corresponding to the high-dimensional data of the multiple categories into a neural network overall structure;

and training the multi-mode data of the overall structure of the neural network by utilizing the high-dimensional data of a plurality of categories and the corresponding labels to obtain a plurality of trained neural network encoders.

In the technical scheme, when training the neural network encoder, all the high-dimensional data planned into the model are used, the neural networks corresponding to the high-dimensional data are integrated together, and one label is selected to train the multi-mode data.

In some alternative embodiments, feature screening is performed comprising:

features are filtered based on predefined criteria and/or based on model performance.

In the above technical solution, the feature screening includes: a filtered approach to screening features based on predefined criteria, such as screening features based on correlation of individual features to a target variable or information gain of individual features; and, a wrapped approach to screening features based on model performance, such as iteratively eliminating unimportant features using a recursive feature elimination algorithm.

In some alternative embodiments, the metamodel includes metamodels based on gradient-lifting decision trees or generalized linear regression algorithms.

In the technical scheme, the characteristics screened in the previous step are used as input data, GBDT or generalized linear regression is selected as an algorithm of a meta-model by combining with the designed label, and the meta-model is trained. By the training method, the stacked model fused with the neural network and the GBDT or the generalized linear model is very flexible to build, and the accuracy of the model is improved. And the method can flexibly select required sources and different types of data according to the requirements of wind control business, develop a model which is sufficient for risk control under complex conditions by utilizing various labels, and integrate the rule-derived feature with the interpretability into the model, so that the model maintains a certain degree of interpretability.

The user credit scoring method provided by the embodiment of the application comprises the following steps:

inputting high-dimensional data in the user data into a trained neural network encoder to obtain an actual feature vector; carrying out rule derivatization based on service experience on the original data in the user data to obtain actual derivatization characteristics;

performing feature screening according to all the actual feature vectors and the actual derivative features to obtain screened actual features;

and inputting the screened actual characteristics into the trained meta model to obtain actual scores.

According to the technical scheme, the data of the input model comprise high-dimensional data and original data, the high-dimensional data of the internal source can be fully utilized based on the multi-mode data, the feature dimension information breadth of the model is greatly increased, higher accuracy can be still maintained in a scene that a client lacks credit history, and in a scene facing a high-risk client, more information is helpful for identifying fraud risk, so that an integrated scheme of anti-fraud and credit scoring is realized in a scene of the high-risk client in which fraud risk and credit risk are difficult to separate.

An electronic device provided in an embodiment of the present application includes: a processor and a memory storing machine-readable instructions executable by the processor, which when executed by the processor, perform a method as any one of the above.

A computer readable storage medium provided by an embodiment of the present application, on which a computer program is stored, which when executed by a processor performs a method as described in any of the above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of steps of a training method for a credit scoring model for a user according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a user credit scoring model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of a method for scoring credit of a user according to an embodiment of the present application;

fig. 4 shows a possible structure of the electronic device provided in the embodiment of the present application.

Icon: 1-processor, 2-memory, 3-communication interface, 4-communication bus.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Credit scoring refers to the process of evaluating and scoring a customer's credit status in retail credit. Such scoring is typically used to predict the ability of a customer to pay in the future and to provide a reference to a bank or other financial institution to decide whether to offer loans or financing products such as credit cards. Credit scoring typically takes into account a variety of factors including the customer's financial status, credit history, revenue level, etc.

In the field of retail credit, there are currently a number of solutions for assessing a customer's credit status and predicting the customer's repayment capacity in the future. These schemes include: rule-based method: such methods use manually set rules to evaluate the credit status of the customer. For example, the rules may take into account the customer's revenue level, credit history, liability ratio, etc. Statistical methods: such methods use statistical models to assess the credit status of the customer. For example, logistic regression algorithms may be used to predict whether a customer will violate.

Existing credit scoring schemes, such as rule-based methods and statistical methods, may suffer from several drawbacks in assessing the credit status of high risk customers, including: rule-based methods may be too simple to adequately account for customer specifics. The method based on logistic regression or decision tree relies on feature derived engineering based on rules, is difficult to efficiently, comprehensively and deeply utilize high-dimensional data, has less available information when facing customers with high risk or lack of credit history information, and is difficult to accurately identify credit risk.

The high-dimensional data mainly comprises: image and video data, including live face verification video, identification card photographs, and the like. Text data, including filling text in the application stage, text generated by Optical Character Recognition (OCR) technology, and the like, for example, by using BERT or ERNIE as a backbone, a Natural Language Processing (NLP) model is established, and the text data and various credit risk labels are fitted to fine tune a large pre-training NLP model, so that text information extraction considering semantics is realized. Sequence data, including predefined types of events throughout the credit cycle, such as registration, live face verification, etc., or touch behavior on the handheld smart device, may be represented as a sequence of vectors. Signal data, typically one-dimensional or three-dimensional, equi-frequency sampled waveform data, includes sound signals and motion sensor signals.

The applicant has also found that in the actual financial sector business, the credit risk and fraud risk are practically difficult to separate, and the role that a pure anti-fraud model can play is often very limited. The existing anti-fraud system mainly uses expert experience and discrete rules, a model used in anti-fraud is in a relatively independent and narrow application scene, and an algorithm used in the anti-fraud system mainly uses deep learning and mainly aims at abnormality detection of some specific targets and specific scenes. The existing anti-fraud model application has narrow and scattered scenes and is difficult to be organically combined with the credit model. However, the anti-fraud model uses a neural network, so that the anti-fraud model can make full use of high-dimensional data such as images, videos and the like, compared with the credit model.

Therefore, in one or more embodiments of the present application, a training method for a user credit scoring model and a user credit scoring method are provided, which solve the problem that the existing credit scoring model is difficult to fully utilize high-dimensional unstructured data such as images, characters, videos, etc. by combining a deep learning model structure in the credit scoring model.

In the embodiment of the application, the model structure using deep learning is various deep neural networks, including Convolutional Neural Networks (CNN), long-term memory (LSTM), and transformations, and these models are combined with generalized linear regression or GBDT in a stacked manner. The specific model structure is as follows: a class of high-dimensional data enters a neural network structure suitable for the class of data, and a neural network encoder in the neural network structure is utilized to output feature vectors, wherein the feature vectors are equivalent to relevant information extracted from the high-dimensional data subjected to modulo operation, and the feature vectors and derivative features derived from empirical rules are put into a feature pool to serve as generalized linear regression or GBDT alternative modulo operation features. Features automatically extracted from high-dimensional data using neural network encoders are more complete, deeper, and more relevant to the target variable than features derived using empirical rules. And the addition of the feature vectors greatly widens the dimension of the feature pool, greatly increases the available information of the generalized linear regression or GBDT model, thereby increasing the performance of the user credit scoring model and greatly widening the available scenes of the user credit scoring model.

Wherein the characteristics of the neural network encoder output, if entered into a subsequent generalized linear regression or GBDT, form a stacked model. However, the stacked model consisting of neural network and GBDT is difficult to train directly. The most important reason is that the neural network and GBDT are trained in a completely different manner. In the training stage of the model, the neural network and the GBDT need to iterate, but the difference is that all parameters of the neural network change when each iteration is performed, while a part of parameters are added when each iteration is performed by the GBDT, the previous parameters do not change, and therefore the neural network and the GBDT are difficult to iterate simultaneously.

In order to solve the above-mentioned problems, an embodiment of the present application provides a training method for a user credit score model, where the user credit score model includes a neural network encoder and a meta-model, please refer to fig. 1, the training method includes:

step 100, inputting high-dimensional data into a trained neural network encoder to obtain feature vectors; wherein the high-dimensional data includes at least one of image data, video data, and text data;

step 200, carrying out feature screening according to all feature vectors and derivative features to obtain screened features;

and 300, training the meta-model according to the screened characteristics and the corresponding labels to obtain the trained meta-model.

In this embodiment, training of the user credit score model is divided into two stages of training, the training in the previous stage obtains a trained neural network encoder, and the training (for example, training of GBDT or generalized linear regression) of the meta model is performed in the next stage by using feature vectors output by the trained neural network encoder and derived features derived from the original data. Therefore, the user credit scoring model obtained through distribution training can fully utilize high-dimensional unstructured data such as images, characters and videos, and improves the accuracy of user credit scoring, so that risks are reduced.

Before the high-dimensional data is input into the trained neural network encoder, the training of the neural network encoder at the last stage is further included, specifically including:

In some optional embodiments, establishing a neural network structure corresponding to the high-dimensional data includes: establishing a corresponding neural network structure for each type of high-dimensional data;

correspondingly, training the neural network structure according to the high-dimensional data and the corresponding label to obtain a trained neural network encoder, comprising: and respectively training the corresponding neural network structure according to each type of high-dimensional data and the corresponding label to obtain a trained neural network encoder corresponding to each type of high-dimensional data.

In the embodiment of the present application, the training of the neural network structure includes the following two cases: first, each type of high-dimensional data is used to independently train its corresponding neural network, but the same labels are used; second, each type of high-dimensional data is used to independently train its corresponding neural network, and a different label is used.

In the embodiment of the application, when training a neural network encoder, all high-dimensional data planned into a model are used, the neural networks corresponding to the high-dimensional data are integrated together, and a label is selected to train multi-mode data.

In some alternative embodiments, feature screening is performed comprising: features are filtered based on predefined criteria and/or based on model performance.

In this embodiment of the present application, feature screening includes: a filtered approach to screening features based on predefined criteria, such as screening features based on correlation of individual features to a target variable or information gain of individual features; and, a wrapped approach to screening features based on model performance, such as iteratively eliminating unimportant features using a recursive feature elimination algorithm.

In the embodiment of the application, the characteristics screened in the previous step are used as input data, GBDT or generalized linear regression is selected as an algorithm of a meta-model by combining with the designed label, and the meta-model is trained. By the training method, the stacked model fused with the neural network and the GBDT or the generalized linear model is very flexible to build, and the accuracy of the model is improved. And the method can flexibly select required sources and different types of data according to the requirements of wind control business, develop a model which is sufficient for risk control under complex conditions by utilizing various labels, and integrate the rule-derived feature with the interpretability into the model, so that the model maintains a certain degree of interpretability.

Referring to fig. 2, fig. 2 is a schematic diagram of a user credit scoring model provided in an embodiment of the present application, and the working procedure of using the model to score credit is as follows:

the first step is to collect data: when a user applies for loan by using credit product client software in the handheld touch intelligent device, the client can acquire various data in the device after the user authorizes. In the application flow, a living body authentication link is provided, and the client can acquire a living body authentication video which is video data. In the application process, the user needs to shoot the identity card on site, wherein the identity card comprises a front side and a back side, and the client can acquire photos shot in real time to obtain image data. The client also collects attribute information of the self-running device and personal basic information filled in by the user. In addition to the data directly collected by the client, third party data such as credit history, financial status, and income level of the user are also utilized.

The second step is data preprocessing: for an identity card photo, an Optical Character Recognition (OCR) technology is used for recognizing and extracting characters on the photo, and text data is generated.

The third step is feature derivation: this step is mainly to map the original data based on rules to obtain derived features. The information filled by the user, the information such as characters, equipment attributes and the like generated by the OCR technology and the third party data are used for mapping the information into category type or numerical scalar type characteristics based on rules generated by business experience. For example, the academic information is mapped into the academic category, the identification card words are mapped into the average dominant income ordinals of provincial administrative units, the filling words are mapped into the professional category, and the like. The derived features are then added to the pool of features. These features are generated based on rules with interpretability.

The fourth step is to design the tags for training the model: and defining a good credit client and a bad credit client according to the debt default condition of the user to form a classification label.

The fifth step is to train the model, specifically, the above-mentioned distributed training mode is adopted, and will not be described herein.

And step six, performing backtracking test on the trained model, formulating strategy rules according to test results, and embedding the strategy rules into an application admission strategy system.

Referring to fig. 3, fig. 3 is a flowchart illustrating steps of a user credit scoring method according to an embodiment of the present application, including:

step 400, inputting high-dimensional data in user data into a trained neural network encoder to obtain an actual feature vector; carrying out rule derivatization based on service experience on the original data in the user data to obtain actual derivatization characteristics;

step 500, carrying out feature screening according to all the actual feature vectors and the actual derivative features to obtain screened actual features;

step 600, inputting the screened actual characteristics into the trained meta model to obtain actual scores.

In the embodiment of the application, the data of the input model comprise high-dimensional data and original data, the high-dimensional data of the internal source can be fully utilized based on the multi-mode data, the feature dimension information breadth of the model is greatly increased, higher accuracy can be still maintained in a scene that a client lacks credit history, and in a scene facing a high-risk client, more information is helpful for identifying fraud risk, so that an integrated scheme of anti-fraud and credit scoring is realized in a scene of the high-risk client in which fraud risk and credit risk are difficult to separate.

Fig. 4 shows a possible structure of the electronic device provided in the embodiment of the present application. Referring to fig. 4, the electronic device includes: processor 1, memory 2, and communication interface 3, which are interconnected and communicate with each other by a communication bus 4 and/or other forms of connection mechanisms (not shown).

The Memory 2 includes one or more (Only one is shown in the figure), which may be, but is not limited to, a random access Memory (Random Access Memory, RAM for short), a Read Only Memory (ROM for short), a programmable Read Only Memory (Programmable Read-Only Memory, PROM for short), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), an electrically erasable programmable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM for short), and the like. The processor 1 and possibly other components may access the memory 2, read and/or write data therein.

The processor 1 comprises one or more (only one shown in the figure), which may be an integrated circuit chip with signal processing capabilities. The processor 1 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a micro control unit (Micro Controller Unit, MCU), a network processor (Network Processor, NP), or other conventional processor; but may also be a special purpose processor including a Neural Network Processor (NPU), a graphics processor (Graphics Processing Unit GPU), a digital signal processor (Digital Signal Processor DSP), an application specific integrated circuit (Application Specific Integrated Circuits ASIC), a field programmable gate array (Field Programmable Gate Array FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Also, when the processor 1 is plural, some of them may be general-purpose processors, and the other may be special-purpose processors.

The communication interface 3 comprises one or more (only one is shown) and may be used for direct or indirect communication with other devices for data interaction. The communication interface 3 may comprise an interface for wired and/or wireless communication.

One or more computer program instructions may be stored in the memory 2, which may be read and executed by the processor 1 to implement the methods provided by the embodiments of the present application.

It will be appreciated that the configuration shown in fig. 4 is merely illustrative, and that the electronic device may also include more or fewer components than shown in fig. 4, or have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof. The electronic device may be a physical device such as a PC, a notebook, a tablet, a cell phone, a server, an embedded device, etc., or may be a virtual device such as a virtual machine, a virtualized container, etc. The electronic device is not limited to a single device, and may be a combination of a plurality of devices or a cluster of a large number of devices.

The present embodiments also provide a computer readable storage medium having stored thereon computer program instructions that, when read and executed by a processor of a computer, perform the methods provided by the embodiments of the present application. For example, the computer readable storage medium may be implemented as the memory 2 in the electronic device of fig. 4.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A method of training a user credit scoring model, the user credit scoring model comprising a neural network encoder and a meta-model, the training method comprising:

performing feature screening according to all the feature vectors and the derivative features to obtain screened features;

2. The method of claim 1, wherein before inputting the high-dimensional data into the trained neural network encoder, further comprising: training the neural network encoder.

3. The method of claim 2, wherein the training neural network encoder comprises:

establishing a neural network structure corresponding to the high-dimensional data; the neural network structure comprises a neural network encoder and a neural network prediction head, wherein the neural network encoder is used for generating and outputting corresponding feature vectors according to high-dimensional data, and the neural network prediction head is used for generating and outputting corresponding prediction values according to the feature vectors;

4. The method of claim 3, wherein the establishing a neural network structure corresponding to the high-dimensional data comprises:

5. The method of claim 4, wherein training the neural network structure based on the high-dimensional data and the corresponding labels, resulting in a trained neural network encoder, further comprises:

6. The method of claim 1, wherein the performing feature screening comprises:

screening features based on predefined criteria; and/or screening features based on model performance.

7. The method of claim 1, wherein the metamodel comprises a metamodel based on a gradient-lifting decision tree or a generalized linear regression algorithm.

8. A method of scoring a user credit, comprising:

9. An electronic device, comprising: a processor and a memory storing machine-readable instructions executable by the processor, which when executed by the processor, perform the method of any of claims 1-8.

10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the method according to any of claims 1-8.