CN116205726B - Loan risk prediction method and device, electronic equipment and storage medium - Google Patents

Loan risk prediction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116205726B
CN116205726B CN202310474449.0A CN202310474449A CN116205726B CN 116205726 B CN116205726 B CN 116205726B CN 202310474449 A CN202310474449 A CN 202310474449A CN 116205726 B CN116205726 B CN 116205726B
Authority
CN
China
Prior art keywords
data
model
training
sub
unstructured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310474449.0A
Other languages
Chinese (zh)
Other versions
CN116205726A (en
Inventor
甘元笛
刘洪江
任晓东
陈昱任
吕文勇
周智杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu New Hope Finance Information Co Ltd
Original Assignee
Chengdu New Hope Finance Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu New Hope Finance Information Co Ltd filed Critical Chengdu New Hope Finance Information Co Ltd
Priority to CN202310474449.0A priority Critical patent/CN116205726B/en
Publication of CN116205726A publication Critical patent/CN116205726A/en
Application granted granted Critical
Publication of CN116205726B publication Critical patent/CN116205726B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a loan risk prediction method, a loan risk prediction device, electronic equipment and a storage medium, wherein the loan risk prediction method comprises the following steps: acquiring user data, wherein the user data comprises unstructured user data and structured user data; inputting user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data. The first sub-model is used for extracting information in unstructured data, and the risk prediction result is obtained by inputting the data characteristics output by the first sub-model into the second sub-model. The method improves the information utilization rate in unstructured data, effectively utilizes the interpretability of logistic regression or integrated decision trees, and improves the risk assessment and prediction capability of a risk prediction model.

Description

Loan risk prediction method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a loan risk prediction method, a loan risk prediction device, electronic equipment and a storage medium.
Background
The loan risk control refers to predicting the risk that a borrower may have a repayment problem by analyzing credit data and repayment behaviors of the borrower through an algorithm in the loan repayment process. Since loan traffic forms are numerous and risk forms are diverse, risk control is the core basis for such traffic. At present, most enterprises perform risk control through a risk policy model or manually, and analyze and sort information filled in by a user during borrowing so as to predict loan risks. This approach does not fully exploit the characteristics and information of the data, resulting in lower prediction accuracy.
Disclosure of Invention
The embodiment of the invention aims at providing a loan risk prediction method, a loan risk prediction device, electronic equipment and a storage medium, which utilize flexibly selected data with different structures to train a risk prediction model, expand the dimension of characteristics and improve the prediction capability of the risk prediction model.
In a first aspect, an embodiment of the present application provides a loan risk prediction method, including: acquiring user data, wherein the user data comprises unstructured user data and structured user data; inputting user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data.
In the implementation process, the risk prediction model comprises a first sub-model and a second sub-model, wherein the first sub-model is used for extracting information in unstructured data, and data features output through the first sub-model are input into the second sub-model to obtain a risk prediction result. The method improves the information utilization rate in unstructured data, effectively utilizes the interpretability of logistic regression or integrated decision trees, and improves the risk assessment and prediction capability of a risk prediction model.
Optionally, in an embodiment of the present application, the unstructured training data includes first unstructured training data and second unstructured training data; before inputting the user data into the preset risk prediction model to obtain the risk prediction result, the method further comprises the following steps: training a preset neural network through first unstructured training data to obtain a first sub-model; inputting the second unstructured training data into the first sub-model to obtain data characteristics; and training the preset meta model through the data characteristics and the structured training data to obtain a second sub model.
In the implementation process, the risk prediction model comprises a first sub-model and a second sub-model, the unstructured training data is utilized to train the first sub-model, the structured training data is utilized to train the second sub-model, the data with different structures can be flexibly selected to train the risk prediction model according to the wind control business requirement, the prediction capability of the risk prediction model is improved, and accurate risk prediction under the relatively complex condition is realized.
Optionally, in this embodiment of the present application, training, through the first unstructured training data, a preset neural network to obtain a first sub-model includes: obtaining a vector sequence based on the first unstructured training data; adding corresponding labels to the vector sequence; and training the neural network through the vector sequence after adding the label to obtain a first sub-model.
In the implementation process, training is performed on a preset neural network through the first unstructured training data to obtain a first sub-model. The neural network is used for fully utilizing the high-dimensional data, and the dimension of the features is expanded on the basis of the original structural features. The data range of the risk prediction model can be greatly enlarged, and the accuracy of risk prediction is improved.
Optionally, in an embodiment of the present application, the first unstructured training data includes event sequence data; the vector sequence comprises a sequence of event vectors; based on the first unstructured training data, a vector sequence is obtained, comprising: acquiring attribute information of event sequence data; acquiring derivative attribute information of the event sequence data based on attribute information corresponding to the event sequence data and attribute information corresponding to the previous event sequence data; splicing the attribute information of the event sequence data and the derived attribute information to generate a feature vector of the event sequence data; and splicing the feature vectors of each event sequence data according to the time sequence to obtain an event vector sequence.
In the implementation process, the event sequence data is analyzed by collecting and integrating various sources and different types of data, information hidden in the data is found, multidimensional evaluation of clients is realized, and the accuracy of the model is improved.
Optionally, in an embodiment of the present application, the vector sequence includes a behavior vector sequence; based on the first unstructured training data, a vector sequence is obtained, comprising: obtaining behavior time information corresponding to the behavior sequence data; and splicing the behavior sequence data according to the behavior time information to obtain a behavior vector sequence.
In the implementation process, the data of the high-dimensional complex structure such as behavior sequence data are deeply utilized, and the applicability and accuracy of the model are improved.
Optionally, in an embodiment of the present application, training the preset meta-model through the data feature and the structured training data to obtain the second sub-model includes: generating structural training data features based on the structural training data through a preset feature generation rule; adding the data features and the structured training data features to a feature pool; screening the data features and the structured training data features in the feature pool to obtain modeling features; and training the meta model through the in-mold feature to obtain a second sub model.
In the implementation process, the first sub-model and the second sub-model are trained step by step, and the risk prediction model is a stacked model fused with the first sub-model and the second sub-model; meanwhile, the second sub-model integrates the feature with the interpretability based on rule derivation, so that the model maintains a certain degree of interpretability, and the accuracy of predicting risks by the risk prediction model is improved.
Optionally, in the embodiment of the present application, inputting the user data into a preset risk prediction model to obtain a risk prediction result includes: inputting unstructured user data into a first sub-model to obtain unstructured user data characteristics; generating structured user data features based on the structured user data; splicing the unstructured user data features and the structured user data features to generate splicing features; and inputting the spliced characteristic into a second sub-model to obtain a risk prediction result.
In the implementation process, splicing the unstructured user data features and the structured user data features to generate splicing features; and inputting the spliced characteristic into a second sub-model to obtain a risk prediction result. The unstructured user data features are fully utilized, and the dimension of the features is expanded on the basis of the original structured features. The comprehensive risk of credit and fraud is precisely controlled.
In a second aspect, an embodiment of the present application further provides a loan risk prediction apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user data, and the user data comprises unstructured user data and structured user data; the prediction module is used for inputting the user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory storing machine-readable instructions executable by the processor to perform the method as described above when executed by the processor.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method described above.
By adopting the loan risk prediction method, the loan risk prediction device, the electronic equipment and the storage medium, the risk prediction model comprises the first sub-model and the second sub-model, the first sub-model is used for information extraction in unstructured data, and the risk prediction result is obtained by inputting the data characteristics output by the first sub-model into the second sub-model. The method has the advantages that the information utilization rate in unstructured data is improved, meanwhile, the interpretability of logistic regression or integrated decision trees is effectively utilized, the risk assessment and prediction capacity of a risk prediction model is improved, and accurate risk prediction is achieved under the condition of being complex.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a loan risk prediction method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a risk prediction model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a loan risk prediction apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the technical solutions of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical solutions of the present application, and thus are only examples, and are not intended to limit the scope of protection of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In the description of the embodiments of the present application, the technical terms "first," "second," etc. are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless explicitly defined otherwise.
Existing risk prediction models for assessing user loan risk use algorithms that are based on logistic regression and integrated decision trees. For the establishment of the risk prediction model, the original data is subjected to feature derivation based on feature generation rules to generate data features corresponding to the original data, and then the model is established by using the data features. This approach makes it difficult to fully exploit high-dimensional unstructured data such as images, text, sequences, etc. Based on the characteristics generated by the characteristic generation rules, only a small amount of information in unstructured data can be extracted, and the information in high-dimensional data is difficult to cover in an all-around manner, so that the prediction accuracy of a risk prediction model mainly comprising logistic regression and an integrated decision tree is low.
The deep learning model based on the deep neural network is more suitable for extracting information from high-dimensional data, namely unstructured data. However, if the deep learning model is adopted to replace the logistic regression or the integrated decision tree in the existing risk prediction model, the neural network has a significantly lower interpretability than the logistic regression and the decision tree, and therefore, the prediction accuracy of the risk prediction model is also lower.
In the prior art, although a neural network is used in the anti-fraud model, high-dimensional data such as images are utilized, the application scene is narrow and scattered, and the identification of whether fraud scenes such as other persons for carrying out the loan transaction, non-self operation or other illegal operations exist or not is limited. The neural network in the anti-fraud model of the prior art cannot be effectively combined with the logistic regression or the integrated decision tree in the credit model, so that the repayment capability, overdue risk and the like of the user are predicted.
In order to achieve comprehensive assessment on clients, the interpretability of a logistic regression or an integrated decision tree is effectively utilized while the information utilization rate in unstructured data is improved, and the risk assessment and prediction capacity of a risk prediction model are improved.
Please refer to fig. 1, which illustrates a flowchart of a loan risk prediction method provided in an embodiment of the present application. The loan risk prediction method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment can comprise a terminal and a server; the terminal can be a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assitant, PDA) and the like; the server may be an application server or a Web server. The loan risk prediction method may include the steps of:
step S110: user data is acquired, the user data including unstructured user data and structured user data.
Step S120: inputting user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data.
In step S110, the user typically performs operations through credit product client software in the terminal device, such as filling in personal information and performing identity verification in a loan APP (Application) installed in a mobile phone, a tablet computer or a computer, when performing the transaction of a loan service. Thus, the user data can be collected by the terminal device after the user authorization is obtained.
Unstructured user data may be high-dimensional data that is not suitable for expression and implementation by a database two-dimensional table. By way of example, unstructured user data may include image and video class data, text class data, sequence data, signal data, and the like. In loan risk prediction scenarios, image and video type data such as live face verification video, identification card photographs, etc.; text-like data such as application information filled in by a user, etc.; the sequence data is, for example, various page operation data, behavior operation data and the like obtained through the terminal equipment; signal data such as sound signals and sensor signals, etc. Unstructured user data collected by embodiments of the present application may be one or more of the above listed data, as well as other unstructured data.
The structured user data may be data logically expressed and implemented by a two-dimensional table structure, which may be stored and managed by a relational database. Illustratively, the structured user data includes user personal information, device information corresponding to a terminal device that collects the user data, and the like.
In step S120, the risk prediction model includes a first sub-model and a second sub-model, and the risk prediction model may be understood as a stacked model composed of the first sub-model and the second sub-model.
The first sub-model may be a deep learning model based on a deep neural network; the first sub-model is used for extracting data characteristic information of unstructured user data. Each type of unstructured user data may have a corresponding type of first sub-model, e.g., if the unstructured user data is image and video type data, the first sub-model may be to build a computer vision model using CNN; if the unstructured user data is text-like data, the first sub-model may be a Natural Language Processing (NLP) model; if the unstructured user data is sequence data, the first sub-model may be a sequence processing model established using LSTM; if the unstructured user data is signal data, the first sub-model may be a signal processing model built using a transducer.
The second sub-model comprises a logistic regression model or an integrated decision tree, such as in particular GBDT (Gradient Boosting Decision Tree, gradient-lifting decision tree) or a generalized linear regression model, etc. The second sub-model is used for carrying out risk prediction on the data features of the unstructured user data extracted according to the first sub-model and the structured features generated based on the feature derivation rule, and obtaining a risk prediction result.
The process of obtaining the risk prediction result by using the risk prediction model may be: inputting unstructured user data in the user data into a first sub-model corresponding to the unstructured user data to obtain data characteristics; based on the feature derivation rule, generating corresponding structured features according to the structured user data; and taking the data features and the structured features as the input of a second sub-model, and inputting the data features and the structured features into the second sub-model to obtain a risk prediction result. The risk prediction result may be a user credit score, fraud prediction, income prediction, or a refund overdue risk probability, etc.
In the implementation process, the risk prediction model comprises a first sub-model and a second sub-model, wherein the first sub-model is used for extracting characteristic information in unstructured data, and a risk prediction result is obtained by inputting data characteristics output by the first sub-model into the second sub-model. The method improves the information utilization rate in unstructured data, effectively utilizes the interpretability of the structured data, and improves the risk assessment and prediction capability of a risk prediction model.
Optionally, in an embodiment of the present application, the unstructured training data includes first unstructured training data and second unstructured training data; before inputting the user data into the preset risk prediction model to obtain the risk prediction result, the method further comprises the following steps: training a preset neural network through first unstructured training data to obtain a first sub-model; inputting the second unstructured training data into the first sub-model to obtain data characteristics; and training the preset meta model through the data characteristics and the structured training data to obtain a second sub model.
In the specific implementation process: the data characteristics output by the neural network encoder in the first sub-model enter a logistic regression model or an integrated decision tree in the second sub-model to form a stacking model, namely a risk prediction model. However, since the training mode of the neural network is completely different from that of the logistic regression model or the integrated decision tree, the risk prediction model formed by stacking the first sub-model and the second sub-model is difficult to train directly. Taking GBDT as an example, the neural network and the GBDT need iteration, but all parameters of the neural network change when each iteration is performed, and a part of parameters are increased when each iteration is performed, so that the previous parameters are not changed, the neural network and the GBDT are difficult to iterate at the same time, and direct training is not possible.
The embodiment of the application adopts a step-by-step training mode for the first sub-model and the second sub-model to obtain the risk prediction model. For the training process of the first sub-model, specifically, for example, first unstructured training data is collected, where the unstructured training data is collected according to an actual risk prediction scenario. For example, the risk prediction scene is to predict the overdue risk of repayment of the customer, and the collected first unstructured training data may be various data in the collection device collected by the customer terminal, for example, touch behavior on the handheld intelligent device; and various events predefined in the whole credit period, such as registration, living human face verification and other data, can be also used. And training a preset neural network through the acquired first unstructured training data to obtain a first sub-model.
And in the training process of the second sub-model, specifically, for example, the pre-collected second unstructured training data is input into the trained first sub-model, and the first sub-model is used for reasoning the second unstructured training data to obtain data characteristics corresponding to the second unstructured training data.
The second unstructured training data is a new training set relative to the first unstructured training data, and the data type, the acquisition mode and the like of the second unstructured training data can be the same as those of the first unstructured training data. The first sub-model and the second sub-model are respectively trained through different training sets, so that the problem of model overfitting caused by the fact that the same training set is adopted in the training process of the first sub-model and the second sub-model is solved.
After the data features corresponding to the second unstructured training data are obtained, training is carried out on the preset meta-model through the data features and the structured training data and combining with the preset labels, and a second sub-model is obtained. The GBDT or generalized linear regression can be selected as an algorithm of the meta-model to train the meta-model.
The training labels of the metamodel may be the same as or different from the training labels of the neural network in the first submodel. As an embodiment, the tag of the first sub-model may be set as related content of fraud, and the tag of the meta-model may be set as related content of loan repayment, debt default. And the first sub-model is used for extracting the information related to fraud in the data, and then the information is input into the meta-model for training with loan repayment, debt default and the like as labels, so that the second sub-model can establish the connection between fraud features and debt default, and the integration of credit evaluation and anti-fraud is realized.
In the implementation process, the risk prediction model comprises a first sub-model and a second sub-model, the unstructured training data is utilized to train the first sub-model, the structured training data is utilized to train the second sub-model, the data with different structures can be flexibly selected to train the risk prediction model according to the wind control business requirement, the prediction capability of the risk prediction model is improved, and accurate risk prediction under the relatively complex condition is realized.
Optionally, in this embodiment of the present application, training, through the first unstructured training data, a preset neural network to obtain a first sub-model includes: obtaining a vector sequence based on the first unstructured training data; adding corresponding labels to the vector sequence; and training the neural network through the vector sequence after adding the label to obtain a first sub-model.
In the specific implementation process: the process of training the first sub-model may specifically be: based on the first unstructured training data, a vector sequence corresponding to the first unstructured training data is obtained. Wherein the first unstructured training data comprises event sequence data and behavior sequence data; correspondingly, the vector sequence corresponding to the event sequence data is an event vector sequence, and the vector sequence corresponding to the behavior sequence data is a behavior vector sequence.
And carrying out data preprocessing on the acquired first unstructured training data to obtain a vector sequence. And adding corresponding labels to the vector sequence according to the information in the vector sequence, wherein the labels comprise overdue repayment, normal repayment, loan compensation or fraud substitution and the like. And training the neural network through the vector sequence after adding the label to obtain a first sub-model.
The event sequence data and the behavior sequence data can respectively train the corresponding neural networks independently, but use the same label; the corresponding neural networks can be trained independently and respectively, and different labels are used; the neural network corresponding to the event sequence data and the behavior sequence data can be integrated together, and the same label is used for training the multi-mode data.
In the implementation process, training is performed on a preset neural network through the first unstructured training data to obtain a first sub-model. The high-dimensional data is fully utilized through the neural network, and the dimension of the features is expanded on the basis of the original structured features. The data range of the risk prediction model can be greatly enlarged, and the accuracy of risk prediction is improved.
Optionally, in an embodiment of the present application, the first unstructured training data includes event sequence data; the vector sequence comprises a sequence of event vectors; based on the first unstructured training data, a vector sequence is obtained, comprising: acquiring attribute information of event sequence data; acquiring derivative attribute information of the event sequence data based on attribute information corresponding to the event sequence data and attribute information corresponding to the previous event sequence data; splicing the attribute information of the event sequence data and the derived attribute information to generate a feature vector of the event sequence data; and splicing the feature vectors of each event sequence data according to the time sequence to obtain an event vector sequence.
In the specific implementation process: the event sequence data is the information of each event recorded in the whole operation flow when the user operates through credit product client software in the terminal equipment. The event sequence data comprises information corresponding to page operation events such as registration, living body authentication, application, withdrawal and/or repayment. The attribute information of the event sequence data comprises event time information and/or event space information; the event time information is the time when the event occurs, and the event space information is the place where the time occurs, such as GPS positioning data.
And obtaining derivative attribute information of the event sequence data based on the attribute information corresponding to the event sequence data and the attribute information corresponding to the previous event sequence data. The previous event sequence data is arranged according to time sequence, and the event happens before the current event. Based on event time information and/or event space information in the attribute information of the event sequence data, calculating time displacement and/or space displacement of the event sequence data and the previous event sequence data, and taking the time displacement and/or space displacement as derivative attribute information of the event sequence data.
And splicing the attribute information and the derivative attribute information of the event sequence data to generate a feature vector of the current event sequence data, and splicing the feature vector of each event sequence data according to a time sequence to obtain an event vector sequence.
In an alternative embodiment, the attribute information of the event sequence data may further include an event type, and the event sequence data and the type of the previous event sequence data may be changed as derivative attribute information of the event sequence data.
In the implementation process, the event sequence data is analyzed by collecting and integrating various sources and different types of data, the information hidden in the data is deeply mined, the multidimensional evaluation of clients is realized, and the accuracy of the model is improved.
Optionally, in an embodiment of the present application, the vector sequence includes a behavior vector sequence; based on the first unstructured training data, a vector sequence is obtained, comprising: obtaining behavior time information corresponding to the behavior sequence data; and splicing the behavior sequence data according to the behavior time information to obtain a behavior vector sequence.
In the specific implementation process: the behavior sequence data comprises information of touch behavior and repayment behavior on the terminal equipment, such as overdue, normal or compensation, repayment mode and the like of specific repayment at each period. The behavior time information corresponding to the behavior sequence data is obtained, and specifically, the time of repayment in each period can be obtained. And splicing the behavior sequence data according to the behavior time information to obtain a behavior vector sequence.
In the implementation process, the data of the high-dimensional complex structure such as behavior sequence data are deeply utilized, and the applicability and accuracy of the model are improved.
Optionally, in an embodiment of the present application, training the preset meta-model through the data feature and the structured training data to obtain the second sub-model includes: generating structural training data features based on the structural training data through a preset feature generation rule; adding the data features and the structured training data features to a feature pool; screening the data features and the structured training data features in the feature pool to obtain modeling features; and training the meta model through the in-mold feature to obtain a second sub model.
In the specific implementation process: and mapping the structured training data through a preset feature generation rule to generate the features of the structured training data. Feature mapping is the mapping of data to a high-dimensional space.
Adding data features and structured training data features into a feature pool, and carrying out feature analysis and screening on the features in the feature pool, wherein feature screening can be carried out in the following way, namely, in the first way: a filtered approach that filters features based on predefined criteria, such as the correlation of individual features with a target variable or the information gain of individual features. The second way is: wrapped methods, which screen features based on model performance, iteratively eliminate unimportant features, for example, using a recursive feature elimination algorithm. After screening, the retained features are used as the mold entering features; and adding a corresponding label to the model-in feature by utilizing a pre-designed label, selecting GBDT or generalized linear regression as an algorithm of the meta-model, and training the meta-model through the model-in feature to obtain a second sub-model.
In the implementation process, the risk prediction model is a stacked model fused with the first sub-model and the second sub-model; the first sub-model acquires unstructured data features, the second sub-model integrates rule-derived features with interpretability, so that the model maintains a certain degree of interpretability, and the accuracy of risk prediction of the risk prediction model is improved.
Please refer to a schematic diagram of the risk prediction model provided in the embodiment of the present application shown in fig. 2.
Optionally, in the embodiment of the present application, inputting the user data into a preset risk prediction model to obtain a risk prediction result includes: inputting unstructured user data into a first sub-model to obtain unstructured user data characteristics; generating structured user data features based on the structured user data; splicing the unstructured user data features and the structured user data features to generate splicing features; and inputting the spliced characteristic into a second sub-model to obtain a risk prediction result.
In the specific implementation process: as shown in fig. 2, unstructured user data includes event sequence data and behavior sequence data, which may be specifically a repayment behavior sequence and event sequence. Inputting the repayment behavior sequence and the event sequence into an LSTM model neural network encoder in the first sub-model, obtaining unstructured user data features, and adding the unstructured user data features into a feature pool.
The structured user data comprises user personal information and equipment information, a feature derivation rule is obtained based on service experience, the structured user data is characterized, a structured user data feature is generated, and the structured user data feature is added into a feature pool.
The unstructured user data features and the structured user data features added into the feature pool can be subjected to feature screening, and the unstructured user data features and the structured user data features after screening are spliced to generate spliced features. And inputting the spliced features into a second sub-model, and obtaining a risk prediction result through GBDT or generalized linear regression algorithm.
The method can enable the risk prediction model to simultaneously realize unstructured features of neural network coding and structured features related to credit evaluation, and can realize full utilization of data features and simultaneously consider model interpretability.
The label of the first sub-model can be set as related content of fraud, the first sub-model is used for extracting related information of fraud in data, and then the related information is input into the meta-model for training with loan repayment, debt default and the like as labels, and the trained GBDT can establish the connection between fraud features and debt default, so that the integration of credit evaluation and anti-fraud is realized.
In the implementation process, splicing the unstructured user data features and the structured user data features to generate splicing features; and inputting the spliced characteristic into a second sub-model to obtain a risk prediction result. The unstructured user data features are fully utilized, and the dimension of the features is expanded on the basis of the original structured features. The comprehensive risk of credit and fraud is precisely controlled.
In an alternative embodiment, first unstructured training data and second unstructured training data for training a risk prediction model are collected in advance, an LSTM neural network of a first sub-model is trained through the first unstructured training data, the structure of the neural network is divided into an encoder and a pre-measuring head, the data is input into the encoder, feature vectors are output, then the feature vectors enter the pre-measuring head, a predicted value is output, and finally the trained first sub-model is obtained.
Features are extracted using a trained neural network encoder. Specifically, the second unstructured training data is input into a neural network encoder of the first sub-model, and reasoning is carried out to obtain a feature vector.
And obtaining the structured training data, and mapping the structured training data through a preset feature generation rule to generate the features of the structured training data.
And adding the feature vector and the structural training data features into a feature pool, and screening the features in the feature pool to obtain screened features.
And taking the screened characteristics as input data, combining with the designed label, selecting GBDT or generalized linear regression as an algorithm of a meta-model, training the meta-model, and obtaining a second sub-model, thereby completing training of the risk prediction model.
The process of predicting loan risk for a user by a risk prediction model is as follows: data of a user who needs loan risk prediction is obtained, and the user data includes unstructured user data and structured user data. And inputting the unstructured user data into a first sub-model in the trained risk prediction model to obtain unstructured user data characteristics.
And mapping the structured user data based on a preset feature generation rule to generate the structured user data feature.
Feature screening is carried out on unstructured user data features and structured user data features, and the unstructured user data features and the structured user data features after screening are spliced to generate spliced features. And inputting the spliced features into a second sub-model, and obtaining a risk prediction result through GBDT or generalized linear regression algorithm. The risk prediction result may be a user credit score, fraud prediction, income prediction, or a refund overdue risk probability, etc.
Please refer to fig. 3, which illustrates a schematic structural diagram of a loan risk prediction apparatus provided in an embodiment of the present application; the embodiment of the application provides a loan risk prediction device 200, which comprises:
an acquisition module 210, configured to acquire user data, where the user data includes unstructured user data and structured user data;
The prediction module 220 is configured to input user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data.
Optionally, in an embodiment of the present application, the loan risk prediction device, the unstructured training data includes first unstructured training data and second unstructured training data; further comprises: the training module is used for training a preset neural network through the first unstructured training data to obtain a first sub-model; inputting the second unstructured training data into the first sub-model to obtain data characteristics; and training the preset meta model through the data characteristics and the structured training data to obtain a second sub model.
Optionally, in an embodiment of the present application, the loan risk prediction device, the training module, is further configured to obtain a vector sequence based on the first unstructured training data; adding corresponding labels to the vector sequence; and training the neural network through the vector sequence after adding the label to obtain a first sub-model.
Optionally, in an embodiment of the present application, the loan risk prediction device, the first unstructured training data includes event sequence data; the vector sequence comprises a sequence of event vectors; the training module is also used for obtaining attribute information of the event sequence data; acquiring derivative attribute information of the event sequence data based on attribute information corresponding to the event sequence data and attribute information corresponding to the previous event sequence data; splicing the attribute information of the event sequence data and the derived attribute information to generate a feature vector of the event sequence data; and splicing the feature vectors of each event sequence data according to the time sequence to obtain an event vector sequence.
Optionally, in an embodiment of the present application, the loan risk prediction device, the first unstructured training data includes behavior sequence data; the vector sequence comprises a sequence of behavior vectors; the training module is also used for obtaining behavior time information corresponding to the behavior sequence data; and splicing the behavior sequence data according to the behavior time information to obtain a behavior vector sequence.
Optionally, in the embodiment of the present application, the loan risk prediction device, the training module, and the training module are further configured to generate, according to a preset feature generation rule, a feature of the structured training data based on the structured training data; adding the data features and the structured training data features to a feature pool; screening the data features and the structured training data features in the feature pool to obtain modeling features; and training the meta model through the in-mold feature to obtain a second sub model.
Optionally, in an embodiment of the present application, the loan risk prediction device, the prediction module, are specifically configured to input unstructured user data into the first sub-model to obtain unstructured user data features; generating structured user data features based on the structured user data; splicing the unstructured user data features and the structured user data features to generate splicing features; and inputting the spliced characteristic into a second sub-model to obtain a risk prediction result.
It should be understood that, corresponding to the loan risk prediction method embodiment described above, the apparatus can perform the steps related to the method embodiment described above, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device includes at least one software functional module that can be stored in memory in the form of software or firmware (firmware) or cured in an Operating System (OS) of the device.
Please refer to fig. 4, which illustrates a schematic structural diagram of an electronic device provided in an embodiment of the present application. An electronic device 300 provided in an embodiment of the present application includes: a processor 310 and a memory 320, the memory 320 storing machine-readable instructions executable by the processor 310, which when executed by the processor 310 perform the method as described above.
The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, performs a method as above.
The storage medium may be implemented by any type of volatile or nonvolatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The foregoing description is merely an optional implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art may easily think about changes or substitutions within the technical scope of the embodiments of the present application, and the changes or substitutions should be covered in the scope of the embodiments of the present application.

Claims (7)

1. A loan risk prediction method, comprising:
obtaining user data, wherein the user data comprises unstructured user data and structured user data;
inputting the user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data;
The unstructured training data includes first unstructured training data and second unstructured training data; before inputting the user data into a preset risk prediction model to obtain a risk prediction result, the method further comprises:
training a preset neural network through the first unstructured training data to obtain the first sub-model;
inputting the second unstructured training data into the first sub-model to obtain the data characteristics;
training a preset meta model through the data characteristics and the structural training data to obtain the second sub model;
training a preset neural network through the first unstructured training data to obtain the first sub-model, wherein the training comprises the following steps:
obtaining a vector sequence based on the first unstructured training data;
adding a corresponding label to the vector sequence;
training the neural network through the vector sequence added with the label to obtain the first sub-model;
the first unstructured training data includes event sequence data; the vector sequence comprises an event vector sequence; based on the first unstructured training data, obtaining a vector sequence comprises:
Obtaining attribute information of the event sequence data; the attribute information of the event sequence data comprises event time information and/or event space information;
acquiring derivative attribute information of the event sequence data based on the attribute information corresponding to the event sequence data and the attribute information corresponding to the previous event sequence data;
splicing the attribute information of the event sequence data and the derivative attribute information to generate a feature vector of the event sequence data;
and splicing the feature vectors of each event sequence data according to the time sequence to obtain the event vector sequence.
2. The method of claim 1, wherein the first unstructured training data comprises behavior sequence data; the vector sequence comprises a behavior vector sequence; based on the first unstructured training data, obtaining a vector sequence comprises:
obtaining behavior time information corresponding to the behavior sequence data;
and splicing the behavior sequence data according to the behavior time information to obtain the behavior vector sequence.
3. The method of claim 1, wherein training a pre-set meta-model through the data features and the structured training data to obtain the second sub-model comprises:
Generating structural training data features based on the structural training data through a preset feature generation rule;
adding the data features and the structured training data features to a feature pool;
screening the data features and the structural training data features in the feature pool to obtain modeling features;
and training the meta model through the modeling feature to obtain the second sub model.
4. A method according to any one of claims 1-3, wherein inputting the user data into a preset risk prediction model to obtain a risk prediction result comprises:
inputting the unstructured user data into the first sub-model to obtain unstructured user data characteristics;
generating structured user data features based on the structured user data;
splicing the unstructured user data features and the structured user data features to generate splicing features;
and inputting the splicing characteristics into the second sub-model to obtain the risk prediction result.
5. A loan risk prediction apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user data, and the user data comprises unstructured user data and structured user data;
The prediction module is used for inputting the user data into a preset risk prediction model to obtain a risk prediction result; the risk prediction model comprises a first sub-model and a second sub-model; the first sub-model is obtained by training unstructured training data; the second sub-model is obtained by obtaining data features through the first sub-model and training the data features and the structured training data;
the unstructured training data includes first unstructured training data and second unstructured training data; the device further comprises a training module, a first sub-model and a second sub-model, wherein the training module is used for training a preset neural network through the first unstructured training data to obtain the first sub-model; inputting the second unstructured training data into the first sub-model to obtain the data characteristics; training a preset meta model through the data characteristics and the structural training data to obtain the second sub model;
the training module is further configured to obtain a vector sequence based on the first unstructured training data; adding a corresponding label to the vector sequence; training the neural network through the vector sequence added with the label to obtain the first sub-model;
The first unstructured training data includes event sequence data; the vector sequence comprises an event vector sequence; the training module is also used for obtaining attribute information of the event sequence data; the attribute information of the event sequence data comprises event time information and/or event space information; acquiring derivative attribute information of the event sequence data based on the attribute information corresponding to the event sequence data and the attribute information corresponding to the previous event sequence data; splicing the attribute information of the event sequence data and the derivative attribute information to generate a feature vector of the event sequence data; and splicing the feature vectors of each event sequence data according to the time sequence to obtain the event vector sequence.
6. An electronic device, comprising: a processor and a memory storing machine-readable instructions executable by the processor to perform the method of any one of claims 1 to 4 when executed by the processor.
7. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the method according to any of claims 1 to 4.
CN202310474449.0A 2023-04-28 2023-04-28 Loan risk prediction method and device, electronic equipment and storage medium Active CN116205726B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310474449.0A CN116205726B (en) 2023-04-28 2023-04-28 Loan risk prediction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310474449.0A CN116205726B (en) 2023-04-28 2023-04-28 Loan risk prediction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116205726A CN116205726A (en) 2023-06-02
CN116205726B true CN116205726B (en) 2023-08-01

Family

ID=86513273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310474449.0A Active CN116205726B (en) 2023-04-28 2023-04-28 Loan risk prediction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116205726B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112785086A (en) * 2021-02-10 2021-05-11 中国工商银行股份有限公司 Credit overdue risk prediction method and device
CN115983982A (en) * 2023-01-09 2023-04-18 深圳前海微众银行股份有限公司 Credit risk identification method, credit risk identification device, credit risk identification equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN116205726A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN111325319B (en) Neural network model detection method, device, equipment and storage medium
CN112819604A (en) Personal credit evaluation method and system based on fusion neural network feature mining
CN112800053B (en) Data model generation method, data model calling device, data model equipment and storage medium
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN113761261A (en) Image retrieval method, image retrieval device, computer-readable medium and electronic equipment
CN112801054B (en) Face recognition model processing method, face recognition method and device
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
WO2019167784A1 (en) Position specifying device, position specifying method, and computer program
CN114218488A (en) Information recommendation method and device based on multi-modal feature fusion and processor
CN114550051A (en) Vehicle loss detection method and device, computer equipment and storage medium
CN112818955A (en) Image segmentation method and device, computer equipment and storage medium
CN114241459B (en) Driver identity verification method and device, computer equipment and storage medium
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN111882034A (en) Neural network processing and face recognition method, device, equipment and storage medium
CN112862023B (en) Object density determination method and device, computer equipment and storage medium
CN116205726B (en) Loan risk prediction method and device, electronic equipment and storage medium
CN116542783A (en) Risk assessment method, device, equipment and storage medium based on artificial intelligence
CN115731620A (en) Method for detecting counter attack and method for training counter attack detection model
CN114863430A (en) Automatic population information error correction method, device and storage medium thereof
CN112529699A (en) Construction method, device and equipment of enterprise trust model and readable storage medium
CN112507912A (en) Method and device for identifying illegal picture
CN116258579B (en) Training method of user credit scoring model and user credit scoring method
CN116246317A (en) Face recognition processing method, device, computer equipment and storage medium
CN116704566A (en) Face recognition method, model training method, device and equipment for face recognition
CN117036720A (en) Image data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant