CN113052512A

CN113052512A - Risk prediction method and device and electronic equipment

Info

Publication number: CN113052512A
Application number: CN202110516028.0A
Authority: CN
Inventors: 陈李龙; 王娜; 强锋; 刘华杰
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-06-29

Abstract

The disclosure provides a risk prediction method, a risk prediction device and electronic equipment, which can be used in the fields of artificial intelligence, finance and the like. The risk prediction method comprises the following steps: acquiring data to be predicted; acquiring risk characteristics of data to be predicted; processing the risk characteristics by using the trained risk prediction model to obtain a risk prediction result, wherein the risk prediction result comprises the category of the data to be predicted; the method comprises the steps that sample data of each category comprises boundary sample data, an objective function of a risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

Description

Risk prediction method and device and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology and the field of finance, and more particularly, to a risk prediction method, apparatus, and electronic device.

Background

Risk prediction is a hot problem for multiple classes of organizations. For example, with the development of the financial industry, the proportion of the legal loan service in the financial institutions is increasing, and the risk prediction of the legal loan service becomes an increasingly important matter.

In the course of implementing the disclosed concept, the applicant finds that at least the following problems exist in the related art, and some service scenes (such as loan service scenes) have the characteristics of high complexity, so that it is difficult to find high-risk customers before the service is processed, and the service processing is abnormal, for example, poor loans are more serious, and the institution is adversely affected.

Disclosure of Invention

In view of the above, the present disclosure provides a risk prediction method, apparatus and electronic device to at least partially solve the problem of high risk prediction.

One aspect of the present disclosure provides a risk prediction method, comprising: acquiring data to be predicted; acquiring risk characteristics of data to be predicted; processing the risk characteristics by using the trained risk prediction model to obtain a risk prediction result, wherein the risk prediction result comprises the category of the data to be predicted; the method comprises the steps that sample data of each category comprises boundary sample data, an objective function of a risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

According to an embodiment of the present disclosure, the categories include a positive sample category and a negative sample category; and the boundary sample discriminating constraint term comprises: the model processing boundary negative sample subset comprises a positive sample sub-constraint item, a negative sample sub-constraint item and a cross item, wherein the output of the positive sample sub-constraint item is related to the difference between the sub-result of the model processing positive sample data and the sub-result of the mean value of the model processing boundary positive sample subset, the output of the negative sample sub-constraint item is related to the difference between the sub-result of the model processing negative sample data and the sub-result of the mean value of the model processing boundary negative sample subset, and the output of the cross item is related to the product of the sub-result of the model processing positive sample data and the sub-result of the model processing negative sample data.

According to the embodiment of the disclosure, acquiring the risk characteristics of the data to be predicted comprises: the method comprises the steps of obtaining at least one of the sub-features of the data to be predicted aiming at the basic information view, obtaining the sub-features of the data to be predicted aiming at the business information view and obtaining the sub-features of the data to be predicted aiming at the behavior information view.

According to the embodiment of the disclosure, the boundary sample discrimination constraint term comprises at least one of the following items: positive sample sub-constraint terms, negative sample sub-constraint terms and cross terms for the base information view; positive sample sub-constraint terms, negative sample sub-constraint terms and cross terms for the business information view; or positive sample sub-constraints, negative sample sub-constraints, and cross terms for the behavioral information perspective.

According to the embodiment of the disclosure, the objective function of the risk prediction model further includes an inter-view boundary sample discrimination constraint term, and the inter-view boundary sample discrimination constraint term enables the outputs of the mean values of the boundary samples of the same category in different views to be consistent.

According to the embodiment of the disclosure, the inter-view boundary sample discrimination constraint term includes at least one of the following items: positive and negative sample sub-constraints for different views; or a positive and negative sample sub-constraint term for different views.

According to an embodiment of the present disclosure, training a risk prediction model includes: acquiring a training sample data set, wherein the training sample data set comprises a positive training sample data subset and a negative training sample data subset; determining a boundary positive sample subset in the positive training sample data subset based on the heterogeneous neighbor information of the samples, and determining a boundary negative sample subset in the negative training sample data subset based on the heterogeneous neighbor information of the samples; and inputting the positive training sample data subset and/or the negative training sample data subset into a risk prediction model, and adjusting parameters of the risk prediction model until a preset iteration number is reached or a loss function difference value of the target function in two iteration processes is smaller than a preset threshold value.

According to the embodiment of the present disclosure, determining a boundary positive sample subset in a positive training sample data subset based on heterogeneous neighbor information of a sample includes: for any negative sample, adding a first specified number of positive samples adjacent to the negative sample into a boundary positive sample subset; and determining a boundary negative sample subset in the negative training sample data subset based on the heterogeneous neighbor information of the samples comprises: for any positive sample, a second specified number of negative samples that neighbor the positive sample are added to the boundary negative sample subset.

According to an embodiment of the present disclosure, the method further includes: and testing the trained risk prediction model by using the test sample set to obtain the test accuracy of the risk prediction result.

According to the embodiment of the disclosure, the risk characteristics of the data to be predicted are acquired by at least one of the following steps: carrying out one-hot coding on category data in data to be predicted to obtain category characteristics; and calculating the correlation data of the operation information and/or the behavior information in the data to be predicted to obtain the derivative characteristics.

According to an embodiment of the present disclosure, the objective function further includes an empirical loss constraint term and a regularization constraint term.

According to an embodiment of the present disclosure, the sample data of each class further includes non-boundary sample data, and a third distance between the boundary sample data and the class center in the same class is greater than a fourth distance between the non-boundary sample data and the class center.

One aspect of the present disclosure provides a risk prediction apparatus, including: the system comprises a data acquisition module, a risk characteristic acquisition module and a risk characteristic processing module. The data acquisition module is used for acquiring data to be predicted; the risk characteristic acquisition module is used for acquiring risk characteristics of data to be predicted; the risk characteristic processing module is used for processing risk characteristics by using the trained risk prediction model to obtain a risk prediction result, and the risk prediction result comprises the category to which the data to be predicted belongs; the method comprises the steps that sample data of each category comprises boundary sample data, an objective function of a risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

Another aspect of the present disclosure provides an electronic device comprising one or more processors and a storage device, wherein the storage device is configured to store executable instructions that, when executed by the processors, implement the risk prediction method as above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing a risk prediction method as above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing a risk prediction method as above when executed.

According to the risk prediction method and device and the electronic equipment, the mean value of the two types of boundary sample data sets is calculated through the boundary sample sets, so that the specific distribution information of the boundary samples is mined and learned to optimize the classification hyperplane. According to the boundary sample discrimination constraint item provided by the embodiment of the disclosure, similar boundary sample data is enabled to be as close as possible in an output space, and dissimilar boundary sample data is enabled to be as far away as possible in the output space, so that a classification hyperplane passes through the middle area of the two types of boundary sample data as far as possible, and the generalization performance of a risk prediction model is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an exemplary system architecture to which the risk prediction method, apparatus and electronic device may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a risk prediction method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a logic diagram for a risk prediction method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method of training a risk prediction model according to an embodiment of the present disclosure;

FIG. 5 schematically shows a schematic diagram of training data according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a logic diagram for training a risk prediction model according to an embodiment of the present disclosure;

FIG. 7 schematically shows a schematic diagram of a subset of boundary sample data, according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of a risk prediction method according to another embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of a risk prediction device according to an embodiment of the present disclosure; and

FIG. 10 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components. All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features.

With the development of the financial industry, the proportion of the legal loan service in the financial institutions is increasing. Due to the complexity of the business scene, high-risk clients are difficult to find in advance, and if bad loans become more serious, the financial institutions will be adversely affected, so that the public praise of the financial institutions is reduced, the profit is reduced, and the like.

For better risk prediction, multidimensional features can be extracted and risk prediction can be performed based on the multidimensional features. For the explanation of the legal loan risk prediction, the legal loan risk prediction in the related art is still insufficient. For example, through a large number of analyses and studies, the applicant finds that the machine learning method in the related art treats all training samples equally, and does not distinguish different importance of boundary samples and non-boundary samples to classification, actually, the importance of samples with different spatial distributions to a classification hyperplane is different, and if all training samples are treated equally, the model may not achieve the expected effect because the distribution information specific to the boundary samples is not mined and learned to optimize the classification hyperplane.

Due to the development of big data technology, the accumulation of the characteristic information related to the client becomes easier, the types of the characteristics related to the loan risk prediction of the legal person are more, and different view angles of the sample can be formed, for example, the characteristic extraction can be performed based on a basic information view angle, an operation information view angle, a behavior information view angle and the like. The multi-view learning technique can utilize different views of a sample to perform modeling learning and predict an unknown sample. Therefore, applying multi-view learning techniques to carry out legal loan risk prediction is an endeavor solution.

In the process of trying, the applicant finds that although the multi-view model is trained by using a model training method in the related art, because the learning process between multiple views is relatively independent, only the sub-results of multiple views are integrated on the final risk prediction model, and if a certain view contains insufficient information to provide the class information of a sample, the existence of the view may reduce the classification effect of the final model.

The risk prediction method comprises a characteristic obtaining process and a prediction result output process, wherein in the characteristic obtaining process, firstly, data to be predicted are obtained, and then, risk characteristics of the data to be predicted are obtained. And entering a prediction result output process after the characteristic obtaining process is finished, and processing the risk characteristics by using the trained risk prediction model to obtain a risk prediction result, wherein the risk prediction result comprises the category of the data to be predicted. The method comprises the steps that sample data of each category comprises boundary sample data, an objective function of a risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

The risk prediction model based on the multi-view boundary discrimination constraint, such as a legal loan risk prediction model, is provided by the disclosure. On one hand, a boundary sample set is selected through heterogeneous neighbor information of samples in a visual angle, the mean value of the two types of boundary sample sets is calculated through the boundary sample set, the similar boundary samples are enabled to be as close as possible in an output space through a boundary sample discrimination constraint term designed by the patent, the heterogeneous boundary samples are enabled to be as far away as possible in the output space, and a classification hyperplane is enabled to penetrate through the middle area of the two types of boundary samples as far as possible, so that the generalization performance of the model is improved. On one hand, consistency of boundary samples in different output spaces is mined among basic information visual angles, business information visual angles and behavior information visual angles, and output of mean values of similar boundary samples in different visual angles is consistent as much as possible through boundary sample distinguishing constraint items among the visual angles provided by the embodiment, so that the purpose is to optimize the multiple visual angles with each other to improve accuracy of classification boundaries.

The method, the device and the electronic equipment for predicting the risk provided by the embodiment of the disclosure can be used in the artificial intelligence field in the relevant aspects of risk prediction, and can also be used in various fields except the artificial intelligence field, such as the financial field.

Fig. 1 schematically illustrates an exemplary system architecture to which the method, apparatus, and electronic device for predicting risk may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and

servers

105, 106, 107. The network 104 may include a plurality of gateways, routers, hubs, network wires, etc. to provide a medium of communication links between the

terminal devices

101, 102, 103 and the

servers

105, 106, 107. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with other terminal devices and

servers

105, 106, 107 via the network 104 to receive or send information and the like, such as a risk prediction request, a training model request, a model maintenance request, a processing result and the like. The

terminal devices

101, 102, 103 may be installed with various communication client applications, such as risk prediction applications, software development applications, banking applications, government affairs applications, monitoring applications, web browser applications, search applications, office applications, instant messaging tools, mailbox clients, social platform software, and the like (for example only). For example, the user may use the terminal device 101 to view the risk prediction result and the related processing suggestion, the automatic processing result, and the like fed back by the server side. For example, a user may request the server to perform model iterative training, etc.

The

terminal devices

101, 102, 103 include, but are not limited to, smart phones, virtual reality devices, augmented reality devices, tablets, laptop portable computers, desktop computers, and the like.

The

servers

105, 106, and 107 may receive the request and process the request, and may specifically be a storage server, a background management server, a server cluster, and the like. For example, server 105 may store a risk prediction model, server 106 may act as a model training server, optimize model parameters, etc., and server 107 may store business data, training databases, etc.

It should be noted that the method for predicting risk provided by the embodiments of the present disclosure may be generally performed by a server. Accordingly, the risk prediction device provided by the embodiment of the present disclosure may be generally disposed in a server. The method of predicting risk provided by embodiments of the present disclosure may also be performed by a server or a cluster of servers capable of communicating with the

terminal devices

101, 102, 103 and/or the

servers

105, 106, 107.

It should be understood that the number of terminal devices, networks, and servers are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a risk prediction method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, data to be predicted is acquired. The data to be predicted may be various traffic data. Such as data generated during the transaction of the customer. Taking the financial institution as an example, the data to be predicted can be data generated in the process of applying for loan transaction, applying for credit card, applying for credit limit and the like by the user. Specifically, the data to be predicted may include business information, behavior information, and the like, and may further include customer attribute information. The customer attribute information may include at least one of: name, unit name, address, annual income and the like. In addition, the data to be predicted can also comprise statistical information, such as historical consumption limit, credit information, consumption habits, consumption preference and the like of the user.

In operation S220, a risk characteristic of the data to be predicted is acquired.

In this embodiment, the risk profile may be used to characterize the potential risk size of the predicted object. For example, a population with high annual income is at a lower risk of default than a population with low income for the same amount of credit.

In certain embodiments, obtaining the risk profile of the data to be predicted comprises at least one of: and carrying out one-hot coding on the category data in the data to be predicted to obtain category characteristics. And calculating the correlation data of the operation information and/or the behavior information in the data to be predicted to obtain the derivative characteristics.

Specifically, in the process of extracting the risk features, the category features, such as company industry categories, company economic properties and the like, are subjected to One-Hot (One-Hot) coding.

And processing the derived index. And establishing derivative characteristics such as average values, standard deviations, maximum values and minimum values for the related original characteristics of the business information and the behavior information.

For example, for a legal person, first, the features related to the prediction of the loan risk of the legal person are classified into three categories: basic information, business information, and behavioral information. The data ranges and thus the data tables involved can be determined by category.

Then, the characteristics, the basic information, the business information, and the behavior information are extracted from the data table.

Then, the class features are converted, and derived features are processed to construct tags.

In operation S230, the trained risk prediction model is used to process the risk features to obtain a risk prediction result, where the risk prediction result includes a category to which the data to be predicted belongs.

In this embodiment, the sample data of each category includes boundary sample data, the objective function of the risk prediction model includes a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term makes a first distance of the boundary sample data respectively belonging to different categories in the output space greater than a second distance of the boundary sample data respectively belonging to the same category in the output space.

The risk prediction model may be a variety of machine learning models, such as a neural network. The risk prediction model can be trained through a back propagation algorithm and the like, so that the prediction accuracy is improved.

The risk prediction model is exemplified below.

In some embodiments, the categories include a positive sample category and a negative sample category.

The boundary sample discrimination constraint term comprises: the model processing boundary negative sample subset comprises a positive sample sub-constraint item, a negative sample sub-constraint item and a cross item, wherein the output of the positive sample sub-constraint item is related to the difference between the sub-result of the model processing positive sample data and the sub-result of the mean value of the model processing boundary positive sample subset, the output of the negative sample sub-constraint item is related to the difference between the sub-result of the model processing negative sample data and the sub-result of the mean value of the model processing boundary negative sample subset, and the output of the cross item is related to the product of the sub-result of the model processing positive sample data and the sub-result of the model processing negative sample data.

For example, the boundary sample discrimination constraint term may be as shown in equation (1).

Wherein, B_class-As a negative class boundary sample set, B_class+For the positive class boundary sample set, μ_class+、μ_class-Is the mean of the two classes of boundary sample sets. f (-) is the classifier. It should be noted that the above formula is only shown by way of example, and the squaring in the formula may also be a first power or a third power, etc. Further, a setting coefficient, a constant offset, and the like are provided, and are not limited herein.

It should be noted that, the above only shows the processing procedure of the boundary sample set, and the non-boundary sample set may also be processed. Specifically, the sample data of each class further includes non-boundary sample data, and a third distance between the boundary sample data and the class center in the same class is greater than a fourth distance between the non-boundary sample data and the class center.

According to the boundary sample discrimination constraint item provided by the embodiment of the disclosure, similar boundary samples are close to each other in the output space as much as possible, and heterogeneous boundary samples are far away from each other in the output space as much as possible, so that the classification hyperplane passes through the middle area of the two types of boundary samples as much as possible, and the generalization performance of the model is improved.

In some embodiments, in order to further improve the model prediction accuracy, risk prediction may be performed from multiple views, and then a final risk prediction result is determined based on the prediction sub-results of the views.

For example, obtaining risk characteristics of data to be predicted includes: the method comprises the steps of obtaining at least one of the sub-features of the data to be predicted aiming at the basic information view, obtaining the sub-features of the data to be predicted aiming at the business information view and obtaining the sub-features of the data to be predicted aiming at the behavior information view. The extraction process of the sub-features for different viewing angles can be as shown above, and is not described in detail here.

In some embodiments, the boundary sample discrimination constraint term may include at least one of: positive sample sub-constraints, negative sample sub-constraints, and cross terms for the underlying information view. Positive sample sub-constraints, negative sample sub-constraints, and cross terms for the business information perspective. Positive sample sub-constraints, negative sample sub-constraints, and cross terms for the behavioral information perspective.

For example, the boundary sample discrimination constraint term may be as shown in equation (2).

Wherein v represents a view number, B_class-Is a negative class boundarySample set, B_class+For the positive class boundary sample set, μ_class+，μ_class-Is the mean of the two classes of boundary sample sets. f. of_v(. h) is a sub-classifier for the v-th view. It should be noted that the above formula is only shown by way of example, and the squaring in the formula may also be a first power or a third power, etc. Further, a setting coefficient, a constant offset, and the like are provided, and are not limited herein.

In some embodiments, the objective function of the risk prediction model may further include an inter-view boundary sample discrimination constraint term, where the inter-view boundary sample discrimination constraint term enables outputs of mean values of boundary samples of the same category in respective different views to be consistent.

The boundary sample discrimination constraint term between the view angles provided in this embodiment makes the average values of the similar boundary samples output in different view angles as consistent as possible, and aims to optimize the multiple view angles to improve the accuracy of the classification boundary.

Specifically, the inter-view boundary sample discrimination constraint term may include at least one of: positive and negative sample sub-constraints for different views; or a positive and negative sample sub-constraint term for different views.

For example, the inter-view boundary sample discrimination constraint term may be as shown in equation (3).

Wherein v and ω represent the view angle numbers, respectively.

In some embodiments, the objective function may further include an empirical loss constraint term and a regularization constraint term.

For example, the expression of the objective function may be as shown in equation (4).

L＝R_emp+α·R_bd+β·R_vbd+γ·R_regFormula (4)

Wherein R is_empFor empirical loss, R_regAlpha, beta, gamma are hyper-parameters for adjusting the above itemsAnd (4) weighting. For example, R_empAnd R_regAre respectively shown as formula (5) and formula (6).

Wherein Y is a label of the specimen, w_vAs a submodel f_vT denotes transposition.

Fig. 3 schematically illustrates a logic diagram of a risk prediction method according to an embodiment of the present disclosure.

As shown in fig. 3, characteristic information related to the prediction of the corporate loan risk, such as basic information, business information, and behavior information, is first obtained from a data warehouse. The basic information comprises the types of industries to which the enterprises belong, economic properties, credit levels, enterprise popularity and the like, the operation information comprises the business account operation inflow and outflow amount, the number of strokes and the like of the enterprises in the last year, and the behavior information comprises the transaction amount, the transaction currency, the fund flow direction and the like. And carrying out data preprocessing and characteristic engineering processing on the sample, and constructing a basic information visual angle, an operation information visual angle and a behavior information visual angle of the sample. And constructing a test sample by using the characteristics of the data to be predicted. And inputting the test sample into a legal person loan risk prediction model based on multi-view boundary discrimination constraint to obtain a prediction result.

Among them, the pretreatment process can be as follows.

First, data selection may be performed. For example, the positive sample may be a good quality customer, and the selection criteria of the good quality customer may be a company customer who has no problem in the previous payment record and is operating stably. Classifying characteristics related to the prediction of the loan risk of the legal person into three categories: basic information, business information, and behavioral information. The data ranges and thus the data tables involved can be determined by category.

Then, data preprocessing is performed. Such as columns of data in a data table relating to basic information, business information, and behavioral information. And splicing related data columns in different tables according to the client identification (id) to form the original characteristics. And for the missing value columns, completing the missing values of the numerical features in a certain mode, namely completing the missing values of the numerical features by using a column mean value, and completing the missing values of the non-numerical features by using 'un-nown'.

The training process of the risk prediction model is exemplified below.

FIG. 4 schematically shows a flow chart of a method of training a risk prediction model according to an embodiment of the present disclosure.

As shown in fig. 4, training the risk prediction model may include operations S410 to S430.

In operation S410, a training sample data set is obtained, where the training sample data set includes a positive training sample data subset and a negative training sample data subset.

First, a training sample needs to be constructed.

Fig. 5 schematically shows a schematic diagram of training data according to an embodiment of the present disclosure.

As shown in fig. 5, the training data may be taken from a sample set. The sample set may include a negative sample subset and a positive sample subset. The negative examples subset may include a boundary negative examples subset and a non-boundary negative examples subset. The positive sample subset may include a boundary positive sample subset and a non-boundary positive sample subset. Each sample may be feature extracted from three views, respectively.

The label of the labeled sample is 1 (omega)₁) And-1 (ω)₂) May represent a legal loan risk customer and a non-risk customer, respectively. Each sample consists of three views, namely a basic information view, an operation information view and a behavior information view.

In operation S420, a boundary positive sample subset in the positive training sample data subset is determined based on the heterogeneous neighbor information of the samples, and/or a boundary negative sample subset in the negative training sample data subset is determined based on the heterogeneous neighbor information of the samples. In order to determine the boundary samples, one or more samples belonging to other classes that are closest to any one sample of the current class may be used as the boundary samples.

In operation S430, the positive training sample data subset and/or the negative training sample data subset are input into the risk prediction model, and parameters of the risk prediction model are adjusted until a preset iteration number is reached or a difference value of loss functions of the objective function in two iteration processes is smaller than a preset threshold.

Specifically, the optimization problem is solved by using a gradient descent method until a preset iteration number is reached or the difference between the loss values of two loss functions is smaller than a preset threshold value. For example, a gradient descent method is used to minimize the objective function to obtain a sub-classification model in each view, such as a final classification model f^*The algorithm formula is shown in formula (7).

f^*＝arg min_fL (f, X, Y) formula (7)

The objective function of the risk prediction model may be as shown above, and is not limited herein. The risk prediction model is trained by a gradient descent method.

FIG. 6 schematically shows a logic diagram for training a risk prediction model according to an embodiment of the present disclosure.

As shown in fig. 6, the training sample obtained through data preprocessing is composed of three views, namely a basic information view, an operation information view and a behavior information view. Firstly, a boundary sample set is selected through heterogeneous neighbor information of samples in a visual angle, the mean value of the two types of boundary sample sets is calculated through a boundary sample subset, the similar boundary samples are enabled to be as close as possible in an output space through a boundary sample discrimination constraint term designed by the patent, the heterogeneous boundary samples are enabled to be as far away as possible in the output space, and a classification hyperplane is enabled to penetrate through the middle area of the two types of boundary samples as far as possible, so that the generalization performance of the model is improved. The consistency of the boundary samples in different output spaces is mined among different visual angles, and the output of the mean value of the similar boundary samples in different visual angles is consistent as much as possible by the boundary sample discrimination constraint items among the visual angles, so that the multiple visual angles are optimized with each other to improve the accuracy of the classification boundary. And obtaining three sub-classifiers by minimizing the experience loss of the model, the boundary sample discrimination constraint term and the boundary sample discrimination constraint term between the visual angles. And finally, integrating the results of the three sub-classifiers to perform classification prediction on the test sample.

In some embodiments, determining a boundary positive sample subset of the subset of positive training sample data based on the heterogeneous neighbor information of the samples comprises: for any negative sample, a first specified number of positive samples that neighbor the negative sample are added to the boundary positive sample subset. Wherein the first specified number includes but is not limited to: 1, 2, 3, 4, 5, 7, 8, 10, 11 or more.

Determining a boundary negative sample subset in the negative training sample data subset based on the heterogeneous neighbor information of the samples comprises: for any positive sample, a second specified number of negative samples that neighbor the positive sample are added to the boundary negative sample subset. Wherein the second specified number includes but is not limited to: 1, 2, 3, 4, 5, 7, 8, 10, 11 or more. The first specified number and the second specified number may be the same or different.

Fig. 7 schematically shows a schematic diagram of a subset of boundary sample data according to an embodiment of the present disclosure.

As shown in fig. 7, the boundary sample set is screened. And selecting a boundary sample set through heterogeneous neighbor information of the samples. For any positive type sample x₊Adding its 5 near neighbor negative class sample into negative class boundary sample set B_class-(ii) a For any positive type sample x_-Adding its 5 near neighbor negative class sample into positive class boundary sample set B_class+And calculating the mean value mu of the two types of boundary sample sets through the boundary sample subsets_class+，μ_class-. The calculation formulas are shown in formulas (8) to (10).

Wherein,

representing 5 positive samples adjacent to the negative sample x,

representing 5 negative samples adjacent to the positive sample x.

μ_class+＝Mean(B_class+) Formula (9)

μ_class-＝Mean(B_class-) Formula (10)

Where Mean () represents the average.

In some embodiments, after the model training is completed, the method may further include a verification operation.

Fig. 8 schematically shows a flow chart of a risk prediction method according to another embodiment of the present disclosure.

As shown in fig. 8, the method may further include an operation S810 after performing the model training in operation S530.

In operation S810, the trained risk prediction model is tested using the test sample set, so as to obtain a test accuracy of the risk prediction result.

And inputting the test sample x into a discrimination function of the classifier to obtain a discrimination result of the model.

The risk prediction method provided by the embodiment of the disclosure takes the legal loan risk prediction based on the multi-view boundary judgment constraint as an example, a sample of the risk prediction method is composed of three views, namely a basic information view, an operation information view and a behavior information view, and labels 1 and-1 of the sample represent a risk user and a general user respectively. For a training sample, firstly, dividing the features into three subsets according to the feature categories, and respectively corresponding to three visual angles of the sample by basic information, business information and behavior information. In the model training process, a boundary sample set is selected through heterogeneous neighbor information of samples in a visual angle, the mean value of the two types of boundary sample sets is calculated through the boundary sample set, the similar boundary samples are enabled to be as close as possible in an output space through a boundary sample discrimination constraint term designed by the patent, the heterogeneous boundary samples are enabled to be as far away as possible in the output space, and a classification hyperplane is enabled to penetrate through the middle area of the two types of boundary samples as far as possible, so that the generalization performance of the model is improved. The consistency of boundary samples in different output spaces is mined among different visual angles, and the output of the mean value of similar boundary samples in different visual angles is consistent as much as possible by distinguishing constraint items of the boundary samples among the visual angles, so that the aim of optimizing the multiple visual angles is fulfilled, and the accuracy of classification boundaries is improved. And obtaining three sub-classifiers by minimizing the experience loss of the model, the boundary sample discrimination constraint term and the boundary sample discrimination constraint term between the visual angles. And finally, integrating the results of the three sub-classifiers to perform classification prediction on the test sample. According to the method, through the learning process of the optimization model between the inside of the visual angle and the visual angle, the similar boundary samples are close to the output space as far as possible, the heterogeneous boundary samples are far away from the output space as far as possible, the classification hyperplane penetrates through the middle areas of the two types of boundary samples as far as possible, and the generalization capability of the model can be improved by the result of the final model integrating the three sub-classifiers.

The embodiment of the disclosure has better effect than the traditional machine learning algorithm in the accuracy rate, the recall rate and the comprehensive evaluation value of risk prediction (such as legal loan risk prediction) classification, and can predict the business risk condition more accurately. For example, the risk prediction model is applied to financial institutions such as banks and the like, accurate prediction is carried out before users loan, and corresponding processing is carried out by customers and managers according to the prediction result of the model, so that the issuing of bad loans is reduced, the loss is reduced, and the competitiveness of the institutions in the same industry is improved.

The embodiment of the disclosure also provides a risk prediction device.

Fig. 9 schematically shows a block diagram of a risk prediction device according to an embodiment of the present disclosure.

As shown in fig. 9, the risk prediction apparatus 900 may include: a data acquisition module 910, a risk feature acquisition module 920 and a risk feature processing module 930.

The data obtaining module 910 is configured to obtain data to be predicted.

The risk characteristic obtaining module 920 is configured to obtain a risk characteristic of the data to be predicted.

The risk feature processing module 930 is configured to process the risk features by using the trained risk prediction model to obtain a risk prediction result, where the risk prediction result includes a category to which the data to be predicted belongs.

The method comprises the steps that sample data of each category comprises boundary sample data, an objective function of a risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described in detail herein.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by any other reasonable means of hardware or firmware by integrating or packaging the circuits, or in any one of three implementations of software, hardware and firmware, or in any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

For example, any of the data obtaining module 910, the risk feature obtaining module 920 and the risk feature processing module 930 may be combined and implemented in one module, or any one of the modules may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the data obtaining module 910, the risk feature obtaining module 920, and the risk feature processing module 930 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or may be implemented by any one of three implementations of software, hardware, and firmware, or any suitable combination of any of the three. Alternatively, at least one of the data acquisition module 910, the risk feature acquisition module 920 and the risk feature processing module 930 may be at least partially implemented as a computer program module, which when executed may perform the respective functions.

FIG. 10 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. Processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1001 may also include onboard memory for caching purposes. The processor 1001 may include a single processing unit or a plurality of processing units for executing different actions of the method flow according to the embodiments of the present disclosure, and the plurality of processing units may be integrated into one processor or distributed into a plurality of processors, which is not limited herein.

In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are stored. The processor 1001, the ROM 1002, and the RAM 1003 are communicatively connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may also be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in one or more memories.

Electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to bus 1004, according to an embodiment of the present disclosure. Electronic device 1000 may also include one or more of the following components connected to I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the system of the embodiment of the present disclosure when executed by the processor 1001. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium.

Referring to fig. 10, the computer-readable storage medium may be included in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 1002 and/or the RAM 1003 described above and/or one or more memories other than the ROM 1002 and the RAM 1003.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the image model training method or the risk prediction method provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 1001, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication part 1009, and/or installed from the removable medium 1011. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of risk prediction, comprising:

acquiring data to be predicted;

acquiring risk characteristics of the data to be predicted; and

processing the risk characteristics by using a trained risk prediction model to obtain a risk prediction result, wherein the risk prediction result comprises the category of the data to be predicted;

the method comprises the steps that sample data of each category comprises boundary sample data, an objective function of the risk prediction model comprises a boundary sample discrimination constraint term, and the boundary sample discrimination constraint term enables first distances of the boundary sample data respectively belonging to different categories in an output space to be larger than second distances of the boundary sample data respectively belonging to the same category in the output space.

2. The method of claim 1, wherein the categories include a positive sample category and a negative sample category; and

the boundary sample discriminating constraint term comprises: the model processing boundary negative sample subset comprises a positive sample sub-constraint term, a negative sample sub-constraint term and a cross term, wherein the output of the positive sample sub-constraint term is related to the difference between the sub-result of the model processing positive sample data and the sub-result of the mean value of the model processing boundary positive sample subset, the output of the negative sample sub-constraint term is related to the difference between the sub-result of the model processing negative sample data and the sub-result of the mean value of the model processing boundary negative sample subset, and the output of the cross term is related to the product of the sub-result of the model processing positive sample data and the sub-result of the model processing negative sample data.

3. The method of claim 2, wherein the obtaining a risk profile of the data to be predicted comprises: and acquiring at least one of the sub-features of the data to be predicted for the basic information view, the sub-features of the data to be predicted for the business information view and the sub-features of the data to be predicted for the behavior information view.

4. The method of claim 3, wherein the boundary sample discriminative constraint term comprises at least one of:

positive sample sub-constraint terms, negative sample sub-constraint terms and cross terms for the base information view;

positive sample sub-constraint terms, negative sample sub-constraint terms and cross terms for the business information view; or

Positive sample sub-constraints, negative sample sub-constraints, and cross terms for the behavioral information perspective.

5. The method of claim 3, wherein the objective function of the risk prediction model further comprises an inter-view boundary sample discrimination constraint term that makes the output of the mean of the same class of boundary samples in each of the different views uniform.

6. The method of claim 5, wherein the inter-view boundary sample discrimination constraint includes at least one of:

positive sample sub-constraint terms for different view angles; or

Negative sample sub-constraint terms for different views.

7. The method of claim 1, wherein training the risk prediction model comprises:

acquiring a training sample data set, wherein the training sample data set comprises a positive training sample data subset and/or a negative training sample data subset;

determining a boundary positive sample subset in the positive training sample data subset based on the heterogeneous neighbor information of the samples, and/or determining a boundary negative sample subset in the negative training sample data subset based on the heterogeneous neighbor information of the samples; and

and inputting the positive training sample data subset and/or the negative training sample data subset into the risk prediction model, and adjusting parameters of the risk prediction model until a preset iteration number is reached or a loss function difference value of the objective function in the two iteration processes is less than a preset threshold value.

8. The method of claim 7, wherein:

the determining a boundary positive sample subset of the positive training sample data subset based on the sample-based heterogeneous neighbor information comprises: for any negative sample, adding a first specified number of positive samples adjacent to the negative sample into a boundary positive sample subset; and

the determining a boundary negative sample subset of the negative training sample data subset based on the sample-based heterogeneous neighbor information comprises: for any positive sample, a second specified number of negative samples that neighbor the positive sample are added to the boundary negative sample subset.

9. The method of claim 7, further comprising:

and testing the trained risk prediction model by using the test sample set to obtain the test accuracy of the risk prediction result.

10. The method according to any one of claims 1 to 9, wherein the risk characteristics of acquiring the data to be predicted comprise at least one of:

carrying out one-hot coding on category data in the data to be predicted to obtain category characteristics; or

And calculating the correlation data of the marketing information and/or the behavior information in the data to be predicted to obtain derivative characteristics.

11. The method of any of claims 1-9, wherein the objective function further comprises an empirical loss constraint term and a regularization constraint term.

12. The method according to any of claims 1 to 9, wherein the sample data of each class further comprises non-boundary sample data, and a third distance between boundary sample data relative to the class centre in the same class is greater than a fourth distance between non-boundary sample data relative to the class centre.

13. A risk prediction device comprising:

the data acquisition module is used for acquiring data to be predicted;

a risk characteristic obtaining module, configured to obtain a risk characteristic of the data to be predicted; and

the risk characteristic processing module is used for processing the risk characteristics by utilizing the trained risk prediction model to obtain a risk prediction result, and the risk prediction result comprises the category to which the data to be predicted belongs;

14. An electronic device, comprising:

one or more processors;

a storage device for storing executable instructions which, when executed by the processor, implement a risk prediction method according to any one of claims 1 to 12.

15. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement a risk prediction method according to any one of claims 1 to 12.