CN111667356A

CN111667356A - Multidimensional big data intelligent risk screening system

Info

Publication number: CN111667356A
Application number: CN202010477432.7A
Authority: CN
Inventors: 陈建; 龙泳先
Original assignee: Beijing Ruizhi Tuyuan Technology Co ltd
Current assignee: Beijing Ruizhi Tuyuan Technology Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-15

Abstract

The invention discloses a multidimensional big data intelligent risk screening system, relating to the technical field of data processing; for risk screening based on big data; the system specifically comprises a data source module, a data preprocessing module, a data modeling module, a rating module, a risk screening module and an interaction module, wherein the data source module comprises a data collector and dealer service data. The invention obtains related data by a data source module in multiple parties, and calculates various derived variables and calculates scores and risk levels, during the period, the accuracy, the integrity and the consistency of the data are ensured by a data preprocessing module and a data modeling module, a bank or other organizations count the matching key information of an applicant into a file, an operator puts the file into a specified directory to request early warning level information, a risk screening module performs data matching, returns the characteristic data of the applicant according to a data service contract by an interaction module, and the file is stored in a disk and automatically deleted at regular time after being transmitted.

Description

Multidimensional big data intelligent risk screening system

Technical Field

The invention relates to the technical field of data processing, in particular to a multidimensional big data intelligent risk screening system.

Background

With the rapid development of credit business in recent years, the change of policy environment and the continuous aggravation of market competition, the situation of customers changes rapidly, the importance of post-loan inspection is more prominent, in order to better prevent the risk of credit business from being degraded, the post-loan management quality is further improved, the risk prevention and control capability is enhanced, companies provide post-loan risk screening services, banks, small loan companies and consumption financial institutions all have the requirements of post-loan risk screening, and a risk screening system can identify high risk groups through the credit scoring of customers, fast check score segments, customer figures, customer early warning levels, online time, online states, offline risk early warning, anti-fraud indexes and other scoring services, is used for prejudging the level of fraud risk, and provides full-process risk early warning before, during and after the loan.

Through retrieval, a Chinese patent application No. 201510620821.X discloses a screening system and a screening method for big data criminal partners, and the screening system comprises a screening system for big data criminal partners, wherein an acquisition module is used for acquiring signaling data, a preprocessing module is used for performing correlation analysis on the signaling data, a summarizing module is used for summarizing the correlation analysis data, and a page display module is used for displaying according to a query result; a big data criminal partnering method comprises the steps of collecting signaling data of a designated user, carrying out correlation analysis and summarization on the signaling data, and displaying the criminal partnering track through a geographic information platform. The screening system and method for big data criminal partnerships in the above patent have the following disadvantages: early warning and risk screening cannot be performed according to the obtained big data related information.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a multi-dimensional big data intelligent risk screening system.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multidimensional big data intelligent screening risk system comprises a data source module, a data preprocessing module, a data modeling module, a rating module, a risk screening module and an interaction module, wherein the data source module comprises a data collector, dealer business data, partner data and a third-party data market; the data preprocessing comprises a data cleaning technology, a data reduction technology, a data integration technology and a data transformation technology; the data modeling module is used for establishing a mathematical model by using logistic regression for predicting the risk of the client; the rating module checks people with low repayment probability according to the data obtained by the data modeling module, and specifically, the people can be divided into A, B, C, D, E, F and G, and 7 risk levels; the risk screening module is in communication connection with the rating module and comprises a data matching module; the interaction module is in communication connection with the risk screening module.

Preferably: the data acquisition unit is client behavior information acquired by software modes such as API, SDK, JS and the like at a PC end or a mobile end.

Preferably: the security dealer service data mainly comprises centralized bidding transaction information of buying and selling such as bulk transaction, agreement transfer, after-hand transaction and the like of security trading in a public and centralized mode, and investment system data of buying and selling on online investment platforms of security dealers, investment analysis decision systems and other investment systems of users.

Preferably: the data of the partner is mainly data information which is provided by an organization having a cooperative relationship with a software developer and reflects the behavior preference, consumption condition and other relevant conditions of a client, and comprises public number data, e-commerce station data and media data.

Preferably: the third party data market comprises blacklist data providing institutions, telecommunication consumption data providing institutions, financial consumption data providing institutions and other data providing institutions.

Preferably: the data cleaning technology is used for cleaning noise in data and correcting inconsistency; the data reduction technology reduces the scale of data by sniping, deleting redundant features or clustering; data integration techniques consolidate data from multiple data sources into a coherent data store, such as a data warehouse; data transformation techniques compress data to a smaller interval, such as 0.0 to 1.0.

Preferably: the L function in the logistic regression generally uses a sigmoid function

Logistic regressionHas a loss function of L (y1, y2) — (y2log (y1)) + (1-y2) log (1-y 1); defining an average of a loss function of m training samples of a cost function

And measuring the average error cost between the predicted result and the real result, wherein the optimization aims at minimizing a cost function J (w, b), the effect of optimizing the model can be achieved by minimizing the cost function, and the optimization of the cost function can be realized by a gradient descent method.

Preferably: the interaction module counts the matching key information of the applicant into a file for a client, an operator puts the file into a specified directory to request early warning grade information, the interaction module feeds back the early warning grade information to the risk screening module, acquires related information according to the rating module, and returns the characteristic data of the applicant according to a data service contract.

Preferably: the risk screening module generates monitoring service data (concurrency, error number, slow query and the like) and sends the monitoring service data to the monitoring center in the process of screening service, the monitoring center gives an alarm according to a monitoring rule and generates a system report, the alarming functions of mails, short messages, calls and the like are completed through the message center, and the monitoring center can also set a service system and complete dynamic service capability configuration of the system.

Preferably: the input and output data in the interactive module are files, each line in the files is all input and output data called once, and the interactive module is internally provided with a timing clearing unit.

Preferably: the data cleansing technique clears noise in the data, correcting inconsistencies comprising:

a1, determining the customer behavior information;

training and sorting the collected customer behavior information as S, wherein S can be expressed as:

wherein x is_ijIs the j attribute of the customer behavior informationTraining values of i times, wherein the value of i is from 1 to m, m is the training times of the customer behavior information, the value of j is from 1 to n, and n is the number of attributes contained in the customer behavior information;

a2, calculating a cleaning threshold value;

wherein, β_jA cleaning threshold value of the jth attribute, k is a preset correction parameter, and E is a dynamic range of a standard deviation;

a3, screening noise in data;

wherein λ is_ijFor the screening result, 1 represents the ith training value x of the jth attribute of the customer behavior information_ijWithout correction, 0 represents the ith training value x of the jth attribute of the customer behavior information_ijStep A4 is required for correction;

a4, correcting inconsistent data;

t_ij＝MEDIAN(x_1j:x_mj),λ_ij＝0

wherein, t_ijIs x_ijCorrected data, MEDIAN (x)_1j:x_mj) Is a function of the median value.

The invention has the beneficial effects that: the data source module acquires related data from multiple parties and is used for calculating various derived variables and calculating grades and risk levels, the accuracy, integrity and consistency of the data are guaranteed through the data preprocessing module and the data modeling module, a bank or other organizations count the matching key information of the applicant into one file, the operator places the file in a designated directory to request early warning level information, the risk screening module performs data matching, the interactive module returns the characteristic data of the applicant according to the data service contract, the file timing automatic deleting mechanism stored in the disk after transmission is started, the grading request sensitive data submitted by the bank is not stored in the disk, the data obtained from each data partner is only used for carrying out early warning level calculation and generating risk index variables, and the personal sensitive data can not be stored in a disk in a landing manner.

Drawings

Fig. 1 is a schematic flow structure diagram of a multidimensional big data intelligent risk screening system provided by the present invention;

fig. 2 is a schematic diagram of a sigmoid function image of a multi-dimensional big data intelligent risk screening system provided by the invention.

Detailed Description

The technical solution of the present patent will be described in further detail with reference to the following embodiments.

Reference will now be made in detail to embodiments of the present patent, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present patent and are not to be construed as limiting the present patent.

In the description of this patent, it is to be understood that the terms "center," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientations and positional relationships indicated in the drawings for the convenience of describing the patent and for the simplicity of description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting of the patent.

In the description of this patent, it is noted that unless otherwise specifically stated or limited, the terms "mounted," "connected," and "disposed" are to be construed broadly and can include, for example, fixedly connected, disposed, detachably connected, disposed, or integrally connected and disposed. The specific meaning of the above terms in this patent may be understood by those of ordinary skill in the art as appropriate.

Example 1:

a multi-dimensional big data intelligent risk screening system is shown in fig. 1 and fig. 2 and comprises a data source module, a data preprocessing module, a data modeling module, a rating module, a risk screening module and an interaction module; the data source module comprises a data collector, dealer service data, partner data and a third-party data market; the data preprocessing comprises a data cleaning technology, a data reduction technology, a data integration technology and a data transformation technology; the data modeling module is used for establishing a mathematical model by using logistic regression for predicting the risk of the client; the rating module checks people with low repayment probability according to the data obtained by the data modeling module, and specifically, the people can be divided into 7 risk levels A, B, C, D, E, F and G; the risk screening module is in communication connection with the rating module and comprises a data matching module; the interaction module is in communication connection with the risk screening module.

The data acquisition unit is client behavior information acquired by software modes such as API, SDK, JS and the like at a PC end or a mobile end.

The security dealer service data mainly comprises centralized bidding transaction information of buying and selling such as bulk transaction, agreement transfer, after-hand transaction and the like of security trading in a public and centralized mode, and investment system data of buying and selling on online investment platforms of security dealers, investment analysis decision systems and other investment systems of users.

The data of the partner is mainly data information which is provided by an organization having a cooperative relationship with a software developer and reflects the behavior preference, consumption condition and other relevant conditions of a client, and comprises public number data, e-commerce station data, media data and the like.

The third party data market comprises blacklist data providing institutions, telecommunication consumption data providing institutions, financial consumption data providing institutions and other data providing institutions.

The data cleaning technology is used for cleaning noise in data and correcting inconsistency; the data reduction technology reduces the scale of data by sniping, deleting redundant features or clustering; data integration techniques consolidate data from multiple data sources into a coherent data store, such as a data warehouse; the data transformation technology compresses data to a smaller interval, such as 0.0 to 1.0, and can improve the accuracy and efficiency of a mining algorithm for designing distance measurement.

And w and b in the logistic regression are parameters to be solved, the logistic regression corresponds w x + b to a hidden state P through a function L, P (w x + b), then the value of the dependent variable is determined according to the size of P and 1-P, and if L is the logistic function, the logistic regression is carried out.

The L function in the logistic regression generally uses a sigmoid function

The loss function of logistic regression is L (y1, y2) — (y2log (y1)) + (1-y2) log (1-y 1); defining an average of a loss function of m training samples of a cost function

The method measures the average error cost between the predicted result and the real result, the optimization aims at minimizing the cost function J (w, b), the effect of optimizing the model can be achieved by minimizing the cost function, and the optimization of the cost function can be realized by a gradient descent method.

In the gradient descent method, the updating mode of w and b is

For learning rate learning-rate representing the step size of the move, gradient

That is, the slope of the current point specifies the moving direction, and the gradient descent method moves in the negative direction of the gradient in order to find the minimum value, and is represented by an image: the curve in the figure is a cost function J, the abscissa is w or b, when the gradient (slope) is positive (the gradient points to the right front), w is calculated by the formula (6)Updating towards the left, and approaching the lowest point of the curve (the gradient is 0); when the gradient (slope) is negative- (the gradient points to the left front), w is updated towards the right through the operation of the formula (6) and is close to the lowest point of the curve until the gradient is 0, the minimum value is reached, and the optimal parameter w and b are obtained to enable J to achieve the minimum value.

F and G of the risk classes may expand the classes to decide on their own whether 2 risk classes are needed.

The data matching module is used for matching data from the data source module through the matching key of the applicant.

The interaction module counts the matching key information of the applicant into a file for a client, an operator puts the file into a specified directory to request early warning grade information, the interaction module feeds back the early warning grade information to the risk screening module, acquires related information according to the rating module, and returns the characteristic data of the applicant according to a data service contract.

Further, the customers comprise banks, small credit companies, internet financial companies and other financial service organizations;

further, the matching key of the applicant comprises a personal identification number, a name, a common mobile phone number and loan-associated bank account content.

The risk screening module generates monitoring service data (concurrency, error number, slow query and the like) and sends the monitoring service data to the monitoring center in the process of screening service, the monitoring center gives an alarm according to a monitoring rule and generates a system report, the alarming functions of mails, short messages, calls and the like are completed through the message center, and the monitoring center can also set a service system and complete dynamic service capability configuration of the system.

The input and output data in the interactive module are files, each line in the files is all input and output data called once, a timing clearing unit is arranged in the interactive module, the files in the disk are automatically deleted at regular time, and no personal sensitive data can be stored no matter the data is the input end data of a credit agency or the output data of a data partner.

When the system is used, the data source module acquires relevant data from multiple parties and is used for calculating various derived variables and calculating grades and risk levels, the accuracy, integrity and consistency of the data are guaranteed through the data preprocessing module and the data modeling module, a bank or other organizations count the matching key information of the applicant into one file, the operator places the file in a designated directory to request early warning level information, the risk screening module performs data matching, the interactive module returns the characteristic data of the applicant according to the data service contract, the file timing automatic deleting mechanism stored in the disk after transmission is started, the grading request sensitive data submitted by the bank is not stored in the disk, the data obtained from each data partner is only used for carrying out early warning level calculation and generating risk index variables, and the personal sensitive data can not be stored in a disk in a landing manner.

Example 2:

The L function in the logistic regression generally uses a sigmoid function

In the gradient descent method, update of w, bIn a manner that

That is, the slope of the current point specifies the moving direction, and the gradient descent method moves in the negative direction of the gradient in order to find the minimum value, and is represented by an image: the curve in the figure is a cost function J, the abscissa is w or b, when the gradient (slope) is positive- (the gradient points to the right front), w is updated towards the left by the operation of the formula (6) and is close to the lowest point of the curve (the gradient is 0); when the gradient (slope) is negative- (the gradient points to the left front), w is updated towards the right through the operation of the formula (6) and is close to the lowest point of the curve until the gradient is 0, the minimum value is reached, and the optimal parameter w and b are obtained to enable J to achieve the minimum value.

When the system is used, relevant data are obtained through a data source module in multiple parties, various derived variables are calculated, and a score and a risk level are calculated, during the period, the accuracy, the integrity and the consistency of the data are guaranteed through a data preprocessing module and a data modeling module, a bank or other organizations count matching key information of an applicant into one file, an operator puts the file into a specified directory to request early warning level information, a risk screening module performs data matching, characteristic data of the applicant are returned through an interaction module according to a data service contract, and the file is stored in a disk and is automatically deleted at regular time after being transmitted.

In the case of the example 3, the following examples are given,

in the above embodiment, the data cleansing technique removes noise from the data, and correcting the inconsistency includes:

a1, determining the customer behavior information;

wherein x is_ijThe ith training value is the jth attribute of the customer behavior information, the value of i is from 1 to m, m is the training frequency of the customer behavior information, the value of j is from 1 to n, and n is the number of attributes contained in the customer behavior information; the attributes include at least: customer preferences, consumption behaviors, and lifestyle;

a2, calculating a cleaning threshold value;

wherein, β_jA cleaning threshold of the jth attribute, k is a preset correction parameter, E is a dynamic range of standard deviation, and generally 0<k<1；

A3, screening noise in data;

a4, correcting inconsistent data;

t_ij＝MEDIAN(x_1j:x_mj),λ_ij＝0

Has the advantages that: according to the technical scheme, the collected customer behavior information is trained, then the cleaning threshold is calculated to screen the noise data in the data, and finally the screened inconsistent data is corrected.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A multidimensional big data intelligent risk screening system comprises a data source module, a data preprocessing module, a data modeling module, a rating module, a risk screening module and an interaction module, and is characterized in that the data source module comprises a data collector, dealer service data, partner data and a third-party data market; the data preprocessing comprises a data cleaning technology, a data reduction technology, a data integration technology and a data transformation technology; the data modeling module is used for establishing a mathematical model by using logistic regression for predicting the risk of the client; the rating module checks people with low repayment probability according to the data obtained by the data modeling module, and specifically, the people can be divided into A, B, C, D, E, F and G, and 7 risk levels; the risk screening module is in communication connection with the rating module and comprises a data matching module; the interaction module is in communication connection with the risk screening module.

2. The system according to claim 1, wherein the data collector is a client behavior information collected by software means such as API, SDK, JS and the like at a PC end or a mobile end.

3. The system of claim 2, wherein the dealer business data includes centralized transaction data information for trading such as public and centralized bidding trading of securities traders, bulk trading, agreement transfer, post-inventory trading, etc., and investment system data for trading of users on online investment platforms of securities traders, investment analysis decision systems, etc.

4. The system as claimed in claim 3, wherein the partner data is mainly data information reflecting client behavior preference, consumption status and other relevant conditions provided by an organization having a cooperative relationship with a software developer, and includes public number data, e-commerce station data and media data.

5. The multi-dimensional big data intelligent screening risk system according to claim 4, wherein the third party data market comprises blacklist data providers, telecommunication consumption data providers, financial consumption data providers, and other data providers.

6. The multidimensional big data intelligent screening risk system according to claim 5, wherein the data cleaning technology removes noise in the data and corrects inconsistency; the data reduction technology reduces the scale of data by sniping, deleting redundant features or clustering; data integration techniques consolidate data from multiple data sources into a coherent data store, such as a data warehouse; data transformation techniques compress data to a smaller interval, such as 0.0 to 1.0.

7. The multidimensional big data intelligent risk screening system according to claim 1, wherein the L function in the logistic regression generally uses sigmoid function

8. The system of claim 7, wherein the interaction module is configured to count matching key information of the applicant into one file for a client, an operator puts the file into a specified directory to request early warning level information, and the interaction module feeds back the file to the risk screening module, obtains related information according to the rating module, and returns feature data of the applicant according to a data service contract.

9. The system of claim 8, wherein the risk screening module generates monitoring service data (concurrency, error count, slow query, etc.) and sends the data to the monitoring center during the screening service, the monitoring center gives an alarm according to a monitoring rule and generates a system report, and the message center completes the alarm functions of e-mail, short message, call, etc., and the monitoring center can further set the service system to complete the dynamic service capability configuration of the system.

10. The system for multi-dimensional big data intelligent screening risk according to claim 9, wherein the input and output data in the interactive module is a file, each line in the file is the whole input and output data that is called once, and the interactive module is provided with a timing clearing unit.

11. The multidimensional big data intelligent screening risk system according to claim 6, wherein the data cleaning technology cleans noise in data, and correcting inconsistency comprises:

a1, determining the customer behavior information;

wherein x is_ijThe ith training value is the jth attribute of the customer behavior information, the value of i is from 1 to m, m is the training frequency of the customer behavior information, the value of j is from 1 to n, and n is the number of attributes contained in the customer behavior information;

a2, calculating a cleaning threshold value;

a3, screening noise in data;

a4, correcting inconsistent data;

t_ij＝MEDIAN(x_1j:x_mj),λ_ij＝0