CN106408184A

CN106408184A - User credit evaluation model based on multi-source heterogeneous data

Info

Publication number: CN106408184A
Application number: CN201610817430.1A
Authority: CN
Inventors: 郑子彬; 杨亚涛; 黄春振
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2016-09-12
Filing date: 2016-09-12
Publication date: 2017-02-15

Abstract

The invention relates to a user credit evaluation model based on multi-source heterogeneous data, comprising the following steps: (1) multi-source heterogeneous data acquiring and merging; (2) user feature processing; and (3) model training. According to the model framework put forward by the invention, in the following feature expansion and selection, the data dimension of a user is extended first, and then, useful features are selected. Thus, the dimension of features is reduced, and the time complexity of the model is reduced. Data missing and abnormality is handled in feature processing, and therefore, the robustness of the model to missing values is improved.

Description

A kind of user credit assessment models based on multi-source heterogeneous data

Technical field

The present invention relates to credit evaluation field, assess mould particularly to a kind of user credit based on multi-source heterogeneous data Type.

Background technology

User credit assessment refers to that credit evaluation mechanism uses expert judgments or Mathematical Method, and personal and enterprise are carried out About various capacitys of consent and prestige degree carry out thoroughly evaluating, and with simple and clear symbol or literal expression out, to meet The market behavior of society need.Credit evaluation has been widely used in financial field.Traditional financial institution's assessment credit relies on In to user in user's financial records of this mechanism and behavior record.With the deep development of big data, traditional credit is commented Estimate using data limitation also be faced with renewal substitute.

With the depth development of the Internet, the various actions record of user all produces daily in a network.These data are The significant data of the performance of user's real behavior, naturally also user credit assessment.How using the multi-source heterogeneous data of user Carrying out user credit assessment becomes new trend.It is proposed that the data of the following dimension of deep exploitation carries out user credit commenting Estimate：

1) Back ground Information：The demographics essential informations such as age of user, native place, current work address；

2) network behavior information：Refer to user to browse webpage, the instrument browsing webpage use, browse distribution and duration etc. Information；

3) student status educational background information：User's education information；

4) social network information：User in public social networkies, such as microblogging, know the behavior of grade and social information；

5) Third-party payment information：User is in the consumption recording information of Third-party payment platform.

6) investigation on the net questionnaire information：Questionnaire imposes reference relevant information and essential information.

All from the Internet, this is had substantially the master data of 6 above dimensions with traditional credit evaluation data Difference.The data dimension of Internet user reaches thousands of dimensions, and these Data Sources are different, can assess use in all its bearings The data of family credit, more various dimensions can be more fully described the credit standing of a user；

But, data dimension rises to thousands of dimensions from tens dimensions, brings challenges also to the construction of model simultaneously.Mould Type facing challenges may be summarized to be：

1. the high-dimensional problem of data.Traditional credit evaluation model because being the model set up in the features of tens dimensions, The time of model training is shorter, so not having the problem of excessive consideration data dimension.And rely on internet information at present to comment Estimate user profile it is considered to user profile be not only the related information of customer transaction, also user social contact network, Behavior preference Etc. dimensional information, the dimension of data can reach thousands of dimensions, the data of such higher-dimension, needs a good feature selection mode to exist Reduce characteristic dimension in the case of not reducing model evaluation effect, allow the training speed of model and actual effect more to strengthen；

2. the problem of shortage of data value and exceptional value.User's dimension due to considering is a lot, so user can not possibly be every With the presence of value in individual dimension, the shortage of data value of user is more in many cases, and because some data are by recessiveness Mode obtain, so data in the collection or transmitting procedure it cannot be guaranteed that completely correct, data is also abnormal along with some Value exists.Current model also seldom goes to propose specific solution for this problem in detail；But missing values and exceptional value The process meaning specifically important to the effect promoting of model evaluation.

Content of the invention

The present invention is to solve the above problems it is proposed that a kind of user credit assessment models based on multi-source heterogeneous data, its Comprise the following steps：

(1) acquisition of multi-source heterogeneous data and merging；

(2) process of user characteristicses；

(3) training of model.

Further, the acquisition of described multi-source heterogeneous data includes：

Using crawler technology, crawl in webpage with user-dependent information；

From providing, the premise that user obtains reference report is to provide appropriate personal essential information to user；

User authorizes the access of the data of the third-party institution.

Further, the merging of described multi-source heterogeneous data：

Authorized user message and user are provided with data carries out mailbox number, cell-phone number, the arbitrary of identity card ID are mated；

Mailbox number, user name, user's authorization merging are carried out to the information that crawls on the net.

Further, the process of user characteristicses includes the process of missing values abnormalities characteristic, category feature discrete codes, sequential Depths of features is excavated, is obtained statistical feature.

Further, the training package vinculum model training of described model, decision-tree model training.

Further, described multi-source heterogeneous data includes the essential information of user, school work information, payment information, social network Network information, operation information, network behavior information.

Further, described missing values abnormalities characteristic processes and specifically includes：

A. miss rate carries out feature filling below 20%, for numeric type feature, fills average, special for classification type Levy filling mode；

B. miss rate carries out discard processing and discrete codes conversion more than 97%, and discard processing is to remove disappearance occupation rate Feature more than 97%, and in the case that miss rate is a lot, discrete codes are carried out to these features；

C. missing values statistical matrix：By user characteristicses matrix, disappearance be set to 1, do not lack is set to 0.

Further, described category feature discrete codes specifically include：One possible value there is the spy of N kind situation Levy, be encoded to N number of binary feature, these feature mutual exclusions, only one of which activation, makes data become sparse every time.

Further, described temporal aspect depth is excavated and is specifically included：

1st, adjacent period is carried out subtracting each other with process, represents the difference conversion of different times or a section；

2nd, adjacent period is divided by process, represents chain rate/slope conversion of different times or a section；

3rd, carry out accumulated process, represent and value changes；

Further, the described statistics feature that obtains specifically includes：The miss rate of counting user information, whether user is big Volume transaction record user, user's active time counts, the user locations rate of change, and statistical method includes global statistics or branch mailbox system Meter.

Further, described linear model training includes LASSO, Liblinear, Linear-SVM；Decision-tree model is instructed White silk includes Boosting, XGBoost.

The invention has the beneficial effects as follows：Model framework proposed by the present invention feature below extension with select, first to The data dimension at family is extended, and then useful feature is selected again, thus lowering the dimension of feature, lowers model Time complexity；In characteristic processing, shortage of data is processed with abnormal situation simultaneously, provide model to missing values Robustness.

Brief description

Fig. 1 is a kind of user credit assessment figure based on multi-source heterogeneous data；

Fig. 2 is the missing values statistical moment system of battle formations；

Fig. 3 is address type latent structure mode figure；

Fig. 4 is the mode mapping graph that is divided by.

Specific embodiment

The present invention will be described in detail below：

User credit assessment based on multi-source heterogeneous data comprises as shown in Figure 1 three big steps：

(1) acquisition of multi-source heterogeneous data and merging；

(2) process of user characteristicses；

(3) training of model.

Wherein：

(1) data basis layer

Data basis layer includes the essential information of user under network environment, school work information, payment information, social networkies letter Breath, operation information, network behavior information etc..These information both are from different data sources, can effective expression user each The information of individual aspect.This is also so that model is capable of the key of more accurate assurance user credit situation.These information are passed through to use Any information in family ID, identity card ID, mailbox number and cell-phone number connects.Multi-source data is connected to user, is that next step is used The assessment of family various dimensions credit is done data and is prepared.

Specifically, the acquisition of wherein multi-source heterogeneous data：

1) crawler technology, crawl in webpage with user-dependent information.

2) user provides certainly, and the premise that user obtains reference report is to provide appropriate personal essential information.

3) user authorizes the access of the data of the third-party institution.

The merging of multi-source heterogeneous data：

1) authorized user message and user are provided with data carries out mailbox number, cell-phone number, the arbitrary of identity card ID are mated.

2) information that crawls on the net is carried out with mailbox number, user name, IP (user's mandate) merging.

(2) data analysis layer

Data analysis layer includes the processing mode of multiple data.It is exactly to look in mixed and disorderly, unordered data in summary To orderly, structurized feature.Thus the information of statement user definitely.The saddlebag of this layer contains：

1. missing values abnormalities characteristic is processed

The merging of multi-source data must cause a large amount of missing datas.The reason cause disappearance has a lot, such as, user does not have There is a payment record of certain bank, or do not collect the essential information of this user, just do not have when even certain user fills in Write some information.For different degree of lacking, the data prediction of multi-form should be carried out.

" -1 " in the form that missing values occur such as numeric type feature, or the null character string in classification type feature, " NULL " etc..We can be processed to the feature of different miss rates：

A. miss rate carries out feature filling below 0.2, for numeric type feature, fills average.Special for classification type Levy filling mode.Such filling proportion and filling mode obtain in test optimal effectiveness；

B. miss rate carries out discard processing and discrete codes conversion more than 97%.Discard processing is to remove disappearance occupation rate Feature more than 97%.And in the case that miss rate is a lot, this feature is more likely to discretization, we are also carried out to these features Discrete codes.

C. missing values statistical matrix：If Fig. 2 statement is by user characteristicses matrix, disappearance be set to 1, do not lack is set to 0.Do The feature of this respect is because it is considered that missing values are also a kind of information.

2. category feature discrete codes

The primary operational of discrete codes is the feature that a possible value has N kind situation, is encoded to N number of binary Feature, these feature mutual exclusions, only one of which activation, so can make data become sparse every time.The benefit of so coding is right In tree-model, the identification ability of feature is more strengthened, also function to the effect of augmented features simultaneously.During feature construction, we The feature that value in data (removing address) feature of classification type and numeric type feature is less than 12 values carries out discrete codes. The reason remove address is that the characteristic dimension obtaining can be excessive if by address direct coding, and the complexity increasing model does not but have Have and lifted well.Such as Fig. 3 is carried out more careful conversion by the feature of address.

3. temporal aspect depth is excavated

The data collected has significantly relevant with time data.Such as, have not in the payment record of a people Payment information of the same period, the diversity of different times behavior record.The trend feature of these sequential can effectively hold one Personal credit trend situation.So, We conducted the feature to different times and carry out more careful process：

3rd, carry out accumulated process, represent and value changes；

Wherein, for division arithmetic, because some show missing values (unified presentation be -1) it is impossible to direct to this row feature Remove.We take in the following manner that situation about can not directly remove is carried out as Fig. 4 formal layout：

4. statistical feature

Statistical feature can effectively hold the information of the overall situation, such as the cash in banks of someone is 50,000, overall sample If this is all thousand of.So this people be can be regarded as relatively rich.So, if overall sample deposit is all 100,000, then this Individual can be regarded as relatively poor.Before there is no global statistics, these information we be difficult to hold.So, statistical feature It is also the important indicator of user's assessment.

Outside features described above construction, we have proposed some statistical features, such as the disappearance of counting user information Rate, whether user is block trade record user, and user's active time counts, user locations rate of change etc..It is all that definition is used Family credit rating has very big contribution.In addition to global statistics, can be counted with branch mailbox.

(3) model training layer

During model training, we fully utilize linear model and tree-model.So utilize different models pair Feature carries out omnibearing training.Thus more effectively using feature and then obtaining more accurate result.Model training layer institute Model has：

1. linear model：

Linear model is the general name of a class statistical model, and it includes linear regression model (LRM), analysis of variance model, covariance Analysis model and linear assembly language (or claiming variance component model) etc..Many biologies, medical science, economy, management, geology, The phenomenon in the fields such as meteorology, agricultural, industry, engineering technology can be with linear model come approximate description.Therefore linear model becomes For one of model of being most widely used in modern statistics.

The linear model that the present invention adopts includes：

LASSO

Linear innovatory algorithm, essence is also one kind of linear classifier, but it is integrated with feature selection and regularization Function；Improve accuracy rate and the interpretability of statistical model.

Liblinear

Algorithm simple and efficiently, apply in practice widely, quickly, big data quantity can be carried；Can be effective Process continuous Value Data, the feature self-explanatory etc. that discretization is crossed；Liblinear is in the degree of fitting of data and model explanation Degree can be taken into account, and takes into account to obtain reasonable algorithm.

Linear-SVM

It is not using kernel matrix, so it is quick more a lot of than LIBSVM；If training set has done big measure feature Engineering, dimension is very high, more suitable with linear-SVM, also reduces over-fitting risk simultaneously.

2. decision-tree model：

Decision tree (Decision Tree) is on the basis of known various situation probability of happening, by constituting decision tree Expected value to ask for net present value (NPV) is more than or equal to zero probability, assessment item risk, judges the method for decision analysis of its feasibility, It is a kind of diagram method intuitively using probability analyses.Because this decision branch is drawn as the branch like one tree for the figure, therefore claim Decision tree.In machine learning, decision tree is a forecast model, and what he represented is the one kind between object properties and object value Mapping relations.

We make use of the Boosting model in decision-tree model；This model is during training objective function to instruction Practice the Taylor expansion that second order has been done in loss, and add canonical item constraint outside object function and just can integrally seek optimal solution； XGBoost also has speed fast, and transplantation writes code, gram fault-tolerant advantage less.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not subject to above-described embodiment Limit, other any spirit without departing from the present invention and the change made under principle, modification, replacement, combine, simplify, All should be equivalent substitute mode, be included within protection scope of the present invention.

Claims

1. a kind of user credit assessment models based on multi-source heterogeneous data, it comprises the following steps：

(1) acquisition of multi-source heterogeneous data and merging；

(2) process of user characteristicses；

(3) training of model.

2. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 1 are it is characterised in that institute The acquisition stating multi-source heterogeneous data includes：

Using crawler technology, crawl in webpage with user-dependent information；

User authorizes the access of the data of the third-party institution；

The merging of described multi-source heterogeneous data includes：

3. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 1 are it is characterised in that use The process of family feature includes the process of missing values abnormalities characteristic, category feature discrete codes, temporal aspect depth are excavated, obtain system Meter property feature.

4. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 1 are it is characterised in that institute State training package vinculum model training, the decision-tree model training of model.

5. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 1 are it is characterised in that institute State multi-source heterogeneous data and include the essential information of user, school work information, payment information, social network information, operation information, network Behavioural information.

6. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 3 are it is characterised in that institute State the process of missing values abnormalities characteristic to specifically include：

A. miss rate carries out feature filling below 20%, for numeric type feature, fills average, fills out for classification type feature Fill mode；

B. miss rate carries out discard processing and discrete codes conversion more than 97%, and discard processing is to remove disappearance occupation rate to exceed 97% feature, and in the case that miss rate is a lot, discrete codes are carried out to these features；

7. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 3 are it is characterised in that institute State category feature discrete codes to specifically include：One possible value there is is the feature of N kind situation, be encoded to N number of binary Feature, these feature mutual exclusions, only one of which activation, makes data become sparse every time.

8. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 3 are it is characterised in that institute State the excavation of temporal aspect depth to specifically include：

3rd, carry out accumulated process, represent and value changes.

9. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 3 are it is characterised in that institute State the statistical feature of acquisition to specifically include：The miss rate of counting user information, whether user is block trade record user, user Active time counts, the user locations rate of change, and statistical method includes global statistics or branch mailbox statistics.

10. a kind of user credit assessment models based on multi-source heterogeneous data according to claim 4 are it is characterised in that institute State linear model training and include LASSO, Liblinear, Linear-SVM；Decision-tree model training include Boosting, XGBoost.