CN107633455A

CN107633455A - Credit estimation method and device based on data model

Info

Publication number: CN107633455A
Application number: CN201710785997.XA
Authority: CN
Inventors: 陈肖黎; 贾西贝
Original assignee: Shenzhen Huaao Data Technology Co Ltd
Current assignee: Shenzhen Huaao Data Technology Co Ltd
Priority date: 2017-09-04
Filing date: 2017-09-04
Publication date: 2018-01-26

Abstract

The invention belongs to finance data processing technology field, there is provided a kind of credit estimation method and device based on data model.This method includes：The characteristic variable needed for assessment models is obtained from data to be assessed, whether each characteristic variable for judging data to be assessed is failure variable：If so, then using completion variable corresponding to the failure variable to carry out completion, and input assessment models, if it is not, then input assessment models, failure variable is loss of learning or the incomplete characteristic variable of information, assessment models are assessed according to the characteristic variable of input, and export evaluation result.Credit estimation method and device of the invention based on data model, it can carry out credit evaluation in the case where shortage of data, data are not complete using small set of data, improve credit violation correction effect.

Description

Credit estimation method and device based on data model

Technical field

The present invention relates to finance data processing technology field, and in particular to a kind of credit estimation method based on data model And device.

Background technology

At present, on the market personal debt-credit software is more, different software towards target group it is different.In order to reduce wind Danger is, it is necessary to assess the loan repayment capacity of user, for accurate lock onto target client, it is necessary to which the debt-credit tendency to user is carried out Assess.

In actual application, bank debit and credit platform big data is adapted to legacy user to borrow or lend money model, and still, internet is put down Though it is big that platform and newest mobile platform but face data, but many not accurate and comprehensive not to the utmost data.If believing With, there occurs some missing or invalid data variables, the model may predict that borrower's credit level can occur in Rating Model Deviation, in addition it is unpredictable, the estimation of bias is then produced to borrower.Also, in the loan platform initial start stage stage, because Data are limited, and finance company may be unaware that the feature of which type of borrower is important in credit scoring pattern.Come from The credit scoring pattern of large-scale finance company can might not accurately predict local user, moreover, the client of different regions Because of regional disparity, it is impossible to local Credit Model is built according to strange land data, for example, same wage income is in a line city and three The credit level in line city will not be identical, therefore can not effectively carry out the risk assessment of user.It is therefore, few for sample at initial stage, If user data information is not complete, shortage of data, existing assessment models can not be assessed.For example, the assessment of loan repayment capacity One of variable of model is wage income, can not be accurate if the wage income or kinsfolk's number of user can not be obtained Really assess its loan repayment capacity.

How in the case where shortage of data, data are not complete, credit evaluation is carried out using small set of data, improves credit promise breaking The problem of prediction effect is those skilled in the art's urgent need to resolve.

The content of the invention

For in the prior art the defects of, the present invention provide credit estimation method and device based on data model, can In the case where shortage of data, data are not complete, credit evaluation is carried out using small set of data, improves credit violation correction effect.

In a first aspect, the present invention provides a kind of credit estimation method based on data model, this method includes：From to be assessed Data in obtain characteristic variable needed for assessment models, assessment models are by training and by the model after inspection；

Whether each characteristic variable for judging data to be assessed is failure variable：

If so, then using completion variable corresponding to the failure variable to carry out completion, and assessment models are inputted,

If it is not, then input assessment models, failure variable is loss of learning or the incomplete characteristic variable of information；

Assessment models are assessed according to the characteristic variable of input, and export evaluation result.

Further, before the characteristic variable needed for assessment models is obtained from data to be assessed, this method also includes： Sample data in training set is classified, obtains classification results；

According to classification results, logistic regression is carried out to the sample data in training set, establishes assessment models；

According to assessment models and the sample data of test set, test result is obtained；

According to test result, assessment models are examined.

Further, after examining assessment models, this method also includes：

According to cross validation method, the random sample data for splitting training set and test set；

Using the sample data Training valuation model after fractionation.

Further, according to classification results, before carrying out logistic regression to the sample data in training set, this method is also wrapped Include：The distance of sample data in training set is calculated, determines associated variable；

Judge whether the distance between any two associated variable value is less than distance threshold, if so, then becoming two associations Amount merges.

Further, calculate in training set after the distance of sample data, this method also includes：

Detect the distance between a certain variable and its dependent variable value；

The variable minimum with the variable distance value is arranged to the completion variable of the variable；

Completion is carried out using completion variable corresponding to the failure variable, specifically included：

The information completion failure variable of completion variable is corresponded to using the failure variable.

Further, after assessment models are established, before being replaced using completion variable corresponding to the failure variable, This method also includes：Target variable is inputted into assessment models；

According to the information value of the existing characteristic variable of assessment models, examine each existing characteristic variable whether effective；

If the characteristic variable of failure be present, the target variable is arranged to the completion variable of the characteristic variable of failure；

Further, according to the information value of the existing characteristic variable of assessment models, each existing characteristic variable is examined to be It is no effective, specifically include：

According to the allocation proportion of sample data in training set, the information value of each characteristic variable is calculated；

Tested according to predetermined value threshold value, judge whether each characteristic variable is effective.

Based on the above-mentioned credit estimation method embodiment arbitrarily based on data model, further, judge that a certain feature becomes Measure after failure variable, to input before assessment models, this method also includes：Statistical calculation is carried out to failure variable, obtaining should The average or intermediate value of failure variable；

Using the average of the failure variable or the failure variable of intermediate value completion loss of learning.

Second aspect, the present invention provide a kind of credit evaluation device based on data model, and the device includes characteristic variable Acquisition module, failure variable completion module and evaluation module, characteristic variable acquisition module are used to obtain from data to be assessed Characteristic variable needed for assessment models, assessment models are by training and by the model after inspection；The variable completion module that fails is used In judge data to be assessed each characteristic variable whether be failure variable：If so, then use completion corresponding to the failure variable Variable carries out completion, and inputs assessment models, if it is not, then inputting assessment models, failure variable is that loss of learning or information is not complete Characteristic variable；Evaluation module is used to make assessment models be assessed according to the characteristic variable of input, and exports evaluation result.

Further, credit evaluation device of the present embodiment based on data model also establishes module including assessment models：With Sample data in training set is classified, and obtains classification results；According to classification results, to the sample data in training set Logistic regression is carried out, establishes assessment models；According to assessment models and the sample data of test set, test result is obtained；According to survey Test result, examine assessment models.

As shown from the above technical solution, the credit estimation method and device based on data model that the present embodiment provides, are adopted With pre-established assessment models, user's data to be assessed are handled, even if the failure variable that existence information lacks or information is not complete, This method can also lose effect variable using completion variable completion, improve credit violation correction effect, completed using small set of data Credit evaluation, the phenomenon for because processing data amount is small, causing assessment models not assess is avoided, save credit analysis cost, be Credit decisions provides Informational support, reduces potential default risk, improves the efficiency and automatization level of credit evaluation.

Therefore, credit estimation method and device of the present embodiment based on data model, in the feelings that shortage of data, data are not complete Under condition, credit evaluation is carried out using small set of data, improves credit violation correction effect, effectively manages the credit risk of user.

Brief description of the drawings

, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art The required accompanying drawing used is briefly described in embodiment or description of the prior art.In all of the figs, similar element Or part is typically identified by similar reference.In accompanying drawing, each element or part might not be drawn according to the ratio of reality.

Fig. 1 shows a kind of method flow diagram of credit estimation method based on data model provided by the present invention；

Fig. 2 shows a kind of structured flowchart of credit evaluation device based on data model provided by the present invention.

Embodiment

The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Clearly illustrate technical scheme, therefore be intended only as example, and the protection of the present invention can not be limited with this Scope.

It should be noted that unless otherwise indicated, technical term or scientific terminology used in this application should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.

In a first aspect, a kind of credit estimation method based on data model that the embodiment of the present invention is provided, with reference to Fig. 1, This method includes：

Step S1：The characteristic variable needed for assessment models is obtained from data to be assessed, assessment models are by training And by the model after inspection.For example, evaluate whether a certain user can refund on time, assessment models can use monthly pay, annual pay, The characteristic variables such as length of service, address region, education background, are assessed the credit of the user, judge that the user is It is no to store default risk.

In actual application, assessment models are the pre- models for first passing through training, examining acquisition.

Step S2：Whether each characteristic variable for judging data to be assessed is failure variable：

If it is not, then input assessment models, failure variable is loss of learning or the incomplete characteristic variable of information.

For example, in actual application, the wage loss of learning or wage information of assessment models acquisition user be not complete, then Wage this characteristic variable is failure variable, can use the house property information of the user, length of service and industry for being engaged in etc. This characteristic variable of information completion wage.

Step S3：Assessment models are assessed according to the characteristic variable of input, and export evaluation result.

As shown from the above technical solution, the credit estimation method based on data model that the present embodiment provides, using built in advance Vertical assessment models, processing user data to be assessed, even if the failure variable that existence information lacks or information is not complete, this method Also completion variable completion failure variable can be used, improves credit violation correction effect, credit evaluation is completed using small set of data, The phenomenon for because processing data amount is small, causing assessment models not assess is avoided, credit analysis cost is saved, is carried for credit decisions For Informational support, potential default risk is reduced.

Therefore, credit estimation method of the present embodiment based on data model, in the case where shortage of data, data are not complete, Credit evaluation is carried out using small set of data, improves credit violation correction effect.

In order to further improve the accuracy of credit estimation method of the present embodiment based on data model, specifically, commenting In terms of estimating model construction, before the characteristic variable needed for assessment models is obtained from data to be assessed, this method can also be right Sample data in training set is classified, and obtains classification results.For example, the classification of variable will classify according to credit promise breaking, This is dependent variable.For example, according to default conditions, variable " age " will be divided into group, and then each group will have corresponding acquiescence speed Rate, this can improve the packet for the variable applied in logistic regression.

According to classification results, logistic regression is carried out to the sample data in training set, establishes assessment models.

Logistic regression is mainly used in predicting credit promise breaking.Logistic regression does not require that data set should be normal distribution or tool There is equal variance.Also, borrower can be divided into two groups by logistic regression, more likely may so repay on time, separately One group may break a contract on loan.With binary result, modeling analysis personnel can be easily applied and verify phase Close the effectiveness of variable.

Here, credit estimation method of the present embodiment based on data model is using logistic regression structure assessment models, logic Return has more preferable estimated performance relative to multilayer perceptron neural network model, can disclose borrower exactly credible Rely the feature in colony, method is simple, it can be readily appreciated that and can be provided for appropriate regulatory bodies and intuitively verify explanation.

In actual application, using test set sample data examine assessment models, i.e., establish assessment models it Afterwards, according to assessment models and the sample data of test set, test result is obtained.

According to test result, assessment models are examined, to judge the problem of assessment models whether there is overfitting.Example Such as, sample data is classified into training set (70%) and test set (30%).The first step of two step regression models is by training set Decision tree and clustering technique be applied to training set, and identical packet will be obtained from training set and handles test set. After segmentation, variable will be fit in logistic regression.Assessment using the ROC curve of training set and test set as estimated performance Standard, and check for over-fitting problem.

Also, credit estimation method of the present embodiment based on data model can also be by the way of cross validation, repeatedly Training valuation model, that is, after examining assessment models, this method can also according to cross validation method, it is random split training set and The sample data of test set.

Using the sample data Training valuation model after fractionation.

For example, random segment is the big small subsets such as n by sample data.Each one of subset will be tested by random application Card collection, and other subsets will be used as training set.One of cross validation is unique in that all subsets can only be in once Tested.Repeat n times, can effectively apply all information in sample data.Finally, the average value of test result can be with It is considered as the accuracy of model.Sample data is split at random using cross validation, multiple Training valuation model, with solution Certainly the problem of assessment models overfitting.

Specifically, in terms of sample data classification, the sample data in training set is classified, obtains classification results When, the specific implementation process of credit estimation method of the present embodiment based on data model is as follows：If the sample data in training set For numerical variable, then classified using decision tree logarithm value variable, determine classification results；If the sample data in training set is Classified variable, then classified variable is classified using clustering algorithm, determine classification results.

In actual application, according to the property of variable, split data into two parts and analyzed.Summarize one part Numerical variable, another part are made up of classified variable.For numerical variable, CHAID decision tree classifications will be applied by variable point For different classifications.Classified variable is by by Ward minimum variance clustering combination.

For numerical variable, descriptive statistic shows the general introduction of some functions of borrower.For example, borrower's is averaged Age is 28 years old, may have stable wage after graduation, be in most cases university.Application time is up to 23 times, borrows money People can personal information be interior one day after quickly receives loan submitting.Borrower pays 35 yuan of the number average out to of the moon of social insurance, The slightly above incumbent company work time limit, this shows that borrower may change work.Under normal circumstances, borrower changes work Chance is fewer, and the possibility that he or she breaks a contract is with regard to smaller, because wages are more stably repaid the loan.

Due to the arborescence run between acquiescence and classification, 95% or 99% significant property level will be off selecting group Collection, then classification can form new classification.For the classification of some small samples, they will be according to general knowledge, and such as " majoring in " educates Background, " scholar " is combined as the new category of " this is above section level ".

Ward minimum variance hierarchical cluster is by the small classification for assembled classification variable.It is different from other clustering methods Differentiating method is characterised by that it clusters classification based on variance analysis rather than distance.Ward clusters are minimized in all clusters The difference of two squares summation., as a kind of polymerization layered approach, it performs bottom-to-top method for it.Each classification is used as one Cluster starts, and then gradually merges with other people.Population variance after polymerization can increase with the generation of merging, and this is in cluster Weighted quadratic distance between the heart.When by them divided by summation square to provide variance proportion, the solution of quadratic sum is also very simple It is single.

Wherein, decision tree is a kind of stratification supervised learning model, can handle different types of data, such as internal, name Justice and alphabetic data.In terms of decision Tree algorithms, the automatic interaction detector in C4.5, classification and regression tree (CART) and card side (CHAID) it is widest credit scoring sector application decision Tree algorithms.

In most cases, by using the segmentation that population can be divided into different homogeneity subgroups, can improve The performance of logistic regression.For continuous variable, segmentation is referred to as the discrete discretization for turning to classified variable.However, work as borrower When subdivision between possibility of breaking a contract prediction and borrower's feature is widely different, one group of parted pattern may be than single credit scoring Model is more suitable for analyzing whole data set.Therefore, the decision tree in each continuous variable will be used as segmented model, be borrowed with optimization The classification of money people's feature, and attempt to improve its adaptability to logistic regression.

Clustering technique is by the unsupervised learning grader of the data group synthesis set of clusters with similar characteristics.This can also A suitable target variable is allocated sample is associated with homogeneous feature, to reduce between training and validation data set Mistake classification effect.On the other hand, by separating isomery borrower, cluster data collection can improve forecasting efficiency.Therefore, should Uniform data are combined as combining with clustering technique, returned with adaptation logic, to improve credit violation correction performance.

Based on cluster set, characteristic sub-area will uniformly organize progress by combining small sample according to minimum variance, avoid and return The problem of returning the too small sample of middle variable to count calculating.

Here, credit estimation method of the present embodiment based on data model can be carried out at classification to different type variable Reason.For numerical variable, this method is classified based on decision tree, and decision tree is relative to artificial neural network and k- arest neighbors Predictive ability is strong, can calculate Euclidean distance, to optimize the classification of loaning bill feature, is favorably improved it to logistic regression Adaptability.For classified variable, this method is classified based on clustering technique, will be had using Ward least variance method similar The data of feature are combined into cluster group, are returned with adaptation logic, improve credit violation correction effect.

Specifically, place can be merged for associated variable, credit estimation method of the present embodiment based on data model Reason, i.e., according to classification results, before carrying out logistic regression to the sample data in training set, this method can also calculate training set The distance of middle sample data, determines associated variable.

The independent variable that logistic regression requires all shall not be related to other independent corresponding relations.It is interrelated not Can only violate logistic regression it is assumed that this may cause inessential variable significant and reduce predictive ability.

Here, credit estimation method of the present embodiment based on data model will can be mutually related, variable merges place Reason, with specific reference to the Euclidean distance between each variable, judges whether to merge two associated variables, wherein, away from Can be that the numerical value or empirical value of acquisition are calculated according to sample data from threshold value.Also, this method enters associated variable Row merging treatment, credit evaluation risk can be reduced.Otherwise, the variable that is mutually related can reduce the evaluation result of logistic regression Accuracy.

Specifically, in terms of completion variable processing, credit estimation method of the present embodiment based on data model can either be adopted Fail-all characteristic variable is mended with average, intermediate value, and can is enough to be worth determination completion variable according to the distance between variable, additionally it is possible to root Completion variable is determined according to information value.

Wherein, the detailed process that fail-all characteristic variable is mended using average, intermediate value is as follows：

After judging a certain characteristic variable for failure variable, input before the assessment models, this method also includes：To institute State failure variable and carry out statistical calculation, obtain the average or intermediate value of the failure variable.

Here, credit estimation method of the present embodiment based on data model can count to failure variable, it is determined Intermediate value or average, and the information lacked in completion failure variable, in order in the infull feelings of the loss of learning of the variable or information Under condition, the infull failure variable of loss of learning or information is fallen using completion variable replacement, completes credit evaluation.

Wherein, determine that the detailed process of completion variable is as follows according to Euclidean distance：

Calculate in training set after the distance of sample data, this method can also be detected between a certain variable and its dependent variable Distance value.

The variable minimum with the variable distance value is arranged to the completion variable of the variable.

When carrying out completion using completion variable corresponding to the failure variable, specific implementation process is：

Using the information completion of completion variable corresponding to the failure variable failure variable.

In actual application, the Euclidean distance between different variables can be calculated using decision tree, if for becoming For measuring A, the distance between variable B is most short, then variable B is arranged to the completion variable of variables A.For example, counting user Characteristic variable " wage " information is not complete or lacks, then " wage " of the user, and then completion " work are extrapolated using " length of service " Money " information.

Here, credit estimation method of the present embodiment based on data model can combine the distance between each variable, sentence Similitude between disconnected two variables, the completion variable of each variable is determined, in order in the loss of learning or information of the variable In the case of incomplete, the infull failure variable of loss of learning or information is fallen using completion variable replacement, completes credit evaluation.

Wherein, determine that the detailed process of completion variable is as follows according to information value：

After assessment models are established, before being replaced using completion variable corresponding to the failure variable, this method is also Including：

Target variable is inputted into assessment models.

According to the information value of the existing characteristic variable of assessment models, examine each existing characteristic variable whether effective.

If the characteristic variable of failure be present, the target variable is arranged to the completion variable of the characteristic variable of failure.Example Such as, in the data set of borrower, only variable (arri_sz_time) missing value.Due to it and another variable (arri_sz_yrs) height correlation, so the value of missing value (arri_sz_time) exits from analysis, only remain “arrival_sz_yrs”.Therefore, there is no the treatment that missing is worth in borrower's data set.

Using the information completion of completion variable corresponding to the failure variable failure variable.For example, the variable newly introduced is " work position ", also, " work position " is the completion variable of " wage ".If characteristic variable " wage " information of user it is incomplete or Missing, then extrapolated using " work position " " wage " of the user, and then completion " wage " information.

Here, credit estimation method of the present embodiment based on data model can also constantly introduce new target variable, Also, according to the information value between characteristic variable judge the target variable whether be other characteristic variables completion variable, with It is easy to, when a certain characteristic variable fails, be replaced using the completion variable of the characteristic variable of the failure, completes credit evaluation.

Also, according to the information value of the existing characteristic variable of assessment models, examine whether each existing characteristic variable has During effect, specific implementation process is as follows：

According to the allocation proportion of sample data in training set, the information value of each characteristic variable is calculated.

In actual application, evidence weight is that the ratio of " good " borrower's feature corresponds to " bad " to borrower The Logarithmic calculation of the ratio of feature, for assessment and the relative risk of more different classes of variable.The specific calculating of evidence weight Formula is as follows：

Wherein, WOE represents the evidence weight of a certain characteristic variable, and DistrGoods represents " good " in sample data and borrowed money The distribution proportion in this feature variable of people, DistrBads represent sample data in " bad " borrower in this feature variable Distribution proportion.

WOE on the occasion of higher, the credit default risk of customer action is lower, and WOE negative value is bigger, the letter of customer action It is higher with default risk.Variable can be converted into the form of rule and information by WOE, and this make it that different types of variable can be with In identical method.Variable can be transferred in WOE, can more effectively protect the free degree of small sample problem.Therefore, use The different variables that WOE is concentrated with smaller sample data.

Information value can assess the predictive ability of characteristic variable, and specific formula for calculation is as follows：

IV=(DistrGoods-DistrBads) * WOE

Wherein, IV represents the information value of a certain characteristic variable, and DistrGoods represents " good " in sample data and borrowed money The distribution proportion in this feature variable of people, DistrBads represent sample data in " bad " borrower in this feature variable Distribution proportion, WOE represents the evidence weight of this feature variable.

If the information value IV of a certain characteristic variable is less than 0.02, the predictive ability of this feature variable is very poor.It is if a certain The information value IV of characteristic variable is between 0.02 to 0.1, then this feature variable is considered as weak predictive ability.If a certain feature The information value IV of variable is more than 0.5, then it is assumed that is excessively to predict.In general, assessment models can be used more than 0.02, and Characteristic variable less than 0.5.

Second aspect, the embodiment of the present invention provides a kind of credit evaluation device based on data model, with reference to Fig. 2, the dress Put including characteristic variable acquisition module 1, failure variable completion module 2 and evaluation module 3, characteristic variable acquisition module 1 be used for from The characteristic variable needed for assessment models is obtained in data to be assessed, assessment models are by training and by the model after inspection. Failure variable completion module 2 is used to judge whether each characteristic variable of data to be assessed to be failure variable：Should if so, then using The completion variable corresponding to variable that fails carries out completion, and inputs assessment models, if it is not, then inputting assessment models, failure variable is Loss of learning or the incomplete characteristic variable of information.Evaluation module 3 is used to make assessment models be commented according to the characteristic variable of input Estimate, and export evaluation result.

As shown from the above technical solution, the credit evaluation device based on data model that the present embodiment provides, using built in advance Vertical assessment models, processing user data to be assessed, even if the failure variable that existence information lacks or information is not complete, the device Also effect variable can be lost using completion variable completion, improves credit violation correction effect, credit is completed using small set of data and comments Estimate, avoid the phenomenon for because processing data amount is small, causing assessment models not assess, save credit analysis cost, determined for credit Plan provides Informational support, reduces potential default risk.

Therefore, credit evaluation device of the present embodiment based on data model, in the case where shortage of data, data are not complete, Credit evaluation is carried out using small set of data, improves credit violation correction effect.

In order to further improve the accuracy of credit evaluation device of the present embodiment based on data model, specifically, commenting In terms of estimating model construction, credit evaluation device of the present embodiment based on data model also establishes module including assessment models, assesses Model building module is used to classify to the sample data in training set, obtains classification results；According to classification results, to training The sample data of concentration carries out logistic regression, establishes assessment models.According to assessment models and the sample data of test set, obtain and survey Test result.According to test result, assessment models are examined, to judge the problem of assessment models whether there is overfitting.

Here, credit evaluation device of the present embodiment based on data model is using logistic regression structure assessment models, logic Return has more preferable estimated performance relative to multilayer perceptron neural network model, can disclose borrower exactly credible Rely the feature in colony, device is simple, carries out risk management it can be readily appreciated that facilitating.

Specifically, in terms of sample data classification, assessment models are established sample data of the module in training set and carried out Classification, when obtaining classification results, it is specifically used for：If the sample data in training set is numerical variable, using decision tree logarithm Value variable is classified, and determines classification results；If the sample data in training set is classified variable, using clustering algorithm to dividing Class variable is classified, and determines classification results.

Here, credit evaluation device of the present embodiment based on data model can be carried out at classification to different type variable Reason.For numerical variable, the device is classified based on decision tree, and decision tree is relative to artificial neural network and k- arest neighbors Predictive ability is strong, can calculate Euclidean distance, to optimize the classification of loaning bill feature, is favorably improved it to logistic regression Adaptability.For classified variable, the device is classified based on clustering technique, will be had using Ward least variance method similar The data of feature are combined into cluster group, are returned with adaptation logic, improve credit violation correction effect.

Specifically, for associated variable, credit evaluation device of the present embodiment based on data model can merge place Reason, i.e., assessment models are established module and are additionally operable to：The distance of sample data in training set is calculated, determines associated variable；Judge any Whether the distance between two associated variables value is less than distance threshold, if so, then merging two associated variables.

Here, credit evaluation device of the present embodiment based on data model will can be mutually related, variable merges place Reason, with specific reference to the Euclidean distance between each variable, judges whether to merge two associated variables, wherein, away from Can be that the numerical value or empirical value of acquisition are calculated according to sample data from threshold value.Also, the device enters associated variable Row merging treatment, credit evaluation risk can be reduced.Otherwise, the variable that is mutually related can reduce the evaluation result of logistic regression Accuracy.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the different embodiments or example and the feature of different embodiments or example described in this specification Close and combine.

It should be noted that the flow chart and block diagram in accompanying drawing show the service of multiple embodiments according to the present invention Architectural framework in the cards, function and the operation of device, method and computer program product.At this point, flow chart or block diagram In each square frame can represent the part of a module, program segment or code, the module, one of program segment or code Subpackage is containing one or more executable instructions for being used to realize defined logic function.It should also be noted that at some as replacement Realization in, the function that is marked in square frame can also be to occur different from the order marked in accompanying drawing.For example, two continuous Square frame can essentially perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is according to involved work( Depending on energy.It is also noted that each square frame and block diagram in block diagram and/or flow chart and/or the square frame in flow chart Combination, function or the special hardware based server of action it can be realized as defined in execution, or can be with special The combination of hardware and computer instruction is realized.

The configuration device that the embodiment of the present invention is provided can be computer program product, including store program code Computer-readable recording medium, the instruction that described program code includes can be used for performing the side described in previous methods embodiment Method, specific implementation can be found in embodiment of the method, will not be repeated here.

It is apparent to those skilled in the art that for convenience and simplicity of description, the service of foregoing description The specific work process of device, device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.

In several embodiments provided herein, it should be understood that disclosed server, apparatus and method, can To realize by another way.Device embodiment described above is only schematical, for example, the unit is drawn Point, only a kind of division of logic function, there can be other dividing mode when actually realizing, in another example, multiple units or group Part can combine or be desirably integrated into another server, or some features can be ignored, or not perform.It is another, show Show or the mutual coupling discussed or direct-coupling or communication connection can be by some communication interfaces, device or unit INDIRECT COUPLING or communication connection, can be electrical, mechanical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be published to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.

If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, either which part or all technical characteristic are entered Row equivalent substitution；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover among the claim of the present invention and the scope of specification.

Claims

A kind of 1. credit estimation method based on data model, it is characterised in that including：

The characteristic variable needed for assessment models is obtained from data to be assessed, the assessment models are by training and will examine Model afterwards；

Whether each characteristic variable for judging data to be assessed is failure variable：

If so, then using completion variable corresponding to the failure variable to carry out completion, and the assessment models are inputted,

If it is not, then input the assessment models, the failure variable is loss of learning or the incomplete characteristic variable of information；

The assessment models are assessed according to the characteristic variable of input, and export evaluation result.
2. the credit estimation method based on data model according to claim 1, it is characterised in that

Before the characteristic variable needed for assessment models is obtained from data to be assessed, this method also includes：

Sample data in training set is classified, obtains classification results；

According to the classification results, logistic regression is carried out to the sample data in the training set, establishes the assessment models；

According to the assessment models and the sample data of test set, test result is obtained；

According to the test result, inspection institute states assessment models.
3. the credit estimation method based on data model according to claim 2, it is characterised in that

After inspection institute states assessment models, this method also includes：

According to cross validation method, the sample data of the training set and the test set is split at random；

The assessment models are trained using the sample data after fractionation.
4. the credit estimation method based on data model according to claim 2, it is characterised in that

According to the classification results, before carrying out logistic regression to the sample data in the training set, this method also includes：

The distance of sample data in the training set is calculated, determines associated variable；

Judge whether the distance between any two associated variable value is less than distance threshold, if so, then entering two associated variables Row merges.
5. the credit estimation method based on data model according to claim 4, it is characterised in that

Calculate in the training set after the distance of sample data, this method also includes：

Detect the distance between a certain variable and its dependent variable value；

The variable minimum with the variable distance value is arranged to the completion variable of the variable；

Completion is carried out using completion variable corresponding to the failure variable, specifically included：

The variable that failed described in the information completion of completion variable is corresponded to using the failure variable.
6. the credit estimation method based on data model according to claim 2, it is characterised in that

After the assessment models are established, before being replaced using completion variable corresponding to the failure variable, this method is also Including：

Target variable is inputted into the assessment models；

According to the information value of the existing characteristic variable of the assessment models, examine each existing characteristic variable whether effective；

If the characteristic variable of failure be present, the target variable is arranged to the completion variable of the characteristic variable of the failure；

Completion is carried out using completion variable corresponding to the failure variable, specifically included：

The variable that failed described in the information completion of completion variable is corresponded to using the failure variable.
7. the credit estimation method based on data model according to claim 6, it is characterised in that

According to the information value of the existing characteristic variable of the assessment models, examine each existing characteristic variable whether effective, tool Body includes：

According to the allocation proportion of sample data in the training set, the information value of each characteristic variable is calculated；

Tested according to predetermined value threshold value, judge whether each characteristic variable is effective.
8. the credit estimation method based on data model according to claim 1, it is characterised in that

After judging a certain characteristic variable for failure variable, input before the assessment models, this method also includes：To the mistake Imitate variable and carry out statistical calculation, obtain the average or intermediate value of the failure variable；

Using the average of the failure variable or the failure variable of intermediate value completion loss of learning.
A kind of 9. credit evaluation device based on data model, it is characterised in that including：

Characteristic variable acquisition module：For obtaining the characteristic variable needed for assessment models, the assessment from data to be assessed Model is by training and by the model after inspection；

Fail variable completion module：For judging whether each characteristic variable of data to be assessed is failure variable：

If so, then using completion variable corresponding to the failure variable to carry out completion, and the assessment models are inputted,

If it is not, then input the assessment models, the failure variable is loss of learning or the incomplete characteristic variable of information；

Evaluation module：For making the assessment models be assessed according to the characteristic variable of input, and export evaluation result.
10. the credit evaluation device based on data model according to claim 9, it is characterised in that the device also includes commenting Estimate model building module：For classifying to the sample data in training set, classification results are obtained；Tied according to the classification Fruit, logistic regression is carried out to the sample data in the training set, establishes the assessment models；According to the assessment models and survey The sample data of collection is tried, obtains test result；According to the test result, inspection institute states assessment models.