Invention content
The technical problem to be solved in the present invention is that:For technical problem of the existing technology, the present invention provides one
Kind implementation method is simple, data processing amount is small, grading is efficient, grading performance is good and can be convenient for obtaining being easy to commenting for user's understanding
The credit rating method that the feature based of grade rule is chosen.
In order to solve the above technical problems, technical solution proposed by the present invention is:
A kind of credit rating method that feature based is chosen, step include:
S1. characteristic attribute collection extracts:The user credit information collection for model training is obtained, extracts user's letter respectively
The corresponding characteristic attribute constitutive characteristic property set of each information is concentrated with information;
S2. the model training that feature based is chosen:Multiple RIPPER (For Repeated are executed to the characteristic attribute collection
Incremental Pruning to Produce Error Reduction) classification, according to classification after each RIPPER classification
As a result it concentrates characteristic attribute to screen characteristic attribute, the characteristic attribute collection after screening is re-started into RIPPER classification, directly
To the RIPPER rating models needed for generation;
S3. credit rating:It inputs the credit information of user to be assessed and extracts corresponding characteristic attribute, the spy that will be extracted
Sign attribute, which is input in the RIPPER rating models, classifies, and obtains the output of credit rating result.
As a further improvement on the present invention:Occurrence is deleted out after classifying especially by each RIPPER in the step S2
Number is less than the characteristic attribute of specified threshold, and the characteristic attribute collection after being screened re-starts RIPPER classification, up to what is generated
The precision or feature quantity of RIPPER rating models reach preset requirement, obtain final RIPPER rating models output.
As a further improvement on the present invention, in the step S2 generate RIPPER rating models the specific steps are:
S21. classified using RIPPER graders to current signature property set, according to each feature category in classification results
Property the number that occurs count the weight of each characteristic attribute, and each characteristic attribute is ranked up according to the weight of statistics, is arranged
Characteristic attribute collection after sequence;
S22. the characteristic attribute that characteristic attribute after the sequence concentrates occurrence number to be less than predetermined threshold value is deleted, is obtained more
Characteristic attribute collection after new;
S23. the updated characteristic attribute collection step S22 obtained carries out RIPPER classification, judges currently available
RIPPER rating models precision or feature quantity whether reach preset requirement, if so, obtaining final RIPPER gradings
Model exports, and otherwise returns to step S21.
As a further improvement on the present invention:In the step S2 specifically used ten foldings cross validation mode be trained with
Avoid model over-fitting, i.e., training set be divided into 10 parts, will wherein 9 parts as training datas, another be used as test data, pass through
After crossing successive ignition, it is chosen at the model that nicety of grading on different test sets reaches corresponding to specified threshold and is trained as current
Obtained RIPPER rating models output.
As a further improvement on the present invention:Further include being carried out to described in obtaining using ROC curve in the step S2
RIPPER rating models are assessed, if the RIPPER rating models correspond to the area under the ROC curve calculated default
In range, final RIPPER rating models are exported, training is otherwise re-started.
As a further improvement on the present invention, in the step S1 the specific steps are:
S11. it extracts the user credit information and concentrates the corresponding characteristic attribute of each original credit information, obtain characteristic attribute
Collection, to the characteristic attribute collection into exporting after data prediction;
S12. different Category Attributes are concentrated to be exported after reunification the characteristic attribute;
S13. composing training collection exports after the step S222 characteristic attribute collection exported being carried out classification grading.
As a further improvement on the present invention:Data prediction is carried out in the step S11, is specifically included the feature
Missing values are filled processing in property set, and the characteristic attribute concentrates redundancy value, exceptional value to carry out delete processing;The missing
When value is filled processing, a kind of filling in median, mode or Lagrange's interpolation specifically is used to concentrated missing values
Mode uses context filling mode to discrete type missing values.
As a further improvement on the present invention:The user credit information include user base information, user's loaning bill information,
Refund in user's liability information, user's history designated time period overdue information, user's future it is specified between section domestic demand refund information,
It is one or more in user's bid information and user's liability information.
As a further improvement on the present invention:The characteristic attribute extracted is input to the RIPPER in the step S3
When being classified in rating model, initial credit rating result is specifically exported by the RIPPER rating models, according to described first
The classifying rules of beginning credit rating result and the RIPPER rating models in carrying out assorting process obtains final grading
As a result it exports.
As a further improvement on the present invention:When the step S2 generates the RIPPER rating models, it is specifically based on
Adaboost (Adaptive Boostin, adaptive to enhance) algorithm uses multiple RIPPER graders to be trained as Weak Classifier
It obtains, and when each RIPPER classifier trainings, what selected section training set sample and a upper RIPPER grader obtained
The combination of partial error sample constitutes final training sample, and ADB strong classifiers are obtained simultaneously by each Weak Classifier after the completion of training
As final RIPPER rating models.
Compared with the prior art, the advantages of the present invention are as follows:
1) the present invention is based on the credit rating method of Feature Selection, the spies such as retractility, the regularization of RIPPER are made full use of
Property, by extracting the characteristic attribute of user credit information, carries out repeatedly classification using RIPPER graders and commented with building RIPPER
Grade model, reuses the RIPPER rating models and grades to the credit of new user, and grading is efficient, grading performance is good, phase
Than that in traditional scorecard mode, accurate grading can be provided for Different Individual, and grade compared to traditional machine learning
Mode, when being graded to new user using RIPPER rating models, it may be convenient to obtain classifying rules therein and be somebody's turn to do
Classifying rules is it can be readily appreciated that consequently facilitating policymaker provides final decision, while the basis point after executing RIPPER classification every time
Class result screens characteristic attribute, the task amount of multidimensional characteristic training can greatly be reduced, to effectively reduce at ratings data
Reason amount improves grading efficiency.
2) the present invention is based on the credit rating methods of Feature Selection, by deleting occurrence number after each RIPPER classification
Less than the characteristic attribute of specified threshold, can remove uncorrelated and redundancy feature makes characteristic reduce, due to the reduction of characteristic,
The example that repetition can also be removed, so as to be effectively prevented from " dimension disaster " and " multiple shot array ", simultaneously because characteristic
With the reduction of instance number, it is possible to reduce the time of model learning, to further increase grading efficiency.
3) the present invention is based on the credit rating methods of Feature Selection, by combining characteristic attribute in data set to divide with RIPPER
Class device carries out Feature Selection, can realize Feature Selection from grader and data set self character so that can be great
The training mission amount of RIPPER rating models is reduced, while not interfering with the performance of model.
Specific implementation mode
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and
It limits the scope of the invention.
As shown in Figure 1, the credit rating method that the present embodiment feature based is chosen, step include:
S1. characteristic attribute collection extracts:The user credit information collection for model training is obtained, extracts user credit letter respectively
Breath concentrates the corresponding characteristic attribute constitutive characteristic property set of each information;
S2. the model training that feature based is chosen:Multiple RIPPER classification is executed to characteristic attribute collection, each RIPPER divides
It concentrates characteristic attribute to screen characteristic attribute according to classification results after class, the characteristic attribute collection after screening is re-started
RIPPER classifies, until the RIPPER rating models needed for generating;
S3. credit rating:It inputs the credit information of user to be assessed and extracts corresponding characteristic attribute, the spy that will be extracted
Sign attribute, which is input in RIPPER rating models, classifies, and obtains the output of credit rating result.
RIPPER (rule inductive learning) is rule-based sorting algorithm, the established decision tree of classification as shown in Fig. 2,
The rule of root node can be looked for one by one from leaf node, as shown in figure 3, being deleted if carrying out redundancy to rule shown in Fig. 3 (a)
Subtract, according to the scale sequence (assigning the rule being triggered with " most harsh " requirement highest priority) of rule, when having judged
When first rule is not met, remove the humidity=normal of the second rule, can similarly remove Article 4, Article 5
Outlook=rainy, outlook=rainy and windy=true in rule, as a result as shown in Fig. 3 (b).RIPPER
In every RIPPER rule be made of some regular former pieces, include better beta pruning and stopping criterion and to regular collection after
Processing, be using incrementally reduce error Pruning Algorithm, the example of training set is divided into two datasets:Growth collection and trimming
Collection, growth collection are used for generation rule, and increase condition meets the requirements until rule, and trimming collection is for building rule, in deletion rule
Condition, until obtaining better rule;Then rule value is evaluated, removes final condition and sees whether value changes,
If do not changed, removal condition is continued to, until obtaining best grader version.
The accuracy of RIPPER is high, rule creation performance is good, and the sample of the efficiency of RIPPER algorithms and training dataset
Number is linear, and time complexity is O (nlog2n), it is often more important that can be in the test set for including hundreds of thousands noise data
On still maintain very high efficiency, while the decision rule of RIPPER classification is user oriented, and grader can generate classification
Rule, and the classifying rules generated is easier to understanding for a user, i.e. RIPPER algorithms have retractility, regularization special
Property.The present embodiment makes full use of the characteristics such as the above-mentioned retractility of RIPPER, regularization, by the feature for extracting user credit information
Attribute carries out repeatedly classification using RIPPER graders and reuses the RIPPER rating models to build RIPPER rating models
It grades to the credit of new user, grading is efficient, grading performance is good, compared to traditional scorecard mode, can be directed to not
Accurate grading is provided with individual, and compared to traditional machine learning rating methods, in use RIPPER rating models to new
When user grades, it may be convenient to obtain classifying rules therein and the classifying rules it can be readily appreciated that consequently facilitating decision
Person provides final decision, while screening characteristic attribute according to classification results after executing RIPPER classification every time, can be in conjunction with spy
The characteristic for levying attribute and RIPPER graders itself carries out Feature Selection, is significantly reduced the task amount of multidimensional characteristic training, from
And ratings data treating capacity is effectively reduced, improve grading efficiency.
In the present embodiment, step S1 the specific steps are:
S11. extraction user credit information concentrates the corresponding characteristic attribute of each original credit information, obtains characteristic attribute collection,
To characteristic attribute collection into exporting after data prediction;
S12. different Category Attributes are concentrated to be exported after reunification characteristic attribute;
S13. the characteristic attribute collection that step S12 is exported is subjected to composing training collection after classification grading.
After extracting data of the user about credit information in original user data library, extraction is every first believes the present embodiment
It is the characteristic value for characterizing each credit information with the corresponding characteristic attribute of information, constitutive characteristic property set carries out characteristic attribute collection
After data prediction, different Category Attributes are subjected to unification, classification grading then is carried out to the tag along sort of characteristic attribute collection
Mark such as uses AA, A, B, C, D, E, F to be marked as grading, constitutes the training set for meeting RIPPER rating model demands, will instruct
Practice collection and upset at random and subsequently reuse RIPPER graders after distribution repeatedly classification iteration is carried out to training set, per root after subseries
Characteristic attribute is screened according to classification results, until obtaining required RIPPER rating models.
In the present embodiment, user credit information specifically includes user base information, loaning bill information, user's history specified time
Refund in section overdue information, user's future it is specified between refund information, user's bid information, user's liability information etc., base in section
Plinth information include the refund information such as name, gender, schooling include successfully refund number, normally pay off number, be overdue specified
In number of days pay off number, it is overdue pay off number etc. more than given number of days, loaning bill information includes successfully loaning bill number, first time
Success borrowing time, accumulative borrowing balance, the amount of money to be gone back, single highest borrowing balance etc., liability information include that historical high is negative
Debt information etc., user credit information can specifically extract all kinds of information datas for characterizing user credit according to actual demand.
In the present embodiment, when step S11 carries out data prediction, specifically includes and fill out characteristic attribute concentration missing values
Processing is filled, characteristic attribute concentrates redundancy value, exceptional value to carry out delete processing, when missing values are filled processing, specifically to concentrating
Type missing values are filled discrete type missing values using context using filling modes such as median, mode or Lagrange's interpolations
Etc. modes, certainly can also according to actual demand using other filling processing modes.
Occurrence number, which is deleted, in the present embodiment, after classifying especially by each RIPPER in step S2 is less than specified threshold
Characteristic attribute, the characteristic attribute collection after being screened re-start RIPPER classification, until the RIPPER rating models generated
Precision or feature quantity reach preset requirement, obtain final RIPPER rating models output.The present embodiment passes through each
The characteristic attribute that occurrence number is less than specified threshold is deleted after RIPPER classification, that is, deletes and do not occur or spy that occurrence number is less
It levies attribute, so that characteristic is reduced to remove uncorrelated and redundancy feature, i.e. the value of characteristic N becomes smaller, due to the reduction of characteristic,
Some examples repeated can also be removed, instance number P is made also to reduce, so as to be effectively prevented from " dimension disaster " and " combination
Explosion ", simultaneously because the reduction of N and P, it is possible to reduce the time of model learning, to further increase grading efficiency.
In the present embodiment, when RIPPER rating models generate, specifically when the precision of the RIPPER rating models generated is (accurate
Degree) it no longer changes or when characteristic attribute number reaches preset quantity, obtains final RIPPER rating models output, i.e.,
The criterion that the precision of model or characteristic attribute number are completed as model training.
In the present embodiment, in step S2 generate RIPPER rating models the specific steps are:
S21. classified using RIPPER graders to current signature property set, according to each feature category in classification results
Property the number that occurs count the weight of each characteristic attribute, and each characteristic attribute is ranked up according to the weight of statistics, is arranged
Characteristic attribute collection after sequence;
S22. the characteristic attribute that characteristic attribute after sequence concentrates occurrence number to be less than predetermined threshold value is deleted, after obtaining update
Characteristic attribute collection;
S23. updated characteristic attribute collection step S22 obtained carries out RIPPER classification, judges currently available
Whether the precision or feature quantity of RIPPER rating models reach preset requirement, if so, obtaining final RIPPER grading moulds
Type exports, and otherwise returns to step S21.
When the present embodiment realizes feature based extraction training RIPPER rating models, specifically being classified first according to RIPPER will
All characteristic attribute collection data are trained, and such as when user credit information is judged as being satisfied by specified requirements, rating result is AA
Rank etc.;
Again after each RIPPER classification, each feature category in RIPPER classifying rules is counted using python programming languages
Property weight, i.e., characteristic attribute occur number, will not have in the rule occurred occur or occurrence number it is less
Characteristic attribute is deleted, and is obtained new characteristic attribute collection and is re-started RIPPER classification, judges whether the accuracy rate of this subseries compares
Last accuracy rate is high, if it is, retaining current attribute, otherwise resets attribute, and it is less to pick out occurrence number again
Attribute as delete candidate item, repeat above step until accuracy rate can not update or reach required characteristic
Amount, completes the training of model, exports final RIPPER rating models, it can be ensured that the performance of final RIPPER rating models,
It is significantly reduced training mission amount and complexity simultaneously.
In concrete application embodiment, when carrying out Feature Selection, it can first define primitive character property set D, to be retained
Attribute number K and screening after characteristic attribute collection S, build Si attribute RIPPER grader, obtain classifying rules result Ci,
The weight that each attribute is counted after completion classification, generates dictionary Di, if the occurrence number of objective attribute target attribute is less than given threshold value, deletes
Except the attribute, until screening obtains K attribute, the characteristic attribute collection S after being screened.
The present embodiment, can be from classification by combining characteristic attribute in data set to carry out Feature Selection with RIPPER graders
Device and data set self character, which set out, realizes Feature Selection so that can be significantly reduced the training mission of RIPPER rating models
Amount, while not interfering with the performance of model.
In the present embodiment, specifically used ten foldings cross validation mode is excessively quasi- to avoid model when being trained in step S2
Close, i.e., training set is divided into 10 parts, will wherein 9 parts as training datas, another as test data, by successive ignition
Afterwards, it is chosen at nicety of grading on different test sets and reaches model corresponding to specified threshold and grade mould as required RIPPER
Type.As in a particular embodiment, the data of training set are divided into a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a1, a2,
A3, a4, a5, a6, a7, a8, a9 are as training data, and a10 is as test set or other combinations, by successive ignition
Afterwards, it is chosen on different test sets and all shows good model as final mask.By using ten folding cross validation modes, by
It is using a part for former data in test set, is not a part for training set, containing many uncertainties, compared to tradition
It is direct be trained using whole training set datas, then use the data of a part of trained mistake as test set, can keep away
Exempt from model over-fitting.
Further include using ROC curve comment obtained RIPPER rating models in the present embodiment, in step S2
Estimate, if RIPPER rating models correspond to the area under the ROC curve calculated within a preset range, exports final RIPPER and comment
Grade model, otherwise re-starts training.It can effectively reflect that the performance of model, the area under ROC curve are got over using ROC curve
Greatly, corresponding model performance is better, and for the present embodiment after initial training obtains RIPPER rating models, the ROC of computation model is bent
Line reuses ROC curve and assesses model, the ROC curve being calculated in concrete application embodiment as shown in figure 4,
Area AUC under ROC curve is 0.9403, meets model performance demand, i.e., carries out model evaluation, realization side by ROC curve
Method is simple and effective, it can be ensured that the performance of RIPPER rating models.
To include specifically user credit characteristic attribute when being classified using RIPPER graders in the present embodiment step S2
Training set in be not belonging to rule data item be randomly divided into growth collection and two subsets of reduced set, to increase collect executing rule
Process of expansion when, initially the condition of rule is emptied, then the following formula of addition (1) repeatedly condition so that information increases
Beneficial Gain (D, At) reaches the value of bigger, and improves covering surface of the rule to data item, until rule covers growth data set
In all data item, At be tree each node;
Ad=v, An≤ θ or An≥θ (1)
Wherein AdFor the attribute of character type, v AdA virtual value, AnFor the variable of Real-valued, θ is in training set
There is AnVirtual value.
When reducing process to reduced set executing rule, the last one condition is rejected from the condition of rule successively, makes function
Value v reaches maximum, and the expression formula of function v is:
Wherein p is to cut to concentrate by the affirmative sample number of rule coverage, and n is to cut to concentrate by the negative sample of rule coverage
Number.
Above-mentioned formula (2) process is repeated until by reduction condition and deletion rule the value of v can not increase, it is raw
At RIPPER rating models and classifying rules.
In the present embodiment, when step S2 generates RIPPER rating models, be specifically based on Adaboost algorithm use it is multiple
RIPPER graders train to obtain as Weak Classifier, and when each RIPPER classifier trainings, selected section training set sample
And the partial error sample combination that a upper RIPPER grader obtains constitutes final training sample, after the completion of training
Each Weak Classifier obtains ADB strong classifiers and as final RIPPER rating model.Adaboost algorithm has to be followed by force by force very much
Weak Classifier preferably can be combined reinforcement by ring learning ability, and the present embodiment is by combining Adaboost algorithm and frame
Frame RIPPER classifier training disaggregated models, realize the assembled classification method of Ripper-ADB, enabling have both Adaboost
The performance advantage of algorithm and RIPPER graders further increases classification grading performance, while selected section training when training
The partial error sample combination that subset sample and a upper Weak Classifier obtain constitutes final training sample and is trained, can
To realize the training method of cycle interpenterating sample, since each selected section etc. divides sample to be trained so that the mistake of expansion
Accidentally sample be definite value, will not increase at multiple, and due to total data carry out decile after, each part of data will be overlapped instruction
Practice, data from the sample survey will not be omitted, it can be ensured that training is complete, while when each progress error sample expansion, not only to error number
It, can be to avoid the excessive training of repeatedly wrong data according to the effect for playing accumulation training, and due to the addition of new samples.
It is realized using NSL-KDD data sets (modified versions of KDD CUP data minings 1999 annual data collection of match) in the present embodiment.
As shown in figure 5, training RIPPER rating models based on Ripper-ADB assembled classifications in concrete application embodiment
Detailed process be:
1. training set sample is carried out decile first, in accordance with iterations, N parts of training subset sample S are obtained1,S2,Sn;
2. by first part of training sample S1Classification based training is carried out using Ripper algorithms, obtains grader a1, error sample
R1;
3. to a1Classification results carry out statistics calculating, obtain a1The weight w of grader1;
4. by a1The sample R of mistake point1Duplicate sampling expansion is carried out according to magnitude (50%) identical with equal portions sample, is obtained
The error sample R of expansion1p;
5. by the error sample R of expansion1pIt is added to second part of training sample S2In, obtain new sample S2R;
6. to new samples S2RThe classification based training of Ripper algorithms is carried out again, generates grader a2, error sample w2;
7. to grader a2Classification results carry out statistics calculating, obtain a2The weight w of grader2。
8. steps be repeated alternatively until that all sample trainings finish;
9. the skilled weighting classification device of institute is overlapped, final strong classifier Ripper-ADB is constituted, is obtained most
Whole RIPPER rating models.
The attribute value extracted is input in the present embodiment, in step S3 when being classified in RIPPER rating models,
Initial credit rating result is specifically exported by RIPPER rating models, according to initial credit rating result and RIPPER grading moulds
Classifying rules of the type in carrying out assorting process obtains final rating result output.Due to the classification in RIPPER assorting processes
Rule it can be readily appreciated that the present embodiment after obtaining initial rating result using RIPPER rating models, in conjunction with RIPPER classification gauges
Final rating result is then generated, can realize more rational grading in conjunction with RIPPER classification.
In concrete application embodiment, use the present embodiment above method realize credit rating detailed step for:
Step 1:The data of all about user information in specified database are extracted, are waited to be pre-treated.
Step 2:Data between different tables are associated with unique key User ID, as will be can first integrated
Tables of data is read in memory, and after establishing the array of tables of data, searching loop array is associated union operation according to User ID.
Step 3:Numerical value missing values in step 2 treated data are filled processing, concentrated missing values are made
It is handled with median, mode or Lagrange's interpolation mode, the methods of context filling processing is used for discrete type missing values.
Step 4:Different Category Attributes units is subjected to unification, such as the numerical value disunity in length of maturity attribute, including
How many a month and two kinds of how many a day needs to be converted into unified format (if the moon is unit):It traverses in the time limit attribute
Each numerical value removes the subsequent word of numerical value if number is followed by ' a month ':' a month ', is then converted into if it is " day "
Numerical value as unit of the moon carries out preservation output.
Step 5:Data set is used into AA, A, B, C, D, E, F tag along sorts carry out grading mark, obtain characteristic attribute collection.
Step 6:RIPPER classification is carried out to characteristic attribute collection using RIPPER graders, obtains classification results;
Step 7:The weight of each characteristic attribute is counted according to classification results (rule), and carries out attribute weight sequence, is deleted
RIPPER classification is re-started after the smaller characteristic attribute of weight;
Step 8:Repeat step 6,7, and judge whether classification accuracy changes or whether reach required feature
Attribute number, until obtaining final RIPPER rating models;
Step 9:Using ROC curve assess the RIPPER rating models that step 8 obtains, RIPPER gradings
Model includes code model and RIPPER rules;
Step 10:New user information to be assessed is input in the RIPPER rating models that step 9 obtains, output grading
As a result, policymaker provides final grading decision according to rating result, the credit rating of user is completed.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention
Disclosed above with preferred embodiment, however, it is not intended to limit the invention.Therefore, every without departing from technical solution of the present invention
Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention
In the range of technical solution of the present invention protection.