CN106909981A - Model training, sample balance method and device and personal credit points-scoring system - Google Patents

Model training, sample balance method and device and personal credit points-scoring system Download PDF

Info

Publication number
CN106909981A
CN106909981A CN201510981091.6A CN201510981091A CN106909981A CN 106909981 A CN106909981 A CN 106909981A CN 201510981091 A CN201510981091 A CN 201510981091A CN 106909981 A CN106909981 A CN 106909981A
Authority
CN
China
Prior art keywords
sample
positive
positive sample
synthesis
uneven
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510981091.6A
Other languages
Chinese (zh)
Other versions
CN106909981B (en
Inventor
席炎
王晓光
赵科科
张柯
毛旭东
杨旭
蔡宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510981091.6A priority Critical patent/CN106909981B/en
Publication of CN106909981A publication Critical patent/CN106909981A/en
Application granted granted Critical
Publication of CN106909981B publication Critical patent/CN106909981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A kind of model training method and device for uneven sample set of disclosure, retraining model after sample balance is carried out to uneven sample set, so as to improve the performance of model.The method includes:Uneven sample set is obtained, negative sample and positive sample are included in the uneven sample set, and negative sample is more than uneven threshold value with the sample size ratio of positive sample, the uneven threshold value is more than 1;According to the positive sample in the uneven sample set, and the sample adjacent with the positive sample, synthesis positive sample is set up, the sample adjacent with the positive sample includes negative sample and/or positive sample;When synthesizing positive sample and being interval interior in balanced threshold with the ratio of negative sample quantity with the quantity sum of positive sample, synthesis positive sample is revised as positive sample, generates balance sample collection;Model training is carried out according to the balance sample collection.The application is also disclosed a kind of sample balance method and device for uneven sample set and a kind of personal credit points-scoring system.

Description

Model training, sample balance method and device and personal credit points-scoring system
Technical field
The application is related to Internet technical field, more particularly to a kind of model training for uneven sample set Method and device, a kind of sample balance method and device and a kind of personal letter for uneven sample set Use points-scoring system.
Background technology
With the arrival in big data epoch, can be carried out according to historical data and the corresponding result of historical data Analysis, so as to predict the following thing that may occur.Such as, generated according to historical data and corresponding result Including at least the sample set of both positive and negative sample, specific model is trained according to sample set.When model is received During existing data, it is possible to predict the corresponding result of available data.It is specific such as, according to cancer patient with Healthy People historical data (including:Case history, diet, work and rest etc.) comprising positive sample, (cancer is suffered from for generation Person) and negative sample (Healthy People) sample set, and according to the sample set train cancer prediction model, work as cancer When disease forecast model receives the historical data of doubtful cancer patient, it is possible to predict cancered possibility, So that and early treatment.
It is trained if based on balance sample collection in training pattern, generally can all obtains preferable performance, Balance sample collection refers to that the quantity of the species included in sample is more or less the same, such as, and the men and women of neonate's sample Than regular meeting close to 1:1.However, with information-based development, prediction small probability event becomes all trades and professions Focus, such as predict cancered probability, the overdue probability of prediction user credit card predicts financial market Probability of mutation etc..But all there is a general character in the sample of these small probability events, be exactly the serious of sample set It is unbalance, cancer patient be it is a small number of, the overdue people of credit card be it is a small number of, the mutation in financial market be also it is rare, When model training is carried out according to uneven sample set, often there is bias, so as to influence the performance of model.
Prior art is balanced to realize sample, is generally oversampling treatment by the way of, i.e., random reproduction is few Several classes of sample, so as to reach the purpose that minority class sample reaches balance with the quantity of many several classes of samples, but with The result that machine is replicated is exactly that identical at least two sample occurs, but in actual applications, typically not Identical two samples occur, so the only processing mode of simple copy, it is clear that sample can be caused This authenticity is relatively low.Carrying out model training based on the relatively low sample of authenticity will certainly also influence the property of model Energy.
The content of the invention
The embodiment of the present application provides a kind of model training method for uneven sample set, to uneven sample Collection carries out retraining model after sample balance, so as to improve the performance of model.
The embodiment of the present application provides a kind of model training apparatus for uneven sample set, to uneven sample Collection carries out retraining model after sample balance, so as to improve the performance of model.
The embodiment of the present application provides a kind of sample balance method for uneven sample set, for injustice When weighing apparatus sample set carries out oversampling treatment, the authenticity of the sample set after raising treatment.
The embodiment of the present application provides a kind of sample bascule for uneven sample set, for injustice When weighing apparatus sample set carries out oversampling treatment, the authenticity of the sample set after raising treatment.
The embodiment of the present application provides a kind of personal credit points-scoring system, for improving the true of personal credit scoring Property.
The embodiment of the present application uses following technical proposals:
A kind of model training method for uneven sample set, including:
Uneven sample set is obtained, negative sample and positive sample, and negative sample are included in the uneven sample set This is more than uneven threshold value with the sample size ratio of positive sample, and the uneven threshold value is more than 1;
According to the positive sample in the uneven sample set, and the sample adjacent with the positive sample, build Vertical synthesis positive sample, the sample adjacent with the positive sample includes negative sample and/or positive sample;
When synthesis positive sample is interval in balanced threshold with the ratio of negative sample quantity with the quantity sum of positive sample When interior, synthesis positive sample is revised as positive sample, generates balance sample collection;
Model training is carried out according to the balance sample collection.
Preferably, the positive sample in the uneven sample set, and it is adjacent with the positive sample Sample, sets up synthesis positive sample, including:A positive sample is chosen from the uneven sample set;With On the basis of one positive sample, the sample set adjacent with one positive sample is chosen from sample space Close, negative sample and/or positive sample are included in the sample set;According to one positive sample and the sample Sample in this set distinguishes corresponding feature and characteristic value in sample space, in one positive sample Synthesis positive sample is set up between the sample in the sample set.
Preferably, when synthesis positive sample is being balanced with the quantity sum of positive sample with the ratio of negative sample quantity When in threshold interval, synthesis positive sample is revised as positive sample, generates balance sample collection, including:Judge Whether synthesis positive sample is interval in balanced threshold with the quantity sum of positive sample and the ratio of negative sample quantity It is interior;When being, synthesis positive sample is revised as positive sample, generates balance sample collection.
Preferably, methods described also includes:When no and interval less than balanced threshold minimum value When, choose a positive sample again from the uneven sample set, repeat it is described with it is one just On the basis of sample, the sample set adjacent with one positive sample is chosen from sample space.
Preferably, on the basis of one positive sample, chosen from sample space and one positive sample This adjacent sample set, including:According to the sample size ratio, and one positive sample is in sample The distance between with least one positive sample in this space, neighbor distance threshold value is determined;According to described adjacent Distance threshold and sample size ratio, on the basis of one positive sample, select from sample space Take the sample set adjacent with one positive sample.
Preferably, divided in sample space with the sample in the sample set according to one positive sample Not corresponding characteristic value, sets up synthesis between the sample in one positive sample and the sample set Positive sample, including:According to the sample in one positive sample and the sample set in sample space The corresponding characteristic value of difference, the centre position of the sample in one positive sample with the sample set Set up synthesis positive sample.
Preferably, methods described is applied to for unbalanced original personal credit sample set, and positive sample is Overdue sample, negative sample is non-overdue sample.A kind of model training apparatus for uneven sample set, Including:Sample set acquiring unit, Sample Establishing unit, sample set generation unit and model training list Unit, wherein,
The sample set acquiring unit, for obtaining uneven sample set, wraps in the uneven sample set Containing negative sample and positive sample, and negative sample is more than uneven threshold value with the sample size ratio of positive sample, described Uneven threshold value is more than 1;
The Sample Establishing unit, for the positive sample in the uneven sample set, and with institute The adjacent sample of positive sample is stated, synthesis positive sample is set up, the sample adjacent with the positive sample is included Negative sample and/or positive sample;
The sample set generation unit, for when synthesis positive sample and the quantity sum and negative sample of positive sample When the ratio of quantity is in balanced threshold is interval, synthesis positive sample is revised as positive sample, generation balance sample This collection;
The model training unit, for carrying out model training according to the balance sample collection.
Preferably, the Sample Establishing unit includes:Positive sample chooses unit, sample set and chooses unit And synthesis positive sample sets up unit, wherein,
The positive sample chooses unit, for choosing a positive sample from the uneven sample set;
The sample set chooses unit, on the basis of one positive sample, from sample space The sample set adjacent with one positive sample is chosen, comprising negative sample and/or just in the sample set Sample;
The synthesis positive sample sets up unit, for according in one positive sample and the sample set Sample in sample space respectively corresponding feature and characteristic value, in one positive sample and the sample Synthesis positive sample is set up between sample in this set.
Preferably, the sample set generation unit includes:Judging unit, balance sample collection generation unit with And jump-transfer unit, wherein,
The judging unit, quantity sum and negative sample quantity for judging synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
The balance sample collection generation unit, for when judged result for it is no be when, will synthesis positive sample repair Positive sample is changed to, balance sample collection is generated;
The jump-transfer unit, minimum for being no when judged result and interval less than the balanced threshold When value is, the execution Sample Establishing unit is redirected.
Preferably, sample set chooses unit, specifically for:According to the sample size ratio, Yi Jisuo State a positive sample in sample space with the distance between at least one positive sample, determine neighbor distance threshold Value;According to the neighbor distance threshold value and sample size ratio, with one positive sample as base Standard, chooses the sample set adjacent with one positive sample from sample space.
Preferably, synthesis positive sample sets up unit, specifically for:According to one positive sample with it is described Sample in sample set distinguishes corresponding characteristic value in sample space, in one positive sample and institute Synthesis positive sample is set up in the centre position for stating the sample in sample set.
A kind of sample balance method for uneven sample set, it is characterised in that the uneven sample Concentrate and include negative sample and positive sample, and negative sample is more than uneven threshold with the sample size ratio of positive sample Value, the uneven threshold value is more than 1, and methods described includes:
A positive sample is chosen from the uneven sample set;
On the basis of one positive sample, choose adjacent with one positive sample from sample space Sample set, includes negative sample and/or positive sample in the sample set;
It is corresponding respectively in sample space with the sample in the sample set according to one positive sample Feature and characteristic value, synthesis is being set up just between the sample in one positive sample and the sample set Sample;
Whether judge to synthesize positive sample with the quantity sum of positive sample and the ratio of negative sample quantity in balance threshold In value is interval;
When being, synthesis positive sample is revised as positive sample, generates balance sample collection.
A kind of sample bascule for uneven sample set, including:Positive sample chooses unit, sample Set chooses unit, synthesis positive sample and sets up unit, judging unit and balance sample collection generation unit, Wherein,
The positive sample chooses unit, for choosing a positive sample from the uneven sample set;
The sample set chooses unit, on the basis of one positive sample, from sample space The sample set adjacent with one positive sample is chosen, comprising negative sample and/or just in the sample set Sample;
The synthesis positive sample sets up unit, for according in one positive sample and the sample set Sample in sample space respectively corresponding feature and characteristic value, in one positive sample and the sample Synthesis positive sample is set up between sample in this set;
The judging unit, quantity sum and negative sample quantity for judging synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
The balance sample collection generation unit, for when judged result is to be, by synthesis positive sample modification It is positive sample, generates balance sample collection.
Preferably, described device also includes jump-transfer unit, specifically for:When judged result is no, and small When the interval minimum value of the balanced threshold is, redirects the execution positive sample and choose unit.
A kind of personal credit points-scoring system, including:Original personal credit sets up system, sample balance system System, Credit Model training system, personal credit points-scoring system, wherein,
The original personal credit sets up system, for according to the corresponding feature of user and characteristic value, setting up Original personal credit sample set;
The sample balance system, for carrying out sample balance to original personal credit sample set;
The Credit Model training system, for training credit mould according to the personal credit sample set after balance Type;
The personal credit points-scoring system, for according to the corresponding feature of user and characteristic value, using credit Model is predicted to the overdue situation of user, and carries out personal credit scoring according to predicting the outcome.
Above-mentioned at least one technical scheme that the embodiment of the present application is used can reach following beneficial effect:Due to In uneven this concentration, minority class sample (positive sample) is less, but adjacent with positive sample not far Sample is often present and the same or analogous feature of the positive sample, selection and the positive sample on the basis of the positive sample This adjacent sample set, then sample is chosen from sample set, and according to feature and characteristic value and the positive sample This foundation synthesize positive sample so that synthesis positive sample also with positive sample existing characteristics value on similitude, relatively In prior art be directed to uneven sample set carry out simple copy some positive samples oversampling processing method and Speech, improves the authenticity of balance sample collection.According to the authenticity balance sample collection higher that the application is generated After carrying out model training, the performance of model can also get a promotion.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, The schematic description and description of the application does not constitute the improper limit to the application for explaining the application It is fixed.In the accompanying drawings:
Fig. 1 is a kind of sample balance method for uneven sample set that the embodiment of the present application 1 is provided Schematic flow sheet;
Fig. 2 is one positive sample schematic diagram of selection that the embodiment of the present application 1 is provided;
Fig. 3 is the sample set adjacent according to the selection of neighbours' distance threshold that the embodiment of the present application 1 is provided Schematic diagram;
Fig. 4 is the schematic diagram of the determination neighbor distance threshold value that the embodiment of the present application 1 is provided;
Fig. 5 is the schematic diagram of the determination neighbor distance threshold value that the embodiment of the present application 1 is provided;
Fig. 6 is that neighbours' distance threshold that the basis that the embodiment of the present application 1 is provided is determined chooses adjacent sample This set schematic diagram;
Fig. 7 is the schematic diagram of the foundation synthesis positive sample that the embodiment of the present application 1 is provided;
Fig. 8 is the foundation synthesis positive sample for providing of the embodiment of the present application 1 and the signal for reaching sample balance Figure;
Fig. 9 is that the positive sample that will synthesize that the embodiment of the present application 1 is provided is revised as positive sample and generates balance sample The schematic diagram of this collection;
Figure 10 is that one kind that the embodiment of the present application 2 is provided carries out Credit Model instruction for uneven credit sample The schematic flow sheet of experienced method;
Figure 11 is a kind of sample bascule for uneven sample set that the embodiment of the present application 3 is provided Structured flowchart;
Figure 12 is a kind of model training method for uneven sample set that the embodiment of the present application 4 is provided Schematic flow sheet;
Figure 13 is a kind of model training apparatus for uneven sample set that the embodiment of the present application 5 is provided Structured flowchart;
Figure 14 is a kind of structured flowchart of personal credit points-scoring system that the embodiment of the present application 6 is provided.
Specific embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer Apply example and corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, it is described Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of the application protection.
Before being discussed in detail of the technical scheme of the application is carried out, for the sake of clearly, here first to several Term is briefly described.Uneven sample set and balance sample set will be related in the embodiment of the present application, sample will be born Sheet and positive sample, and sample space and characteristic value.Positive sample and negative sample, each sample are included in sample set One object of this expression.Such as, when needing to set up the sample set of healthy population and cancer patient, everyone It is exactly a sample, the people of health is negative sample, and cancer patient is exactly positive sample, and negative sample here is exactly Many several classes of samples are represented, positive sample represents minority class sample.It is after all a small number of due to cancer patient, so negative The sample size of sample and positive sample is more than 1 than certain, can set a uneven threshold value, such as and 1.2, When negative sample is more than 1.2 with the sample size ratio of positive sample in sample set, you can think that the sample set is not Balance sample.A balance sample threshold interval can be preset, the interval is bounded interval, such as the interval It is (0.9,1.1) to represent the sample size ratio when negative sample and positive sample in sample set in (0.9,1.1) In the range of when, it is believed that the sample set is balance sample.Each sample has the feature and characteristic value of oneself, In such as one patients with gastric cancer positive sample, early symptom is had:Vomiting, gastric ulcer etc., " early symptom " just It is feature, " vomiting, gastric ulcer etc. " is exactly characteristic value.Each sample has the feature and characteristic value of various dimensions, The sample space of various dimensions can be set up according to the feature of various dimensions and characteristic value, according to the feature of each dimension Be placed on each sample in certain position of sample space by value, it is possible to according to specified distance metric method Determine the distance between each two sample.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Embodiment 1
As it was previously stated, with information-based development, prediction small probability event becomes the focus of all trades and professions, Cancered probability is such as predicted, the overdue probability of prediction user credit card predicts the general of financial market mutation Rate etc..But the sample set of these small probability events is all serious unbalance, such as (1000:1, i.e., every 1000 Just have an overdue people for not going back credit card in individual), when model training is carried out according to uneven sample set, Often there is bias, so as to influence the performance of model.Prior art in order to solve this problem, use Mode is generally oversampling treatment, i.e. random reproduction minority class sample, so as to reach minority class sample with many several classes ofs The quantity of sample reaches the purpose of balance, such as the sample 1 in the minority class sample in sample set, Replicate twice, then just have sample 1, sample 1 ' and sample on the position of sample 1 in the sample space This 1 " this 3 sample, but be typically not in identical two samples, just in actual applications As every ill reason, early symptom of cancer patient etc. are all not quite similar, each does not go back the people's of credit card Whether historical behavior and individual itself also complete identical, so if to several minority class Sample is simply copied, it is clear that the authenticity of sample can be caused relatively low.In training pattern due to occurring Multiple identical samples, the weight for also resulting in these samples is higher, so as to cause the problem of overfitting, And then influence the training effect of model.Based on this defect, present inventors have proposed one kind for uneven sample The sample balance method of collection, for uneven sample set is carried out oversampling process when, raising treatment after The authenticity of sample set.The method is directed to a kind of oversampling processing method that uneven sample set is carried out, sample This concentration includes negative sample and positive sample, and negative sample is more than uneven threshold value with the sample size ratio of positive sample, The uneven threshold value can be (such as 1.2 or 1.5 etc.) set in advance, the step of due to setting up sample set It is not this programme emphasis, so not introducing excessively.The schematic flow sheet of the method is as shown in figure 1, including under State step:
Step 11:Never balance sample is concentrated and chooses a positive sample.
Because oversampling treatment is directed to minority class sample, it is possible to which never balance sample is concentrated and chooses one Individual positive sample, the mode of selection can be randomly selected, or carried out according to the position in sample space Choose, such as, for two dimensional sample plane, can be chosen according to the ascending order of characteristic value, For three-dimensional samples space, can be chosen to surrounding at sample space Zhong You centers according to characteristic value, etc.. It should be noted that sample space described in the present embodiment includes that two dimensional sample plane and multidigit sample are empty Between.
It should be noted that when being briefly described to term, by the agency of negative sample here is exactly to represent Many several classes of samples, positive sample represents minority class sample.In actual applications, can with the positive negative sample of self-defining, Such as positive sample can also be defined as many several classes of samples.The two definition are set in advance, and at one In flow, once it is not modifiable to set.If negative sample is defined as into minority class sample in actual applications This, then the step is exactly to choose a negative sample.Positive sample is defined as minority class sample by the application, this After repeat no more.
By taking two dimensional sample plane as an example, as shown in Fig. 2 "○" is negative sample, " " is positive sample, Ke Yicong One is randomly selected in 5 positive samples, such as have chosen positive sample 1.
Step 12:On the basis of this positive sample, the sample adjacent with this positive sample is chosen from sample space This set.
Because in sample space, the relative position between sample is determined according to the characteristic value of sample, It is considered that apart from two nearer samples, the relation between them is tightr, the difference of their characteristic value It is smaller, so, the sample set adjacent with this positive sample can be chosen according to distance.
In the step, a neighbor distance threshold value can be preset, this neighbor distance threshold value can be set in advance Fixed, on the basis of this positive sample chosen in step 11, it is less than with the distance of the positive sample adjacent The sample of distance threshold can be chosen to get in sample set, in sample set can comprising negative sample and/or Positive sample.Such as, as shown in figure 3, for positive sample 1, neighbor distance threshold value set in advance is R, then can from the position of this sample as the center of circle, with r as radius in the range of, from two dimensional sample plane Middle to choose the sample set adjacent with this positive sample, the sample set includes that 3 negative samples (are designated 3 "○" of " √ ").
It should be noted that the distance in the application, is determined according to specified distance metric method, than Such as, Euclidean distance (Euclidean Distance), manhatton distance (Manhattan Distance) standardizes Europe Family name's distance (Standardized Euclidean distance), etc..
In actual applications, preset neighbor distance threshold value and be not necessarily suitable whole positive samples, than Such as, the distance of certain positive sample and other samples is all far, then further according to presetting neighbor distance threshold Value area chooses sample set adjacent thereto and is possible to choosing not out, so, in certain implementation method In, in order to the regulation neighbor distance threshold value of the position self adaptation according to positive sample in itself, with this just On the basis of sample, the sample adjacent with this positive sample is chosen from sample space using neighbor distance threshold value Set, can include:
According to sample size ratio, and this positive sample in sample space with least one other positive sample The distance between, determine neighbor distance threshold value;According to neighbor distance threshold value and sample size ratio, with this On the basis of individual positive sample, the sample set adjacent with this positive sample is chosen from sample space.
Specifically, neighbor distance threshold value can be determined according to following formula:
Wherein, K is the total quantity of a positive sample and at least one other positive sample chosen,
The distance between N=sample sizes ratio -1, d is i-th positive sample to k-th positive sample.
After neighbor distance threshold value is determined, can be according to neighbor distance threshold value and N, never balance sample Concentrate and choose the sample set adjacent with this positive sample.
Specifically, such as, as shown in figure 4, K can take 3, due to having chosen a positive sample 1, So choosing two positive samples again, selection mode can be randomly selected, or choose adjacent, than Such as, positive sample 2 and positive sample 3 be have chosen.In fig. 4, negative sample has 15, and positive sample has 5, So N=15:5-1=2, it is believed that when 2 positive samples are copied according to each positive sample, can reach Sample is balanced.2 can be randomly selected from K positive sample, as i=1 and i=2, by K positive sample This is used as k=1, k=2 and k=3.
As shown in figure 5, d (i=1, k=1)=L1;D (1,2)=L3;D (1,3)=0;
D (2,1)=0;D (2,2)=L2;D (2,3)=L1;
So, D=(L1+L1+L2+L3)/(2 × 3)
By taking Fig. 5 as an example, L1=872 (unit), L2=L1=738 (unit), L3=1144 (unit), " (unit) " represents the parasang in two dimensional sample plane.Then D=605 (unit).
As shown in fig. 6, with positive sample 1 as the center of circle, D for radius circle scope in, from uneven sample This concentration is chosen (at random or according to apart from size) the N=2 sample adjacent with this positive sample and (is designated 2 "○" of " √ "), constitute sample set.
It should be noted that when the sample set adjacent with this positive sample is chosen, it is also possible to choose positive sample This, because (either positive sample or negative sample) has in sample of the positive sample around close Some features same or analogous with positive sample.
Step 13:Corresponding spy is distinguished in sample space with the sample in sample set according to this positive sample Seek peace characteristic value, synthesis positive sample is set up between the sample in this positive sample and sample set.
Due in uneven sample set, the negligible amounts of positive sample, in some scenes (cancer patient, The overdue user of credit card) quantity of positive sample is even more few, so the distance between two positive samples are general In the case of can more than the distance between two negative samples, but as previously shown, with positive sample apart from close week Some features same or analogous with positive sample are had in the sample for enclosing.Similar, such as around criminal People, more or less may have the feature similar to criminal, such as, educational background, living condition, family, Bad habit etc..Although people around criminal not necessarily can crime, have potential possibility, so In this step, can be according to the sample in the sample set chosen in the sample, with step 12 chosen The corresponding feature of this difference and characteristic value, set up synthesis between the sample in this positive sample and sample set Positive sample.It should be noted that when synthesis sample is set up, feature and characteristic value will be corresponded, such as, Will be according to the annual income of this positive sample chosen:The annual income of the sample in 50000, with sample set:60000 Carry out correspondence.
In one embodiment, can be set up in the interposition of the sample in this positive sample with sample set Vertical synthesis positive sample, such as the example of above-mentioned " annual income ", the annual income for setting up synthesis positive sample is 5.5 ten thousand.As shown in fig. 7, being two synthesis positive sample " △ " set up.
It should be noted that in actual applications, often N is not integer, in this case, there are two kinds Processing mode:
The first, is to carry out part to round up with multiple positive samples, such as, N is 0.7, then can basis 10 positive samples set up 7 synthesis positive samples.
Second, on the basis of rounding up, when N is 3.3,3 conjunctions are set up according to a positive sample Into positive sample, untill sample set reaches balance, or when N is 1.56, built according to a positive sample Vertical 2 synthesis positive sample, untill sample set reaches balance.
Step 14:Judge to synthesize the quantity sum of positive sample and positive sample and negative sample quantity ratio whether In balanced threshold is interval.
Because synthesis positive sample can just be considered as positive sample, it is possible to by synthesis positive sample in positive sample one Rise and add up, determine quantity sum, judge quantity sum with the ratio of negative sample quantity whether in balance In threshold interval, when not existing, and during less than the interval minimum value of the balanced threshold, never balance sample is concentrated A positive sample is chosen again, step 12 to step 14 is repeated, that is, continues to set up synthesis positive sample, It should be noted that never balance sample concentrate choose a positive sample again, can essentially with step 11 in That positive sample chosen is identical, it is also possible to different, but in practical operation, exactly by performing step 11 A positive sample is chosen again, so ought not be, and during less than the interval minimum value of the balanced threshold, it is also possible to Step 11 is directly performed, and repeats execution step 12 and arrive step 14.
When ratio is in the balanced threshold is interval, then it is considered that positive sample now (including synthesizes positive sample This) balance is reached and negative sample between, as shown in figure 8, setting up 2 positive samples of synthesis according to each positive sample This, now, positive sample (including synthesis positive sample) is all 15 with the quantity of negative sample, and ratio is exactly 1:1, reach complete equipilibrium, it is possible to synthesis positive sample is revised as positive sample, balance sample collection is generated, Result namely as shown in Figure 9.
In actual applications, multiple positive samples are often once selected, according to each positive sample, is built parallel Vertical synthesis sample, so, it is possible to occur, the quantity and unnecessary negative sample of synthesis positive sample and positive sample, Now, this sample set is just again unbalanced.Such as, for the example with Fig. 2 to Fig. 9, most Starting negative sample has 15, and positive sample only has 5, when setting up synthesis positive sample further according to positive sample, such as Fruit synthesis positive sample becomes for 20 with the quantity of positive sample, then and imbalance, so, in one kind In implementation method, if synthesis positive sample is not being put down with the quantity sum of positive sample with the ratio of negative sample quantity In weighing apparatus threshold interval, and during more than the interval maximum of the balanced threshold, delete the positive sample of synthesis of specified quantity This, and whether judge to synthesize positive sample with the quantity sum of positive sample and the ratio of negative sample quantity in balance threshold In value is interval.
So, when practical application is balanced sample to uneven sample set, according to balance set in advance Threshold interval is still deleted synthesis positive sample and is controlled to foundation synthesis positive sample, and final purpose is to reach To sample balance.Such as, uneven threshold value is set to 2, that is, negative sample is positive sample in the sample set for getting During this at least twice, start to carry out it operation of sample balance, it is [0.95,1.05] that balanced threshold is interval, When 0.95≤negative sample number/(synthesis positive sample number+positive sample number)≤1.05, balance sample collection is generated.
After being balanced sample set, final purpose can, for training pattern, make the model for training Can be more preferable, so, in one embodiment, the method can also include:Carried out according to balance sample collection Model training.Because the process trained is not the emphasis of the application, so repeating no more.
The method provided using embodiment 1, due in uneven this concentration, minority class sample (positive sample) It is less, but not far sample adjacent with positive sample is often present and the same or analogous spy of the positive sample Levy, the sample set adjacent with the positive sample is chosen on the basis of the positive sample, then chosen from sample set Sample, and synthesis positive sample is set up according to feature and characteristic value and the positive sample so that synthesis positive sample also with Similitude in positive sample existing characteristics value, is carried out simply multiple relative to prior art for uneven sample set Make for the oversampling processing method of some positive samples, improve the authenticity of balance sample collection.According to this Shen After the authenticity that please generate balance sample collection higher carries out model training, the performance of model can also get a promotion.
In actual applications, also a kind of prior art, is also to carry out synthesis minority class to uneven sample set The oversampling method of sample, i.e. SMOTE (Synthetic Minority Over-Sampling Technique) Algorithm, the algorithm is in uneven sample set, first to randomly select a positive sample, then choose and the positive sample This another nearest positive sample, randomly selects between this two positive sample and a little set up synthesis positive sample, should Although algorithm is similar with the application, as previously mentioned, for uneven sample set for, between positive sample Distance it is distant (and unbalance more serious, the distance between positive sample is often bigger), so between positive sample Mostly without what similar features, thus the positive sample set up between two positive samples from characteristic value with which Individual positive sample has larger difference, is also more blindly to carry out oversampling processing method.For giving an actual example, A citizen for staying in BeiJing, China and a citizen for staying in Canberra, AUS, although all exist overdue The behavior (i.e. overdue sample) of credit card is not gone back, but no matter from consumption habit, the purchasing power of money, or from All there is larger difference on ethnic group, social background, so can not easily think between this two citizen The citizen in certain city of The Republic of PALAU are just overdue sample.But in the application, can look for and stay in Chinese Shanghai A citizen (overdue sample or non-overdue sample), built such as Jinan City, Shandong Province between this two citizen The vertical overdue sample of synthesis, due to no matter from consumption habit, the purchasing power of money, or from ethnic group, social background On do not deposit larger difference, so the overdue sample of the synthesis set up out, just more genuine and believable.
Embodiment 2
With the development of personal credit system, " credibility record " can be set up for everyone, be wrapped in credibility record The history credit information of user is included, such as, the data relevant with credit of various dimensions are (age, education, individual People's archives, work, wage income etc.).By the analysis to history credit information, it is possible to predict future Whether this people is credible.But when sample set is set up, because the people for not going back credit card is after all a small number of, institute Relative to non-overdue sample it is considerably less with overdue sample, which forms uneven sample set, such as preceding institute State, prior art is simply to replicate some overdue samples to carry out oversampling treatment, but everyone (makees Be a sample) characteristic value be all not quite similar (without identical two people), so simple copy is just The authenticity of the overdue sample of synthesis set up can be caused poor.Carried out at oversampling using SMOTE algorithms During reason, because two overdue samples are typically apart from each other, the overdue sample set up between them also do not have compared with Authenticity high, concrete reason may be referred to described in embodiment 1.So for prior art to imbalance Personal credit sample set carry out oversampling treatment and reach sample balance method defect, and based on reality The identical inventive concept of example 1 is applied, embodiment 2 provides a kind of imbalance credit sample that is directed to and carries out credit mould The method of type training, the performance for improving Credit Model.The schematic flow sheet of the method is as shown in Figure 10, Comprise the steps:
Step 21:According to the corresponding feature of user and characteristic value, original personal credit sample set is set up.
In the step, the corresponding all features of user and characteristic value, then the behaviour for being pre-processed can be first obtained Make, in this process, different data sources first passes around data cleansing, eliminates wrong data and unrelated number According to, the form that system can be recognized and supported then is turned into by data conversion, used finally by each The data unique mark at family and by same user different pieces of information source in data fusion be a data. After completing pretreatment operation, it is possible to set up original personal credit sample according to the corresponding feature of user and characteristic value This collection.(arrived comprising overdue sample (unredeeming the user of credit card after the date due) and non-overdue sample in the sample set Phase pays off the user of credit card), because overdue sample must be a small number of, so the original personal credit sample This collection must be unbalanced sample set.
Step 22:An overdue sample is chosen from original personal credit sample set.
Step 23:According to sample size ratio, and the overdue sample in sample space with it is at least one other The distance between overdue sample, determines neighbor distance threshold value.
Step 24:According to neighbor distance threshold value and the sample size ratio determined, with the overdue sample as base Standard, chooses the sample set adjacent with the overdue sample from sample space.
Overdue sample can be included in sample set, it is also possible to comprising non-overdue sample.
Step 25:Corresponding spy is distinguished in sample space with the sample in sample set according to the overdue sample Seek peace characteristic value, the overdue sample of synthesis is set up in the centre position of the sample in the overdue sample and sample set.
Step 26:Judge the overdue sample of synthesis and the quantity sum of overdue sample and the ratio of non-overdue sample size Whether value is in balanced threshold is interval.
When not existing, and during less than the interval minimum value of the balanced threshold, from original personal credit sample set again An overdue sample is chosen, step 22 to step 26 is repeated, that is, continues to set up the overdue sample of synthesis This.
When in the balanced threshold is interval, then it is considered that overdue sample now (including synthesizes overdue sample Originally) balance is reached between non-overdue negative sample.So can will synthesize overdue sample is revised as overdue sample, And generate the personal credit sample set of balance.
Step 27:Personal credit sample set according to balance trains Credit Model.
In actual applications, the credit data and corresponding credit record in user 1 year can be obtained, is taken The credit data of the first three quarters and corresponding credit record are used to train Credit Model, last quarter to use In the performance of checking Credit Model.If performance is not up to expected requirement, can suitably adjust and such as choose phase Parameter (such as entering row coefficient addition to K, N and/or D) during adjacent sample set etc..
The method provided using embodiment 2, due in unbalanced personal credit sample set, overdue sample Originally it is little, but often there is or phase identical with the overdue sample in not far sample adjacent with overdue sample As feature, choose the sample set adjacent with the overdue sample on the basis of the overdue sample, then from sample Overdue or non-overdue sample, and this sample in feature and characteristic value and the sample set are chosen in set This foundation synthesizes overdue sample so that the overdue sample of synthesis also with overdue sample existing characteristics value on similitude, Relative to prior art simple copy some overdue samples are carried out for unbalanced personal credit sample set For oversampling processing method, the authenticity of personal credit sample set is improve, higher according to authenticity After the personal credit sample set of balance is trained to Credit Model, the performance of Credit Model can also get a promotion.
Embodiment 3
Based on identical inventive concept, embodiment 3 provides a kind of sample balance for uneven sample set Device, for when oversampling treatment is carried out to uneven sample set, improving the authenticity of sample.Figure 11 is The structured flowchart of the device, the device includes:
Positive sample chooses unit 31, sample set selection unit 32, synthesis positive sample and sets up unit 33, sentences Disconnected unit 34 and balance sample collection generation unit 35, wherein,
Positive sample chooses unit 31, can be used for never balance sample and concentrates one positive sample of selection;
Sample set chooses unit 32, can be used on the basis of a positive sample, is chosen from sample space The sample set adjacent with a positive sample, includes negative sample and/or positive sample in sample set;
Synthesis positive sample sets up unit 33, and the sample that can be used in a positive sample and sample set exists Corresponding feature and characteristic value are distinguished in sample space, between the sample in a positive sample and sample set Set up synthesis positive sample;
Judging unit 34, can be used for judging the quantity sum and negative sample quantity of synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
Balance sample collection generation unit 35, can be used for when judged result is to be, by synthesis positive sample modification It is positive sample, generates balance sample collection.
In one embodiment, the device also includes jump-transfer unit, can be used for:
When minimum value when judged result is no and interval less than balanced threshold is, execution positive sample choosing is redirected Take unit.
Device provided using embodiment 3, due in uneven this concentration, minority class sample (positive sample) It is less, but not far sample adjacent with positive sample is often present and the same or analogous spy of the positive sample Levy, the sample set adjacent with the positive sample is chosen on the basis of the positive sample, then chosen from sample set Sample, and synthesis positive sample is set up according to feature and characteristic value and the positive sample so that synthesis positive sample also with Similitude in positive sample existing characteristics value, is carried out simply multiple relative to prior art for uneven sample set Make for the oversampling processing method of some positive samples, improve the authenticity of balance sample collection.According to this Shen After the authenticity that please generate balance sample collection higher carries out model training, the performance of model can also get a promotion.
Embodiment 4
A kind of side that Credit Model training is carried out for uneven credit sample is had been described above in example 2 Method, and in actual applications, the purpose major part of balance sample is also used in training pattern, so, Based on identical inventive concept, the present embodiment 4 provides a kind of model training method for uneven sample set, Retraining model after sample balance is carried out to uneven sample set, so as to improve the performance of model.The method Schematic flow sheet as shown in figure 12, comprises the steps:
Step 41:Obtain uneven sample set.
In the step, negative sample and positive sample, and negative sample and positive sample can be included in uneven sample set Sample size ratio be more than uneven threshold value.In actual applications, it is also possible to comprising content be condition, Whether judgement sample collection is uneven sample set, such as, receive certain sample set, in judging the sample set Whether two kinds of samples are only included, then judges negative sample with the sample size of positive sample than whether more than imbalance Threshold value (such as 1.2), determines whether to be uneven sample, to carry out subsequent operation again according to judged result.
Step 42:According to the positive sample in uneven sample set, and the sample adjacent with positive sample, set up Synthesis positive sample.
In the step, the step of introducing 11, step 12 can be decomposed into three sub-steps, i.e. embodiment 1 With step 13, purpose be exactly by positive sample and the negative sample adjacent with each positive sample and/or positive sample, Synthesis positive sample is set up, so as to reach the purpose of sample balance.Detailed step is introduced in embodiment 1, Here is omitted.
Step 43:When synthesis positive sample is balancing threshold with the quantity sum of positive sample with the ratio of negative sample quantity When in value is interval, synthesis positive sample is revised as positive sample, generates balance sample collection.
The step is exactly the synthesis positive sample set up using step 42, and the synthesis of positive sample is represented by judgement Whether the quantity sum of positive sample and positive sample with negative sample reaches balance to generate the process of balance sample collection, Detailed step is introduced in embodiment 1, and here is omitted.
Step 44:Model training is carried out according to balance sample collection.
The method provided using embodiment 4, for the uneven sample set for getting, using with positive sample The adjacent Sample Establishing sample related to positive sample, so as to improve the authenticity of balance sample collection.Root again Model training is carried out according to authenticity balance sample collection higher, the performance of model can also get a promotion.
Embodiment 5
Based on identical inventive concept, embodiment 5 provides a kind of model training for uneven sample set Device, retraining model after sample balance is carried out to uneven sample set, so as to improve the performance of model.Figure 13 is the structured flowchart of the device, and the device includes:
Sample set acquiring unit 51, Sample Establishing unit 52, sample set generation unit 53 and model training Unit 54, wherein,
Sample set acquiring unit 51, can be used for obtaining uneven sample set, comprising negative in uneven sample set Sample and positive sample, and negative sample is more than uneven threshold value, the imbalance with the sample size ratio of positive sample Threshold value is more than 1;
Sample Establishing unit 52, the positive sample that can be used in uneven sample set, and and positive sample Adjacent sample, sets up synthesis positive sample, and the sample adjacent with positive sample includes negative sample and/or positive sample;
Sample set generation unit 53, can be used for quantity sum and negative sample with positive sample when synthesis positive sample When the ratio of quantity is in balanced threshold is interval, synthesis positive sample is revised as positive sample, generates balance sample Collection;
Model training unit 54, can be used for carrying out model training according to balance sample collection.
In one embodiment, Sample Establishing unit 52 includes:Positive sample chooses unit 31, sample set Close to choose unit 32 and synthesize positive sample and set up unit 33, wherein,
Positive sample chooses unit 31, can be used for never balance sample and concentrates one positive sample of selection;
Sample set chooses unit 32, can be used on the basis of a positive sample, is chosen from sample space The sample set adjacent with a positive sample, includes negative sample and/or positive sample in sample set;
Synthesis positive sample sets up unit 33, and the sample that can be used in a positive sample and sample set exists Corresponding feature and characteristic value are distinguished in sample space, between the sample in a positive sample and sample set Set up synthesis positive sample.
In one embodiment, sample set generation unit 53 includes:Judging unit 34, balance sample collection Generation unit 35 and jump-transfer unit, wherein,
Judging unit 34, can be used for judging the quantity sum and negative sample quantity of synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
Balance sample collection generation unit 35, can be used for when judged result is to be, by synthesis positive sample modification It is positive sample, generates balance sample collection;
Jump-transfer unit, can be used for when judged result is no, and the minimum value interval less than the balanced threshold When, redirect the execution Sample Establishing unit.
In one embodiment, sample set chooses unit 32, can be used for:
According to sample size ratio, and a positive sample in sample space between at least one positive sample Distance, determines neighbor distance threshold value;
According to neighbor distance threshold value and sample size ratio, on the basis of a positive sample, from sample space Choose the sample set adjacent with a positive sample.
In one embodiment, synthesis positive sample sets up unit 33, can be used for:
Corresponding characteristic value is distinguished in sample space with the sample in sample set according to a positive sample, Synthesis positive sample is set up in the centre position of the sample in one positive sample and sample set.
The device provided using embodiment 5, for the uneven sample set for getting, using with positive sample The adjacent Sample Establishing sample related to positive sample, so as to improve the authenticity of balance sample collection.Root again Model training is carried out according to authenticity balance sample collection higher, the performance of model can also get a promotion.
Embodiment 6
Prior art, the methods of marking to personal credit is based on simple rule, such as, new personal letter With fraction be 1, if this month refund on schedule, the fraction for Jia 0.1 on the original basis, when season base, When half a year, whole is refunded on schedule then, has the addition that fraction is not waited.But with arriving for big data epoch Come, this simple methods of marking do not adapted to big data, various dimensions, many scenes credit scoring will Ask.So being based on being commented there is provided a kind of personal credit with previous embodiment identical inventive concept, embodiment 6 Subsystem, the authenticity for improving personal credit scoring.Figure 14 is the structured flowchart of the system, and this is System includes:
Original personal credit sets up system 61, sample balance system 62, Credit Model training system 63, individual People's credit scoring system 64, wherein,
Original personal credit sets up system 61, can be used for, according to the corresponding feature of user and characteristic value, building Found original personal credit sample set.
Such as, can the same day obtain repayment date for proxima luce (prox. luc) all users at the one before month corresponding feature And characteristic value, set up original personal credit sample set.Specifically, repayment date is 10, in September 11 Day when, obtain user in August 11 days to the September credit data of 10 days and corresponding credit record (overdue or non-overdue).
Sample balance system 62, can be used for carrying out sample balance to original personal credit sample set.
Due to it is overdue be after all a small number of, it is possible to the mode of the balance sample in embodiment 1, it is right Original personal credit sample set is balanced.
Credit Model training system 63, the personal credit sample set training credit after can be used for according to balance Model.
Personal credit points-scoring system 64, can be used for according to the corresponding feature of user and characteristic value, using letter The overdue situation of user is predicted with model, and personal credit scoring is carried out according to predicting the outcome.
Such as, can before repayment date some days, according to the of that month credit data of user, using credit Model, the overdue situation to user is predicted, such as 99% can refund, or 72% can refund, according to The result of prediction, can carry out addition, it is possible to use logistic regression algorithm on the basis of original fraction, Calculate score value, specifically such as more than 95%, can add 1 point, less than 60%, can subtract 1 point, etc..
System provided using embodiment 4, because the method for the balance sample according to the application is to imbalance Original personal credit sample set be balanced, that is, correct, form the balance sample of more real various dimensions, So for existing technology is only scored by simple rule, improve the true of personal credit fraction Property, also just more can truly embody the credit rating of user.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on implement computer program product form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program product Flow chart and/or block diagram describe.It should be understood that can by computer program instructions realize flow chart and/ Or flow in each flow and/or square frame and flow chart and/or block diagram in block diagram and/or The combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, embedded can be provided The processor of processor or other programmable data processing devices is producing a machine so that by computer Or the instruction of the computing device of other programmable data processing devices is produced for realizing in flow chart one The device of the function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output Interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM). Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by appointing What method or technique realizes information Store.Information can be computer-readable instruction, data structure, program Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its The random access memory (RAM) of his type, read-only storage (ROM), electrically erasable are read-only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD- ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic rigid disk are deposited Storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be visited by computing device The information asked.Defined according to herein, computer-readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to non-row His property is included, so that process, method, commodity or equipment including a series of key elements not only include Those key elements, but also other key elements including being not expressly set out, or also include for this process, Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including It is individual ... " limit key element, it is not excluded that also deposited in the process including key element, method, commodity or equipment In other identical element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.And, the application can be used and wherein include calculating at one or more Machine usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, Optical memory etc.) on implement computer program product form.
Embodiments herein is these are only, the application is not limited to.For people in the art For member, the application can have various modifications and variations.It is all to be made within spirit herein and principle Any modification, equivalent substitution and improvements etc., within the scope of should be included in claims hereof.

Claims (16)

1. a kind of model training method for uneven sample set, it is characterised in that including:
Uneven sample set is obtained, negative sample and positive sample, and negative sample are included in the uneven sample set This is more than uneven threshold value with the sample size ratio of positive sample, and the uneven threshold value is more than 1;
According to the positive sample in the uneven sample set, and the sample adjacent with the positive sample, build Vertical synthesis positive sample, the sample adjacent with the positive sample includes negative sample and/or positive sample;
When synthesis positive sample is interval in balanced threshold with the ratio of negative sample quantity with the quantity sum of positive sample When interior, synthesis positive sample is revised as positive sample, generates balance sample collection;
Model training is carried out according to the balance sample collection.
2. the method for claim 1, it is characterised in that according in the uneven sample set Positive sample, and the sample adjacent with the positive sample, set up synthesis positive sample, including:
A positive sample is chosen from the uneven sample set;
On the basis of one positive sample, choose adjacent with one positive sample from sample space Sample set, includes negative sample and/or positive sample in the sample set;
It is corresponding respectively in sample space with the sample in the sample set according to one positive sample Feature and characteristic value, synthesis is being set up just between the sample in one positive sample and the sample set Sample.
3. the method for claim 1, it is characterised in that when synthesis positive sample and the number of positive sample When the ratio of amount sum and negative sample quantity is in balanced threshold is interval, synthesis positive sample is revised as positive sample This, generates balance sample collection, including:
Whether judge to synthesize positive sample with the quantity sum of positive sample and the ratio of negative sample quantity in balance threshold In value is interval;
When being, synthesis positive sample is revised as positive sample, generates balance sample collection.
4. method as claimed in claim 3, it is characterised in that methods described also includes:
When no, and during less than the interval minimum value of the balanced threshold, from the uneven sample set again A positive sample is chosen, described on the basis of one positive sample, the choosing from sample space is repeated Take the sample set adjacent with one positive sample.
5. method as claimed in claim 2, it is characterised in that on the basis of one positive sample, The sample set adjacent with one positive sample is chosen from sample space, including:
According to the sample size ratio, and one positive sample in sample space with least one just The distance between sample, determines neighbor distance threshold value;
According to the neighbor distance threshold value and sample size ratio, with one positive sample as base Standard, chooses the sample set adjacent with one positive sample from sample space.
6. method as claimed in claim 2, it is characterised in that according to one positive sample with it is described Sample in sample set distinguishes corresponding characteristic value in sample space, in one positive sample and institute Foundation synthesis positive sample between the sample in sample set is stated, including:
It is corresponding respectively in sample space with the sample in the sample set according to one positive sample Synthesis is being set up just in characteristic value, the centre position of the sample in one positive sample with the sample set Sample.
7. the method for claim 1, it is characterised in that methods described is applied to for imbalance Original personal credit sample set, positive sample be overdue sample, negative sample be non-overdue sample.
8. a kind of model training apparatus for uneven sample set, it is characterised in that including:Sample set Acquiring unit, Sample Establishing unit, sample set generation unit and model training unit, wherein,
The sample set acquiring unit, for obtaining uneven sample set, wraps in the uneven sample set Containing negative sample and positive sample, and negative sample is more than uneven threshold value with the sample size ratio of positive sample, described Uneven threshold value is more than 1;
The Sample Establishing unit, for the positive sample in the uneven sample set, and with institute The adjacent sample of positive sample is stated, synthesis positive sample is set up, the sample adjacent with the positive sample is included Negative sample and/or positive sample;
The sample set generation unit, for when synthesis positive sample and the quantity sum and negative sample of positive sample When the ratio of quantity is in balanced threshold is interval, synthesis positive sample is revised as positive sample, generation balance sample This collection;
The model training unit, for carrying out model training according to the balance sample collection.
9. device as claimed in claim 8, it is characterised in that the Sample Establishing unit includes:Just Sample chooses unit, sample set selection unit and synthesis positive sample and sets up unit, wherein,
The positive sample chooses unit, for choosing a positive sample from the uneven sample set;
The sample set chooses unit, on the basis of one positive sample, from sample space The sample set adjacent with one positive sample is chosen, comprising negative sample and/or just in the sample set Sample;
The synthesis positive sample sets up unit, for according in one positive sample and the sample set Sample in sample space respectively corresponding feature and characteristic value, in one positive sample and the sample Synthesis positive sample is set up between sample in this set.
10. device as claimed in claim 8, it is characterised in that the sample set generation unit bag Include:Judging unit, balance sample collection generation unit and jump-transfer unit, wherein,
The judging unit, quantity sum and negative sample quantity for judging synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
The balance sample collection generation unit, for when judged result is to be, by synthesis positive sample modification It is positive sample, generates balance sample collection;
The jump-transfer unit, minimum for being no when judged result and interval less than the balanced threshold During value, the execution Sample Establishing unit is redirected.
11. devices as claimed in claim 9, it is characterised in that sample set chooses unit, specifically For:
According to the sample size ratio, and one positive sample in sample space with least one just The distance between sample, determines neighbor distance threshold value;
According to the neighbor distance threshold value and sample size ratio, with one positive sample as base Standard, chooses the sample set adjacent with one positive sample from sample space.
12. devices as claimed in claim 9, it is characterised in that synthesis positive sample sets up unit, has Body is used for:
It is corresponding respectively in sample space with the sample in the sample set according to one positive sample Synthesis is being set up just in characteristic value, the centre position of the sample in one positive sample with the sample set Sample.
A kind of 13. sample balance methods for uneven sample set, it is characterised in that the imbalance Negative sample and positive sample are included in sample set, and negative sample is more than imbalance with the sample size ratio of positive sample Threshold value, the uneven threshold value is more than 1, and methods described includes:
A positive sample is chosen from the uneven sample set;
On the basis of one positive sample, choose adjacent with one positive sample from sample space Sample set, includes negative sample and/or positive sample in the sample set;
It is corresponding respectively in sample space with the sample in the sample set according to one positive sample Feature and characteristic value, synthesis is being set up just between the sample in one positive sample and the sample set Sample;
Whether judge to synthesize positive sample with the quantity sum of positive sample and the ratio of negative sample quantity in balance threshold In value is interval;
When being, synthesis positive sample is revised as positive sample, generates balance sample collection.
A kind of 14. sample bascules for uneven sample set, it is characterised in that including:Positive sample This selection unit, sample set choose unit, synthesis positive sample and set up unit, judging unit and balance Sample set generation unit, wherein,
The positive sample chooses unit, for choosing a positive sample from the uneven sample set;
The sample set chooses unit, on the basis of one positive sample, from sample space The sample set adjacent with one positive sample is chosen, comprising negative sample and/or just in the sample set Sample;
The synthesis positive sample sets up unit, for according in one positive sample and the sample set Sample in sample space respectively corresponding feature and characteristic value, in one positive sample and the sample Synthesis positive sample is set up between sample in this set;
The judging unit, quantity sum and negative sample quantity for judging synthesis positive sample and positive sample Ratio whether in balanced threshold is interval;
The balance sample collection generation unit, for when judged result is to be, by synthesis positive sample modification It is positive sample, generates balance sample collection.
15. devices as claimed in claim 14, it is characterised in that described device also includes redirecting list Unit, specifically for:
When judged result is no, and during less than the interval minimum value of the balanced threshold, redirect execution described Positive sample chooses unit.
A kind of 16. personal credit points-scoring systems, it is characterised in that including:Original personal credit sets up system System, sample balance system, Credit Model training system, personal credit points-scoring system, wherein,
The original personal credit sets up system, for according to the corresponding feature of user and characteristic value, setting up Original personal credit sample set;
The sample balance system, for carrying out sample balance to original personal credit sample set;
The Credit Model training system, for training credit mould according to the personal credit sample set after balance Type;
The personal credit points-scoring system, for according to the corresponding feature of user and characteristic value, using credit Model is predicted to the overdue situation of user, and carries out personal credit scoring according to predicting the outcome.
CN201510981091.6A 2015-12-23 2015-12-23 Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system Active CN106909981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510981091.6A CN106909981B (en) 2015-12-23 2015-12-23 Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510981091.6A CN106909981B (en) 2015-12-23 2015-12-23 Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system

Publications (2)

Publication Number Publication Date
CN106909981A true CN106909981A (en) 2017-06-30
CN106909981B CN106909981B (en) 2020-08-25

Family

ID=59200037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510981091.6A Active CN106909981B (en) 2015-12-23 2015-12-23 Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system

Country Status (1)

Country Link
CN (1) CN106909981B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242672A (en) * 2018-09-29 2019-01-18 北京京东金融科技控股有限公司 Refund information forecasting method, device and the computer readable storage medium of loan
CN109446324A (en) * 2018-10-16 2019-03-08 北京字节跳动网络技术有限公司 Processing method, device, storage medium and the electronic equipment of sample data
CN109740750A (en) * 2018-12-17 2019-05-10 北京深极智能科技有限公司 Method of data capture and device
WO2019127924A1 (en) * 2017-12-29 2019-07-04 深圳云天励飞技术有限公司 Sample weight allocation method, model training method, electronic device, and storage medium
CN110108486A (en) * 2018-01-31 2019-08-09 阿里巴巴集团控股有限公司 Bearing fault prediction technique, equipment and system
CN110175247A (en) * 2019-03-13 2019-08-27 北京邮电大学 A method of abnormality detection model of the optimization based on deep learning
CN110310162A (en) * 2019-07-09 2019-10-08 西安点告网络科技有限公司 The method and device that sample generates
CN110998648A (en) * 2018-08-09 2020-04-10 北京嘀嘀无限科技发展有限公司 System and method for distributing orders
CN111666997A (en) * 2020-06-01 2020-09-15 安徽紫薇帝星数字科技有限公司 Sample balancing method and target organ segmentation model construction method
CN111709468A (en) * 2020-06-05 2020-09-25 内蒙古中孚明丰农业科技有限公司 Training method and device for directional artificial intelligence and storage medium
CN113762519A (en) * 2020-06-03 2021-12-07 杭州海康威视数字技术股份有限公司 Data cleaning method, device and equipment
CN113870013A (en) * 2021-10-14 2021-12-31 浙江孚临科技有限公司 Credit default prediction method based on unbalanced data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0519739A1 (en) * 1991-06-19 1992-12-23 Tektronix Inc. Edge integrating phase detector
CN101046876A (en) * 2006-03-31 2007-10-03 探宇科技股份有限公司 Credit scoring system and method of using data mining method
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN104239351A (en) * 2013-06-20 2014-12-24 阿里巴巴集团控股有限公司 User behavior machine learning model training method and device
WO2015141724A1 (en) * 2014-03-20 2015-09-24 日本電気株式会社 Device and method for extracting adverse events of drug

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0519739A1 (en) * 1991-06-19 1992-12-23 Tektronix Inc. Edge integrating phase detector
CN101046876A (en) * 2006-03-31 2007-10-03 探宇科技股份有限公司 Credit scoring system and method of using data mining method
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN104239351A (en) * 2013-06-20 2014-12-24 阿里巴巴集团控股有限公司 User behavior machine learning model training method and device
WO2015141724A1 (en) * 2014-03-20 2015-09-24 日本電気株式会社 Device and method for extracting adverse events of drug

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
楼晓俊 等: ""聚类边界过采样不平衡数据分类方法"", 《浙江大学学报》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019127924A1 (en) * 2017-12-29 2019-07-04 深圳云天励飞技术有限公司 Sample weight allocation method, model training method, electronic device, and storage medium
CN110108486A (en) * 2018-01-31 2019-08-09 阿里巴巴集团控股有限公司 Bearing fault prediction technique, equipment and system
CN110998648A (en) * 2018-08-09 2020-04-10 北京嘀嘀无限科技发展有限公司 System and method for distributing orders
CN109242672A (en) * 2018-09-29 2019-01-18 北京京东金融科技控股有限公司 Refund information forecasting method, device and the computer readable storage medium of loan
CN109446324B (en) * 2018-10-16 2020-12-15 北京字节跳动网络技术有限公司 Sample data processing method and device, storage medium and electronic equipment
CN109446324A (en) * 2018-10-16 2019-03-08 北京字节跳动网络技术有限公司 Processing method, device, storage medium and the electronic equipment of sample data
CN109740750A (en) * 2018-12-17 2019-05-10 北京深极智能科技有限公司 Method of data capture and device
CN110175247A (en) * 2019-03-13 2019-08-27 北京邮电大学 A method of abnormality detection model of the optimization based on deep learning
CN110175247B (en) * 2019-03-13 2021-06-08 北京邮电大学 Method for optimizing anomaly detection model based on deep learning
CN110310162B (en) * 2019-07-09 2021-09-17 西安点告网络科技有限公司 Sample generation method and device
CN110310162A (en) * 2019-07-09 2019-10-08 西安点告网络科技有限公司 The method and device that sample generates
CN111666997A (en) * 2020-06-01 2020-09-15 安徽紫薇帝星数字科技有限公司 Sample balancing method and target organ segmentation model construction method
CN111666997B (en) * 2020-06-01 2023-10-27 安徽紫薇帝星数字科技有限公司 Sample balancing method and target organ segmentation model construction method
CN113762519A (en) * 2020-06-03 2021-12-07 杭州海康威视数字技术股份有限公司 Data cleaning method, device and equipment
CN111709468A (en) * 2020-06-05 2020-09-25 内蒙古中孚明丰农业科技有限公司 Training method and device for directional artificial intelligence and storage medium
CN111709468B (en) * 2020-06-05 2021-10-26 内蒙古中孚明丰农业科技有限公司 Training method and device for directional artificial intelligence and storage medium
CN113870013A (en) * 2021-10-14 2021-12-31 浙江孚临科技有限公司 Credit default prediction method based on unbalanced data

Also Published As

Publication number Publication date
CN106909981B (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN106909981A (en) Model training, sample balance method and device and personal credit points-scoring system
CN108959246A (en) Answer selection method, device and electronic equipment based on improved attention mechanism
CN110309839B (en) A kind of method and device of iamge description
US20170278510A1 (en) Electronic device, method and training method for natural language processing
CN107133865B (en) Credit score obtaining and feature vector value output method and device
CN110008399A (en) A kind of training method and device, a kind of recommended method and device of recommended models
CN107463605A (en) The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium
CN110263157B (en) Data risk prediction method, device and equipment
CN110349000A (en) Method, apparatus and electronic equipment are determined based on the volume strategy that mentions of tenant group
US11586815B2 (en) Method, system and computer program product for generating artificial documents
CN107391545A (en) A kind of method classified to user, input method and device
CN109615504A (en) Products Show method, apparatus, electronic equipment and computer readable storage medium
CN117236410B (en) Trusted electronic file large language model training and reasoning method and device
CN110415103A (en) The method, apparatus and electronic equipment that tenant group mentions volume are carried out based on variable disturbance degree index
CN110349007A (en) The method, apparatus and electronic equipment that tenant group mentions volume are carried out based on variable discrimination index
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
CN110059178A (en) Problem distributing method and device
Hunt et al. Transfer learning for education data
CN115374259A (en) Question and answer data mining method and device and electronic equipment
CN114996486A (en) Data recommendation method and device, server and storage medium
US20220156862A1 (en) System and method for analyzing grantability of a legal filing
Xie et al. Differentially private synthetic data via foundation model apis 2: Text
CN110347806A (en) Original text discriminating method, device, equipment and computer readable storage medium
CN117235633A (en) Mechanism classification method, mechanism classification device, computer equipment and storage medium
CN108733672A (en) The method and apparatus for realizing network information quality evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant