CN109325020A - Small sample application method, device, computer equipment and storage medium - Google Patents
Small sample application method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN109325020A CN109325020A CN201810949574.1A CN201810949574A CN109325020A CN 109325020 A CN109325020 A CN 109325020A CN 201810949574 A CN201810949574 A CN 201810949574A CN 109325020 A CN109325020 A CN 109325020A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- woe
- data characteristics
- modeling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000003860 storage Methods 0.000 title claims abstract description 11
- 238000004140 cleaning Methods 0.000 claims abstract description 12
- 238000005457 optimization Methods 0.000 claims description 25
- 238000013507 mapping Methods 0.000 claims description 21
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 230000007704 transition Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000005406 washing Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 description 66
- 238000004364 calculation method Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 12
- 238000009826 distribution Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 230000018109 developmental process Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000003455 independent Effects 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 239000002537 cosmetic Substances 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
Landscapes
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Finance (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Marketing (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to big data field, in particular to a kind of small sample application method, device, computer equipment and storage medium.The described method includes: obtaining the data characteristics in original document;It successively carries out IV and WOE to the data characteristics to calculate, generation carries out the characteristic modeling data after pretreatment cleaning;According in preset configuration table multiple model informations and the modeling data, establish the corresponding model of the multiple model information;The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.It is calculated by carrying out WOE, IV to data characteristics, generates modeling data, according to the model information in modeling data and preset configuration table, generate model report, it is intended to it solves in existing modeling, it is less to use small sample problem.
Description
Technical field
The present invention relates to big data field, in particular to a kind of small sample application method, device, computer equipment and storage
Medium.
Background technique
In existing technology, when carrying out data modeling using machine learning platform, be related to cluster resource distribution it is insufficient,
Small Sample Database collection distributes the disadvantages of time link is long, the high level model development cycle is long, since the Small samples modeling time is long, debugging
Inconvenient problem, it is less to use small sample in existing modeling.
Summary of the invention
In view of the shortcomings of the prior art, the present invention proposes that a kind of small sample application method, device, computer equipment and storage are situated between
Matter is calculated by carrying out WOE, IV to data characteristics, modeling data is generated, according to the model in modeling data and preset configuration table
Information generates model report, it is intended to it solves in existing modeling, it is less to use small sample problem.
Technical solution proposed by the present invention is:
A kind of small sample application method, which comprises
Obtain the data characteristics in original document;
It successively carries out IV and WOE to the data characteristics to calculate, after generation carries out pretreatment cleaning to the characteristic
Modeling data;
According in preset configuration table multiple model informations and the modeling data, establish the multiple model information point
Not corresponding model;
The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.
Further, in the Data Representation for obtaining the multiple model, it is corresponding to generate the multiple model
After the step of model report, which comprises
According to the Data Representation of the multiple model, each model is iterated respectively and parameter optimization.
Further, it is described acquisition original document in data characteristics the step of in, comprising:
By the hdfs data file transition on Hadoop cluster at csv file;
Read the data characteristics in the csv file.
Further, in the data characteristics read in the csv file the step of, comprising:
Configure stand-alone program to Parameter File required for the csv file operation, the Parameter File include model ID,
Data Filename, data ID column, data reject characteristic series, target signature column and model algorithm;
The csv file is inputted into the stand-alone program operation;
Read the data characteristics in the csv file.
Further, IV and WOE successively carried out to the data characteristics calculate described, generate to the characteristic into
Row pre-processed in the step of modeling data after cleaning, comprising:
IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics;
It is ranked up according to IV value of the numerical values recited to each data characteristics, the first number is sequentially screened out according to the sequence
The target IV value of amount, and obtain the corresponding target data feature of the target IV value;
WOE calculating is carried out to the target data feature, obtains the WOE mapping table of the target data feature;
It is modeling data by the target data Feature Conversion according to the WOE mapping relations of the target data feature.
Further, multiple model informations in the preset configuration table include xgboost model information, gbdt model letter
Breath, lightGBM model information, catboost model information and tensorflow model information.
Further, in the Data Representation for obtaining the multiple model, it is corresponding to generate the multiple model
After the step of model report, which comprises
According to the Data Representation of the multiple module, classify to the multiple module report;
The same category of module report will be belonged to be stored in same file folder.
The present invention also provides a kind of small sample use device, described device includes:
Module is obtained, for obtaining the data characteristics in original document;
Processing module calculates for successively carrying out IV and WOE to the data characteristics, generates and carry out to the characteristic
Modeling data after pretreatment cleaning;
Model building module, for according in preset configuration table multiple model informations and the modeling data, establish
The corresponding model of the multiple model information;
Model report generation module generates the multiple model difference for obtaining the Data Representation of the multiple model
Corresponding model report.
The present invention also provides a kind of computer equipment, including memory and processor, the memory is stored with computer
The step of program, the processor realizes method described in any of the above embodiments when executing the computer program.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey
The step of method described in any of the above embodiments is realized when sequence is executed by processor.
According to above-mentioned technical solution, the invention has the advantages that: being calculated by carrying out WOE, IV to data characteristics, generation is built
Modulus evidence generates model report, it is intended to solve existing modeling according to the model information in modeling data and preset configuration table
In, it is less to use small sample problem.
Detailed description of the invention
Fig. 1 is the flow chart using small sample application method provided in an embodiment of the present invention;
Fig. 2 is the functional block diagram using small sample use device provided in an embodiment of the present invention;
Fig. 3 is the structural schematic block diagram using computer equipment provided by the embodiments of the present application.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
As shown in Figure 1, the embodiment of the present invention proposes small sample application method, the described method comprises the following steps:
Step S101, the data characteristics in original document is obtained.
Obtaining data characteristics in original document includes that obtain being capable of Direct Modeling data and can not be direct in original document
Modeling data.
In the present embodiment, obtain original document in can not Direct Modeling data be mainly obtain original document in original
Dirty data of beginning, including missing values, class variable and date variable etc..
In the present embodiment, in step s101, comprising:
By the hdfs data file transition on Hadoop cluster at csv file;
Read the data characteristics in csv file.
It is described reading csv file in data characteristics the step of in, comprising:
Stand-alone program is configured to Parameter File required for csv file operation, Parameter File includes model ID, data file
Name, data ID column, data reject characteristic series, target signature column and model algorithm;
Csv file is inputted into stand-alone program operation;
Read the data characteristics in csv file.
Specifically, the original dirty data, including missing values, class variable and date variable etc. in original document are obtained,
Specifically, by the hdfs data file transition on Hadoop cluster at csv file, for the input of single machine sequential operation under line,
Then Parameter File required for configurator is run, including model ID, Data Filename, data ID column, data reject feature
Column, target signature column and model algorithm, read the data characteristics in csv file.
Step S102, it successively carries out IV and WOE to data characteristics to calculate, after generation carries out pretreatment cleaning to characteristic
Modeling data.
Data characteristics is obtained from original document, after obtaining data characteristics, data characteristics is calculated, mainly
It carries out WOE, IV to calculate, successively carries out IV and WOE and calculate, the data generation obtained after calculating pre-process to characteristic clear
Modeling data after washing.
In step s 102, comprising:
IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics;
It is ranked up according to IV value of the numerical values recited to each data characteristics, according to first quantity that sequentially screens out of sequence
Target IV value, and obtain the corresponding target data feature of target IV value;
WOE calculating is carried out to target data feature, obtains the WOE mapping table of target data feature;
It is modeling data by target data Feature Conversion according to the WOE mapping relations of target data feature.
IV calculating is carried out to each data characteristics, after completing each data characteristics IV and calculating, obtains each data
The IV value of feature, the specific calculation formula that IV is calculated are as follows:
The IV value of each data characteristics can be obtained according to the specific calculation formula that IV is calculated, IV can be understood as selected spy
One functional relation IV=f (x1, y) of sign and target signature, value range are greater than 0, and numerical value is bigger, illustrates feature to target
The influence degree of variable is bigger.Be ranked up according to IV value of the numerical values recited to each data characteristics, in the present embodiment, by greatly to
It is small successively to sort, according to the target IV value for sequentially screening out the first quantity of sequence, and obtain the corresponding number of targets of target IV value
According to feature, after obtaining target data feature, WOE calculating is carried out to target data feature, is completed to target data feature
After WOE is calculated, the WOE mapping relations of target data feature, the specific calculation formula that WOE is calculated are obtained are as follows:
According to the WOE mapping relations of target data feature, by target data Feature Conversion, converting purpose is by initial data
It is changed into the data that can be directly modeled, after completing conversion, generates modeling data.
It carries out WOE, IV to each data characteristics to calculate, the IV value and WOE mapping table, IV for obtaining data characteristics are
The abbreviation of English Information Value, Chinese mean information value or information content.With logistic regression, decision tree
When equal model methods building disaggregated model, it is often necessary to be screened to independent variable.For example we have 200 candidate independents variable,
Under normal conditions, directly 200 variables will not be placed directly in model and goes to be fitted training, but certain methods can be used,
Selected from this 200 independents variable it is some come out, put model into, formed into mould variable list.Selecting into mould multivariable process is
More complicated process, there are many factor in need of consideration, such as: the predictive ability of variable, the correlation between variable, variable
Simplicity (is easy to generate and use), and the robustness (being not easy to be bypassed) of variable, variable (is chosen in operational interpretation
Wartime can explain logical) etc..But wherein main and most direct measurement standard is the predictive ability of variable." variable
Predictive ability " this saying is very general, very subjective, non-quantized, can be measured by some specific quantizating index often from
The predictive ability of variable, and according to the size of these quantizating index, to determine which variable enters model.IV is exactly such a
Index, iv can be used to measure the predictive ability of independent variable.There are also information gains, Gini coefficient etc. for similar index.
The full name of WOE is " Weight of Evidence ", i.e. evidence weight.WOE is a kind of volume to original argument
Code form.WOE coding is carried out to a variable, need that this variable is grouped processing first and (be also discretization, branch mailbox
Etc., what is said is all a meaning).After grouping, for i-th group, the calculation formula of WOE is as follows:
The technology formula of iv is as follows:
Wherein, pyiIt is that (in risk model, corresponding be promise breaking client to customer in response in this group, in short, referring to mould
In type predictive variable value be "Yes" in other words 1 individual) account for the ratios of all customer in response in all samples, pniIt is this
Non- customer in response accounts for the ratio of all non-customer in response in sample, #y in groupiIt is the quantity of customer in response in this group, #niIt is this
The quantity of non-customer in response, #y in a groupTIt is the quantity of all customer in response in sample, #nTIt is all non-customer in response in sample
Quantity.
According to above-mentioned formula it is found that WOE expression is actually that " customer in response accounts for all customer in response in current group
The difference of ratio " and " ratio that the client not responded in current group accounts for all clients not responded to ".
One simple transformation is done to this formula, available:
It will be seen that WOE can also so understand after transformation, that he indicates is the visitor responded in this current group
The difference of this ratio in the ratio and all samples of family and non-customer in response.This difference is the ratio with the two ratios,
Logarithm is taken to indicate again.WOE is bigger, this species diversity is bigger, this grouping in sample responses a possibility that bigger, WOE
Smaller, difference is smaller, this grouping in sample responses a possibility that with regard to smaller.
The data in original document are converted according to the WOE mapping table of generation, the pretreatment for completing data is clear
Journey is washed, so that the data in original document is transformed into the data for being used directly for being modeled, improves the accuracy of modeling.
Step S103, according to the multiple model informations and modeling data in preset configuration table, multiple model informations are established
Corresponding model.
Step S104, the Data Representation for obtaining multiple models, generates the corresponding model report of multiple models.
According to the multiple model informations and modeling data in preset configuration table, each model information and modeling data can
To establish the corresponding model of the model information, to establish the corresponding model of multiple model informations.Obtain multiple models
Data Representation generates the corresponding model report of multiple models according to the Data Representation of multiple models.
In the present embodiment, the model information in preset configuration table include xgboost model information, gbdt model information,
LightGBM model information, catboost model information and tensorflow model information.
Pass through the model information in selection preset configuration table, comprising: xgboost model information, gbdt model information,
LightGBM model information, catboost model information, tensorflow model information etc. are modeled, and each model is exported
Specific manifestation, to generate model report.Specifically, with building one for each client couple in the client set for the company of predicting
For the prediction model whether our a certain marketing activity is able to respond, randomly selected in first subsidiary company customer list
100000 clients have carried out marketing activity test, have collected the response results of these clients, as our modeling data collection,
The client wherein responded has 10000.The some variables for extracting these clients, as the candidate variables collection of our models, these
Variable include it is following these:
1, whether there is purchase within nearest one month;
2, the last purchase amount of money;
3, the merchandise classification of a nearest purchase;
It 4, whether is company VIP client;
Discretization first is carried out to these variables, the result of statistics is as shown in following several tables.
(1) whether there is purchase within nearest one month:
Whether purchase was had within nearest one month | Response | It does not respond | It is total | Response ratio |
It is | 4000 | 16000 | 20000 | 20% |
It is no | 6000 | 74000 | 80000 | 7.5% |
It is total | 10000 | 90000 | 100000 | 10% |
(2) the last purchase amount of money:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio |
100 yuan of < | 2500 | 47500 | 50000 | 5% |
[100,200) | 3000 | 27000 | 30000 | 10% |
[200,500) | 3000 | 12000 | 15000 | 20% |
>=500 yuan | 1500 | 3500 | 5000 | 30% |
It is total | 10000 | 90000 | 100000 | 10% |
(3) merchandise classification of a nearest purchase:
The merchandise classification of a nearest purchase | Response | It does not respond | It is total | Response ratio |
3C | 3000 | 57000 | 60000 | 5% |
Cosmetics | 2000 | 18000 | 20000 | 10% |
Mother and baby | 5000 | 15000 | 20000 | 25% |
It is total | 10000 | 90000 | 100000 | 10% |
(4) whether it is company VIP client:
It whether is company VIP client | Response | It does not respond | It is total | Response ratio |
It is | 5500 | 4500 | 10000 | 55% |
It is no | 4500 | 85000 | 90000 | 5% |
It is total | 10000 | 90000 | 100000 | 10% |
By taking " the last time purchase amount of money " variable as an example:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio |
100 yuan of < | 2500 | 47500 | 50000 | 5% |
[100,200) | 3000 | 27000 | 30000 | 10% |
[200,500) | 3000 | 12000 | 15000 | 20% |
>=500 yuan | 1500 | 3500 | 5000 | 30% |
It is total | 10000 | 90000 | 100000 | 10% |
The calculation formula of WOE are as follows:
By this variable discretization for 4 segmentations: 100 yuan of <, [100,200), [and 200,500), >=500 yuan.It is first
First, according to WOE calculation formula, the WOE of this four segmentations is respectively as follows:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio | WOE |
100 yuan of < | 2500 | 47500 | 50000 | 5% | -0.74721 |
[100,200) | 3000 | 27000 | 30000 | 10% | 0 |
[200,500) | 3000 | 12000 | 15000 | 20% | 0.81093 |
>=500 yuan | 1500 | 3500 | 5000 | 30% | 1.349927 |
It is total | 10000 | 90000 | 100000 | 10% | 0 |
We can have a look the basic characteristics of WOE from calculated result above:
The first, in current group, the ratio of response is bigger, and WOE value is bigger;
The second, current group WOE's is positive and negative, is responded by current group and the ratio that does not respond, with sample Whole Response and
The size relation for the ratio not responded determines that when the ratio of current group is less than sample overall ratio, WOE is negative, current group
Ratio when being greater than overall ratio, WOE is positive, when the ratio and equal overall ratio of current group, WOE 0.
Third, the value range of WOE are all real numbers.
WOE describes this current grouping of variable, influences direction and size to judging whether individual can respond to play, when
WOE is timing, and the current value of variable plays the influence for judging the individual forward direction that whether can be responded and play when WOE is negative
Negative sense influences.And the size of WOE value, then it is the embodiment of this size influenced.
Then the IV value of four groupings is calculated separately:
According to the calculated result of above-mentioned IV we can see that the following characteristics of IV:
The first, for one of variable grouping, the response of this grouping and the ratio not responded and sample Whole Response and
The ratio difference not responded is bigger, and IV value is bigger, and otherwise, IV value is smaller;
The second, it under extreme case, the response of current group and the response of ratio and sample entirety not responded and does not respond
When being in equal proportions, IV value be 0;
Third, the value range of IV value be [0 ,+∞), and, when in current group only comprising customer in response or not responding
When client, IV=+ ∞.
Calculate the total IV value of variable:
IV=IV1+IV2+IV3+IV4=0.492706
The other three variable is calculated according to above-mentioned principle, the IV result for obtaining four variables is as follows.
(1), the last purchase amount of money: 0.49270645;
(2), whether there is within nearest one month a purchase: 0.250224725;
(3), the merchandise classification of a nearest purchase: 0.615275563;
It (4), whether is company VIP client: 1.56550367.
This four variable IV ranking results are such that whether be nearest one commodity class bought of company VIP client >
Other > the last time buys whether amount of money > nearest one month had purchase.Know that " whether being company VIP client " is prediction energy
The highest variable of power, and " whether having purchase within nearest one month " is the minimum variable of predictive ability.Can according to IV from height to
Select variable in this low four variables, clean unnecessary or representative not high data, thus improve modeling universality and
Accuracy does not need to be modeled by spark big data platform, data sample under conditions of less than 500,000 data sample
This is few to reduce the Modeling Calculation used time, improves model iteration efficiency, substantially reduces the time of debugging;And program is beaten
Directly run under Linux/Windows environment after packet, have be easy to transplant characteristic, and to the dependence of system environments compared with
It is few, facilitate and safeguarded, can be showed in data according to each model and practical business scene is imported, opposite spark
The upper packaged interface that can not be changed, development process is freer, implements customizable, improves while improving model accuracy
Business service efficiency.
After step s 104, which comprises
According to the Data Representation of multiple models, each model is iterated respectively and parameter optimization.
In model report include model Data Representation, according to the Data Representation of multiple models, respectively to each model into
Row iteration and parameter optimization.By carrying out model iteration and parameter optimization according to the Data Representation in model report, specifically,
Data Representation includes parameter plateau and parameter isolated island, and parameter plateau refers to that model is at this there is a wider parameter area
Preferable effect can be obtained in a parameter area, generally can be centrally formed approximate normal distribution shape with plateau;And parameter is lonely
When island refers to only in the range of parameter value is in some very little, model just has preferable performance, and when the parameter drift-out value,
The performance of model will significantly be deteriorated, so an important principle seeks to strive for parameter plateau rather than join in parameter optimization
Number isolated island.According to above-mentioned parameter plateau and parameter isolated island principle, when in model there are when multiple parameters array, an often parameter
The value of array influences whether the distribution on another parameter plateau.Specifically, the method for parameter optimization can be used and gradually be received
Hold back method, i.e., first individually a parameter is optimized, be fixed up after obtaining its optimum value, then again to another parameter into
Row optimization, is fixed up after obtaining its optimum value, so recycles, until optimum results no longer change.It is bought with an equal line intersection
For selling Trading Model, two independent parameters are equal line short cycle N1 and long period N2 respectively.N2 fixed first is 1, to N1 1
Test screen is carried out in 100 numberical range, finds optimal values, is finally obtained optimal parameter and is 8 and fixes;Secondly to N2
It optimizes, obtain optimum value 26 and fixes between 1 to 200;The second wheel is carried out to N1 again to optimize, and obtains new optimum value
10 and fixed;Finally N2 is optimized to obtain optimum value 28 and be fixed.The screening so recycled is gone down, until optimum results not
It changes again.If finally obtained optimal value of the parameter is that N1 is 10, N2 30 respectively, so far, parameter optimization work terminates.
Certainly, another method of parameter optimization is to utilize the programmed software design platform with stronger computing function,
The distribution between objective function and parameter array is directly calculated, and then asks the distribution of multidimensional difference, defines a differential threshold, it is poor
Divide absolute value to be less than corresponding multidimensional volume maximum, multidimensional inscribe radius of a ball soprano in threshold range, enters to be selected as most stable ginseng
Number value, to complete model iteration and parameter optimization.
In the present embodiment, after step s 104, which comprises
According to the Data Representation of multiple models, classify to multiple model reports;
Same category of model report will be belonged to be stored in same file folder.
Multiple models can have difference to the Data Representation of modeling data, in the present embodiment, the Data Representation point of model
For excellent, good, poor.According to the Data Representation of multiple models, classify to multiple model reports, mainly to divide excellent, good, poor three
Class is stored in same file folder for belonging to same category of model report, so that the subsequent Data Representation to model is classified
It checks.For example, the Data Representation of xgboost model be it is excellent, the Data Representation of gbdt model be it is excellent, then by xgboost model
It is stored in same file folder with the model report of gbdt model.
In some embodiments, the model report of generation is classified and is stored, specifically, can by missing values,
Class variable, date variable or user information are classified and are stored as label, thus it is clear, directly understand it is each
The difference and connection of model report, and then can be according to the data characteristics in different original documents come to corresponding with data characteristics
Model carry out model iteration and parameter optimization.
In conclusion obtaining the data characteristics in original document;WOE, IV are carried out to data characteristics to calculate, and generate modeling number
According to;According to the model information and modeling data in preset configuration table, model report is generated.Less than 500,000 data samples
Under the conditions of, it does not need to be modeled by spark big data platform, data sample is few to reduce the Modeling Calculation used time, improves mould
Type iteration efficiency, substantially reduces the time of debugging;And it is direct under Linux/Windows environment after being packaged program
Operation has the characteristic for being easy to transplant, and less to the dependence of system environments, facilitates and is safeguarded, can be according to each mould
Type shows in data and practical business scene is imported, the packaged interface that can not be changed on opposite spark, exploitation stream
Journey is freer, implements customizable, improves business service efficiency while improving model accuracy.
As shown in Fig. 2, the embodiment of the present invention proposes small sample use device 1, device 1 includes obtaining module 11, processing mould
Block 12, model building module 13 and model report generation module 14.
Module 11 is obtained, for obtaining the data characteristics in original document.
In the present embodiment, the original dirty data in original document is obtained, including missing values, class variable and date become
Amount etc..
In the present embodiment, obtaining module 11 includes:
Conversion module, for by the hdfs data file transition on Hadoop cluster at csv file;
Read module, for reading the data characteristics in csv file.
Read module includes:
Stand-alone program is configured to Parameter File required for csv file operation, Parameter File includes model ID, data file
Name, data ID column, data reject characteristic series, target signature column and model algorithm;
Csv file is inputted into stand-alone program operation;
Read the data characteristics in csv file.
Specifically, the original dirty data, including missing values, class variable and date variable etc. in original document are obtained,
Specifically, by the hdfs data file transition on Hadoop cluster at csv file, for the input of single machine sequential operation under line,
Then Parameter File required for configurator is run, including model ID, Data Filename, data ID column, data reject feature
Column, target signature column and model algorithm, read the data characteristics in csv file.
Processing module 12 successively carries out IV and WOE to data characteristics and calculates, and generation carries out pretreatment cleaning to characteristic
Modeling data afterwards.
Data characteristics is obtained from original document, after obtaining data characteristics, data characteristics is calculated, mainly
It carries out WOE, IV to calculate, successively carries out IV and WOE and calculate, the data generation obtained after calculating pre-process to characteristic clear
Modeling data after washing.
Processing module 12 includes:
Computing module obtains the IV value of each data characteristics for carrying out IV calculating to each data characteristics;
Module is obtained, for being ranked up according to IV value of the numerical values recited to each data characteristics, is sieved according to the sequence of sequence
The target IV value of the first quantity is selected, and obtains the corresponding target data feature of target IV value;
Mapping block obtains the WOE mapping relations of target data feature for carrying out WOE calculating to target data feature
Table;
Target data Feature Conversion is modeling number for the WOE mapping relations according to target data feature by conversion module
According to.
IV calculating is carried out to each data characteristics, after completing each data characteristics IV and calculating, obtains each data
The IV value of feature, the specific calculation formula that IV is calculated are as follows:
The IV value of each data characteristics can be obtained according to the specific calculation formula that IV is calculated, IV can be understood as selected spy
One functional relation IV=f (x1, y) of sign and target signature, value range are greater than 0, and numerical value is bigger, illustrates feature to target
The influence degree of variable is bigger.Be ranked up according to IV value of the numerical values recited to each data characteristics, in the present embodiment, by greatly to
It is small successively to sort, according to the target IV value for sequentially screening out the first quantity of sequence, and obtain the corresponding number of targets of target IV value
According to feature, after obtaining target data feature, WOE calculating is carried out to target data feature, is completed to target data feature
After WOE is calculated, the WOE mapping relations of target data feature, the specific calculation formula that WOE is calculated are obtained are as follows:
According to the WOE mapping relations of target data feature, by target data Feature Conversion, converting purpose is by initial data
It is changed into the data that can be directly modeled, after completing conversion, generates modeling data.
It carries out WOE, IV to each data characteristics to calculate, the IV value and WOE mapping table, IV for obtaining data characteristics are
The abbreviation of English Information Value, Chinese mean information value or information content.With logistic regression, decision tree
When equal model methods building disaggregated model, it is often necessary to be screened to independent variable.For example we have 200 candidate independents variable,
Under normal conditions, directly 200 variables will not be placed directly in model and goes to be fitted training, but certain methods can be used,
Selected from this 200 independents variable it is some come out, put model into, formed into mould variable list.Selecting into mould multivariable process is
More complicated process, there are many factor in need of consideration, such as: the predictive ability of variable, the correlation between variable, variable
Simplicity (is easy to generate and use), and the robustness (being not easy to be bypassed) of variable, variable (is chosen in operational interpretation
Wartime can explain logical) etc..But wherein main and most direct measurement standard is the predictive ability of variable." variable
Predictive ability " this saying is very general, very subjective, non-quantized, can be measured by some specific quantizating index often from
The predictive ability of variable, and according to the size of these quantizating index, to determine which variable enters model.IV is exactly such a
Index, iv can be used to measure the predictive ability of independent variable.There are also information gains, Gini coefficient etc. for similar index.
The full name of WOE is " Weight of Evidence ", i.e. evidence weight.WOE is a kind of volume to original argument
Code form.WOE coding is carried out to a variable, need that this variable is grouped processing first and (be also discretization, branch mailbox
Etc., what is said is all a meaning).After grouping, for i-th group, the calculation formula of WOE is as follows:
The technology formula of iv is as follows:
Wherein, pyiIt is that (in risk model, corresponding be promise breaking client to customer in response in this group, in short, referring to mould
In type predictive variable value be "Yes" in other words 1 individual) account for the ratios of all customer in response in all samples, pniIt is this
Non- customer in response accounts for the ratio of all non-customer in response in sample, #y in groupiIt is the quantity of customer in response in this group, #niIt is this
The quantity of non-customer in response, #y in a groupTIt is the quantity of all customer in response in sample, #nTIt is all non-customer in response in sample
Quantity.
According to above-mentioned formula it is found that WOE expression is actually that " customer in response accounts for all customer in response in current group
The difference of ratio " and " ratio that the client not responded in current group accounts for all clients not responded to ".
One simple transformation is done to this formula, available:
It will be seen that WOE can also so understand after transformation, that he indicates is the visitor responded in this current group
The difference of this ratio in the ratio and all samples of family and non-customer in response.This difference is the ratio with the two ratios,
Logarithm is taken to indicate again.WOE is bigger, this species diversity is bigger, this grouping in sample responses a possibility that bigger, WOE
Smaller, difference is smaller, this grouping in sample responses a possibility that with regard to smaller.
The data in original document are converted according to the WOE mapping table of generation, the pretreatment for completing data is clear
Journey is washed, so that the data in original document is transformed into the data for being used directly for being modeled, improves the accuracy of modeling.
Model building module 13, for establishing more according to the multiple model informations and modeling data in preset configuration table
The corresponding model of a model information.
It is corresponding to generate multiple models for obtaining the Data Representation of multiple models for model report generation module 14
Model report.
According to the multiple model informations and modeling data in preset configuration table, each model information and modeling data can
To establish the corresponding model of the model information, to establish the corresponding model of multiple model informations.Obtain multiple models
Data Representation generates the corresponding model report of multiple models according to the Data Representation of multiple models.
In the present embodiment, the model information in preset configuration table include xgboost model information, gbdt model information,
LightGBM model information, catboost model information and tensorflow model information.
Pass through the model information in selection preset configuration table, comprising: xgboost model information, gbdt model information,
LightGBM model information, catboost model information, tensorflow model information etc. are modeled, and each model is exported
Specific manifestation, to generate model report.Specifically, with building one for each client couple in the client set for the company of predicting
For the prediction model whether our a certain marketing activity is able to respond, randomly selected in first subsidiary company customer list
100000 clients have carried out marketing activity test, have collected the response results of these clients, as our modeling data collection,
The client wherein responded has 10000.The some variables for extracting these clients, as the candidate variables collection of our models, these
Variable include it is following these:
1, whether there is purchase within nearest one month;
2, the last purchase amount of money;
3, the merchandise classification of a nearest purchase;
It 4, whether is company VIP client;
Discretization first is carried out to these variables, the result of statistics is as shown in following several tables.
(1) whether there is purchase within nearest one month:
Whether purchase was had within nearest one month | Response | It does not respond | It is total | Response ratio |
It is | 4000 | 16000 | 20000 | 20% |
It is no | 6000 | 74000 | 80000 | 7.5% |
It is total | 10000 | 90000 | 100000 | 10% |
(2) the last purchase amount of money:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio |
100 yuan of < | 2500 | 47500 | 50000 | 5% |
[100,200) | 3000 | 27000 | 30000 | 10% |
[200,500) | 3000 | 12000 | 15000 | 20% |
>=500 yuan | 1500 | 3500 | 5000 | 30% |
It is total | 10000 | 90000 | 100000 | 10% |
(3) merchandise classification of a nearest purchase:
The merchandise classification of a nearest purchase | Response | It does not respond | It is total | Response ratio |
3C | 3000 | 57000 | 60000 | 5% |
Cosmetics | 2000 | 18000 | 20000 | 10% |
Mother and baby | 5000 | 15000 | 20000 | 25% |
It is total | 10000 | 90000 | 100000 | 10% |
(4) whether it is company VIP client:
It whether is company VIP client | Response | It does not respond | It is total | Response ratio |
It is | 5500 | 4500 | 10000 | 55% |
It is no | 4500 | 85000 | 90000 | 5% |
It is total | 10000 | 90000 | 100000 | 10% |
By taking " the last time purchase amount of money " variable as an example:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio |
100 yuan of < | 2500 | 47500 | 50000 | 5% |
[100,200) | 3000 | 27000 | 30000 | 10% |
[200,500) | 3000 | 12000 | 15000 | 20% |
>=500 yuan | 1500 | 3500 | 5000 | 30% |
It is total | 10000 | 90000 | 100000 | 10% |
The calculation formula of WOE are as follows:
By this variable discretization for 4 segmentations: 100 yuan of <, [100,200), [and 200,500), >=500 yuan.It is first
First, according to WOE calculation formula, the WOE of this four segmentations is respectively as follows:
The last time purchase amount of money | Response | It does not respond | It is total | Response ratio | WOE |
100 yuan of < | 2500 | 47500 | 50000 | 5% | -0.74721 |
[100,200) | 3000 | 27000 | 30000 | 10% | 0 |
[200,500) | 3000 | 12000 | 15000 | 20% | 0.81093 |
>=500 yuan | 1500 | 3500 | 5000 | 30% | 1.349927 |
It is total | 10000 | 90000 | 100000 | 10% | 0 |
We can have a look the basic characteristics of WOE from calculated result above:
The first, in current group, the ratio of response is bigger, and WOE value is bigger;
The second, current group WOE's is positive and negative, is responded by current group and the ratio that does not respond, with sample Whole Response and
The size relation for the ratio not responded determines that when the ratio of current group is less than sample overall ratio, WOE is negative, current group
Ratio when being greater than overall ratio, WOE is positive, when the ratio and equal overall ratio of current group, WOE 0.
Third, the value range of WOE are all real numbers.
WOE describes this current grouping of variable, influences direction and size to judging whether individual can respond to play, when
WOE is timing, and the current value of variable plays the influence for judging the individual forward direction that whether can be responded and play when WOE is negative
Negative sense influences.And the size of WOE value, then it is the embodiment of this size influenced.
Then the IV value of four groupings is calculated separately:
According to the calculated result of above-mentioned IV we can see that the following characteristics of IV:
The first, for one of variable grouping, the response of this grouping and the ratio not responded and sample Whole Response and
The ratio difference not responded is bigger, and IV value is bigger, and otherwise, IV value is smaller;
The second, it under extreme case, the response of current group and the response of ratio and sample entirety not responded and does not respond
When being in equal proportions, IV value be 0;
Third, the value range of IV value be [0 ,+∞), and, when in current group only comprising customer in response or not responding
When client, IV=+ ∞.
Calculate the total IV value of variable:
IV=IV1+IV2+IV3+IV4=0.492706
The other three variable is calculated according to above-mentioned principle, the IV result for obtaining four variables is as follows.
(1), the last purchase amount of money: 0.49270645;
(2), whether there is within nearest one month a purchase: 0.250224725;
(3), the merchandise classification of a nearest purchase: 0.615275563;
It (4), whether is company VIP client: 1.56550367.
This four variable IV ranking results are such that whether be nearest one commodity class bought of company VIP client >
Other > the last time buys whether amount of money > nearest one month had purchase.Know that " whether being company VIP client " is prediction energy
The highest variable of power, and " whether having purchase within nearest one month " is the minimum variable of predictive ability.Can according to IV from height to
Select variable in this low four variables, clean unnecessary or representative not high data, thus improve modeling universality and
Accuracy does not need to be modeled by spark big data platform, data sample under conditions of less than 500,000 data sample
This is few to reduce the Modeling Calculation used time, improves model iteration efficiency, substantially reduces the time of debugging;And program is beaten
Directly run under Linux/Windows environment after packet, have be easy to transplant characteristic, and to the dependence of system environments compared with
It is few, facilitate and safeguarded, can be showed in data according to each model and practical business scene is imported, opposite spark
The upper packaged interface that can not be changed, development process is freer, implements customizable, improves while improving model accuracy
Business service efficiency.
Device 1 includes:
Optimization module is respectively iterated each model and parameter optimization for the Data Representation according to multiple models.
In model report include model Data Representation, according to the Data Representation of multiple models, respectively to each model into
Row iteration and parameter optimization.By carrying out model iteration and parameter optimization according to the Data Representation in model report, specifically,
Data Representation includes parameter plateau and parameter isolated island, and parameter plateau refers to that model is at this there is a wider parameter area
Preferable effect can be obtained in a parameter area, generally can be centrally formed approximate normal distribution shape with plateau;And parameter is lonely
When island refers to only in the range of parameter value is in some very little, model just has preferable performance, and when the parameter drift-out value,
The performance of model will significantly be deteriorated, so an important principle seeks to strive for parameter plateau rather than join in parameter optimization
Number isolated island.According to above-mentioned parameter plateau and parameter isolated island principle, when in model there are when multiple parameters array, an often parameter
The value of array influences whether the distribution on another parameter plateau.Specifically, the method for parameter optimization can be used and gradually be received
Hold back method, i.e., first individually a parameter is optimized, be fixed up after obtaining its optimum value, then again to another parameter into
Row optimization, is fixed up after obtaining its optimum value, so recycles, until optimum results no longer change.It is bought with an equal line intersection
For selling Trading Model, two independent parameters are equal line short cycle N1 and long period N2 respectively.N2 fixed first is 1, to N1 1
Test screen is carried out in 100 numberical range, finds optimal values, is finally obtained optimal parameter and is 8 and fixes;Secondly to N2
It optimizes, obtain optimum value 26 and fixes between 1 to 200;The second wheel is carried out to N1 again to optimize, and obtains new optimum value
10 and fixed;Finally N2 is optimized to obtain optimum value 28 and be fixed.The screening so recycled is gone down, until optimum results not
It changes again.If finally obtained optimal value of the parameter is that N1 is 10, N2 30 respectively, so far, parameter optimization work terminates.
Certainly, another method of parameter optimization is to utilize the programmed software design platform with stronger computing function,
The distribution between objective function and parameter array is directly calculated, and then asks the distribution of multidimensional difference, defines a differential threshold, it is poor
Divide absolute value to be less than corresponding multidimensional volume maximum, multidimensional inscribe radius of a ball soprano in threshold range, enters to be selected as most stable ginseng
Number value, to complete model iteration and parameter optimization.
In the present embodiment, device 1 includes:
Categorization module classifies to multiple model reports for the Data Representation according to multiple models;
Memory module is stored in same file folder for that will belong to same category of model report.
Multiple models can have difference to the Data Representation of modeling data, in the present embodiment, the Data Representation point of model
For excellent, good, poor.According to the Data Representation of multiple models, classify to multiple model reports, mainly to divide excellent, good, poor three
Class is stored in same file folder for belonging to same category of model report, so that the subsequent Data Representation to model is classified
It checks.For example, the Data Representation of xgboost model be it is excellent, the Data Representation of gbdt model be it is excellent, then by xgboost model
It is stored in same file folder with the model report of gbdt model.
In some embodiments, the model report of generation is classified and is stored, specifically, can by missing values,
Class variable, date variable or user information are classified and are stored as label, thus it is clear, directly understand it is each
The difference and connection of model report, and then can be according to the data characteristics in different original documents come to corresponding with data characteristics
Model carry out model iteration and parameter optimization.
In conclusion obtaining the data characteristics in original document;WOE, IV are carried out to data characteristics to calculate, and generate modeling number
According to;According to the model information and modeling data in preset configuration table, model report is generated.Less than 500,000 data samples
Under the conditions of, it does not need to be modeled by spark big data platform, data sample is few to reduce the Modeling Calculation used time, improves mould
Type iteration efficiency, substantially reduces the time of debugging;And it is direct under Linux/Windows environment after being packaged program
Operation has the characteristic for being easy to transplant, and less to the dependence of system environments, facilitates and is safeguarded, can be according to each mould
Type shows in data and practical business scene is imported, the packaged interface that can not be changed on opposite spark, exploitation stream
Journey is freer, implements customizable, improves business service efficiency while improving model accuracy.
As shown in figure 3, also providing a kind of computer equipment in the embodiment of the present application, which can be service
Device, internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, net connected by system bus
Network interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment
Memory includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer journey
Sequence and database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium.
The database of the computer equipment is for storing the data such as the model of small sample application method.The network interface of the computer equipment
For being communicated with external terminal by network connection.To realize small sample user when the computer program is executed by processor
Method.
Above-mentioned processor executes the step of above-mentioned small sample application method: obtaining the data characteristics in original document;Successively
It carries out IV and WOE to the data characteristics to calculate, generation carries out the characteristic modeling data after pretreatment cleaning;Root
According in preset configuration table multiple model informations and the modeling data, establish the corresponding mould of the multiple model information
Type;The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.
In one embodiment, the Data Representation of the multiple model of above-mentioned acquisition, it is right respectively to generate the multiple model
After the step of model report answered, comprising:
According to the Data Representation of the multiple model, each model is iterated respectively and parameter optimization.
In one embodiment, in the step of data characteristics in above-mentioned acquisition original document, comprising:
By the hdfs data file transition on Hadoop cluster at csv file;
Read the data characteristics in the csv file.
In one embodiment, in the step of data characteristics in the above-mentioned reading csv file, comprising:
Configure stand-alone program to Parameter File required for the csv file operation, the Parameter File include model ID,
Data Filename, data ID column, data reject characteristic series, target signature column and model algorithm;
The csv file is inputted into the stand-alone program operation;
Read the data characteristics in the csv file.
In one embodiment, above-mentioned that IV and WOE calculating successively is carried out to the data characteristics, it generates to the characteristic
According in the step of carrying out the modeling data after pretreatment cleaning, comprising:
IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics;
It is ranked up according to IV value of the numerical values recited to each data characteristics, the first number is sequentially screened out according to the sequence
The target IV value of amount, and obtain the corresponding target data feature of the target IV value;
WOE calculating is carried out to the target data feature, obtains the WOE mapping table of the target data feature;
It is modeling data by the target data Feature Conversion according to the WOE mapping relations of the target data feature.
In one embodiment, multiple model informations in above-mentioned preset configuration table include xgboost model information, gbdt
Model information, lightGBM model information, catboost model information and tensorflow model information.
In one embodiment, the Data Representation of the multiple model of above-mentioned acquisition, it is right respectively to generate the multiple model
After the step of model report answered, comprising:
According to the Data Representation of the multiple module, classify to the multiple module report;
Same category of module report will be belonged to be stored in same file folder.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied
The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
The computer equipment of the embodiment of the present application obtains the data characteristics in original document;To data characteristics carry out WOE,
IV is calculated, and generates modeling data;According to the model information and modeling data in preset configuration table, model report is generated.Small
It under conditions of 500,000 data samples, does not need to be modeled by spark big data platform, data sample is few to be built to reduce
Mould calculates the used time, improves model iteration efficiency, substantially reduces the time of debugging;And in Linux/ after program is packaged
It is directly run under Windows environment, there is the characteristic for being easy to transplant, and less to the dependence of system environments, facilitate and tieed up
Shield, can show in data according to each model and practical business scene is imported, it is packaged with respect on spark can not
The interface of change, development process is freer, implements customizable, improves business service efficiency while improving model accuracy.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates
Small sample application method is realized when machine program is executed by processor, specifically: obtain the data characteristics in original document;It is successively right
The data characteristics carries out IV and WOE and calculates, and generation carries out the modeling data after pretreatment cleaning to the characteristic;According to
Multiple model informations and the modeling data in preset configuration table, establish the corresponding mould of the multiple model information
Type;The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.
In one embodiment, the Data Representation of the multiple model of above-mentioned acquisition, it is right respectively to generate the multiple model
After the step of model report answered, comprising:
According to the Data Representation of the multiple model, each model is iterated respectively and parameter optimization.
In one embodiment, in the step of data characteristics in above-mentioned acquisition original document, comprising:
By the hdfs data file transition on Hadoop cluster at csv file;
Read the data characteristics in the csv file.
In one embodiment, in the step of data characteristics in the above-mentioned reading csv file, comprising:
Configure stand-alone program to Parameter File required for the csv file operation, the Parameter File include model ID,
Data Filename, data ID column, data reject characteristic series, target signature column and model algorithm;
The csv file is inputted into the stand-alone program operation;
Read the data characteristics in the csv file.
In one embodiment, above-mentioned that IV and WOE calculating successively is carried out to the data characteristics, it generates to the characteristic
According in the step of carrying out the modeling data after pretreatment cleaning, comprising:
IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics;
It is ranked up according to IV value of the numerical values recited to each data characteristics, the first number is sequentially screened out according to the sequence
The target IV value of amount, and obtain the corresponding target data feature of the target IV value;
WOE calculating is carried out to the target data feature, obtains the WOE mapping table of the target data feature;
It is modeling data by the target data Feature Conversion according to the WOE mapping relations of the target data feature.
In one embodiment, multiple model informations in above-mentioned preset configuration table include xgboost model information, gbdt
Model information, lightGBM model information, catboost model information and tensorflow model information.
In one embodiment, the Data Representation of the multiple model of above-mentioned acquisition, it is right respectively to generate the multiple model
After the step of model report answered, comprising:
According to the Data Representation of the multiple module, classify to the multiple module report;
Same category of module report will be belonged to be stored in same file folder.
The storage medium of the embodiment of the present application obtains the data characteristics in original document;WOE, IV are carried out to data characteristics
It calculates, generates modeling data;According to the model information and modeling data in preset configuration table, model report is generated.It is being less than
It under conditions of 500000 data samples, does not need to be modeled by spark big data platform, data sample is few to reduce modeling
The used time is calculated, model iteration efficiency is improved, substantially reduces the time of debugging;And in Linux/ after program is packaged
It is directly run under Windows environment, there is the characteristic for being easy to transplant, and less to the dependence of system environments, facilitate and tieed up
Shield, can show in data according to each model and practical business scene is imported, it is packaged with respect on spark can not
The interface of change, development process is freer, implements customizable, improves business service efficiency while improving model accuracy.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
Any reference used in provided herein and embodiment to memory, storage, database or other media,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (10)
1. a kind of small sample application method, which is characterized in that the described method includes:
Obtain the data characteristics in original document;
IV and WOE successively are carried out to the data characteristics to calculate, and generate and building after pretreatment cleaning is carried out to the characteristic
Modulus evidence;
According in preset configuration table multiple model informations and the modeling data, it is right respectively to establish the multiple model information
The model answered;
The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.
2. small sample application method according to claim 1, which is characterized in that in the number for obtaining the multiple model
According to performance, after the step of generating the multiple model corresponding model report, which comprises
According to the Data Representation of the multiple model, each model is iterated respectively and parameter optimization.
3. small sample application method according to claim 1, which is characterized in that the data in the acquisition original document
In the step of feature, comprising:
By the hdfs data file transition on Hadoop cluster at csv file;
Read the data characteristics in the csv file.
4. small sample application method according to claim 3, which is characterized in that read in the csv file described
In the step of data characteristics, comprising:
Stand-alone program is configured to Parameter File required for the csv file operation, the Parameter File includes model ID, data
Filename, data ID column, data reject characteristic series, target signature column and model algorithm;
The csv file is inputted into the stand-alone program operation;
Read the data characteristics in the csv file.
5. small sample application method according to claim 1, which is characterized in that it is described successively to the data characteristics into
In the step of row IV and WOE are calculated, and generation carries out the modeling data after pretreatment cleaning to the characteristic, comprising:
IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics;
It is ranked up according to IV value of the numerical values recited to each data characteristics, according to first quantity that sequentially screens out of the sequence
Target IV value, and obtain the corresponding target data feature of the target IV value;
WOE calculating is carried out to the target data feature, obtains the WOE mapping table of the target data feature;
It is modeling data by the target data Feature Conversion according to the WOE mapping relations of the target data feature.
6. small sample application method according to claim 1, which is characterized in that multiple models in the preset configuration table
Information include xgboost model information, gbdt model information, lightGBM model information, catboost model information and
Tensorflow model information.
7. small sample application method according to claim 1, which is characterized in that in the number for obtaining the multiple model
According to performance, after the step of generating the multiple model corresponding model report, which comprises
According to the Data Representation of the multiple model, classify to the multiple model report;
The same category of model report will be belonged to be stored in same file folder.
8. a kind of small sample use device, which is characterized in that described device includes:
Module is obtained, for obtaining the data characteristics in original document;
Processing module is calculated for successively carrying out IV and WOE to the data characteristics, and the characteristic is located in generation in advance
Clear the modeling data after washing;
Model building module, for according in preset configuration table multiple model informations and the modeling data, described in foundation
The corresponding model of multiple model informations;
Model report generation module generates the multiple model and respectively corresponds for obtaining the Data Representation of the multiple model
Model report.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In the processor realizes method described in any one of claims 1 to 7 when executing computer program the step of.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810949574.1A CN109325020A (en) | 2018-08-20 | 2018-08-20 | Small sample application method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810949574.1A CN109325020A (en) | 2018-08-20 | 2018-08-20 | Small sample application method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109325020A true CN109325020A (en) | 2019-02-12 |
Family
ID=65264304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810949574.1A Pending CN109325020A (en) | 2018-08-20 | 2018-08-20 | Small sample application method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109325020A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704706A (en) * | 2019-09-11 | 2020-01-17 | 北京海益同展信息科技有限公司 | Training method and classification method of classification model, related equipment and classification system |
CN111638948A (en) * | 2020-06-03 | 2020-09-08 | 重庆银行股份有限公司 | Multi-channel high-availability big data real-time decision making system and decision making method |
CN112395349A (en) * | 2020-11-17 | 2021-02-23 | 平安普惠企业管理有限公司 | Early warning method, device, equipment and storage medium for visual report |
CN114757291A (en) * | 2022-04-26 | 2022-07-15 | 国网四川省电力公司电力科学研究院 | Single-phase fault identification optimization method, system and equipment based on machine learning algorithm |
CN111984637B (en) * | 2020-07-06 | 2023-04-18 | 苏州研数信息科技有限公司 | Missing value processing method and device in data modeling, equipment and storage medium |
CN111984636B (en) * | 2020-07-06 | 2023-06-16 | 苏州研数信息科技有限公司 | Data modeling method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239539A (en) * | 2017-06-02 | 2017-10-10 | 山东浪潮商用系统有限公司 | A kind of user-defined m odel method based on relevant database |
CN107633265A (en) * | 2017-09-04 | 2018-01-26 | 深圳市华傲数据技术有限公司 | For optimizing the data processing method and device of credit evaluation model |
CN107977351A (en) * | 2017-12-28 | 2018-05-01 | 平安科技(深圳)有限公司 | Electronic report forms generation method, device, computer equipment and storage medium |
CN108334954A (en) * | 2018-01-22 | 2018-07-27 | 中国平安人寿保险股份有限公司 | Construction method, device, storage medium and the terminal of Logic Regression Models |
CN108388924A (en) * | 2018-03-08 | 2018-08-10 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
-
2018
- 2018-08-20 CN CN201810949574.1A patent/CN109325020A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239539A (en) * | 2017-06-02 | 2017-10-10 | 山东浪潮商用系统有限公司 | A kind of user-defined m odel method based on relevant database |
CN107633265A (en) * | 2017-09-04 | 2018-01-26 | 深圳市华傲数据技术有限公司 | For optimizing the data processing method and device of credit evaluation model |
CN107977351A (en) * | 2017-12-28 | 2018-05-01 | 平安科技(深圳)有限公司 | Electronic report forms generation method, device, computer equipment and storage medium |
CN108334954A (en) * | 2018-01-22 | 2018-07-27 | 中国平安人寿保险股份有限公司 | Construction method, device, storage medium and the terminal of Logic Regression Models |
CN108388924A (en) * | 2018-03-08 | 2018-08-10 | 平安科技(深圳)有限公司 | A kind of data classification method, device, equipment and computer readable storage medium |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704706A (en) * | 2019-09-11 | 2020-01-17 | 北京海益同展信息科技有限公司 | Training method and classification method of classification model, related equipment and classification system |
CN110704706B (en) * | 2019-09-11 | 2021-09-03 | 北京海益同展信息科技有限公司 | Training method and classification method of classification model, related equipment and classification system |
CN111638948A (en) * | 2020-06-03 | 2020-09-08 | 重庆银行股份有限公司 | Multi-channel high-availability big data real-time decision making system and decision making method |
CN111638948B (en) * | 2020-06-03 | 2023-04-07 | 重庆银行股份有限公司 | Multi-channel high-availability big data real-time decision making system and decision making method |
CN111984637B (en) * | 2020-07-06 | 2023-04-18 | 苏州研数信息科技有限公司 | Missing value processing method and device in data modeling, equipment and storage medium |
CN111984636B (en) * | 2020-07-06 | 2023-06-16 | 苏州研数信息科技有限公司 | Data modeling method, device, equipment and storage medium |
CN112395349A (en) * | 2020-11-17 | 2021-02-23 | 平安普惠企业管理有限公司 | Early warning method, device, equipment and storage medium for visual report |
CN114757291A (en) * | 2022-04-26 | 2022-07-15 | 国网四川省电力公司电力科学研究院 | Single-phase fault identification optimization method, system and equipment based on machine learning algorithm |
CN114757291B (en) * | 2022-04-26 | 2023-05-23 | 国网四川省电力公司电力科学研究院 | Single-phase fault identification optimization method, system and equipment based on machine learning algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109325020A (en) | Small sample application method, device, computer equipment and storage medium | |
Chang et al. | Trend discovery in financial time series data using a case based fuzzy decision tree | |
CN111178611B (en) | Method for predicting daily electric quantity | |
CN105718490A (en) | Method and device for updating classifying model | |
CN110930198A (en) | Electric energy substitution potential prediction method and system based on random forest, storage medium and computer equipment | |
CN105354595A (en) | Robust visual image classification method and system | |
CN110991474A (en) | Machine learning modeling platform | |
CN112700324A (en) | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine | |
CN111626821A (en) | Product recommendation method and system for realizing customer classification based on integrated feature selection | |
CN105046323B (en) | Regularization-based RBF network multi-label classification method | |
Durica et al. | Business failure prediction using cart-based model: A case of Slovak companies. | |
CN111143685A (en) | Recommendation system construction method and device | |
Liu et al. | A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge | |
Cao et al. | Bond rating using support vector machine | |
Xu et al. | Novel key indicators selection method of financial fraud prediction model based on machine learning hybrid mode | |
CN114881343B (en) | Short-term load prediction method and device for power system based on feature selection | |
CN115358481A (en) | Early warning and identification method, system and device for enterprise ex-situ migration | |
Ruhkopf et al. | Masif: Meta-learned algorithm selection using implicit fidelity information | |
CN110472659A (en) | Data processing method, device, computer readable storage medium and computer equipment | |
CN113506160A (en) | Risk early warning method and system for unbalanced financial text data | |
CN115795131A (en) | Electronic file classification method and device based on artificial intelligence and electronic equipment | |
CN114048854B (en) | Deep neural network big data internal data file management method | |
Lin et al. | Applying the random forest model to forecast the market reaction of start-up firms: case study of GISA equity crowdfunding platform in Taiwan | |
Ruciński | Neural modelling of electricity prices quoted on the Day-Ahead Market of TGE SA shaped by environmental and economic factors | |
WO1992017853A2 (en) | Direct data base analysis, forecasting and diagnosis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190212 |