CN109325020A

CN109325020A - Small sample application method, device, computer equipment and storage medium

Info

Publication number: CN109325020A
Application number: CN201810949574.1A
Authority: CN
Inventors: 周南光
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2019-02-12

Abstract

The present invention relates to big data field, in particular to a kind of small sample application method, device, computer equipment and storage medium.The described method includes: obtaining the data characteristics in original document；It successively carries out IV and WOE to the data characteristics to calculate, generation carries out the characteristic modeling data after pretreatment cleaning；According in preset configuration table multiple model informations and the modeling data, establish the corresponding model of the multiple model information；The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.It is calculated by carrying out WOE, IV to data characteristics, generates modeling data, according to the model information in modeling data and preset configuration table, generate model report, it is intended to it solves in existing modeling, it is less to use small sample problem.

Description

Small sample application method, device, computer equipment and storage medium

Technical field

The present invention relates to big data field, in particular to a kind of small sample application method, device, computer equipment and storage Medium.

Background technique

In existing technology, when carrying out data modeling using machine learning platform, be related to cluster resource distribution it is insufficient, Small Sample Database collection distributes the disadvantages of time link is long, the high level model development cycle is long, since the Small samples modeling time is long, debugging Inconvenient problem, it is less to use small sample in existing modeling.

Summary of the invention

In view of the shortcomings of the prior art, the present invention proposes that a kind of small sample application method, device, computer equipment and storage are situated between Matter is calculated by carrying out WOE, IV to data characteristics, modeling data is generated, according to the model in modeling data and preset configuration table Information generates model report, it is intended to it solves in existing modeling, it is less to use small sample problem.

Technical solution proposed by the present invention is:

A kind of small sample application method, which comprises

Obtain the data characteristics in original document；

It successively carries out IV and WOE to the data characteristics to calculate, after generation carries out pretreatment cleaning to the characteristic Modeling data；

According in preset configuration table multiple model informations and the modeling data, establish the multiple model information point Not corresponding model；

The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.

Further, in the Data Representation for obtaining the multiple model, it is corresponding to generate the multiple model After the step of model report, which comprises

According to the Data Representation of the multiple model, each model is iterated respectively and parameter optimization.

Further, it is described acquisition original document in data characteristics the step of in, comprising:

By the hdfs data file transition on Hadoop cluster at csv file；

Read the data characteristics in the csv file.

Further, in the data characteristics read in the csv file the step of, comprising:

Configure stand-alone program to Parameter File required for the csv file operation, the Parameter File include model ID, Data Filename, data ID column, data reject characteristic series, target signature column and model algorithm；

The csv file is inputted into the stand-alone program operation；

Read the data characteristics in the csv file.

Further, IV and WOE successively carried out to the data characteristics calculate described, generate to the characteristic into Row pre-processed in the step of modeling data after cleaning, comprising:

IV calculating is carried out to each data characteristics, obtains the IV value of each data characteristics；

It is ranked up according to IV value of the numerical values recited to each data characteristics, the first number is sequentially screened out according to the sequence The target IV value of amount, and obtain the corresponding target data feature of the target IV value；

WOE calculating is carried out to the target data feature, obtains the WOE mapping table of the target data feature；

It is modeling data by the target data Feature Conversion according to the WOE mapping relations of the target data feature.

Further, multiple model informations in the preset configuration table include xgboost model information, gbdt model letter Breath, lightGBM model information, catboost model information and tensorflow model information.

According to the Data Representation of the multiple module, classify to the multiple module report；

The same category of module report will be belonged to be stored in same file folder.

The present invention also provides a kind of small sample use device, described device includes:

Module is obtained, for obtaining the data characteristics in original document；

Processing module calculates for successively carrying out IV and WOE to the data characteristics, generates and carry out to the characteristic Modeling data after pretreatment cleaning；

Model building module, for according in preset configuration table multiple model informations and the modeling data, establish The corresponding model of the multiple model information；

Model report generation module generates the multiple model difference for obtaining the Data Representation of the multiple model Corresponding model report.

The present invention also provides a kind of computer equipment, including memory and processor, the memory is stored with computer The step of program, the processor realizes method described in any of the above embodiments when executing the computer program.

The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey The step of method described in any of the above embodiments is realized when sequence is executed by processor.

According to above-mentioned technical solution, the invention has the advantages that: being calculated by carrying out WOE, IV to data characteristics, generation is built Modulus evidence generates model report, it is intended to solve existing modeling according to the model information in modeling data and preset configuration table In, it is less to use small sample problem.

Detailed description of the invention

Fig. 1 is the flow chart using small sample application method provided in an embodiment of the present invention；

Fig. 2 is the functional block diagram using small sample use device provided in an embodiment of the present invention；

Fig. 3 is the structural schematic block diagram using computer equipment provided by the embodiments of the present application.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

As shown in Figure 1, the embodiment of the present invention proposes small sample application method, the described method comprises the following steps:

Step S101, the data characteristics in original document is obtained.

Obtaining data characteristics in original document includes that obtain being capable of Direct Modeling data and can not be direct in original document Modeling data.

In the present embodiment, obtain original document in can not Direct Modeling data be mainly obtain original document in original Dirty data of beginning, including missing values, class variable and date variable etc..

In the present embodiment, in step s101, comprising:

By the hdfs data file transition on Hadoop cluster at csv file；

Read the data characteristics in csv file.

It is described reading csv file in data characteristics the step of in, comprising:

Stand-alone program is configured to Parameter File required for csv file operation, Parameter File includes model ID, data file Name, data ID column, data reject characteristic series, target signature column and model algorithm；

Csv file is inputted into stand-alone program operation；

Read the data characteristics in csv file.

Specifically, the original dirty data, including missing values, class variable and date variable etc. in original document are obtained, Specifically, by the hdfs data file transition on Hadoop cluster at csv file, for the input of single machine sequential operation under line, Then Parameter File required for configurator is run, including model ID, Data Filename, data ID column, data reject feature Column, target signature column and model algorithm, read the data characteristics in csv file.

Step S102, it successively carries out IV and WOE to data characteristics to calculate, after generation carries out pretreatment cleaning to characteristic Modeling data.

Data characteristics is obtained from original document, after obtaining data characteristics, data characteristics is calculated, mainly It carries out WOE, IV to calculate, successively carries out IV and WOE and calculate, the data generation obtained after calculating pre-process to characteristic clear Modeling data after washing.

In step s 102, comprising:

It is ranked up according to IV value of the numerical values recited to each data characteristics, according to first quantity that sequentially screens out of sequence Target IV value, and obtain the corresponding target data feature of target IV value；

WOE calculating is carried out to target data feature, obtains the WOE mapping table of target data feature；

It is modeling data by target data Feature Conversion according to the WOE mapping relations of target data feature.

IV calculating is carried out to each data characteristics, after completing each data characteristics IV and calculating, obtains each data The IV value of feature, the specific calculation formula that IV is calculated are as follows:

The IV value of each data characteristics can be obtained according to the specific calculation formula that IV is calculated, IV can be understood as selected spy One functional relation IV=f (x1, y) of sign and target signature, value range are greater than 0, and numerical value is bigger, illustrates feature to target The influence degree of variable is bigger.Be ranked up according to IV value of the numerical values recited to each data characteristics, in the present embodiment, by greatly to It is small successively to sort, according to the target IV value for sequentially screening out the first quantity of sequence, and obtain the corresponding number of targets of target IV value According to feature, after obtaining target data feature, WOE calculating is carried out to target data feature, is completed to target data feature After WOE is calculated, the WOE mapping relations of target data feature, the specific calculation formula that WOE is calculated are obtained are as follows:

According to the WOE mapping relations of target data feature, by target data Feature Conversion, converting purpose is by initial data It is changed into the data that can be directly modeled, after completing conversion, generates modeling data.

It carries out WOE, IV to each data characteristics to calculate, the IV value and WOE mapping table, IV for obtaining data characteristics are The abbreviation of English Information Value, Chinese mean information value or information content.With logistic regression, decision tree When equal model methods building disaggregated model, it is often necessary to be screened to independent variable.For example we have 200 candidate independents variable, Under normal conditions, directly 200 variables will not be placed directly in model and goes to be fitted training, but certain methods can be used, Selected from this 200 independents variable it is some come out, put model into, formed into mould variable list.Selecting into mould multivariable process is More complicated process, there are many factor in need of consideration, such as: the predictive ability of variable, the correlation between variable, variable Simplicity (is easy to generate and use), and the robustness (being not easy to be bypassed) of variable, variable (is chosen in operational interpretation Wartime can explain logical) etc..But wherein main and most direct measurement standard is the predictive ability of variable." variable Predictive ability " this saying is very general, very subjective, non-quantized, can be measured by some specific quantizating index often from The predictive ability of variable, and according to the size of these quantizating index, to determine which variable enters model.IV is exactly such a Index, iv can be used to measure the predictive ability of independent variable.There are also information gains, Gini coefficient etc. for similar index.

The full name of WOE is " Weight of Evidence ", i.e. evidence weight.WOE is a kind of volume to original argument Code form.WOE coding is carried out to a variable, need that this variable is grouped processing first and (be also discretization, branch mailbox Etc., what is said is all a meaning).After grouping, for i-th group, the calculation formula of WOE is as follows:

The technology formula of iv is as follows:

Wherein, py_iIt is that (in risk model, corresponding be promise breaking client to customer in response in this group, in short, referring to mould In type predictive variable value be "Yes" in other words 1 individual) account for the ratios of all customer in response in all samples, pn_iIt is this Non- customer in response accounts for the ratio of all non-customer in response in sample, #y in group_iIt is the quantity of customer in response in this group, #n_iIt is this The quantity of non-customer in response, #y in a group_TIt is the quantity of all customer in response in sample, #n_TIt is all non-customer in response in sample Quantity.

According to above-mentioned formula it is found that WOE expression is actually that " customer in response accounts for all customer in response in current group The difference of ratio " and " ratio that the client not responded in current group accounts for all clients not responded to ".

One simple transformation is done to this formula, available:

It will be seen that WOE can also so understand after transformation, that he indicates is the visitor responded in this current group The difference of this ratio in the ratio and all samples of family and non-customer in response.This difference is the ratio with the two ratios, Logarithm is taken to indicate again.WOE is bigger, this species diversity is bigger, this grouping in sample responses a possibility that bigger, WOE Smaller, difference is smaller, this grouping in sample responses a possibility that with regard to smaller.

The data in original document are converted according to the WOE mapping table of generation, the pretreatment for completing data is clear Journey is washed, so that the data in original document is transformed into the data for being used directly for being modeled, improves the accuracy of modeling.

Step S103, according to the multiple model informations and modeling data in preset configuration table, multiple model informations are established Corresponding model.

Step S104, the Data Representation for obtaining multiple models, generates the corresponding model report of multiple models.

According to the multiple model informations and modeling data in preset configuration table, each model information and modeling data can To establish the corresponding model of the model information, to establish the corresponding model of multiple model informations.Obtain multiple models Data Representation generates the corresponding model report of multiple models according to the Data Representation of multiple models.

In the present embodiment, the model information in preset configuration table include xgboost model information, gbdt model information, LightGBM model information, catboost model information and tensorflow model information.

Pass through the model information in selection preset configuration table, comprising: xgboost model information, gbdt model information, LightGBM model information, catboost model information, tensorflow model information etc. are modeled, and each model is exported Specific manifestation, to generate model report.Specifically, with building one for each client couple in the client set for the company of predicting For the prediction model whether our a certain marketing activity is able to respond, randomly selected in first subsidiary company customer list 100000 clients have carried out marketing activity test, have collected the response results of these clients, as our modeling data collection, The client wherein responded has 10000.The some variables for extracting these clients, as the candidate variables collection of our models, these Variable include it is following these:

1, whether there is purchase within nearest one month；

2, the last purchase amount of money；

3, the merchandise classification of a nearest purchase；

It 4, whether is company VIP client；

Discretization first is carried out to these variables, the result of statistics is as shown in following several tables.

(1) whether there is purchase within nearest one month:

Whether purchase was had within nearest one month	Response	It does not respond	It is total	Response ratio
					It is	4000	16000	20000	20%
It is no	6000	74000	80000	7.5%
					It is total	10000	90000	100000	10%

(2) the last purchase amount of money:

The last time purchase amount of money	Response	It does not respond	It is total	Response ratio
					100 yuan of <	2500	47500	50000	5%
[100,200)	3000	27000	30000	10%
					[200,500)	3000	12000	15000	20%
>=500 yuan	1500	3500	5000	30%
					It is total	10000	90000	100000	10%

(3) merchandise classification of a nearest purchase:

The merchandise classification of a nearest purchase	Response	It does not respond	It is total	Response ratio
					3C	3000	57000	60000	5%
Cosmetics	2000	18000	20000	10%
					Mother and baby	5000	15000	20000	25%
It is total	10000	90000	100000	10%

(4) whether it is company VIP client:

It whether is company VIP client	Response	It does not respond	It is total	Response ratio
					It is	5500	4500	10000	55%
It is no	4500	85000	90000	5%
					It is total	10000	90000	100000	10%

By taking " the last time purchase amount of money " variable as an example:

The calculation formula of WOE are as follows:

By this variable discretization for 4 segmentations: 100 yuan of <, [100,200), [and 200,500), >=500 yuan.It is first First, according to WOE calculation formula, the WOE of this four segmentations is respectively as follows:

The last time purchase amount of money	Response	It does not respond	It is total	Response ratio	WOE
						100 yuan of <	2500	47500	50000	5%	-0.74721
[100,200)	3000	27000	30000	10%	0
						[200,500)	3000	12000	15000	20%	0.81093
>=500 yuan	1500	3500	5000	30%	1.349927
						It is total	10000	90000	100000	10%	0

We can have a look the basic characteristics of WOE from calculated result above:

The first, in current group, the ratio of response is bigger, and WOE value is bigger；

The second, current group WOE's is positive and negative, is responded by current group and the ratio that does not respond, with sample Whole Response and The size relation for the ratio not responded determines that when the ratio of current group is less than sample overall ratio, WOE is negative, current group Ratio when being greater than overall ratio, WOE is positive, when the ratio and equal overall ratio of current group, WOE 0.

Third, the value range of WOE are all real numbers.

WOE describes this current grouping of variable, influences direction and size to judging whether individual can respond to play, when WOE is timing, and the current value of variable plays the influence for judging the individual forward direction that whether can be responded and play when WOE is negative Negative sense influences.And the size of WOE value, then it is the embodiment of this size influenced.

Then the IV value of four groupings is calculated separately:

According to the calculated result of above-mentioned IV we can see that the following characteristics of IV:

The first, for one of variable grouping, the response of this grouping and the ratio not responded and sample Whole Response and The ratio difference not responded is bigger, and IV value is bigger, and otherwise, IV value is smaller；

The second, it under extreme case, the response of current group and the response of ratio and sample entirety not responded and does not respond When being in equal proportions, IV value be 0；

Third, the value range of IV value be [0 ,+∞), and, when in current group only comprising customer in response or not responding When client, IV=+ ∞.

Calculate the total IV value of variable:

IV=IV₁+IV₂+IV₃+IV₄=0.492706

The other three variable is calculated according to above-mentioned principle, the IV result for obtaining four variables is as follows.

(1), the last purchase amount of money: 0.49270645；

(2), whether there is within nearest one month a purchase: 0.250224725；

(3), the merchandise classification of a nearest purchase: 0.615275563；

It (4), whether is company VIP client: 1.56550367.

This four variable IV ranking results are such that whether be nearest one commodity class bought of company VIP client > Other > the last time buys whether amount of money > nearest one month had purchase.Know that " whether being company VIP client " is prediction energy The highest variable of power, and " whether having purchase within nearest one month " is the minimum variable of predictive ability.Can according to IV from height to Select variable in this low four variables, clean unnecessary or representative not high data, thus improve modeling universality and Accuracy does not need to be modeled by spark big data platform, data sample under conditions of less than 500,000 data sample This is few to reduce the Modeling Calculation used time, improves model iteration efficiency, substantially reduces the time of debugging；And program is beaten Directly run under Linux/Windows environment after packet, have be easy to transplant characteristic, and to the dependence of system environments compared with It is few, facilitate and safeguarded, can be showed in data according to each model and practical business scene is imported, opposite spark The upper packaged interface that can not be changed, development process is freer, implements customizable, improves while improving model accuracy Business service efficiency.

After step s 104, which comprises

According to the Data Representation of multiple models, each model is iterated respectively and parameter optimization.

In model report include model Data Representation, according to the Data Representation of multiple models, respectively to each model into Row iteration and parameter optimization.By carrying out model iteration and parameter optimization according to the Data Representation in model report, specifically, Data Representation includes parameter plateau and parameter isolated island, and parameter plateau refers to that model is at this there is a wider parameter area Preferable effect can be obtained in a parameter area, generally can be centrally formed approximate normal distribution shape with plateau；And parameter is lonely When island refers to only in the range of parameter value is in some very little, model just has preferable performance, and when the parameter drift-out value, The performance of model will significantly be deteriorated, so an important principle seeks to strive for parameter plateau rather than join in parameter optimization Number isolated island.According to above-mentioned parameter plateau and parameter isolated island principle, when in model there are when multiple parameters array, an often parameter The value of array influences whether the distribution on another parameter plateau.Specifically, the method for parameter optimization can be used and gradually be received Hold back method, i.e., first individually a parameter is optimized, be fixed up after obtaining its optimum value, then again to another parameter into Row optimization, is fixed up after obtaining its optimum value, so recycles, until optimum results no longer change.It is bought with an equal line intersection For selling Trading Model, two independent parameters are equal line short cycle N1 and long period N2 respectively.N2 fixed first is 1, to N1 1 Test screen is carried out in 100 numberical range, finds optimal values, is finally obtained optimal parameter and is 8 and fixes；Secondly to N2 It optimizes, obtain optimum value 26 and fixes between 1 to 200；The second wheel is carried out to N1 again to optimize, and obtains new optimum value 10 and fixed；Finally N2 is optimized to obtain optimum value 28 and be fixed.The screening so recycled is gone down, until optimum results not It changes again.If finally obtained optimal value of the parameter is that N1 is 10, N2 30 respectively, so far, parameter optimization work terminates.

Certainly, another method of parameter optimization is to utilize the programmed software design platform with stronger computing function, The distribution between objective function and parameter array is directly calculated, and then asks the distribution of multidimensional difference, defines a differential threshold, it is poor Divide absolute value to be less than corresponding multidimensional volume maximum, multidimensional inscribe radius of a ball soprano in threshold range, enters to be selected as most stable ginseng Number value, to complete model iteration and parameter optimization.

In the present embodiment, after step s 104, which comprises

According to the Data Representation of multiple models, classify to multiple model reports；

Same category of model report will be belonged to be stored in same file folder.

Multiple models can have difference to the Data Representation of modeling data, in the present embodiment, the Data Representation point of model For excellent, good, poor.According to the Data Representation of multiple models, classify to multiple model reports, mainly to divide excellent, good, poor three Class is stored in same file folder for belonging to same category of model report, so that the subsequent Data Representation to model is classified It checks.For example, the Data Representation of xgboost model be it is excellent, the Data Representation of gbdt model be it is excellent, then by xgboost model It is stored in same file folder with the model report of gbdt model.

In some embodiments, the model report of generation is classified and is stored, specifically, can by missing values, Class variable, date variable or user information are classified and are stored as label, thus it is clear, directly understand it is each The difference and connection of model report, and then can be according to the data characteristics in different original documents come to corresponding with data characteristics Model carry out model iteration and parameter optimization.

In conclusion obtaining the data characteristics in original document；WOE, IV are carried out to data characteristics to calculate, and generate modeling number According to；According to the model information and modeling data in preset configuration table, model report is generated.Less than 500,000 data samples Under the conditions of, it does not need to be modeled by spark big data platform, data sample is few to reduce the Modeling Calculation used time, improves mould Type iteration efficiency, substantially reduces the time of debugging；And it is direct under Linux/Windows environment after being packaged program Operation has the characteristic for being easy to transplant, and less to the dependence of system environments, facilitates and is safeguarded, can be according to each mould Type shows in data and practical business scene is imported, the packaged interface that can not be changed on opposite spark, exploitation stream Journey is freer, implements customizable, improves business service efficiency while improving model accuracy.

As shown in Fig. 2, the embodiment of the present invention proposes small sample use device 1, device 1 includes obtaining module 11, processing mould Block 12, model building module 13 and model report generation module 14.

Module 11 is obtained, for obtaining the data characteristics in original document.

In the present embodiment, the original dirty data in original document is obtained, including missing values, class variable and date become Amount etc..

In the present embodiment, obtaining module 11 includes:

Conversion module, for by the hdfs data file transition on Hadoop cluster at csv file；

Read module, for reading the data characteristics in csv file.

Read module includes:

Csv file is inputted into stand-alone program operation；

Read the data characteristics in csv file.

Processing module 12 successively carries out IV and WOE to data characteristics and calculates, and generation carries out pretreatment cleaning to characteristic Modeling data afterwards.

Processing module 12 includes:

Computing module obtains the IV value of each data characteristics for carrying out IV calculating to each data characteristics；

Module is obtained, for being ranked up according to IV value of the numerical values recited to each data characteristics, is sieved according to the sequence of sequence The target IV value of the first quantity is selected, and obtains the corresponding target data feature of target IV value；

Mapping block obtains the WOE mapping relations of target data feature for carrying out WOE calculating to target data feature Table；

Target data Feature Conversion is modeling number for the WOE mapping relations according to target data feature by conversion module According to.

The technology formula of iv is as follows:

One simple transformation is done to this formula, available:

Model building module 13, for establishing more according to the multiple model informations and modeling data in preset configuration table The corresponding model of a model information.

It is corresponding to generate multiple models for obtaining the Data Representation of multiple models for model report generation module 14 Model report.

1, whether there is purchase within nearest one month；

2, the last purchase amount of money；

3, the merchandise classification of a nearest purchase；

It 4, whether is company VIP client；

(1) whether there is purchase within nearest one month:

(2) the last purchase amount of money:

(3) merchandise classification of a nearest purchase:

(4) whether it is company VIP client:

By taking " the last time purchase amount of money " variable as an example:

The calculation formula of WOE are as follows:

Third, the value range of WOE are all real numbers.

Then the IV value of four groupings is calculated separately:

Calculate the total IV value of variable:

IV=IV₁+IV₂+IV₃+IV₄=0.492706

(1), the last purchase amount of money: 0.49270645；

(2), whether there is within nearest one month a purchase: 0.250224725；

(3), the merchandise classification of a nearest purchase: 0.615275563；

It (4), whether is company VIP client: 1.56550367.

Device 1 includes:

Optimization module is respectively iterated each model and parameter optimization for the Data Representation according to multiple models.

In the present embodiment, device 1 includes:

Categorization module classifies to multiple model reports for the Data Representation according to multiple models；

Memory module is stored in same file folder for that will belong to same category of model report.

As shown in figure 3, also providing a kind of computer equipment in the embodiment of the present application, which can be service Device, internal structure can be as shown in Figure 3.The computer equipment includes processor, the memory, net connected by system bus Network interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment Memory includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer journey Sequence and database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium. The database of the computer equipment is for storing the data such as the model of small sample application method.The network interface of the computer equipment For being communicated with external terminal by network connection.To realize small sample user when the computer program is executed by processor Method.

Above-mentioned processor executes the step of above-mentioned small sample application method: obtaining the data characteristics in original document；Successively It carries out IV and WOE to the data characteristics to calculate, generation carries out the characteristic modeling data after pretreatment cleaning；Root According in preset configuration table multiple model informations and the modeling data, establish the corresponding mould of the multiple model information Type；The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.

In one embodiment, the Data Representation of the multiple model of above-mentioned acquisition, it is right respectively to generate the multiple model After the step of model report answered, comprising:

In one embodiment, in the step of data characteristics in above-mentioned acquisition original document, comprising:

By the hdfs data file transition on Hadoop cluster at csv file；

Read the data characteristics in the csv file.

In one embodiment, in the step of data characteristics in the above-mentioned reading csv file, comprising:

The csv file is inputted into the stand-alone program operation；

Read the data characteristics in the csv file.

In one embodiment, above-mentioned that IV and WOE calculating successively is carried out to the data characteristics, it generates to the characteristic According in the step of carrying out the modeling data after pretreatment cleaning, comprising:

In one embodiment, multiple model informations in above-mentioned preset configuration table include xgboost model information, gbdt Model information, lightGBM model information, catboost model information and tensorflow model information.

Same category of module report will be belonged to be stored in same file folder.

It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.

The computer equipment of the embodiment of the present application obtains the data characteristics in original document；To data characteristics carry out WOE, IV is calculated, and generates modeling data；According to the model information and modeling data in preset configuration table, model report is generated.Small It under conditions of 500,000 data samples, does not need to be modeled by spark big data platform, data sample is few to be built to reduce Mould calculates the used time, improves model iteration efficiency, substantially reduces the time of debugging；And in Linux/ after program is packaged It is directly run under Windows environment, there is the characteristic for being easy to transplant, and less to the dependence of system environments, facilitate and tieed up Shield, can show in data according to each model and practical business scene is imported, it is packaged with respect on spark can not The interface of change, development process is freer, implements customizable, improves business service efficiency while improving model accuracy.

One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculates Small sample application method is realized when machine program is executed by processor, specifically: obtain the data characteristics in original document；It is successively right The data characteristics carries out IV and WOE and calculates, and generation carries out the modeling data after pretreatment cleaning to the characteristic；According to Multiple model informations and the modeling data in preset configuration table, establish the corresponding mould of the multiple model information Type；The Data Representation for obtaining the multiple model generates the corresponding model report of the multiple model.

By the hdfs data file transition on Hadoop cluster at csv file；

Read the data characteristics in the csv file.

The csv file is inputted into the stand-alone program operation；

Read the data characteristics in the csv file.

The storage medium of the embodiment of the present application obtains the data characteristics in original document；WOE, IV are carried out to data characteristics It calculates, generates modeling data；According to the model information and modeling data in preset configuration table, model report is generated.It is being less than It under conditions of 500000 data samples, does not need to be modeled by spark big data platform, data sample is few to reduce modeling The used time is calculated, model iteration efficiency is improved, substantially reduces the time of debugging；And in Linux/ after program is packaged It is directly run under Windows environment, there is the characteristic for being easy to transplant, and less to the dependence of system environments, facilitate and tieed up Shield, can show in data according to each model and practical business scene is imported, it is packaged with respect on spark can not The interface of change, development process is freer, implements customizable, improves business service efficiency while improving model accuracy.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of small sample application method, which is characterized in that the described method includes:

Obtain the data characteristics in original document；

IV and WOE successively are carried out to the data characteristics to calculate, and generate and building after pretreatment cleaning is carried out to the characteristic Modulus evidence；

According in preset configuration table multiple model informations and the modeling data, it is right respectively to establish the multiple model information The model answered；

2. small sample application method according to claim 1, which is characterized in that in the number for obtaining the multiple model According to performance, after the step of generating the multiple model corresponding model report, which comprises

3. small sample application method according to claim 1, which is characterized in that the data in the acquisition original document In the step of feature, comprising:

By the hdfs data file transition on Hadoop cluster at csv file；

Read the data characteristics in the csv file.

4. small sample application method according to claim 3, which is characterized in that read in the csv file described In the step of data characteristics, comprising:

Stand-alone program is configured to Parameter File required for the csv file operation, the Parameter File includes model ID, data Filename, data ID column, data reject characteristic series, target signature column and model algorithm；

The csv file is inputted into the stand-alone program operation；

Read the data characteristics in the csv file.

5. small sample application method according to claim 1, which is characterized in that it is described successively to the data characteristics into In the step of row IV and WOE are calculated, and generation carries out the modeling data after pretreatment cleaning to the characteristic, comprising:

It is ranked up according to IV value of the numerical values recited to each data characteristics, according to first quantity that sequentially screens out of the sequence Target IV value, and obtain the corresponding target data feature of the target IV value；

6. small sample application method according to claim 1, which is characterized in that multiple models in the preset configuration table Information include xgboost model information, gbdt model information, lightGBM model information, catboost model information and Tensorflow model information.

7. small sample application method according to claim 1, which is characterized in that in the number for obtaining the multiple model According to performance, after the step of generating the multiple model corresponding model report, which comprises

According to the Data Representation of the multiple model, classify to the multiple model report；

The same category of model report will be belonged to be stored in same file folder.

8. a kind of small sample use device, which is characterized in that described device includes:

Processing module is calculated for successively carrying out IV and WOE to the data characteristics, and the characteristic is located in generation in advance Clear the modeling data after washing；

Model building module, for according in preset configuration table multiple model informations and the modeling data, described in foundation The corresponding model of multiple model informations；

Model report generation module generates the multiple model and respectively corresponds for obtaining the Data Representation of the multiple model Model report.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the processor realizes method described in any one of claims 1 to 7 when executing computer program the step of.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.