CN111079937A

CN111079937A - Rapid modeling method

Info

Publication number: CN111079937A
Application number: CN201911121863.3A
Authority: CN
Inventors: 盛森
Original assignee: Suzhou Jinzhiqu Information Technology Co Ltd
Current assignee: Suzhou Jinzhiqu Information Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-04-28

Abstract

The invention provides a method for rapid modeling, which comprises the following steps: reading configuration file parameters, checking input data, and performing data type conversion; pretreatment: processing missing values and abnormal values, encoding category variables, processing time variables and processing data unbalance; characteristic derivation: deriving input original variables according to the configuration file; selecting characteristics: performing cascade characteristic filtration; training the algorithm model; estimating a model; two dataset distance metric: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets. The invention reduces bad learning results caused by the difference of experience and capability of different algorithm personnel, greatly reduces the threshold of machine learning application, has high expandability and usability, and can combine various functions to flexibly meet the change of actual use requirements.

Description

Rapid modeling method

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a rapid modeling method.

Background

With the development of the field of big data and machine learning, more and more companies mine valuable information from data and search for rules in the big data through machine learning, but professional algorithm personnel are required to clean the data, generate features, select the features, adjust a classifier and select proper indexes in the past, the whole process is manually carried out step by step, the algorithm personnel need to adjust according to actual conditions due to different data formats and good features of different services, and the self experiences of different algorithm personnel can influence the final effect. The method specifically comprises the following technical defects:

1. the existing algorithm personnel are seriously short in supply in the market, the algorithm personnel have irregular abilities, and experienced algorithm personnel are more scarce.

2. Due to the fact that processing steps involved in the machine learning process have a lot of commonalities, most algorithm personnel still perform repeated processing manually at present.

3. The practical experience of algorithm application is crucial to the machine learning result, but these excellent practical experiences are usually difficult to obtain.

Disclosure of Invention

The invention aims to provide a method for quickly modeling, which carries out standardized machine learning by extracting common steps in the machine learning process and using the common steps in a configuration file form so as to reduce the requirements on algorithm personnel and quickly obtain a machine learning result on a given data set.

The invention provides a method for rapid modeling, which is characterized by comprising the following steps:

step 1, reading configuration file parameters, checking input data and performing data type conversion;

step 2, preprocessing the data; the preprocessing comprises missing value and abnormal value processing, category variable coding, time variable processing and data unbalance processing;

and 3, characteristic derivation: deriving input original variables according to the configuration file; the original variables comprise the access behavior, times, time periods and user labels of the users, and basic statistical fields aiming at the access behavior;

and 4, selecting characteristics: carrying out cascading type feature filtering according to variance filtering, chi-square testing, IV values, mutual information, maximum information number, clustering decorrelation, stepwise regression and a tree integration model;

step 5, training an algorithm model, wherein the algorithm model comprises a random forest, an XGboost, an SVM and an artificial neural network;

step 6, estimating the model in the step 5, and scoring the prediction sample according to the result obtained by the model and the weight of the model;

and 7, measuring the distance between the two data sets: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets.

Compared with the prior art, the invention has the beneficial effects that:

poor learning results caused by the difference of experience and capability of different algorithm personnel are reduced, the threshold of machine learning application is greatly reduced, the framework has high expandability and usability, and all functions can be combined to flexibly meet the change of actual use requirements.

Drawings

FIG. 1 is a distribution plot of two data sets of the present invention.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The embodiment provides a method for rapid modeling, which comprises the following steps:

step 1, reading configuration file parameters and checking input data, and performing data type conversion.

The configuration file parameter setting method comprises the following steps: model parameters are set according to the result of the backtest (for example, parameter results obtained according to grid search are configured on some models).

The parameter configuration is based on: the f1score was evaluated for selection based on the recall rate of each set of parameters during the review phase.

The method specifically comprises the following steps: parameters related to the operating environment, data reading limits (time period, data amount required by model training), model parameter setting (comprising data function item switches, data preprocessing parameters, extended dimension reduction parameters, coding parameters and derivative parameters), and a total of 50 parameters.

The data types are divided into three categories: numeric, character, temporal. The numerical type is divided into floating point type and integer type, which are respectively converted in the code, when the numerical type conversion fails, the character type is used for conversion, and finally, the time dimension is converted into the time stamp because the data of the input model are all numerical type (or 0-1 type).

Step 2, preprocessing the data; the preprocessing comprises missing value and abnormal value processing, category variable coding, time variable processing and data unbalance processing.

And (4) directly removing missing values of the important fields, and processing the non-important fields according to column mean values or 0 supplementation.

And 6, eliminating the external abnormal value of sigma.

The categories are divided into two blocks, direct text coding, and one-hot coding.

The time variable is weighted up by the time dimension.

Two methods are used for data equalization, oversampling and downsampling, and a few samples in two groups are weighted, for example, in the sample equalization method, the ratio of positive and negative samples of the original sample is 0.025: 1, the ratio of positive and negative samples is 0.8 after the down-sampling method: 1.

and 3, characteristic derivation: deriving input original variables according to the configuration file; the original variables include the access behavior, times, time periods and user tags of the user, and basic statistics fields (summarized by time periods) for the access behavior.

Method of deriving variables: time weight is constructed according to the time period of the user access behavior, the access situation fields of the user on each label are constructed according to the labels, and finally statistical field expansion is carried out through the public key on the sklern, wherein the number of the fields before expansion is about 120, and the number of the fields after expansion can reach about 1000.

And 4, selecting characteristics: and carrying out cascading type feature filtering according to variance filtering, chi-square testing, IV values, mutual information, maximum information number, clustering decorrelation, stepwise regression and a tree integration model.

Firstly, removing the highly-collinear fields according to the collinear module of the sklern, secondly, filtering according to the variance volatility of each field, if the volatility does not exist, directly removing, and according to the back measurement result, setting the threshold value of the volatility to be about 0.1 is better. Then, randomly dividing each variable into two parts of data to carry out chi-square test, if the Q-static statistical result is not significant, considering that the variable is available and obeys the same distribution, and if the Q-static statistical result is significant, removing the variable; and then, using the IV value, namely, firstly carrying out binning on the data, then observing the forecasting force of the data, if the forecasting force of the data in each bin on the dependent variable is generally weak, lowering the IV value, selectively removing the IV value, then, measuring the variables by using mutual entropy, removing the variables with the number of more than 95 quantiles on the final mutual entropy distribution, and screening the independent variables by using a clustering and stepwise regression method to remove the high-correlation variables. And finally, ranking the rest variables by the tree models of the model part, and carrying out the next model by the variables with higher ranking weights.

And secondly, using a CART5.0 algorithm of a decision tree to calculate information gain to obtain a field with larger gain for reservation.

And 5, training an algorithm model, wherein the algorithm model comprises a random forest, an XGboost, an SVM and an artificial neural network.

The algorithm modules (models) are divided into 4: random forests, XGBoost, SVM, and artificial neural networks.

During training, samples are divided into a training set, a testing set and a verification set, the training steps are (random forest) and 1500 trees, samples (ratio: alpha) and fields (ratio: beta) are randomly extracted each time to carry out decision tree training, the information gain evaluation standard is a kini coefficient, the number of out of bag samples in the final training set is tested back, the verification set is used for observing the model effect after the test back is finished, and if the test back is not good, the trees, the alpha and the beta of the trees are modified. Finally, validation of the test set was performed on a steady basis (recall and f1 no longer show excessive fluctuations).

And 6, estimating the model in the step 5, and scoring the prediction sample according to the result obtained by the model and the weight of the model.

The evaluation was performed using auc, roc, f1 values. The evaluation process comprises the steps of grouping the data sets for N times, wherein each group is used as a brand new sample set and is divided into training, testing and verifying, carrying out testing by substituting the data sets into the model, and finally scoring the model according to the evaluation results of the N times (for example, if the average value of f1 of the N times of the random forest is 0.4, the average value of f1 of the XGBoostN times is 0.5, the average value of f1 of the N times of the SVM is 0.6, and the average value of f1 of the N times of the artificial neural network is 0.7, model weights are taken for the four models by using softmax). And finally, scoring the prediction samples according to the result obtained by the model and the weight of the model.

The two data sets refer to two data sets, namely training vs test, training vs prediction and testing vs prediction. The KL divergence (also called relative entropy) can be understood as the similarity of the two distributions. As shown in FIG. 1, the distributions of the two data sets are relatively different, and the values obtained using KL divergence are large, indicating that the two distributions are not very similar and have a large difference. Variable filtering is the same as step 4.

The invention has the technical effects that:

1. the combination of the feature selection methods in the software system and the setting of default parameters.

The feature combination modes of different models are different, all features are combined into a feature pool, the feature combination of each tree in a random forest is different, the feature selection of a gradient tree is also random, if the SVM is adopted, the overall features are used, and the artificial neural network predicts the high-level semantic features obtained by nonlinear combination of the input features.

2. For the distance comparison of the two data sets, firstly, multiple distance measurements are carried out on the same variable, and then cosine distance measurements are carried out on all the variables, so that the comparison between the two data sets is obtained.

For example, if we need to measure the dispersion of the distribution of the variable a in different data sets, the KL dispersion, the euclidean example, the cross entropy is used for statistics, and the uniformity of the distribution of the uniform variable in different data sets is observed. The verification method that the distance between the variables changes with the change of the data set is as follows: assuming that the difference between the cosine distance of the variable A and the cosine distance of the variable B in the data set 1 and the cosine distance of the variable B in the data set 2 is large, the data set is considered to be unevenly split, and the distances are generally judged by adopting statistical 95% of signal level.

3. The feature derivation section proposes the concept of group derivation to reduce combinatorial explosion and improve effective combining.

For example, for a random forest, the model not only randomly selects and combines the spanning tree model according to the dimension input by the user, but also considers that the dimension is grouped when the dimension is input, for example, all data fed back by the user side are bound into a group, including behavior data, behavior occurrence time and frequency; and the dimensions of all the artificial labeling or construction are subdivided into a group, including time weight and frequency statistics; and all the two classification dimensions in one group are provided and divided into another group as a factor group, including all the classification groups, and the group-to-group interaction can be effectively reduced through the groups, so that the combined explosion is reduced, and the model operation efficiency is improved.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method of rapid modeling, comprising: