CN111079937A - Rapid modeling method - Google Patents

Rapid modeling method Download PDF

Info

Publication number
CN111079937A
CN111079937A CN201911121863.3A CN201911121863A CN111079937A CN 111079937 A CN111079937 A CN 111079937A CN 201911121863 A CN201911121863 A CN 201911121863A CN 111079937 A CN111079937 A CN 111079937A
Authority
CN
China
Prior art keywords
model
data
processing
configuration file
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911121863.3A
Other languages
Chinese (zh)
Inventor
盛森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Jinzhiqu Information Technology Co Ltd
Original Assignee
Suzhou Jinzhiqu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jinzhiqu Information Technology Co Ltd filed Critical Suzhou Jinzhiqu Information Technology Co Ltd
Priority to CN201911121863.3A priority Critical patent/CN111079937A/en
Publication of CN111079937A publication Critical patent/CN111079937A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for rapid modeling, which comprises the following steps: reading configuration file parameters, checking input data, and performing data type conversion; pretreatment: processing missing values and abnormal values, encoding category variables, processing time variables and processing data unbalance; characteristic derivation: deriving input original variables according to the configuration file; selecting characteristics: performing cascade characteristic filtration; training the algorithm model; estimating a model; two dataset distance metric: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets. The invention reduces bad learning results caused by the difference of experience and capability of different algorithm personnel, greatly reduces the threshold of machine learning application, has high expandability and usability, and can combine various functions to flexibly meet the change of actual use requirements.

Description

Rapid modeling method
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a rapid modeling method.
Background
With the development of the field of big data and machine learning, more and more companies mine valuable information from data and search for rules in the big data through machine learning, but professional algorithm personnel are required to clean the data, generate features, select the features, adjust a classifier and select proper indexes in the past, the whole process is manually carried out step by step, the algorithm personnel need to adjust according to actual conditions due to different data formats and good features of different services, and the self experiences of different algorithm personnel can influence the final effect. The method specifically comprises the following technical defects:
1. the existing algorithm personnel are seriously short in supply in the market, the algorithm personnel have irregular abilities, and experienced algorithm personnel are more scarce.
2. Due to the fact that processing steps involved in the machine learning process have a lot of commonalities, most algorithm personnel still perform repeated processing manually at present.
3. The practical experience of algorithm application is crucial to the machine learning result, but these excellent practical experiences are usually difficult to obtain.
Disclosure of Invention
The invention aims to provide a method for quickly modeling, which carries out standardized machine learning by extracting common steps in the machine learning process and using the common steps in a configuration file form so as to reduce the requirements on algorithm personnel and quickly obtain a machine learning result on a given data set.
The invention provides a method for rapid modeling, which is characterized by comprising the following steps:
step 1, reading configuration file parameters, checking input data and performing data type conversion;
step 2, preprocessing the data; the preprocessing comprises missing value and abnormal value processing, category variable coding, time variable processing and data unbalance processing;
and 3, characteristic derivation: deriving input original variables according to the configuration file; the original variables comprise the access behavior, times, time periods and user labels of the users, and basic statistical fields aiming at the access behavior;
and 4, selecting characteristics: carrying out cascading type feature filtering according to variance filtering, chi-square testing, IV values, mutual information, maximum information number, clustering decorrelation, stepwise regression and a tree integration model;
step 5, training an algorithm model, wherein the algorithm model comprises a random forest, an XGboost, an SVM and an artificial neural network;
step 6, estimating the model in the step 5, and scoring the prediction sample according to the result obtained by the model and the weight of the model;
and 7, measuring the distance between the two data sets: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets.
Compared with the prior art, the invention has the beneficial effects that:
poor learning results caused by the difference of experience and capability of different algorithm personnel are reduced, the threshold of machine learning application is greatly reduced, the framework has high expandability and usability, and all functions can be combined to flexibly meet the change of actual use requirements.
Drawings
FIG. 1 is a distribution plot of two data sets of the present invention.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The embodiment provides a method for rapid modeling, which comprises the following steps:
step 1, reading configuration file parameters and checking input data, and performing data type conversion.
The configuration file parameter setting method comprises the following steps: model parameters are set according to the result of the backtest (for example, parameter results obtained according to grid search are configured on some models).
The parameter configuration is based on: the f1score was evaluated for selection based on the recall rate of each set of parameters during the review phase.
The method specifically comprises the following steps: parameters related to the operating environment, data reading limits (time period, data amount required by model training), model parameter setting (comprising data function item switches, data preprocessing parameters, extended dimension reduction parameters, coding parameters and derivative parameters), and a total of 50 parameters.
The data types are divided into three categories: numeric, character, temporal. The numerical type is divided into floating point type and integer type, which are respectively converted in the code, when the numerical type conversion fails, the character type is used for conversion, and finally, the time dimension is converted into the time stamp because the data of the input model are all numerical type (or 0-1 type).
Step 2, preprocessing the data; the preprocessing comprises missing value and abnormal value processing, category variable coding, time variable processing and data unbalance processing.
And (4) directly removing missing values of the important fields, and processing the non-important fields according to column mean values or 0 supplementation.
And 6, eliminating the external abnormal value of sigma.
The categories are divided into two blocks, direct text coding, and one-hot coding.
The time variable is weighted up by the time dimension.
Two methods are used for data equalization, oversampling and downsampling, and a few samples in two groups are weighted, for example, in the sample equalization method, the ratio of positive and negative samples of the original sample is 0.025: 1, the ratio of positive and negative samples is 0.8 after the down-sampling method: 1.
and 3, characteristic derivation: deriving input original variables according to the configuration file; the original variables include the access behavior, times, time periods and user tags of the user, and basic statistics fields (summarized by time periods) for the access behavior.
Method of deriving variables: time weight is constructed according to the time period of the user access behavior, the access situation fields of the user on each label are constructed according to the labels, and finally statistical field expansion is carried out through the public key on the sklern, wherein the number of the fields before expansion is about 120, and the number of the fields after expansion can reach about 1000.
And 4, selecting characteristics: and carrying out cascading type feature filtering according to variance filtering, chi-square testing, IV values, mutual information, maximum information number, clustering decorrelation, stepwise regression and a tree integration model.
Firstly, removing the highly-collinear fields according to the collinear module of the sklern, secondly, filtering according to the variance volatility of each field, if the volatility does not exist, directly removing, and according to the back measurement result, setting the threshold value of the volatility to be about 0.1 is better. Then, randomly dividing each variable into two parts of data to carry out chi-square test, if the Q-static statistical result is not significant, considering that the variable is available and obeys the same distribution, and if the Q-static statistical result is significant, removing the variable; and then, using the IV value, namely, firstly carrying out binning on the data, then observing the forecasting force of the data, if the forecasting force of the data in each bin on the dependent variable is generally weak, lowering the IV value, selectively removing the IV value, then, measuring the variables by using mutual entropy, removing the variables with the number of more than 95 quantiles on the final mutual entropy distribution, and screening the independent variables by using a clustering and stepwise regression method to remove the high-correlation variables. And finally, ranking the rest variables by the tree models of the model part, and carrying out the next model by the variables with higher ranking weights.
And secondly, using a CART5.0 algorithm of a decision tree to calculate information gain to obtain a field with larger gain for reservation.
And 5, training an algorithm model, wherein the algorithm model comprises a random forest, an XGboost, an SVM and an artificial neural network.
The algorithm modules (models) are divided into 4: random forests, XGBoost, SVM, and artificial neural networks.
During training, samples are divided into a training set, a testing set and a verification set, the training steps are (random forest) and 1500 trees, samples (ratio: alpha) and fields (ratio: beta) are randomly extracted each time to carry out decision tree training, the information gain evaluation standard is a kini coefficient, the number of out of bag samples in the final training set is tested back, the verification set is used for observing the model effect after the test back is finished, and if the test back is not good, the trees, the alpha and the beta of the trees are modified. Finally, validation of the test set was performed on a steady basis (recall and f1 no longer show excessive fluctuations).
And 6, estimating the model in the step 5, and scoring the prediction sample according to the result obtained by the model and the weight of the model.
The evaluation was performed using auc, roc, f1 values. The evaluation process comprises the steps of grouping the data sets for N times, wherein each group is used as a brand new sample set and is divided into training, testing and verifying, carrying out testing by substituting the data sets into the model, and finally scoring the model according to the evaluation results of the N times (for example, if the average value of f1 of the N times of the random forest is 0.4, the average value of f1 of the XGBoostN times is 0.5, the average value of f1 of the N times of the SVM is 0.6, and the average value of f1 of the N times of the artificial neural network is 0.7, model weights are taken for the four models by using softmax). And finally, scoring the prediction samples according to the result obtained by the model and the weight of the model.
And 7, measuring the distance between the two data sets: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets.
The two data sets refer to two data sets, namely training vs test, training vs prediction and testing vs prediction. The KL divergence (also called relative entropy) can be understood as the similarity of the two distributions. As shown in FIG. 1, the distributions of the two data sets are relatively different, and the values obtained using KL divergence are large, indicating that the two distributions are not very similar and have a large difference. Variable filtering is the same as step 4.
The invention has the technical effects that:
1. the combination of the feature selection methods in the software system and the setting of default parameters.
The feature combination modes of different models are different, all features are combined into a feature pool, the feature combination of each tree in a random forest is different, the feature selection of a gradient tree is also random, if the SVM is adopted, the overall features are used, and the artificial neural network predicts the high-level semantic features obtained by nonlinear combination of the input features.
2. For the distance comparison of the two data sets, firstly, multiple distance measurements are carried out on the same variable, and then cosine distance measurements are carried out on all the variables, so that the comparison between the two data sets is obtained.
For example, if we need to measure the dispersion of the distribution of the variable a in different data sets, the KL dispersion, the euclidean example, the cross entropy is used for statistics, and the uniformity of the distribution of the uniform variable in different data sets is observed. The verification method that the distance between the variables changes with the change of the data set is as follows: assuming that the difference between the cosine distance of the variable A and the cosine distance of the variable B in the data set 1 and the cosine distance of the variable B in the data set 2 is large, the data set is considered to be unevenly split, and the distances are generally judged by adopting statistical 95% of signal level.
3. The feature derivation section proposes the concept of group derivation to reduce combinatorial explosion and improve effective combining.
For example, for a random forest, the model not only randomly selects and combines the spanning tree model according to the dimension input by the user, but also considers that the dimension is grouped when the dimension is input, for example, all data fed back by the user side are bound into a group, including behavior data, behavior occurrence time and frequency; and the dimensions of all the artificial labeling or construction are subdivided into a group, including time weight and frequency statistics; and all the two classification dimensions in one group are provided and divided into another group as a factor group, including all the classification groups, and the group-to-group interaction can be effectively reduced through the groups, so that the combined explosion is reduced, and the model operation efficiency is improved.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (1)

1. A method of rapid modeling, comprising:
step 1, reading configuration file parameters, checking input data and performing data type conversion;
step 2, preprocessing the data; the preprocessing comprises missing value and abnormal value processing, category variable coding, time variable processing and data unbalance processing;
and 3, characteristic derivation: deriving input original variables according to the configuration file; the original variables comprise the access behavior, times, time periods and user labels of the users, and basic statistical fields aiming at the access behavior;
and 4, selecting characteristics: carrying out cascading type feature filtering according to variance filtering, chi-square testing, IV values, mutual information, maximum information number, clustering decorrelation, stepwise regression and a tree integration model;
step 5, training an algorithm model, wherein the algorithm model comprises a random forest, an XGboost, an SVM and an artificial neural network;
step 6, estimating the model in the step 5, and scoring the prediction sample according to the result obtained by the model and the weight of the model;
and 7, measuring the distance between the two data sets: various distance measures are used for a model training set, a test set and a prediction set to assist modeling, variable filtering and difference evaluation between data sets.
CN201911121863.3A 2019-11-15 2019-11-15 Rapid modeling method Pending CN111079937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911121863.3A CN111079937A (en) 2019-11-15 2019-11-15 Rapid modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911121863.3A CN111079937A (en) 2019-11-15 2019-11-15 Rapid modeling method

Publications (1)

Publication Number Publication Date
CN111079937A true CN111079937A (en) 2020-04-28

Family

ID=70311108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911121863.3A Pending CN111079937A (en) 2019-11-15 2019-11-15 Rapid modeling method

Country Status (1)

Country Link
CN (1) CN111079937A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101950A (en) * 2020-09-27 2020-12-18 中国建设银行股份有限公司 Suspicious transaction monitoring model feature extraction method and device
CN112132111A (en) * 2020-10-10 2020-12-25 安徽江淮汽车集团股份有限公司 Parking typical scene extraction method, device, storage medium and device
CN112734560A (en) * 2020-12-31 2021-04-30 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN113553256A (en) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 AB test method and device and electronic equipment
CN113782212A (en) * 2021-04-19 2021-12-10 东华医为科技有限公司 Data processing system
CN114722746A (en) * 2022-05-24 2022-07-08 苏州浪潮智能科技有限公司 Chip aided design method, device, equipment and readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241892A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 A kind of Data Modeling Method and device
KR101864286B1 (en) * 2017-11-10 2018-07-04 주식회사 한컴엠디에스 Method and apparatus for using machine learning algorithm
CN108363714A (en) * 2017-12-21 2018-08-03 北京至信普林科技有限公司 A kind of method and system for the ensemble machine learning for facilitating data analyst to use
CN109523316A (en) * 2018-11-16 2019-03-26 杭州珞珈数据科技有限公司 The automation modeling method of commerce services model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241892A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 A kind of Data Modeling Method and device
KR101864286B1 (en) * 2017-11-10 2018-07-04 주식회사 한컴엠디에스 Method and apparatus for using machine learning algorithm
CN108363714A (en) * 2017-12-21 2018-08-03 北京至信普林科技有限公司 A kind of method and system for the ensemble machine learning for facilitating data analyst to use
CN109523316A (en) * 2018-11-16 2019-03-26 杭州珞珈数据科技有限公司 The automation modeling method of commerce services model

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101950A (en) * 2020-09-27 2020-12-18 中国建设银行股份有限公司 Suspicious transaction monitoring model feature extraction method and device
CN112101950B (en) * 2020-09-27 2024-05-10 中国建设银行股份有限公司 Suspicious transaction monitoring model feature extraction method and suspicious transaction monitoring model feature extraction device
CN112132111A (en) * 2020-10-10 2020-12-25 安徽江淮汽车集团股份有限公司 Parking typical scene extraction method, device, storage medium and device
CN112734560A (en) * 2020-12-31 2021-04-30 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN112734560B (en) * 2020-12-31 2024-05-14 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN113782212A (en) * 2021-04-19 2021-12-10 东华医为科技有限公司 Data processing system
CN113553256A (en) * 2021-06-18 2021-10-26 北京百度网讯科技有限公司 AB test method and device and electronic equipment
CN113553256B (en) * 2021-06-18 2023-07-14 北京百度网讯科技有限公司 AB test method and device and electronic equipment
CN114722746A (en) * 2022-05-24 2022-07-08 苏州浪潮智能科技有限公司 Chip aided design method, device, equipment and readable medium
CN114722746B (en) * 2022-05-24 2022-11-01 苏州浪潮智能科技有限公司 Chip aided design method, device and equipment and readable medium

Similar Documents

Publication Publication Date Title
CN111079937A (en) Rapid modeling method
CN110866819A (en) Automatic credit scoring card generation method based on meta-learning
CN111738462B (en) Fault first-aid repair active service early warning method for electric power metering device
CN104503874A (en) Hard disk failure prediction method for cloud computing platform
CN110930198A (en) Electric energy substitution potential prediction method and system based on random forest, storage medium and computer equipment
CN111614491A (en) Power monitoring system oriented safety situation assessment index selection method and system
CN111461216A (en) Case risk identification method based on machine learning
CN111583012B (en) Method for evaluating default risk of credit, debt and debt main body by fusing text information
Umayaparvathi et al. Attribute selection and customer churn prediction in telecom industry
CN111738331A (en) User classification method and device, computer-readable storage medium and electronic device
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112966962A (en) Electric business and enterprise evaluation method
CN111986027A (en) Abnormal transaction processing method and device based on artificial intelligence
Zhang et al. A fault diagnosis method of power transformer based on cost sensitive one-dimensional convolution neural network
CN113450004A (en) Power credit report generation method and device, electronic equipment and readable storage medium
CN114155025A (en) Random forest based electric vehicle charging station charging loss user prediction method
CN113779785A (en) Deconstruction model and deconstruction method of digital twin complex equipment
Ackermann et al. Black-box learning of parametric dependencies for performance models
CN117674119A (en) Power grid operation risk assessment method, device, computer equipment and storage medium
CN103136440A (en) Method and device of data processing
CN117131425A (en) Numerical control machine tool processing state monitoring method and system based on feedback data
CN116091206A (en) Credit evaluation method, credit evaluation device, electronic equipment and storage medium
CN114626940A (en) Data analysis method and device and electronic equipment
CN113935819A (en) Method for extracting checking abnormal features
CN113886592A (en) Quality detection method for operation and maintenance data of power information communication system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination