CN109146080A - The method of model realization framework based on supervision class machine learning algorithm - Google Patents

The method of model realization framework based on supervision class machine learning algorithm Download PDF

Info

Publication number
CN109146080A
CN109146080A CN201811072255.3A CN201811072255A CN109146080A CN 109146080 A CN109146080 A CN 109146080A CN 201811072255 A CN201811072255 A CN 201811072255A CN 109146080 A CN109146080 A CN 109146080A
Authority
CN
China
Prior art keywords
data
model
sample
machine learning
learning algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811072255.3A
Other languages
Chinese (zh)
Inventor
郭益民
石乾坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Zheng Load Information Technology Co Ltd
Original Assignee
Suzhou Zheng Load Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Zheng Load Information Technology Co Ltd filed Critical Suzhou Zheng Load Information Technology Co Ltd
Priority to CN201811072255.3A priority Critical patent/CN109146080A/en
Publication of CN109146080A publication Critical patent/CN109146080A/en
Pending legal-status Critical Current

Links

Abstract

The present invention relates to the methods of the model realization framework based on supervision class machine learning algorithm, comprising the following steps: step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data;Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated;Step 3: sample control design case, mainly for the sample data and label data in supervision machine study;Step 4: an algorithms library is mainly established in model training design, and the training data that step 2 is completed the process is as input, then, is called the algorithm in algorithms library, that is, is produced corresponding machine learning model;Step 5: test data is inputted in trained each model and calculates acquisition prediction result, compares the otherness of the target item and prediction result in test data by model evaluation design.The present invention reaches the realization to overall architecture by supervising the model of the learning algorithm of class machine, helps the operation in simplified later period.

Description

The method of model realization framework based on supervision class machine learning algorithm
Technical field
The present invention relates to a kind of methods of model realization framework based on supervision class machine learning algorithm.
Background technique
Under current techniques environment, machine learning is most popular most exciting one of field.The study of machine, allows people Enjoyed stable twit filter, convenient text and speech recognition, reliable network search engines and brilliant Chess player, and safe and efficient autonomous driving vehicle is expected to occur indubitable, and machine learning has become one Popular domain, but it is sometimes easy to have one's view of the important overshadowed by the trivial that constantly volume carries out innovative ability and realizes continuous study by it needs.
Summary of the invention
In order to solve the above technical problems, the object of the present invention is to provide a kind of models based on supervision class machine learning algorithm The method for realizing framework.
To achieve the above object, the present invention adopts the following technical scheme:
The method of model realization framework based on supervision class machine learning algorithm, comprising the following steps:
Step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data;
Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated;
Step 3: sample control design case, mainly for the sample data and label data in supervision machine study;
Step 4: model training design mainly establishes an algorithms library, the training data that step 2 is completed the process is as defeated Enter, then, calls the algorithm in algorithms library, that is, produce corresponding machine learning model;
Step 5: test data is inputted in trained each model and calculates acquisition prediction knot by model evaluation design Fruit compares the otherness of the target item and prediction result in test data.
Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein the step Mode input data are divided into target item and characteristic item in rapid 1, wherein target item is the object that model needs to predict, passes through industry Business demand confirms such object;Characteristic item be then for carrying out model training multi-dimensional matrix, it is every in characteristic item One dimension all has certain influence to prediction target item.
Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein described Processing mode in step 2 the following steps are included:
1, deletion row records duplicate data sample or any one column missing values are more than 50% characteristic series;
2, the basic conversion of correlated characteristic column;
3, pass through the characteristic series of some continuous types of the related dummy variable discretization of design or classifying text type;
4, the processing of exceptional value deviates excessive data point for arranging, and is directly deleted or assignment again;
5, it is calculated with the multiple characteristic series of specific logical association, generates new characteristic series;
6, data are carried out by lateral division with certain rule, is respectively defined as training data and test data.
Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein described Basic conversion includes that LOG, EXP, SQRT are converted.
It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute Sample data in step 3 is stated, needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule are as follows:
The sample that label is 1, weight are assigned a value of p1/r1;
The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1);
Wherein p1 is ratio shared by label is 1 in initial bulk sample notebook data sample, and r1 is sample adjusted of sampling Ratio shared by the sample that label is 1 in data.
It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute Stating the algorithms library in step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or is calculated for the MLlib in Spark Faku County.
It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute It states target item involved in step 5 and prediction result is equipped with reference quantity, respectively mean square error and classification accuracy, wherein
MSE is known as mean square error, calculation formula are as follows:
Wherein, N is test sample amount, yiFor the target item in test data,For model predication value.
Classification accuracy, calculation formula are as follows:
Wherein, N is test sample amount, and p is that model prediction is 1 and realistic objective item is also 1 quantity, and q is model prediction For 0 and realistic objective item is also 0 quantity.
It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein one The rule of fixed rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is used for training pattern, 30% Data sample be used to test.
According to the above aspect of the present invention, the present invention has at least the following advantages:
1, the invention is to supervise the complete procedure of class machine learning algorithm, has versatility and reproducibility, for each The machine learning algorithm business in field can use.
2, the invention considers thorough in process of data preprocessing, for establish machine learning model provide it is reliable defeated Enter.
3, the invention is suitable for all kinds of machine learning frames and all kinds of machine learning models.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, the following is a detailed description of the preferred embodiments of the present invention and the accompanying drawings.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is structural schematic diagram of the invention.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Embodiments of the present invention, which are generally described and illustrated herein in the accompanying drawings Component can arrange and design with a variety of different configurations.Therefore, the implementation of the invention to providing in the accompanying drawings below The detailed description of example is not intended to limit the range of claimed invention, but is merely representative of selected implementation of the invention Example.Based on the embodiment of the present invention, those skilled in the art are obtained all without making creative work Other embodiments shall fall within the protection scope of the present invention.
Embodiment
As shown in Figure 1, the method for the model realization framework based on supervision class machine learning algorithm, comprising the following steps:
Step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data;
Wherein, mode input data are divided into target item and characteristic item, wherein target item is pair that model needs to predict As confirming such object by business demand;Characteristic item is then for carrying out model training multi-dimensional matrix, feature Each of item dimension all has certain influence to prediction target item.
Therefore the main task of the model data frame entirety is according to the business demand of actual items and available Conceptual data situation is determined target item and feature item data by specific logical definition, and they is merged and pools one A complete mode input matrix.
Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated;
Processing mode in the step 2 the following steps are included:
1, deletion row records duplicate data sample or any one column missing values are more than 50% characteristic series;
2, the basic conversion of correlated characteristic column, such as the basic conversion include that LOG, EXP, SQRT are converted;
3, pass through the characteristic series of some continuous types of the related dummy variable discretization of design or classifying text type;
4, the processing of exceptional value deviates excessive data point for arranging, and is directly deleted or assignment again;
5, it is calculated with the multiple characteristic series of specific logical association, generates new characteristic series;
6, data are carried out by lateral division with certain rule, is respectively defined as training data and test data.
The rule of certain rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is for instructing Practice model, 30% data sample is used to test.
Step 3: sample control design case, mainly for the sample data and label data in supervision machine study;
Often occurs the case where 1-0 sample imbalance in actual items, the sample data that in most cases label is 1 Much smaller than the sample data that label is 0.It would therefore be desirable to have the processes of sample control, that is, replicate the sample data that label is 1 Or the sample data that random sampling label is 0, finally make the sample data volume that label is 1 and the sample data that label is 0 Amount is maintained on the same order of magnitude.
Sample data in the step 3 needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule Then are as follows:
The sample that label is 1, weight are assigned a value of p1/r1;
The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1);
Wherein p1 is ratio shared by label is 1 in initial bulk sample notebook data sample, and r1 is sample adjusted of sampling Ratio shared by the sample that label is 1 in data.
Step 4: model training design mainly establishes an algorithms library, the training data that step 2 is completed the process is as defeated Enter, then, calls the algorithm in algorithms library, that is, produce corresponding machine learning model;
It needs to need to pay attention in modeling process for step 4:
1, in the case where no theoretical proof certain algorithm is optimal, all suitable input numbers in algorithms library are needed to be traversed for According to model;
2, for every kind of algorithm, key parameter type is also had nothing in common with each other;For the major parameter of every kind of algorithm, need It is targetedly configured, just the fitting effect of model can be made to reach best;
3, the generation of model over-fitting in order to prevent, the cross validation for needing to carry out K folding carry out those parameters to be estimated Adjustment fitting;
4, modeling needs to carry out Model Diagnosis after completing, such as needs to check that the R2 of model is for the algorithm of regression class It is no larger, illustrate that the fitting effect of model is better closer to 1.In addition also to check that the normal state randomness test of residual error whether can Enough pass through, whether there is apparent multicollinearity phenomenon between dimension;For classification problem, need to check the ROC drawn out Whether the AUC value under curve is larger, illustrates that the fitting effect of model is better closer to 1.
Algorithms library in the step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or is Spark In MLlib algorithms library.
Step 5: test data is inputted in trained each model and calculates acquisition prediction knot by model evaluation design Fruit compares the otherness of the target item and prediction result in test data.
Target item involved in the step 5 and prediction result are equipped with reference quantity, and respectively mean square error and classification is accurate Rate, wherein
MSE is known as mean square error, calculation formula are as follows:
Wherein, N is test sample amount, yiFor the target item in test data,For model predication value.
Classification accuracy, calculation formula are as follows:
Wherein, N is test sample amount, and p is that model prediction is 1 and realistic objective item is also 1 quantity, and q is model prediction For 0 and realistic objective item is also 0 quantity.
In addition, there may be customized model prediction Performance Evaluating Indexes in some actual projects.Comprehensively consider MSE Value, ACCURACY value and customized model-evaluation index, selection MSE value as far as possible is small, the big model of ACCURACY value as The model that final choice uses.
The present invention has at least the following advantages:
1, the invention is to supervise the complete procedure of class machine learning algorithm, has versatility and reproducibility, for each The machine learning algorithm business in field can use.
2, the invention considers thorough in process of data preprocessing, for establish machine learning model provide it is reliable defeated Enter.
3, the invention is suitable for all kinds of machine learning frames and all kinds of machine learning models.
The above is only a preferred embodiment of the present invention, it is not intended to restrict the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is several improvement and Modification, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (8)

1. the method for the model realization framework based on supervision class machine learning algorithm, which comprises the following steps:
Step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data;
Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated;
Step 3: sample control design case, mainly for the sample data and label data in supervision machine study;
Step 4: an algorithms library is mainly established in model training design, and the training data that step 2 is completed the process is as input, so Afterwards, the algorithm in algorithms library is called, that is, produces corresponding machine learning model;
Step 5: test data is inputted in trained each model and calculates acquisition prediction result by model evaluation design, than Compared with the otherness of target item and prediction result in test data.
2. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: mode input data are divided into target item and characteristic item in the step 1, wherein target item is pair that model needs to predict As confirming such object by business demand;Characteristic item is then for carrying out model training multi-dimensional matrix, feature Each of item dimension all has certain influence to prediction target item.
3. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: processing mode in the step 2 the following steps are included:
1, deletion row records duplicate data sample or any one column missing values are more than 50% characteristic series;
2, the basic conversion of correlated characteristic column;
3, pass through the characteristic series of some continuous types of the related dummy variable discretization of design or classifying text type;
4, the processing of exceptional value deviates excessive data point for arranging, and is directly deleted or assignment again;
5, it is calculated with the multiple characteristic series of specific logical association, generates new characteristic series;
6, data are carried out by lateral division with certain rule, is respectively defined as training data and test data.
4. the method for the model realization framework according to claim 3 based on supervision class machine learning algorithm, feature exist In: the basic conversion includes that LOG, EXP, SQRT are converted.
5. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: sample data in the step 3 needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule are as follows:
The sample that label is 1, weight are assigned a value of p1/r1;
The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1);
Wherein p1 is ratio shared by label is 1 in initial bulk sample notebook data sample, and r1 is sample data adjusted of sampling Ratio shared by the sample that middle label is 1.
6. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: the algorithms library in the step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or in Spark MLlib algorithms library.
7. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: target item involved in the step 5 and prediction result are equipped with reference quantity, respectively mean square error and classification accuracy, In,
MSE is known as mean square error, calculation formula are as follows:
Wherein, N is test sample amount, yiFor the target item in test data,For model predication value.
Classification accuracy, calculation formula are as follows:
Wherein, N is test sample amount, and p is that model prediction is 1 and realistic objective item is also 1 quantity, q be model prediction be 0 and Realistic objective item is also 0 quantity.
8. the method for the model realization framework according to claim 3 based on supervision class machine learning algorithm, feature exist In: the rule of certain rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is for training mould Type, 30% data sample are used to test.
CN201811072255.3A 2018-09-14 2018-09-14 The method of model realization framework based on supervision class machine learning algorithm Pending CN109146080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811072255.3A CN109146080A (en) 2018-09-14 2018-09-14 The method of model realization framework based on supervision class machine learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811072255.3A CN109146080A (en) 2018-09-14 2018-09-14 The method of model realization framework based on supervision class machine learning algorithm

Publications (1)

Publication Number Publication Date
CN109146080A true CN109146080A (en) 2019-01-04

Family

ID=64825268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811072255.3A Pending CN109146080A (en) 2018-09-14 2018-09-14 The method of model realization framework based on supervision class machine learning algorithm

Country Status (1)

Country Link
CN (1) CN109146080A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008347A (en) * 2019-11-25 2020-04-14 杭州安恒信息技术股份有限公司 Website identification method, device and system and computer readable storage medium
CN111047049A (en) * 2019-12-05 2020-04-21 北京小米移动软件有限公司 Method, apparatus and medium for processing multimedia data based on machine learning model
CN113869342A (en) * 2020-06-30 2021-12-31 微软技术许可有限责任公司 Mark offset detection and adjustment in predictive modeling
CN114254588A (en) * 2021-12-16 2022-03-29 马上消费金融股份有限公司 Data tag processing method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008347A (en) * 2019-11-25 2020-04-14 杭州安恒信息技术股份有限公司 Website identification method, device and system and computer readable storage medium
CN111047049A (en) * 2019-12-05 2020-04-21 北京小米移动软件有限公司 Method, apparatus and medium for processing multimedia data based on machine learning model
CN111047049B (en) * 2019-12-05 2023-08-11 北京小米移动软件有限公司 Method, device and medium for processing multimedia data based on machine learning model
CN113869342A (en) * 2020-06-30 2021-12-31 微软技术许可有限责任公司 Mark offset detection and adjustment in predictive modeling
CN114254588A (en) * 2021-12-16 2022-03-29 马上消费金融股份有限公司 Data tag processing method and device
CN114254588B (en) * 2021-12-16 2023-10-13 马上消费金融股份有限公司 Data tag processing method and device

Similar Documents

Publication Publication Date Title
CN109146080A (en) The method of model realization framework based on supervision class machine learning algorithm
Li et al. Random search and reproducibility for neural architecture search
CN103729678B (en) A kind of based on navy detection method and the system of improving DBN model
CN107688825B (en) Improved integrated weighted extreme learning machine sewage treatment fault diagnosis method
Effendy et al. Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest
CN106503689A (en) Neutral net local discharge signal mode identification method based on particle cluster algorithm
CN111063194A (en) Traffic flow prediction method
Putra et al. Estimation of parameters in the SIR epidemic model using particle swarm optimization
CN112578089B (en) Air pollutant concentration prediction method based on improved TCN
CN108062566A (en) A kind of intelligent integrated flexible measurement method based on the potential feature extraction of multinuclear
Özsoy et al. Estimating the parameters of nonlinear regression models through particle swarm optimization
Kavitha et al. Real time credit card fraud detection on huge imbalanced data using meta-classifiers
Pourchot et al. Importance mixing: Improving sample reuse in evolutionary policy search methods
CN111753751A (en) Fan fault intelligent diagnosis method for improving firework algorithm
Regazzoni et al. A physics-informed multi-fidelity approach for the estimation of differential equations parameters in low-data or large-noise regimes
CN113657452A (en) Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning
CN116029221B (en) Power equipment fault diagnosis method, device, equipment and medium
Namazi et al. Surrogate assisted optimisation for travelling thief problems
CN116258899A (en) Corn ear classification method based on custom light convolutional neural network
CN114861364A (en) Intelligent sensing and suction regulation and control method for air inlet flow field of air-breathing engine
CN109697511A (en) Data reasoning method, apparatus and computer equipment
Ye Linear conic programming
JP2021012600A (en) Method for diagnosis, method for learning, learning device, and program
Reena et al. Software defect prediction system–decision tree algorithm with two level data pre-processing
Tan Using supervised attribute selection for unsupervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190104