CN109146080A

CN109146080A - The method of model realization framework based on supervision class machine learning algorithm

Info

Publication number: CN109146080A
Application number: CN201811072255.3A
Authority: CN
Inventors: 郭益民; 石乾坤
Original assignee: Suzhou Zheng Load Information Technology Co Ltd
Current assignee: Suzhou Zheng Load Information Technology Co Ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2019-01-04

Abstract

The present invention relates to the methods of the model realization framework based on supervision class machine learning algorithm, comprising the following steps: step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data；Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated；Step 3: sample control design case, mainly for the sample data and label data in supervision machine study；Step 4: an algorithms library is mainly established in model training design, and the training data that step 2 is completed the process is as input, then, is called the algorithm in algorithms library, that is, is produced corresponding machine learning model；Step 5: test data is inputted in trained each model and calculates acquisition prediction result, compares the otherness of the target item and prediction result in test data by model evaluation design.The present invention reaches the realization to overall architecture by supervising the model of the learning algorithm of class machine, helps the operation in simplified later period.

Description

The method of model realization framework based on supervision class machine learning algorithm

Technical field

The present invention relates to a kind of methods of model realization framework based on supervision class machine learning algorithm.

Background technique

Under current techniques environment, machine learning is most popular most exciting one of field.The study of machine, allows people Enjoyed stable twit filter, convenient text and speech recognition, reliable network search engines and brilliant Chess player, and safe and efficient autonomous driving vehicle is expected to occur indubitable, and machine learning has become one Popular domain, but it is sometimes easy to have one's view of the important overshadowed by the trivial that constantly volume carries out innovative ability and realizes continuous study by it needs.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide a kind of models based on supervision class machine learning algorithm The method for realizing framework.

To achieve the above object, the present invention adopts the following technical scheme:

The method of model realization framework based on supervision class machine learning algorithm, comprising the following steps:

Step 1: the design of model data frame entirety, mainly for explicitly defining for mode input data；

Step 2: data prediction design carries out further processing processing mainly for mode input matrix is generated；

Step 3: sample control design case, mainly for the sample data and label data in supervision machine study；

Step 4: model training design mainly establishes an algorithms library, the training data that step 2 is completed the process is as defeated Enter, then, calls the algorithm in algorithms library, that is, produce corresponding machine learning model；

Step 5: test data is inputted in trained each model and calculates acquisition prediction knot by model evaluation design Fruit compares the otherness of the target item and prediction result in test data.

Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein the step Mode input data are divided into target item and characteristic item in rapid 1, wherein target item is the object that model needs to predict, passes through industry Business demand confirms such object；Characteristic item be then for carrying out model training multi-dimensional matrix, it is every in characteristic item One dimension all has certain influence to prediction target item.

Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein described Processing mode in step 2 the following steps are included:

1, deletion row records duplicate data sample or any one column missing values are more than 50% characteristic series；

2, the basic conversion of correlated characteristic column；

3, pass through the characteristic series of some continuous types of the related dummy variable discretization of design or classifying text type；

4, the processing of exceptional value deviates excessive data point for arranging, and is directly deleted or assignment again；

5, it is calculated with the multiple characteristic series of specific logical association, generates new characteristic series；

6, data are carried out by lateral division with certain rule, is respectively defined as training data and test data.

Further, the method for the model realization framework based on supervision class machine learning algorithm, wherein described Basic conversion includes that LOG, EXP, SQRT are converted.

It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute Sample data in step 3 is stated, needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule are as follows:

The sample that label is 1, weight are assigned a value of p1/r1；

The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1)；

Wherein p1 is ratio shared by label is 1 in initial bulk sample notebook data sample, and r1 is sample adjusted of sampling Ratio shared by the sample that label is 1 in data.

It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute Stating the algorithms library in step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or is calculated for the MLlib in Spark Faku County.

It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein institute It states target item involved in step 5 and prediction result is equipped with reference quantity, respectively mean square error and classification accuracy, wherein

MSE is known as mean square error, calculation formula are as follows:

Wherein, N is test sample amount, y_iFor the target item in test data,For model predication value.

Classification accuracy, calculation formula are as follows:

Wherein, N is test sample amount, and p is that model prediction is 1 and realistic objective item is also 1 quantity, and q is model prediction For 0 and realistic objective item is also 0 quantity.

It is further again, the method for the model realization framework based on supervision class machine learning algorithm, wherein one The rule of fixed rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is used for training pattern, 30% Data sample be used to test.

According to the above aspect of the present invention, the present invention has at least the following advantages:

1, the invention is to supervise the complete procedure of class machine learning algorithm, has versatility and reproducibility, for each The machine learning algorithm business in field can use.

2, the invention considers thorough in process of data preprocessing, for establish machine learning model provide it is reliable defeated Enter.

3, the invention is suitable for all kinds of machine learning frames and all kinds of machine learning models.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, the following is a detailed description of the preferred embodiments of the present invention and the accompanying drawings.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is structural schematic diagram of the invention.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Embodiments of the present invention, which are generally described and illustrated herein in the accompanying drawings Component can arrange and design with a variety of different configurations.Therefore, the implementation of the invention to providing in the accompanying drawings below The detailed description of example is not intended to limit the range of claimed invention, but is merely representative of selected implementation of the invention Example.Based on the embodiment of the present invention, those skilled in the art are obtained all without making creative work Other embodiments shall fall within the protection scope of the present invention.

Embodiment

As shown in Figure 1, the method for the model realization framework based on supervision class machine learning algorithm, comprising the following steps:

Wherein, mode input data are divided into target item and characteristic item, wherein target item is pair that model needs to predict As confirming such object by business demand；Characteristic item is then for carrying out model training multi-dimensional matrix, feature Each of item dimension all has certain influence to prediction target item.

Therefore the main task of the model data frame entirety is according to the business demand of actual items and available Conceptual data situation is determined target item and feature item data by specific logical definition, and they is merged and pools one A complete mode input matrix.

Processing mode in the step 2 the following steps are included:

2, the basic conversion of correlated characteristic column, such as the basic conversion include that LOG, EXP, SQRT are converted；

The rule of certain rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is for instructing Practice model, 30% data sample is used to test.

Often occurs the case where 1-0 sample imbalance in actual items, the sample data that in most cases label is 1 Much smaller than the sample data that label is 0.It would therefore be desirable to have the processes of sample control, that is, replicate the sample data that label is 1 Or the sample data that random sampling label is 0, finally make the sample data volume that label is 1 and the sample data that label is 0 Amount is maintained on the same order of magnitude.

Sample data in the step 3 needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule Then are as follows:

The sample that label is 1, weight are assigned a value of p1/r1；

The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1)；

It needs to need to pay attention in modeling process for step 4:

1, in the case where no theoretical proof certain algorithm is optimal, all suitable input numbers in algorithms library are needed to be traversed for According to model；

2, for every kind of algorithm, key parameter type is also had nothing in common with each other；For the major parameter of every kind of algorithm, need It is targetedly configured, just the fitting effect of model can be made to reach best；

3, the generation of model over-fitting in order to prevent, the cross validation for needing to carry out K folding carry out those parameters to be estimated Adjustment fitting；

4, modeling needs to carry out Model Diagnosis after completing, such as needs to check that the R2 of model is for the algorithm of regression class It is no larger, illustrate that the fitting effect of model is better closer to 1.In addition also to check that the normal state randomness test of residual error whether can Enough pass through, whether there is apparent multicollinearity phenomenon between dimension；For classification problem, need to check the ROC drawn out Whether the AUC value under curve is larger, illustrates that the fitting effect of model is better closer to 1.

Algorithms library in the step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or is Spark In MLlib algorithms library.

Target item involved in the step 5 and prediction result are equipped with reference quantity, and respectively mean square error and classification is accurate Rate, wherein

MSE is known as mean square error, calculation formula are as follows:

Classification accuracy, calculation formula are as follows:

In addition, there may be customized model prediction Performance Evaluating Indexes in some actual projects.Comprehensively consider MSE Value, ACCURACY value and customized model-evaluation index, selection MSE value as far as possible is small, the big model of ACCURACY value as The model that final choice uses.

The present invention has at least the following advantages:

The above is only a preferred embodiment of the present invention, it is not intended to restrict the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is several improvement and Modification, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. the method for the model realization framework based on supervision class machine learning algorithm, which comprises the following steps:

Step 4: an algorithms library is mainly established in model training design, and the training data that step 2 is completed the process is as input, so Afterwards, the algorithm in algorithms library is called, that is, produces corresponding machine learning model；

Step 5: test data is inputted in trained each model and calculates acquisition prediction result by model evaluation design, than Compared with the otherness of target item and prediction result in test data.

2. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: mode input data are divided into target item and characteristic item in the step 1, wherein target item is pair that model needs to predict As confirming such object by business demand；Characteristic item is then for carrying out model training multi-dimensional matrix, feature Each of item dimension all has certain influence to prediction target item.

3. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: processing mode in the step 2 the following steps are included:

2, the basic conversion of correlated characteristic column；

4. the method for the model realization framework according to claim 3 based on supervision class machine learning algorithm, feature exist In: the basic conversion includes that LOG, EXP, SQRT are converted.

5. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: sample data in the step 3 needs to increase a column entitled " weight " or the amendment column of " offset ", assignment rule are as follows:

The sample that label is 1, weight are assigned a value of p1/r1；

The sample that label is 0, weight are assigned a value of (1-p1)/(1-r1)；

Wherein p1 is ratio shared by label is 1 in initial bulk sample notebook data sample, and r1 is sample data adjusted of sampling Ratio shared by the sample that middle label is 1.

6. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: the algorithms library in the step 4 is the algorithm packet in R, or is the Scipy algorithms library in Python, or in Spark MLlib algorithms library.

7. the method for the model realization framework according to claim 1 based on supervision class machine learning algorithm, feature exist In: target item involved in the step 5 and prediction result are equipped with reference quantity, respectively mean square error and classification accuracy, In,

MSE is known as mean square error, calculation formula are as follows:

Classification accuracy, calculation formula are as follows:

Wherein, N is test sample amount, and p is that model prediction is 1 and realistic objective item is also 1 quantity, q be model prediction be 0 and Realistic objective item is also 0 quantity.

8. the method for the model realization framework according to claim 3 based on supervision class machine learning algorithm, feature exist In: the rule of certain rule is the ratio cut partition training test data with 7:3, i.e., 70% data sample is for training mould Type, 30% data sample are used to test.