CN117648646A - Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning - Google Patents
Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning Download PDFInfo
- Publication number
- CN117648646A CN117648646A CN202410122065.7A CN202410122065A CN117648646A CN 117648646 A CN117648646 A CN 117648646A CN 202410122065 A CN202410122065 A CN 202410122065A CN 117648646 A CN117648646 A CN 117648646A
- Authority
- CN
- China
- Prior art keywords
- model
- drilling
- learning
- data
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005553 drilling Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000004519 manufacturing process Methods 0.000 title claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000005457 optimization Methods 0.000 claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 20
- 238000007637 random forest analysis Methods 0.000 claims abstract description 19
- 238000012795 verification Methods 0.000 claims abstract description 19
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 239000003208 petroleum Substances 0.000 claims abstract description 13
- 238000012216 screening Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000012417 linear regression Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 9
- 238000010801 machine learning Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 230000009469 supplementation Effects 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 claims description 2
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 abstract description 4
- 239000003345 natural gas Substances 0.000 abstract description 2
- 239000003209 petroleum derivative Substances 0.000 abstract description 2
- 238000003066 decision tree Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 12
- 238000009826 distribution Methods 0.000 description 11
- 238000005070 sampling Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, which belongs to the technical field of petroleum and natural gas drilling and production and comprises the following steps: s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD; s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set; s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out; s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. The invention provides a training framework for feature selection and two-stage stacking heterogeneous integrated learning, and the Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.
Description
Technical Field
The invention relates to the technical field of petroleum and natural gas drilling and production, in particular to a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning.
Background
Machine learning is a method of learning a general rule from limited observation data, establishing a learning model, and predicting unknown data using the established model. The machine learning process is to extract the original data into a group of features through a manual experience or a feature conversion method, then input the features into a learning model, output a prediction result, and further improve the prediction accuracy of the machine learning model by carrying out feature engineering and data screening on the data set.
Instead of using a single powerful model to help make accurate predictions, ensemble learning in machine learning is defined as a technique that combines multiple weak models. These weak models are integrated in some way to obtain the final predictions. The bagging algorithm is a method for randomly creating data samples from a single dataset, and the random forest is a bagging algorithm specially used for decision trees, and the random forest not only introduces diversity in terms of data, but also introduces diversity in terms of variables. The lifting algorithm is a method for changing the target value to the previous model without fitting, and aims to improve the accuracy of the model by sequentially combining weak learners, wherein the classification lifting model (Catboost) introduces techniques such as ordered target statistics and ordered enhancement to improve the traditional enhancement method. The stacked algorithm combines different learning algorithms on a single dataset compared to other ensemble learning algorithms. In a first step, generating a set of basic level learning models; in a second step, the meta-level model is trained by the output of the base-level learning model. In stacked integration, since the prediction results of the plurality of models in the first step are combined and used as the input of the meta learner, it is possible to improve the accuracy of prediction while reducing the deviation. The integrated algorithm model uses multiple generalized algorithms, known as heterogeneous, heterogeneous integration of different algorithms will achieve good predictive performance. The stacking algorithm is important to apply different models, use different learning strategies or parameters, determine the combination of the basic learner and the meta learner through repeated experiments, and properly determine the optimal super parameter value of the basic learner to realize good training effect. Bayesian optimization attempts to collect observations with the highest information in each iteration by taking a balance between exploring uncertain hyper-parameters and collecting observations from near optimal hyper-parameters.
Therefore, how to improve the accuracy of the model in the training stage of the basic learner of the stacked integrated learning is a technical problem to be solved.
Disclosure of Invention
The invention aims to provide a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, so as to solve the technical problem of how to improve the accuracy of petroleum drilling and production engineering cost prediction.
The invention is realized by adopting the following technical scheme: the drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning comprises the following steps:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value.
Further, step S1 includes the following sub-steps:
s11: performing blank value supplementation and filtering treatment on the petroleum drilling engineering cost data set by a machine learning method;
s12: performing target coding processing on the type parameter data in the data set after the vacancy value supplementation and the filtering processing;
s13: the petroleum drilling data set is recorded as a data set SD, and the data in the data set SD is normalized.
Further, step S2 includes the following sub-steps:
s21: carrying out spearman correlation coefficient calculation on the single feature and the tag cost value in the data set SD to obtain an optimal feature subset PD;
s22: data screening is carried out on the optimal feature subset PD by utilizing a greedy strategy, and a core data subset CD is obtained;
s23: the core data subset CD is divided into a training set TrainD, a test set TestD and a verification set VD.
Further, step S22 includes the following sub-steps:
s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples of (a) are uniformly sampled by h tuples as a sample set +.>Elements of (a) and (b);
s222: computing a sample set of samplesUtility of each tuple of (a):
;
wherein,effect of t, ++>Is a utility calculation formula, t isElements of (a) and (b);
s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Joining to data subsetsIn (a):
;
;
wherein,selecting a calculation formula for maximum utility,/->Data subsets;
s224: repeating the steps to finally obtain the core data subset CD.
Further, step S3 includes the following sub-steps:
s31: constructing a random forest model, a support vector regression model and a classification lifting model, and carrying out model training by adopting a training set TrainD and a test set TestD;
s32: model evaluation, namely evaluating the generalization capability of the trained learning model by using a mean square error and an R square evaluation index;
s33: performing model optimization, namely performing super-parameter optimization on the learning model by using a Bayesian parameter optimization algorithm, and simultaneously monitoring whether the learning model is fitted or not so as to determine whether training is stopped;
s34: and respectively setting the super parameters of the learning model as the parameter combinations output in the step S33, putting the verification set VD divided in the step S2 into the learning model, and outputting the drilling and production engineering cost predicted value.
Further, step S34 includes the following sub-steps:
s341: putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a first drilling engineering cost prediction value;
s342: putting the verification set VD into a support vector regression model after Bayesian optimization to obtain a second drilling engineering cost prediction value;
s343: and putting the verification set VD into a classification lifting model subjected to Bayesian optimization to obtain a third drilling engineering cost prediction value.
Further, step S4 includes the following sub-steps:
s41: selecting a linear regression model as a secondary learning model, and training the outputs of the random forest model, the support vector regression model and the classification lifting model as inputs of the linear regression model;
s42: and (3) taking the first drilling engineering cost predicted value, the second drilling engineering cost predicted value and the third drilling engineering cost predicted value obtained in the step (S3) as inputs of a trained linear regression model, and outputting to obtain a final drilling engineering cost predicted value.
The invention has the beneficial effects that:
firstly, carrying out data preprocessing on acquired drilling parameters, wherein the data preprocessing comprises missing value processing, illegal data filtering, characteristic data target coding and the like; performing feature selection and data screening on the processed data set to obtain a high-quality feature data subset; constructing and training three basic learning models of random forest, support vector regression and classification lifting (Catboost) for the screened data set, and performing parameter tuning on the basic learning models by adopting a Bayesian super-parameter optimization algorithm; and finally, constructing a linear regression model as a secondary learning model to obtain a final cost prediction result.
According to the invention, a machine learning algorithm is used for predicting the cost of petroleum drilling engineering, a training framework of feature selection and two-stage stacking heterogeneous integrated learning is provided, and Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a feature selection and data screening flow chart;
FIG. 3 is a schematic diagram of a data screening framework;
FIG. 4 is a flowchart of a Bayesian parameter optimization algorithm.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1 to 4, the drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning includes the steps of:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. In fig. 3, loop represents a cycle, tuple represents a tuple, and Abolo N-1. Abolo N-6 each represents a well name, vertical, directional represents a well type, and Production, exploration represents a well type.
In this embodiment, step S1 includes the following sub-steps:
s11: the machine learning method supplements the data set with blank values and filters illegal data.
S12: and (3) carrying out target coding processing on category type parameter data (such as parameters of well type, well type and the like) in the completed data set, wherein each category of the discrete attribute is coded into an average value or other statistical information of the discrete attribute on a target variable by target coding, and the association between the category and the target variable can be captured.
S13: the petroleum drilling data set is named as SD (the data set SD comprises characteristic parameters such as drilling depth, drilling period, well completion period, well type and the like), the data set sample number is named as n, the characteristic number is named as m, and the characteristic space of the data set SD is defined asWherein->Is a petroleum drilling engineering feature, which is->. The data is normalized, and the calculation formula is as follows:
;
wherein the method comprises the steps ofIs->Normalized value of>Representing the ith feature in the data set SD, and (2)>Is a given dataset +.>Minimum value of characteristic, ++>Is a given dataset +.>Maximum value of the feature.
In this embodiment, step S2 includes the following sub-steps:
s21: correlation calculation, namely, carrying out Szelman correlation coefficient calculation on single feature and tag cost value in a data set, and firstly, carrying out ith feature in the original data setAnd target feature Y rearranges the two sets of data in an ascending order of Y, and gives each +.>And Y assignment rank value +.>Two sets of gradation value data are obtained>And->WhereinThe calculation formula is:
;
wherein the method comprises the steps ofIs the Pi Erman correlation coefficient of the ith feature in the original dataset SD,/for>A class value representing the jth sample of the ith feature,/->Represents->Average value of>A class value representing the j-th sample of the target feature, a class value representing the j-th sample of the target feature>Represents->Average value of (2).
S22:The larger the feature, the more cost-dependent the feature, so will be->Is selected as an effective feature, wherein +_>And deleting the features with smaller correlation to obtain the optimal feature subset PD.
S23: in order to improve the quality of the data set, the data subset is further subjected to data screening by utilizing a greedy strategy to obtain a core data subset CD, the core data set CD is initialized to be an empty set, and the size of the core data set CD is given to be K #Setting +.>When the data set is small +.>) Setting a sample set (++>) The size is h (the default size of h is set to 200 and can be adjusted according to actual conditions).
S24: employing a greedy algorithm strategy for dataset PD will have maximum utility (the "utility" of tuple t represents the expected reduction of gradient approximation errors after tuple t is added to core set CDIs added K times to the core set. Uniformly sampling h tuples as tuples not in the core set CD (one tuple is a piece of data in the dataset)Elements of (a) and (b);
s25: calculation ofThe utility of each tuple in (a) is calculated as follows:
;
for the effect of t, effect calculation +.>The method comprises the following specific steps:
(1) Considering the expectations of the core set as a utility function, the utility of the core set CD may be expressed as E [ CD ], and K may be converted to a sum of the expected values of each tuple in the computation CD, as follows:
;
in the middle ofThe j-th piece of data in the core set CD, which is also a tuple; />Representing the Euclidean distance between the feature vectors of two tuples, i.e. the feature distance, wherein +.>Is an index mapping from core set CD to data set PD, with +.>To represent the j-th tuple in CD (i.e +.>) Is the i-th tuple in PD (i.e.)>);
(2) Adding tuple t to core set CD to obtain new core setInitializing the utility as 0, and the expression is as follows:
;
;
(3) Computing each tuple in the PDFor core set->The utility of (2) is calculated by the following formula:
;
in the middle ofRepresenting core set->Is also a tuple; />Representing the euclidean distance between feature vectors of two tuples, i.e. the feature distance; />Is a slave core set->Index mapping to dataset PD with +.>To express +.>Is the j-th tuple of (i.e +)>) Is the i-th tuple in PD (i.e.)>);
(4) Repeating step (3) until the value of i is equal to n, i.e., each tuple in the dataset PD is traversed;
(5) The utility of tuple t on core set CD is then obtained, as calculated by:
。
s26: from the slaveIn selecting the most effective tuple +.>And adds it to the core set CD, whose expression is as follows:
;
。
s27: steps S24-S26 are repeated until the size of the core set |cd| is equal to K, resulting in a final core set CD.
S28: the core data set CD is divided into a training set TrainD, a test set TestD and a verification set VD according to a certain proportion.
In this embodiment, step S3 includes the following sub-steps:
s31: a random forest model, a support vector regression model and a classification boost (Catboost) model were constructed for a total of 3 basic learning models and trained with a test set.
(1) Training a random forest basic learning model, randomly extracting K new sample sets from an original data set by using a self-help sampling method (bootstrap self-help sampling method), randomly selecting a sample subset as a training set of the decision tree, and randomly selecting a part of features (the selected feature number is the square root of the total feature number) as the feature set of the decision tree. Constructing decision trees based on the training set and the feature set, repeating the step of selecting samples to establish decision trees until the number of the predetermined leaf nodes is reached or the predetermined leaf nodes cannot be segmented, and establishing a plurality of decision trees, wherein the expression of a prediction function of each decision tree is as follows:
;
wherein:representing the prediction result of the kth decision tree, k representing the kth decision tree, x representing the input sample,/for the k decision tree>Represents the number of leaf nodes of the kth decision tree,/->Predictive value representing the jth leaf node of the kth decision tree,/for the kth decision tree>A sample set representing the jth leaf node of the kth decision tree.
For a new sample, inputting the new sample into each decision tree to obtain a plurality of prediction results, and averaging the plurality of prediction results to obtain a final prediction result, wherein the calculation formula of the prediction function of the plurality of decision trees is as follows:
;
wherein: k represents the kth decision tree, K represents the number of decision trees, represents the prediction result of the kth decision tree,representing the predicted results of a plurality of decision trees.
(2) Importing the training set TrainD divided in the step S2 into a support vector regression basic learning model for training to obtain a trained support vector regression model;
(3) And (3) importing the training set TrainD divided in the step (S2) into a Catboost basic learning model for training to obtain a trained classification lifting (Catboost) model.
S32: model evaluation: the generalization capability of the trained basic learning model is evaluated by using a mean square error and an R square evaluation index, when the value of the mean square error is smaller and the value of the R square is larger, the predicted value of the predicted model is closer to the true value, the effect is better, and the calculation formula is as follows:
;
;
wherein: MSE represents the mean square error result, i represents the ith sample, m represents the number of samples,representing the sample true tag value,/->Predictive value of representative model on sample, +.>Representing the average of the sample true tag values.
(1) Inputting the test set TestD into the random forest model trained in the S31, and evaluating the generalization capability of the random forest model by using a mean square error and an R square evaluation index;
(2) Inputting the test set TestD into a support vector regression model trained by S31, and evaluating the generalization capability of the support vector regression model by using a mean square error and an R square evaluation index;
(3) The test set TestD is input into a classification boost (Catboost) model trained in S31, and the generalization capability of the classification boost (Catboost) model is evaluated by using a mean square error and an R square evaluation index.
S33: and (3) model tuning: super parameters (built-in parameters of the model) of the model are respectively set for the 3 basic models, super parameter tuning is respectively carried out on the basic learning models by using a Bayesian parameter optimization algorithm, and meanwhile, whether the models are fitted or not is monitored (so as to decide whether training is stopped or not). The Bayesian parameter optimization algorithm comprises the following steps:
(1) Optimizing the super parameters of the basic learning model through a Bayesian parameter optimization algorithm, respectively setting the super parameters of 3 basic models, giving a parameter range, and randomly generating initialization points in the parameter range;
(2) Probability proxy model of bayesian optimization algorithm:
;
in the middle ofIs an unknown objective function; />Representing a set of collected sample points; />Is the current sampling point value; />Is the likelihood distribution of y; />Is a priori probability distribution model of f; />Is the marginal likelihood distribution of marginalization f; />The posterior probability distribution of f is the confidence of the unknown function after the prior probability distribution is corrected.
(3) Assuming that the parameter combination to be optimized isThe Bayesian optimization objective is the error of the trained model to the test set, expressed as +.>Obtaining a set of data +.>The method comprises the steps of carrying out a first treatment on the surface of the Prediction->Observations at ∈>Wherein->The observation points are a certain->A sample of the gaussian distribution, which is distributed as follows:
;
wherein,representing a distribution result; GP represents a Gaussian distribution; />Represents the mean; />As a covariance function; />Representation->Is a transposed matrix of (a); wherein:
;
;
thereby can be obtainedIs a distribution of:
;
is->Distribution result of (2) then->The data of each observation point is,/>Is a likelihood function;
(4) EI (Expected Improvement) is selected as an acquisition function to determine the super-parameters of the next iteration. The x with the greatest probability of improving the objective function, namely the x of the maximum acquisition function EI (x), is the next super parameter selected, and the calculation formula is as follows:
;
wherein,for the next hyper-parameter calculation formula, function +.>Is obtained by mapping decision space X, observation space and hyper-parameter space to real space.
(5) New collected samplesAdd to historical acquisition sample->The Gaussian model is updated and is corrected to enable the Gaussian model to be more approximate to the real distribution of the objective function;
(6) Repeating the steps (2) - (5), stopping updating the model when the iteration reaches the maximum number of times, and outputting the optimal parameter combination of the model.
S34: respectively setting the super parameters of the 3 models as the parameter combinations output in the step S33, and putting the verification set VD divided in the step S2 into the 3 basic learning models for training;
(1) Putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 1 output by a primary learner;
(2) Putting the verification set VD into a support vector regression model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 2 output by a primary learner;
(3) Putting the verification set VD into a classification lifting (Catboost) model after Bayesian optimization to obtain a drilling and production engineering cost prediction value 3 output by a primary learner;
in the present embodiment, step S4 includes the steps of:
s41: selecting a linear regression model as a secondary learning model, and training the output (cost predicted value) of the 3 basic learning models as the input of the linear regression model;
s42: and (3) taking the cost predicted value 1, the cost predicted value 2 and the cost predicted value 3 obtained in the step (S3) as the input of a trained linear regression model, and outputting to obtain the final drilling and production engineering cost predicted value.
Firstly, carrying out data preprocessing on acquired drilling parameters, wherein the data preprocessing comprises missing value processing, illegal data filtering, characteristic data target coding and the like; performing feature selection and data screening on the processed data set to obtain a high-quality feature data subset; constructing and training three basic learning models of random forest, support vector regression and classification lifting (Catboost) for the screened data set, and performing parameter tuning on the basic learning models by adopting a Bayesian super-parameter optimization algorithm; and finally, constructing a linear regression model as a secondary learning model to obtain a final cost prediction result.
According to the invention, a machine learning algorithm is used for predicting the cost of petroleum drilling engineering, a training framework of feature selection and two-stage stacking heterogeneous integrated learning is provided, and Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.
For the foregoing embodiments, for simplicity of explanation, the same is shown as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts referred to are not necessarily required for the present application.
In the above embodiments, the basic principle and main features of the present invention and advantages of the present invention are described. It will be appreciated by persons skilled in the art that the present invention is not limited by the foregoing embodiments, but rather is shown and described in what is considered to be illustrative of the principles of the invention, and that modifications and changes can be made by those skilled in the art without departing from the spirit and scope of the invention, and therefore, is within the scope of the appended claims.
Claims (7)
1. The drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning is characterized by comprising the following steps of:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value.
2. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S1 includes the sub-steps of:
s11: performing blank value supplementation and filtering treatment on the petroleum drilling engineering cost data set by a machine learning method;
s12: performing target coding processing on the type parameter data in the data set after the vacancy value supplementation and the filtering processing;
s13: the petroleum drilling data set is recorded as a data set SD, and the data in the data set SD is normalized.
3. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S2 includes the sub-steps of:
s21: carrying out spearman correlation coefficient calculation on the single feature and the tag cost value in the data set SD to obtain an optimal feature subset PD;
s22: data screening is carried out on the optimal feature subset PD by utilizing a greedy strategy, and a core data subset CD is obtained;
s23: the core data subset CD is divided into a training set TrainD, a test set TestD and a verification set VD.
4. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S22 includes the sub-steps of:
s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples in (a) are uniformly sampled by h tuples as a sample setElements of (a) and (b);
s222: computing a sample set of samplesUtility of each tuple of (a):
;
wherein,effect of t, ++>Is a utility calculation formula, t isElements of (a) and (b);
s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Add to data subset->In (a):
;
;
wherein,selecting a calculation formula for maximum utility,/->Data subsets;
s224: repeating the steps to finally obtain the core data subset CD.
5. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S3 includes the sub-steps of:
s31: constructing a random forest model, a support vector regression model and a classification lifting model, and carrying out model training by adopting a training set TrainD and a test set TestD;
s32: model evaluation, namely evaluating the generalization capability of the trained learning model by using a mean square error and an R square evaluation index;
s33: performing model optimization, namely performing super-parameter optimization on the learning model by using a Bayesian parameter optimization algorithm, and simultaneously monitoring whether the learning model is fitted or not so as to determine whether training is stopped;
s34: and respectively setting the super parameters of the learning model as the parameter combinations output in the step S33, putting the verification set VD divided in the step S2 into the learning model, and outputting the drilling and production engineering cost predicted value.
6. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 5, wherein step S34 includes the sub-steps of:
s341: putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a first drilling engineering cost prediction value;
s342: putting the verification set VD into a support vector regression model after Bayesian optimization to obtain a second drilling engineering cost prediction value;
s343: and putting the verification set VD into a classification lifting model subjected to Bayesian optimization to obtain a third drilling engineering cost prediction value.
7. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 6, wherein step S4 includes the sub-steps of:
s41: selecting a linear regression model as a secondary learning model, and training the outputs of the random forest model, the support vector regression model and the classification lifting model as inputs of the linear regression model;
s42: and (3) taking the first drilling engineering cost predicted value, the second drilling engineering cost predicted value and the third drilling engineering cost predicted value obtained in the step (S3) as inputs of a trained linear regression model, and outputting to obtain a final drilling engineering cost predicted value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410122065.7A CN117648646B (en) | 2024-01-30 | 2024-01-30 | Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410122065.7A CN117648646B (en) | 2024-01-30 | 2024-01-30 | Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117648646A true CN117648646A (en) | 2024-03-05 |
CN117648646B CN117648646B (en) | 2024-04-26 |
Family
ID=90046383
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410122065.7A Active CN117648646B (en) | 2024-01-30 | 2024-01-30 | Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117648646B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7660705B1 (en) * | 2002-03-19 | 2010-02-09 | Microsoft Corporation | Bayesian approach for learning regression decision graph models and regression models for time series analysis |
CN113626315A (en) * | 2021-07-27 | 2021-11-09 | 江苏大学 | Dual-integration software defect prediction method combined with neural network |
WO2022151654A1 (en) * | 2021-01-14 | 2022-07-21 | 新智数字科技有限公司 | Random greedy algorithm-based horizontal federated gradient boosted tree optimization method |
CN117349751A (en) * | 2023-10-24 | 2024-01-05 | 中国地质大学(武汉) | Loess landslide slip distance prediction method and system based on meta-learning and Bayesian optimization |
CN117439053A (en) * | 2023-10-15 | 2024-01-23 | 国网天津市电力公司电力科学研究院 | Method, device and storage medium for predicting electric quantity of Stacking integrated model |
-
2024
- 2024-01-30 CN CN202410122065.7A patent/CN117648646B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7660705B1 (en) * | 2002-03-19 | 2010-02-09 | Microsoft Corporation | Bayesian approach for learning regression decision graph models and regression models for time series analysis |
WO2022151654A1 (en) * | 2021-01-14 | 2022-07-21 | 新智数字科技有限公司 | Random greedy algorithm-based horizontal federated gradient boosted tree optimization method |
CN113626315A (en) * | 2021-07-27 | 2021-11-09 | 江苏大学 | Dual-integration software defect prediction method combined with neural network |
CN117439053A (en) * | 2023-10-15 | 2024-01-23 | 国网天津市电力公司电力科学研究院 | Method, device and storage medium for predicting electric quantity of Stacking integrated model |
CN117349751A (en) * | 2023-10-24 | 2024-01-05 | 中国地质大学(武汉) | Loess landslide slip distance prediction method and system based on meta-learning and Bayesian optimization |
Also Published As
Publication number | Publication date |
---|---|
CN117648646B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111563706A (en) | Multivariable logistics freight volume prediction method based on LSTM network | |
Gaur | Neural networks in data mining | |
CN111785329A (en) | Single-cell RNA sequencing clustering method based on confrontation automatic encoder | |
CN110443417A (en) | Multi-model integrated load prediction method based on wavelet transformation | |
CN109685653A (en) | A method of fusion deepness belief network and the monitoring of the credit risk of isolated forest algorithm | |
CN110782658A (en) | Traffic prediction method based on LightGBM algorithm | |
CN109934422B (en) | Neural network wind speed prediction method based on time series data analysis | |
CN106919980A (en) | A kind of increment type target identification system based on neuromere differentiation | |
KR20220059120A (en) | System for modeling automatically of machine learning with hyper-parameter optimization and method thereof | |
CN116526450A (en) | Error compensation-based two-stage short-term power load combination prediction method | |
CN111738477A (en) | Deep feature combination-based power grid new energy consumption capability prediction method | |
CN112434891A (en) | Method for predicting solar irradiance time sequence based on WCNN-ALSTM | |
CN115618987A (en) | Production well production data prediction method, device, equipment and storage medium | |
CN114757433B (en) | Method for rapidly identifying relative risk of drinking water source antibiotic resistance | |
CN117114184A (en) | Urban carbon emission influence factor feature extraction and medium-long-term prediction method and device | |
CN117349743A (en) | Data classification method and system of hypergraph neural network based on multi-mode data | |
CN117648646B (en) | Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning | |
CN116646929A (en) | PSO-CNN-BILSTM-based short-term wind power prediction method | |
CN116613732A (en) | Multi-element load prediction method and system based on SHAP value selection strategy | |
CN113723707A (en) | Medium-and-long-term runoff trend prediction method based on deep learning model | |
CN114330485A (en) | Power grid investment capacity prediction method based on PLS-SVM-GA algorithm | |
Hila et al. | A Hybrid Neural Network Model to Forecast Arrival Guest in Malaysia | |
CN112465253A (en) | Method and device for predicting links in urban road network | |
CN116405368B (en) | Network fault diagnosis method and system under high-dimensional unbalanced data condition | |
Dash et al. | Towards crafting an improved functional link artificial neural network based on differential evolution and feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |