CN117648646A - Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning - Google Patents

Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning Download PDF

Info

Publication number
CN117648646A
CN117648646A CN202410122065.7A CN202410122065A CN117648646A CN 117648646 A CN117648646 A CN 117648646A CN 202410122065 A CN202410122065 A CN 202410122065A CN 117648646 A CN117648646 A CN 117648646A
Authority
CN
China
Prior art keywords
model
drilling
learning
data
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410122065.7A
Other languages
Chinese (zh)
Other versions
CN117648646B (en
Inventor
赵莉
李皋
任冬梅
蒋俊
肖东
夏文鹤
李红涛
刘厚彬
方潘
杨旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Petroleum University
Original Assignee
Southwest Petroleum University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Petroleum University filed Critical Southwest Petroleum University
Priority to CN202410122065.7A priority Critical patent/CN117648646B/en
Publication of CN117648646A publication Critical patent/CN117648646A/en
Application granted granted Critical
Publication of CN117648646B publication Critical patent/CN117648646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, which belongs to the technical field of petroleum and natural gas drilling and production and comprises the following steps: s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD; s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set; s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out; s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. The invention provides a training framework for feature selection and two-stage stacking heterogeneous integrated learning, and the Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.

Description

Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning
Technical Field
The invention relates to the technical field of petroleum and natural gas drilling and production, in particular to a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning.
Background
Machine learning is a method of learning a general rule from limited observation data, establishing a learning model, and predicting unknown data using the established model. The machine learning process is to extract the original data into a group of features through a manual experience or a feature conversion method, then input the features into a learning model, output a prediction result, and further improve the prediction accuracy of the machine learning model by carrying out feature engineering and data screening on the data set.
Instead of using a single powerful model to help make accurate predictions, ensemble learning in machine learning is defined as a technique that combines multiple weak models. These weak models are integrated in some way to obtain the final predictions. The bagging algorithm is a method for randomly creating data samples from a single dataset, and the random forest is a bagging algorithm specially used for decision trees, and the random forest not only introduces diversity in terms of data, but also introduces diversity in terms of variables. The lifting algorithm is a method for changing the target value to the previous model without fitting, and aims to improve the accuracy of the model by sequentially combining weak learners, wherein the classification lifting model (Catboost) introduces techniques such as ordered target statistics and ordered enhancement to improve the traditional enhancement method. The stacked algorithm combines different learning algorithms on a single dataset compared to other ensemble learning algorithms. In a first step, generating a set of basic level learning models; in a second step, the meta-level model is trained by the output of the base-level learning model. In stacked integration, since the prediction results of the plurality of models in the first step are combined and used as the input of the meta learner, it is possible to improve the accuracy of prediction while reducing the deviation. The integrated algorithm model uses multiple generalized algorithms, known as heterogeneous, heterogeneous integration of different algorithms will achieve good predictive performance. The stacking algorithm is important to apply different models, use different learning strategies or parameters, determine the combination of the basic learner and the meta learner through repeated experiments, and properly determine the optimal super parameter value of the basic learner to realize good training effect. Bayesian optimization attempts to collect observations with the highest information in each iteration by taking a balance between exploring uncertain hyper-parameters and collecting observations from near optimal hyper-parameters.
Therefore, how to improve the accuracy of the model in the training stage of the basic learner of the stacked integrated learning is a technical problem to be solved.
Disclosure of Invention
The invention aims to provide a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, so as to solve the technical problem of how to improve the accuracy of petroleum drilling and production engineering cost prediction.
The invention is realized by adopting the following technical scheme: the drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning comprises the following steps:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value.
Further, step S1 includes the following sub-steps:
s11: performing blank value supplementation and filtering treatment on the petroleum drilling engineering cost data set by a machine learning method;
s12: performing target coding processing on the type parameter data in the data set after the vacancy value supplementation and the filtering processing;
s13: the petroleum drilling data set is recorded as a data set SD, and the data in the data set SD is normalized.
Further, step S2 includes the following sub-steps:
s21: carrying out spearman correlation coefficient calculation on the single feature and the tag cost value in the data set SD to obtain an optimal feature subset PD;
s22: data screening is carried out on the optimal feature subset PD by utilizing a greedy strategy, and a core data subset CD is obtained;
s23: the core data subset CD is divided into a training set TrainD, a test set TestD and a verification set VD.
Further, step S22 includes the following sub-steps:
s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples of (a) are uniformly sampled by h tuples as a sample set +.>Elements of (a) and (b);
s222: computing a sample set of samplesUtility of each tuple of (a):
wherein,effect of t, ++>Is a utility calculation formula, t isElements of (a) and (b);
s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Joining to data subsetsIn (a):
wherein,selecting a calculation formula for maximum utility,/->Data subsets;
s224: repeating the steps to finally obtain the core data subset CD.
Further, step S3 includes the following sub-steps:
s31: constructing a random forest model, a support vector regression model and a classification lifting model, and carrying out model training by adopting a training set TrainD and a test set TestD;
s32: model evaluation, namely evaluating the generalization capability of the trained learning model by using a mean square error and an R square evaluation index;
s33: performing model optimization, namely performing super-parameter optimization on the learning model by using a Bayesian parameter optimization algorithm, and simultaneously monitoring whether the learning model is fitted or not so as to determine whether training is stopped;
s34: and respectively setting the super parameters of the learning model as the parameter combinations output in the step S33, putting the verification set VD divided in the step S2 into the learning model, and outputting the drilling and production engineering cost predicted value.
Further, step S34 includes the following sub-steps:
s341: putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a first drilling engineering cost prediction value;
s342: putting the verification set VD into a support vector regression model after Bayesian optimization to obtain a second drilling engineering cost prediction value;
s343: and putting the verification set VD into a classification lifting model subjected to Bayesian optimization to obtain a third drilling engineering cost prediction value.
Further, step S4 includes the following sub-steps:
s41: selecting a linear regression model as a secondary learning model, and training the outputs of the random forest model, the support vector regression model and the classification lifting model as inputs of the linear regression model;
s42: and (3) taking the first drilling engineering cost predicted value, the second drilling engineering cost predicted value and the third drilling engineering cost predicted value obtained in the step (S3) as inputs of a trained linear regression model, and outputting to obtain a final drilling engineering cost predicted value.
The invention has the beneficial effects that:
firstly, carrying out data preprocessing on acquired drilling parameters, wherein the data preprocessing comprises missing value processing, illegal data filtering, characteristic data target coding and the like; performing feature selection and data screening on the processed data set to obtain a high-quality feature data subset; constructing and training three basic learning models of random forest, support vector regression and classification lifting (Catboost) for the screened data set, and performing parameter tuning on the basic learning models by adopting a Bayesian super-parameter optimization algorithm; and finally, constructing a linear regression model as a secondary learning model to obtain a final cost prediction result.
According to the invention, a machine learning algorithm is used for predicting the cost of petroleum drilling engineering, a training framework of feature selection and two-stage stacking heterogeneous integrated learning is provided, and Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a feature selection and data screening flow chart;
FIG. 3 is a schematic diagram of a data screening framework;
FIG. 4 is a flowchart of a Bayesian parameter optimization algorithm.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1 to 4, the drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning includes the steps of:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. In fig. 3, loop represents a cycle, tuple represents a tuple, and Abolo N-1. Abolo N-6 each represents a well name, vertical, directional represents a well type, and Production, exploration represents a well type.
In this embodiment, step S1 includes the following sub-steps:
s11: the machine learning method supplements the data set with blank values and filters illegal data.
S12: and (3) carrying out target coding processing on category type parameter data (such as parameters of well type, well type and the like) in the completed data set, wherein each category of the discrete attribute is coded into an average value or other statistical information of the discrete attribute on a target variable by target coding, and the association between the category and the target variable can be captured.
S13: the petroleum drilling data set is named as SD (the data set SD comprises characteristic parameters such as drilling depth, drilling period, well completion period, well type and the like), the data set sample number is named as n, the characteristic number is named as m, and the characteristic space of the data set SD is defined asWherein->Is a petroleum drilling engineering feature, which is->. The data is normalized, and the calculation formula is as follows:
wherein the method comprises the steps ofIs->Normalized value of>Representing the ith feature in the data set SD, and (2)>Is a given dataset +.>Minimum value of characteristic, ++>Is a given dataset +.>Maximum value of the feature.
In this embodiment, step S2 includes the following sub-steps:
s21: correlation calculation, namely, carrying out Szelman correlation coefficient calculation on single feature and tag cost value in a data set, and firstly, carrying out ith feature in the original data setAnd target feature Y rearranges the two sets of data in an ascending order of Y, and gives each +.>And Y assignment rank value +.>Two sets of gradation value data are obtained>And->WhereinThe calculation formula is:
wherein the method comprises the steps ofIs the Pi Erman correlation coefficient of the ith feature in the original dataset SD,/for>A class value representing the jth sample of the ith feature,/->Represents->Average value of>A class value representing the j-th sample of the target feature, a class value representing the j-th sample of the target feature>Represents->Average value of (2).
S22:The larger the feature, the more cost-dependent the feature, so will be->Is selected as an effective feature, wherein +_>And deleting the features with smaller correlation to obtain the optimal feature subset PD.
S23: in order to improve the quality of the data set, the data subset is further subjected to data screening by utilizing a greedy strategy to obtain a core data subset CD, the core data set CD is initialized to be an empty set, and the size of the core data set CD is given to be K #Setting +.>When the data set is small +.>) Setting a sample set (++>) The size is h (the default size of h is set to 200 and can be adjusted according to actual conditions).
S24: employing a greedy algorithm strategy for dataset PD will have maximum utility (the "utility" of tuple t represents the expected reduction of gradient approximation errors after tuple t is added to core set CDIs added K times to the core set. Uniformly sampling h tuples as tuples not in the core set CD (one tuple is a piece of data in the dataset)Elements of (a) and (b);
s25: calculation ofThe utility of each tuple in (a) is calculated as follows:
for the effect of t, effect calculation +.>The method comprises the following specific steps:
(1) Considering the expectations of the core set as a utility function, the utility of the core set CD may be expressed as E [ CD ], and K may be converted to a sum of the expected values of each tuple in the computation CD, as follows:
in the middle ofThe j-th piece of data in the core set CD, which is also a tuple; />Representing the Euclidean distance between the feature vectors of two tuples, i.e. the feature distance, wherein +.>Is an index mapping from core set CD to data set PD, with +.>To represent the j-th tuple in CD (i.e +.>) Is the i-th tuple in PD (i.e.)>);
(2) Adding tuple t to core set CD to obtain new core setInitializing the utility as 0, and the expression is as follows:
(3) Computing each tuple in the PDFor core set->The utility of (2) is calculated by the following formula:
in the middle ofRepresenting core set->Is also a tuple; />Representing the euclidean distance between feature vectors of two tuples, i.e. the feature distance; />Is a slave core set->Index mapping to dataset PD with +.>To express +.>Is the j-th tuple of (i.e +)>) Is the i-th tuple in PD (i.e.)>);
(4) Repeating step (3) until the value of i is equal to n, i.e., each tuple in the dataset PD is traversed;
(5) The utility of tuple t on core set CD is then obtained, as calculated by:
s26: from the slaveIn selecting the most effective tuple +.>And adds it to the core set CD, whose expression is as follows:
s27: steps S24-S26 are repeated until the size of the core set |cd| is equal to K, resulting in a final core set CD.
S28: the core data set CD is divided into a training set TrainD, a test set TestD and a verification set VD according to a certain proportion.
In this embodiment, step S3 includes the following sub-steps:
s31: a random forest model, a support vector regression model and a classification boost (Catboost) model were constructed for a total of 3 basic learning models and trained with a test set.
(1) Training a random forest basic learning model, randomly extracting K new sample sets from an original data set by using a self-help sampling method (bootstrap self-help sampling method), randomly selecting a sample subset as a training set of the decision tree, and randomly selecting a part of features (the selected feature number is the square root of the total feature number) as the feature set of the decision tree. Constructing decision trees based on the training set and the feature set, repeating the step of selecting samples to establish decision trees until the number of the predetermined leaf nodes is reached or the predetermined leaf nodes cannot be segmented, and establishing a plurality of decision trees, wherein the expression of a prediction function of each decision tree is as follows:
wherein:representing the prediction result of the kth decision tree, k representing the kth decision tree, x representing the input sample,/for the k decision tree>Represents the number of leaf nodes of the kth decision tree,/->Predictive value representing the jth leaf node of the kth decision tree,/for the kth decision tree>A sample set representing the jth leaf node of the kth decision tree.
For a new sample, inputting the new sample into each decision tree to obtain a plurality of prediction results, and averaging the plurality of prediction results to obtain a final prediction result, wherein the calculation formula of the prediction function of the plurality of decision trees is as follows:
wherein: k represents the kth decision tree, K represents the number of decision trees, represents the prediction result of the kth decision tree,representing the predicted results of a plurality of decision trees.
(2) Importing the training set TrainD divided in the step S2 into a support vector regression basic learning model for training to obtain a trained support vector regression model;
(3) And (3) importing the training set TrainD divided in the step (S2) into a Catboost basic learning model for training to obtain a trained classification lifting (Catboost) model.
S32: model evaluation: the generalization capability of the trained basic learning model is evaluated by using a mean square error and an R square evaluation index, when the value of the mean square error is smaller and the value of the R square is larger, the predicted value of the predicted model is closer to the true value, the effect is better, and the calculation formula is as follows:
wherein: MSE represents the mean square error result, i represents the ith sample, m represents the number of samples,representing the sample true tag value,/->Predictive value of representative model on sample, +.>Representing the average of the sample true tag values.
(1) Inputting the test set TestD into the random forest model trained in the S31, and evaluating the generalization capability of the random forest model by using a mean square error and an R square evaluation index;
(2) Inputting the test set TestD into a support vector regression model trained by S31, and evaluating the generalization capability of the support vector regression model by using a mean square error and an R square evaluation index;
(3) The test set TestD is input into a classification boost (Catboost) model trained in S31, and the generalization capability of the classification boost (Catboost) model is evaluated by using a mean square error and an R square evaluation index.
S33: and (3) model tuning: super parameters (built-in parameters of the model) of the model are respectively set for the 3 basic models, super parameter tuning is respectively carried out on the basic learning models by using a Bayesian parameter optimization algorithm, and meanwhile, whether the models are fitted or not is monitored (so as to decide whether training is stopped or not). The Bayesian parameter optimization algorithm comprises the following steps:
(1) Optimizing the super parameters of the basic learning model through a Bayesian parameter optimization algorithm, respectively setting the super parameters of 3 basic models, giving a parameter range, and randomly generating initialization points in the parameter range;
(2) Probability proxy model of bayesian optimization algorithm:
in the middle ofIs an unknown objective function; />Representing a set of collected sample points; />Is the current sampling point value; />Is the likelihood distribution of y; />Is a priori probability distribution model of f; />Is the marginal likelihood distribution of marginalization f; />The posterior probability distribution of f is the confidence of the unknown function after the prior probability distribution is corrected.
(3) Assuming that the parameter combination to be optimized isThe Bayesian optimization objective is the error of the trained model to the test set, expressed as +.>Obtaining a set of data +.>The method comprises the steps of carrying out a first treatment on the surface of the Prediction->Observations at ∈>Wherein->The observation points are a certain->A sample of the gaussian distribution, which is distributed as follows:
wherein,representing a distribution result; GP represents a Gaussian distribution; />Represents the mean; />As a covariance function; />Representation->Is a transposed matrix of (a); wherein:
thereby can be obtainedIs a distribution of:
is->Distribution result of (2) then->The data of each observation point is,/>Is a likelihood function;
(4) EI (Expected Improvement) is selected as an acquisition function to determine the super-parameters of the next iteration. The x with the greatest probability of improving the objective function, namely the x of the maximum acquisition function EI (x), is the next super parameter selected, and the calculation formula is as follows:
wherein,for the next hyper-parameter calculation formula, function +.>Is obtained by mapping decision space X, observation space and hyper-parameter space to real space.
(5) New collected samplesAdd to historical acquisition sample->The Gaussian model is updated and is corrected to enable the Gaussian model to be more approximate to the real distribution of the objective function;
(6) Repeating the steps (2) - (5), stopping updating the model when the iteration reaches the maximum number of times, and outputting the optimal parameter combination of the model.
S34: respectively setting the super parameters of the 3 models as the parameter combinations output in the step S33, and putting the verification set VD divided in the step S2 into the 3 basic learning models for training;
(1) Putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 1 output by a primary learner;
(2) Putting the verification set VD into a support vector regression model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 2 output by a primary learner;
(3) Putting the verification set VD into a classification lifting (Catboost) model after Bayesian optimization to obtain a drilling and production engineering cost prediction value 3 output by a primary learner;
in the present embodiment, step S4 includes the steps of:
s41: selecting a linear regression model as a secondary learning model, and training the output (cost predicted value) of the 3 basic learning models as the input of the linear regression model;
s42: and (3) taking the cost predicted value 1, the cost predicted value 2 and the cost predicted value 3 obtained in the step (S3) as the input of a trained linear regression model, and outputting to obtain the final drilling and production engineering cost predicted value.
Firstly, carrying out data preprocessing on acquired drilling parameters, wherein the data preprocessing comprises missing value processing, illegal data filtering, characteristic data target coding and the like; performing feature selection and data screening on the processed data set to obtain a high-quality feature data subset; constructing and training three basic learning models of random forest, support vector regression and classification lifting (Catboost) for the screened data set, and performing parameter tuning on the basic learning models by adopting a Bayesian super-parameter optimization algorithm; and finally, constructing a linear regression model as a secondary learning model to obtain a final cost prediction result.
According to the invention, a machine learning algorithm is used for predicting the cost of petroleum drilling engineering, a training framework of feature selection and two-stage stacking heterogeneous integrated learning is provided, and Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.
For the foregoing embodiments, for simplicity of explanation, the same is shown as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts referred to are not necessarily required for the present application.
In the above embodiments, the basic principle and main features of the present invention and advantages of the present invention are described. It will be appreciated by persons skilled in the art that the present invention is not limited by the foregoing embodiments, but rather is shown and described in what is considered to be illustrative of the principles of the invention, and that modifications and changes can be made by those skilled in the art without departing from the spirit and scope of the invention, and therefore, is within the scope of the appended claims.

Claims (7)

1. The drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning is characterized by comprising the following steps of:
s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;
s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;
s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;
s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value.
2. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S1 includes the sub-steps of:
s11: performing blank value supplementation and filtering treatment on the petroleum drilling engineering cost data set by a machine learning method;
s12: performing target coding processing on the type parameter data in the data set after the vacancy value supplementation and the filtering processing;
s13: the petroleum drilling data set is recorded as a data set SD, and the data in the data set SD is normalized.
3. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S2 includes the sub-steps of:
s21: carrying out spearman correlation coefficient calculation on the single feature and the tag cost value in the data set SD to obtain an optimal feature subset PD;
s22: data screening is carried out on the optimal feature subset PD by utilizing a greedy strategy, and a core data subset CD is obtained;
s23: the core data subset CD is divided into a training set TrainD, a test set TestD and a verification set VD.
4. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S22 includes the sub-steps of:
s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples in (a) are uniformly sampled by h tuples as a sample setElements of (a) and (b);
s222: computing a sample set of samplesUtility of each tuple of (a):
wherein,effect of t, ++>Is a utility calculation formula, t isElements of (a) and (b);
s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Add to data subset->In (a):
wherein,selecting a calculation formula for maximum utility,/->Data subsets;
s224: repeating the steps to finally obtain the core data subset CD.
5. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S3 includes the sub-steps of:
s31: constructing a random forest model, a support vector regression model and a classification lifting model, and carrying out model training by adopting a training set TrainD and a test set TestD;
s32: model evaluation, namely evaluating the generalization capability of the trained learning model by using a mean square error and an R square evaluation index;
s33: performing model optimization, namely performing super-parameter optimization on the learning model by using a Bayesian parameter optimization algorithm, and simultaneously monitoring whether the learning model is fitted or not so as to determine whether training is stopped;
s34: and respectively setting the super parameters of the learning model as the parameter combinations output in the step S33, putting the verification set VD divided in the step S2 into the learning model, and outputting the drilling and production engineering cost predicted value.
6. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 5, wherein step S34 includes the sub-steps of:
s341: putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a first drilling engineering cost prediction value;
s342: putting the verification set VD into a support vector regression model after Bayesian optimization to obtain a second drilling engineering cost prediction value;
s343: and putting the verification set VD into a classification lifting model subjected to Bayesian optimization to obtain a third drilling engineering cost prediction value.
7. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 6, wherein step S4 includes the sub-steps of:
s41: selecting a linear regression model as a secondary learning model, and training the outputs of the random forest model, the support vector regression model and the classification lifting model as inputs of the linear regression model;
s42: and (3) taking the first drilling engineering cost predicted value, the second drilling engineering cost predicted value and the third drilling engineering cost predicted value obtained in the step (S3) as inputs of a trained linear regression model, and outputting to obtain a final drilling engineering cost predicted value.
CN202410122065.7A 2024-01-30 2024-01-30 Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning Active CN117648646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410122065.7A CN117648646B (en) 2024-01-30 2024-01-30 Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410122065.7A CN117648646B (en) 2024-01-30 2024-01-30 Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning

Publications (2)

Publication Number Publication Date
CN117648646A true CN117648646A (en) 2024-03-05
CN117648646B CN117648646B (en) 2024-04-26

Family

ID=90046383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410122065.7A Active CN117648646B (en) 2024-01-30 2024-01-30 Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning

Country Status (1)

Country Link
CN (1) CN117648646B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7660705B1 (en) * 2002-03-19 2010-02-09 Microsoft Corporation Bayesian approach for learning regression decision graph models and regression models for time series analysis
CN113626315A (en) * 2021-07-27 2021-11-09 江苏大学 Dual-integration software defect prediction method combined with neural network
WO2022151654A1 (en) * 2021-01-14 2022-07-21 新智数字科技有限公司 Random greedy algorithm-based horizontal federated gradient boosted tree optimization method
CN117349751A (en) * 2023-10-24 2024-01-05 中国地质大学(武汉) Loess landslide slip distance prediction method and system based on meta-learning and Bayesian optimization
CN117439053A (en) * 2023-10-15 2024-01-23 国网天津市电力公司电力科学研究院 Method, device and storage medium for predicting electric quantity of Stacking integrated model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7660705B1 (en) * 2002-03-19 2010-02-09 Microsoft Corporation Bayesian approach for learning regression decision graph models and regression models for time series analysis
WO2022151654A1 (en) * 2021-01-14 2022-07-21 新智数字科技有限公司 Random greedy algorithm-based horizontal federated gradient boosted tree optimization method
CN113626315A (en) * 2021-07-27 2021-11-09 江苏大学 Dual-integration software defect prediction method combined with neural network
CN117439053A (en) * 2023-10-15 2024-01-23 国网天津市电力公司电力科学研究院 Method, device and storage medium for predicting electric quantity of Stacking integrated model
CN117349751A (en) * 2023-10-24 2024-01-05 中国地质大学(武汉) Loess landslide slip distance prediction method and system based on meta-learning and Bayesian optimization

Also Published As

Publication number Publication date
CN117648646B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN111563706A (en) Multivariable logistics freight volume prediction method based on LSTM network
Gaur Neural networks in data mining
CN111785329A (en) Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN109685653A (en) A method of fusion deepness belief network and the monitoring of the credit risk of isolated forest algorithm
CN110782658A (en) Traffic prediction method based on LightGBM algorithm
CN109934422B (en) Neural network wind speed prediction method based on time series data analysis
CN106919980A (en) A kind of increment type target identification system based on neuromere differentiation
CN109165672A (en) A kind of Ensemble classifier method based on incremental learning
CN111738477A (en) Deep feature combination-based power grid new energy consumption capability prediction method
KR20220059120A (en) System for modeling automatically of machine learning with hyper-parameter optimization and method thereof
CN112434891A (en) Method for predicting solar irradiance time sequence based on WCNN-ALSTM
CN116526450A (en) Error compensation-based two-stage short-term power load combination prediction method
CN115796358A (en) Carbon emission prediction method and terminal
Yang Combination forecast of economic chaos based on improved genetic algorithm
CN113515540A (en) Query rewriting method for database
CN115618987A (en) Production well production data prediction method, device, equipment and storage medium
CN117114184A (en) Urban carbon emission influence factor feature extraction and medium-long-term prediction method and device
CN117648646B (en) Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning
CN112529684A (en) Customer credit assessment method and system based on FWA _ DBN
CN114757433B (en) Method for rapidly identifying relative risk of drinking water source antibiotic resistance
Kang et al. Predicting Stock Closing Price with Stock Network Public Opinion Based on AdaBoost-IWOA-Elman Model and CEEMDAN Algorithm
CN113723707A (en) Medium-and-long-term runoff trend prediction method based on deep learning model
Sun Application of GA-BP neural network in online education quality evaluation in colleges and universities
CN112465253A (en) Method and device for predicting links in urban road network
CN112465054A (en) Multivariate time series data classification method based on FCN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant