CN117648646A

CN117648646A - Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning

Info

Publication number: CN117648646A
Application number: CN202410122065.7A
Authority: CN
Inventors: 赵莉; 李皋; 任冬梅; 蒋俊; 肖东; 夏文鹤; 李红涛; 刘厚彬; 方潘; 杨旭
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-03-05
Anticipated expiration: 2044-01-30
Also published as: CN117648646B

Abstract

The invention discloses a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, which belongs to the technical field of petroleum and natural gas drilling and production and comprises the following steps: s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD; s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set; s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out; s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. The invention provides a training framework for feature selection and two-stage stacking heterogeneous integrated learning, and the Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.

Description

Drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning

Technical Field

The invention relates to the technical field of petroleum and natural gas drilling and production, in particular to a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning.

Background

Machine learning is a method of learning a general rule from limited observation data, establishing a learning model, and predicting unknown data using the established model. The machine learning process is to extract the original data into a group of features through a manual experience or a feature conversion method, then input the features into a learning model, output a prediction result, and further improve the prediction accuracy of the machine learning model by carrying out feature engineering and data screening on the data set.

Instead of using a single powerful model to help make accurate predictions, ensemble learning in machine learning is defined as a technique that combines multiple weak models. These weak models are integrated in some way to obtain the final predictions. The bagging algorithm is a method for randomly creating data samples from a single dataset, and the random forest is a bagging algorithm specially used for decision trees, and the random forest not only introduces diversity in terms of data, but also introduces diversity in terms of variables. The lifting algorithm is a method for changing the target value to the previous model without fitting, and aims to improve the accuracy of the model by sequentially combining weak learners, wherein the classification lifting model (Catboost) introduces techniques such as ordered target statistics and ordered enhancement to improve the traditional enhancement method. The stacked algorithm combines different learning algorithms on a single dataset compared to other ensemble learning algorithms. In a first step, generating a set of basic level learning models; in a second step, the meta-level model is trained by the output of the base-level learning model. In stacked integration, since the prediction results of the plurality of models in the first step are combined and used as the input of the meta learner, it is possible to improve the accuracy of prediction while reducing the deviation. The integrated algorithm model uses multiple generalized algorithms, known as heterogeneous, heterogeneous integration of different algorithms will achieve good predictive performance. The stacking algorithm is important to apply different models, use different learning strategies or parameters, determine the combination of the basic learner and the meta learner through repeated experiments, and properly determine the optimal super parameter value of the basic learner to realize good training effect. Bayesian optimization attempts to collect observations with the highest information in each iteration by taking a balance between exploring uncertain hyper-parameters and collecting observations from near optimal hyper-parameters.

Therefore, how to improve the accuracy of the model in the training stage of the basic learner of the stacked integrated learning is a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning, so as to solve the technical problem of how to improve the accuracy of petroleum drilling and production engineering cost prediction.

The invention is realized by adopting the following technical scheme: the drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning comprises the following steps:

s1: preprocessing a petroleum drilling engineering cost data set to obtain a data set SD;

s2: performing feature selection and data screening on the data set SD to obtain a core data set, and dividing the core data set into a training set, a test set and a verification set;

s3: building and training a learning model: the method comprises the steps of carrying out super-parameter optimization on a learning model by using a Bayesian parameter optimization algorithm, wherein the random forest model, the support vector regression model and the classification lifting model are respectively carried out;

s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value.

Further, step S1 includes the following sub-steps:

s11: performing blank value supplementation and filtering treatment on the petroleum drilling engineering cost data set by a machine learning method;

s12: performing target coding processing on the type parameter data in the data set after the vacancy value supplementation and the filtering processing;

s13: the petroleum drilling data set is recorded as a data set SD, and the data in the data set SD is normalized.

Further, step S2 includes the following sub-steps:

s21: carrying out spearman correlation coefficient calculation on the single feature and the tag cost value in the data set SD to obtain an optimal feature subset PD;

s22: data screening is carried out on the optimal feature subset PD by utilizing a greedy strategy, and a core data subset CD is obtained;

s23: the core data subset CD is divided into a training set TrainD, a test set TestD and a verification set VD.

Further, step S22 includes the following sub-steps:

s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples of (a) are uniformly sampled by h tuples as a sample set +.>Elements of (a) and (b);

s222: computing a sample set of samplesUtility of each tuple of (a):

；

wherein,effect of t, ++>Is a utility calculation formula, t isElements of (a) and (b);

s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Joining to data subsetsIn (a):

；

wherein,selecting a calculation formula for maximum utility,/->Data subsets;

s224: repeating the steps to finally obtain the core data subset CD.

Further, step S3 includes the following sub-steps:

s31: constructing a random forest model, a support vector regression model and a classification lifting model, and carrying out model training by adopting a training set TrainD and a test set TestD;

s32: model evaluation, namely evaluating the generalization capability of the trained learning model by using a mean square error and an R square evaluation index;

s33: performing model optimization, namely performing super-parameter optimization on the learning model by using a Bayesian parameter optimization algorithm, and simultaneously monitoring whether the learning model is fitted or not so as to determine whether training is stopped;

s34: and respectively setting the super parameters of the learning model as the parameter combinations output in the step S33, putting the verification set VD divided in the step S2 into the learning model, and outputting the drilling and production engineering cost predicted value.

Further, step S34 includes the following sub-steps:

s341: putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a first drilling engineering cost prediction value;

s342: putting the verification set VD into a support vector regression model after Bayesian optimization to obtain a second drilling engineering cost prediction value;

s343: and putting the verification set VD into a classification lifting model subjected to Bayesian optimization to obtain a third drilling engineering cost prediction value.

Further, step S4 includes the following sub-steps:

s41: selecting a linear regression model as a secondary learning model, and training the outputs of the random forest model, the support vector regression model and the classification lifting model as inputs of the linear regression model;

s42: and (3) taking the first drilling engineering cost predicted value, the second drilling engineering cost predicted value and the third drilling engineering cost predicted value obtained in the step (S3) as inputs of a trained linear regression model, and outputting to obtain a final drilling engineering cost predicted value.

The invention has the beneficial effects that:

firstly, carrying out data preprocessing on acquired drilling parameters, wherein the data preprocessing comprises missing value processing, illegal data filtering, characteristic data target coding and the like; performing feature selection and data screening on the processed data set to obtain a high-quality feature data subset; constructing and training three basic learning models of random forest, support vector regression and classification lifting (Catboost) for the screened data set, and performing parameter tuning on the basic learning models by adopting a Bayesian super-parameter optimization algorithm; and finally, constructing a linear regression model as a secondary learning model to obtain a final cost prediction result.

According to the invention, a machine learning algorithm is used for predicting the cost of petroleum drilling engineering, a training framework of feature selection and two-stage stacking heterogeneous integrated learning is provided, and Bayesian super-parameter optimization is used for training optimization of a two-stage model, so that the accuracy of overall model prediction can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a feature selection and data screening flow chart;

FIG. 3 is a schematic diagram of a data screening framework;

FIG. 4 is a flowchart of a Bayesian parameter optimization algorithm.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1 to 4, the drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning includes the steps of:

s4: and weighting and combining the two-level learning model to obtain a final drilling engineering cost prediction value. In fig. 3, loop represents a cycle, tuple represents a tuple, and Abolo N-1. Abolo N-6 each represents a well name, vertical, directional represents a well type, and Production, exploration represents a well type.

In this embodiment, step S1 includes the following sub-steps:

s11: the machine learning method supplements the data set with blank values and filters illegal data.

S12: and (3) carrying out target coding processing on category type parameter data (such as parameters of well type, well type and the like) in the completed data set, wherein each category of the discrete attribute is coded into an average value or other statistical information of the discrete attribute on a target variable by target coding, and the association between the category and the target variable can be captured.

S13: the petroleum drilling data set is named as SD (the data set SD comprises characteristic parameters such as drilling depth, drilling period, well completion period, well type and the like), the data set sample number is named as n, the characteristic number is named as m, and the characteristic space of the data set SD is defined asWherein->Is a petroleum drilling engineering feature, which is->. The data is normalized, and the calculation formula is as follows:

；

wherein the method comprises the steps ofIs->Normalized value of>Representing the ith feature in the data set SD, and (2)>Is a given dataset +.>Minimum value of characteristic, ++>Is a given dataset +.>Maximum value of the feature.

In this embodiment, step S2 includes the following sub-steps:

s21: correlation calculation, namely, carrying out Szelman correlation coefficient calculation on single feature and tag cost value in a data set, and firstly, carrying out ith feature in the original data setAnd target feature Y rearranges the two sets of data in an ascending order of Y, and gives each +.>And Y assignment rank value +.>Two sets of gradation value data are obtained>And->WhereinThe calculation formula is:

；

wherein the method comprises the steps ofIs the Pi Erman correlation coefficient of the ith feature in the original dataset SD,/for>A class value representing the jth sample of the ith feature,/->Represents->Average value of>A class value representing the j-th sample of the target feature, a class value representing the j-th sample of the target feature>Represents->Average value of (2).

S22：The larger the feature, the more cost-dependent the feature, so will be->Is selected as an effective feature, wherein +_>And deleting the features with smaller correlation to obtain the optimal feature subset PD.

S23: in order to improve the quality of the data set, the data subset is further subjected to data screening by utilizing a greedy strategy to obtain a core data subset CD, the core data set CD is initialized to be an empty set, and the size of the core data set CD is given to be K #Setting +.>When the data set is small +.>) Setting a sample set (++>) The size is h (the default size of h is set to 200 and can be adjusted according to actual conditions).

S24: employing a greedy algorithm strategy for dataset PD will have maximum utility (the "utility" of tuple t represents the expected reduction of gradient approximation errors after tuple t is added to core set CDIs added K times to the core set. Uniformly sampling h tuples as tuples not in the core set CD (one tuple is a piece of data in the dataset)Elements of (a) and (b);

s25: calculation ofThe utility of each tuple in (a) is calculated as follows:

；

for the effect of t, effect calculation +.>The method comprises the following specific steps:

(1) Considering the expectations of the core set as a utility function, the utility of the core set CD may be expressed as E [ CD ], and K may be converted to a sum of the expected values of each tuple in the computation CD, as follows:

；

in the middle ofThe j-th piece of data in the core set CD, which is also a tuple; />Representing the Euclidean distance between the feature vectors of two tuples, i.e. the feature distance, wherein +.>Is an index mapping from core set CD to data set PD, with +.>To represent the j-th tuple in CD (i.e +.>) Is the i-th tuple in PD (i.e.)>）；

(2) Adding tuple t to core set CD to obtain new core setInitializing the utility as 0, and the expression is as follows:

；

(3) Computing each tuple in the PDFor core set->The utility of (2) is calculated by the following formula:

；

in the middle ofRepresenting core set->Is also a tuple; />Representing the euclidean distance between feature vectors of two tuples, i.e. the feature distance; />Is a slave core set->Index mapping to dataset PD with +.>To express +.>Is the j-th tuple of (i.e +)>) Is the i-th tuple in PD (i.e.)>）；

(4) Repeating step (3) until the value of i is equal to n, i.e., each tuple in the dataset PD is traversed;

(5) The utility of tuple t on core set CD is then obtained, as calculated by:

。

s26: from the slaveIn selecting the most effective tuple +.>And adds it to the core set CD, whose expression is as follows:

；

。

s27: steps S24-S26 are repeated until the size of the core set |cd| is equal to K, resulting in a final core set CD.

S28: the core data set CD is divided into a training set TrainD, a test set TestD and a verification set VD according to a certain proportion.

In this embodiment, step S3 includes the following sub-steps:

s31: a random forest model, a support vector regression model and a classification boost (Catboost) model were constructed for a total of 3 basic learning models and trained with a test set.

(1) Training a random forest basic learning model, randomly extracting K new sample sets from an original data set by using a self-help sampling method (bootstrap self-help sampling method), randomly selecting a sample subset as a training set of the decision tree, and randomly selecting a part of features (the selected feature number is the square root of the total feature number) as the feature set of the decision tree. Constructing decision trees based on the training set and the feature set, repeating the step of selecting samples to establish decision trees until the number of the predetermined leaf nodes is reached or the predetermined leaf nodes cannot be segmented, and establishing a plurality of decision trees, wherein the expression of a prediction function of each decision tree is as follows:

；

wherein:representing the prediction result of the kth decision tree, k representing the kth decision tree, x representing the input sample,/for the k decision tree>Represents the number of leaf nodes of the kth decision tree,/->Predictive value representing the jth leaf node of the kth decision tree,/for the kth decision tree>A sample set representing the jth leaf node of the kth decision tree.

For a new sample, inputting the new sample into each decision tree to obtain a plurality of prediction results, and averaging the plurality of prediction results to obtain a final prediction result, wherein the calculation formula of the prediction function of the plurality of decision trees is as follows:

；

wherein: k represents the kth decision tree, K represents the number of decision trees, represents the prediction result of the kth decision tree,representing the predicted results of a plurality of decision trees.

(2) Importing the training set TrainD divided in the step S2 into a support vector regression basic learning model for training to obtain a trained support vector regression model;

(3) And (3) importing the training set TrainD divided in the step (S2) into a Catboost basic learning model for training to obtain a trained classification lifting (Catboost) model.

S32: model evaluation: the generalization capability of the trained basic learning model is evaluated by using a mean square error and an R square evaluation index, when the value of the mean square error is smaller and the value of the R square is larger, the predicted value of the predicted model is closer to the true value, the effect is better, and the calculation formula is as follows:

；

wherein: MSE represents the mean square error result, i represents the ith sample, m represents the number of samples,representing the sample true tag value,/->Predictive value of representative model on sample, +.>Representing the average of the sample true tag values.

(1) Inputting the test set TestD into the random forest model trained in the S31, and evaluating the generalization capability of the random forest model by using a mean square error and an R square evaluation index;

(2) Inputting the test set TestD into a support vector regression model trained by S31, and evaluating the generalization capability of the support vector regression model by using a mean square error and an R square evaluation index;

(3) The test set TestD is input into a classification boost (Catboost) model trained in S31, and the generalization capability of the classification boost (Catboost) model is evaluated by using a mean square error and an R square evaluation index.

S33: and (3) model tuning: super parameters (built-in parameters of the model) of the model are respectively set for the 3 basic models, super parameter tuning is respectively carried out on the basic learning models by using a Bayesian parameter optimization algorithm, and meanwhile, whether the models are fitted or not is monitored (so as to decide whether training is stopped or not). The Bayesian parameter optimization algorithm comprises the following steps:

(1) Optimizing the super parameters of the basic learning model through a Bayesian parameter optimization algorithm, respectively setting the super parameters of 3 basic models, giving a parameter range, and randomly generating initialization points in the parameter range;

(2) Probability proxy model of bayesian optimization algorithm:

；

in the middle ofIs an unknown objective function; />Representing a set of collected sample points; />Is the current sampling point value; />Is the likelihood distribution of y; />Is a priori probability distribution model of f; />Is the marginal likelihood distribution of marginalization f; />The posterior probability distribution of f is the confidence of the unknown function after the prior probability distribution is corrected.

(3) Assuming that the parameter combination to be optimized isThe Bayesian optimization objective is the error of the trained model to the test set, expressed as +.>Obtaining a set of data +.>The method comprises the steps of carrying out a first treatment on the surface of the Prediction->Observations at ∈>Wherein->The observation points are a certain->A sample of the gaussian distribution, which is distributed as follows:

；

wherein,representing a distribution result; GP represents a Gaussian distribution; />Represents the mean; />As a covariance function; />Representation->Is a transposed matrix of (a); wherein:

；

thereby can be obtainedIs a distribution of:

；

is->Distribution result of (2) then->The data of each observation point is，/>Is a likelihood function;

(4) EI (Expected Improvement) is selected as an acquisition function to determine the super-parameters of the next iteration. The x with the greatest probability of improving the objective function, namely the x of the maximum acquisition function EI (x), is the next super parameter selected, and the calculation formula is as follows:

；

wherein,for the next hyper-parameter calculation formula, function +.>Is obtained by mapping decision space X, observation space and hyper-parameter space to real space.

(5) New collected samplesAdd to historical acquisition sample->The Gaussian model is updated and is corrected to enable the Gaussian model to be more approximate to the real distribution of the objective function;

(6) Repeating the steps (2) - (5), stopping updating the model when the iteration reaches the maximum number of times, and outputting the optimal parameter combination of the model.

S34: respectively setting the super parameters of the 3 models as the parameter combinations output in the step S33, and putting the verification set VD divided in the step S2 into the 3 basic learning models for training;

(1) Putting the verification set VD into a random forest model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 1 output by a primary learner;

(2) Putting the verification set VD into a support vector regression model subjected to Bayesian optimization to obtain a drilling and production engineering cost prediction value 2 output by a primary learner;

(3) Putting the verification set VD into a classification lifting (Catboost) model after Bayesian optimization to obtain a drilling and production engineering cost prediction value 3 output by a primary learner;

in the present embodiment, step S4 includes the steps of:

s41: selecting a linear regression model as a secondary learning model, and training the output (cost predicted value) of the 3 basic learning models as the input of the linear regression model;

s42: and (3) taking the cost predicted value 1, the cost predicted value 2 and the cost predicted value 3 obtained in the step (S3) as the input of a trained linear regression model, and outputting to obtain the final drilling and production engineering cost predicted value.

For the foregoing embodiments, for simplicity of explanation, the same is shown as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts referred to are not necessarily required for the present application.

In the above embodiments, the basic principle and main features of the present invention and advantages of the present invention are described. It will be appreciated by persons skilled in the art that the present invention is not limited by the foregoing embodiments, but rather is shown and described in what is considered to be illustrative of the principles of the invention, and that modifications and changes can be made by those skilled in the art without departing from the spirit and scope of the invention, and therefore, is within the scope of the appended claims.

Claims

1. The drilling and production cost prediction method based on feature selection and stacked heterogeneous integrated learning is characterized by comprising the following steps of:

2. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S1 includes the sub-steps of:

3. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 1, wherein step S2 includes the sub-steps of:

4. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S22 includes the sub-steps of:

s221: adopting greedy algorithm strategy aiming at the optimal feature subset PD, iterating the tuple t with the maximum effect K times, and adding the tuple t with the maximum effect to the data subsetIn the case of not being in the data subset +.>The tuples in (a) are uniformly sampled by h tuples as a sample setElements of (a) and (b);

s222: computing a sample set of samplesUtility of each tuple of (a):

；

s223: from a sample set of samplesIn selecting the most effective tuple +.>And will->Add to data subset->In (a):

；

wherein,selecting a calculation formula for maximum utility,/->Data subsets;

s224: repeating the steps to finally obtain the core data subset CD.

5. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 3, wherein step S3 includes the sub-steps of:

6. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as claimed in claim 5, wherein step S34 includes the sub-steps of:

7. The drilling and production cost prediction method based on feature selection and stacked heterogeneous ensemble learning as set forth in claim 6, wherein step S4 includes the sub-steps of: