CN117093922A

CN117093922A - Improved SVM-based complex fluid identification method for unbalanced sample oil reservoir

Info

Publication number: CN117093922A
Application number: CN202311068108.XA
Authority: CN
Inventors: 毛敏; 刘娟霞; 徐长敏; 何理鹏; 杨毅; 印森林; 罗思雨
Original assignee: China France Bohai Geoservices Co Ltd
Current assignee: China France Bohai Geoservices Co Ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-21

Abstract

The application relates to an unbalanced sample oil reservoir complex fluid identification method based on an improved SVM. According to the improved SVM-based unbalanced sample oil reservoir complex fluid identification method, firstly, data are expanded through an ADASYN model, so that the possibility of occurrence of fitting is reduced, and then, a plurality of weak SVM classifiers are combined through an AdaBoost.M2-SVM, so that the model popularization capability is improved, and the problem that the traditional method is inadaptive and multi-solution in unconventional oil reservoir fluid identification is solved. In addition, according to the method, the intelligent identification of the oil, gas and water layers of the reservoir can be realized in the oil reservoir fluid identification combined with logging based on an artificial intelligence algorithm. The method comprises the steps of data processing, algorithm design, model training and model deployment, and a set of reservoir oil-gas-water layer intelligent identification system combining logging data is formed.

Description

Improved SVM-based complex fluid identification method for unbalanced sample oil reservoir

Technical Field

The application relates to an unbalanced sample oil reservoir complex fluid identification method based on an improved SVM, and belongs to the technical field of oil reservoir complex fluid identification.

Background

In recent years, multiple sets of low-resistance reservoirs are drilled and met in offshore oil field exploration processes. Identification of the fluid properties of low-resistance reservoirs is an important link in oil and gas field exploration and development research, and the accuracy of identification is directly related to discovery and efficient development of oil and gas reservoirs. Currently, reservoir fluid property identification relies primarily on logging and logging data. The logging data is used as the first characteristic data of exploration, has the characteristics of being visual and more obvious in the original stratum property, is one of the most important data for assisting specialists in searching and evaluating the reservoir, and can effectively identify the lithology and physical properties of the reservoir and identify the fluid property of the reservoir through various technical means (such as rock debris logging, core logging and X-ray element logging). The logging data is obtained by directly information of underground oil gas, has the characteristics of intuitionism and accuracy, but has some defects. First, the sampling interval is large, possibly resulting in discontinuity of information. Second, the delamination accuracy is low and the interpretation of reservoir thickness is not accurate enough.

Logging techniques mainly distinguish hydrocarbon reservoirs from water layers by differences in the response of logging curves such as resistivity logging and sonic logging to fluid properties. The resistivity well logging can distinguish the impedance difference of different mediums to current so as to distinguish the hydrocarbon reservoir and the water layer, and the acoustic well logging can distinguish different mediums by utilizing the difference of propagation speeds of acoustic waves in different mediums, so that the existence of the hydrocarbon reservoir can be judged. In addition, auxiliary logging curves such as natural gamma logging, neutron logging, etc. may also be used for identification of reservoir fluid properties. However, for a low-resistance oil reservoir, the logging response characteristics of an oil layer and a water layer in the same oil reservoir are not obvious, the resistivity of the water layer of the oil layer is not obviously limited, the low-resistance oil reservoir is difficult to identify based on conventional logging means, and the low-resistance oil reservoir is often misinterpreted as the water layer and omitted.

The well logging interpretation method mainly comprises an intersection plate method based on mathematical statistics and a method based on a volume model and a theoretical formula. However, for unconventional reservoirs such as low-resistance reservoirs, the simplified volume model is not adaptive in practical application, and the established empirical formula is poor in precision and popularization.

With the development of artificial intelligence technology, researchers have gradually begun to try data-driven reservoir fluid identification methods. For example, algorithms based on convolutional neural networks, long-short-term memory network algorithm methods. However, logging data has the problem of unbalanced data distribution, and is easy to be over-fitted to a deep network, so that the model has poor popularization capability and is difficult to apply to actual development. Therefore, in order to solve the problems of the above-mentioned various technologies, it is necessary to develop an unbalanced sample oil reservoir complex fluid identification method based on the improvement SVM (Support Vector Machine), so as to solve the above problems of the existing oil reservoir complex fluid identification method.

Disclosure of Invention

The application aims at: the method for identifying the complex fluid of the unbalanced sample oil reservoir based on the improved SVM is provided, so that the problems of poor precision and poor popularization existing in the existing method for identifying the complex fluid of the unbalanced sample oil reservoir are solved.

The technical scheme of the application is as follows:

an unbalanced sample oil reservoir complex fluid identification method based on an improved SVM is characterized by comprising the following steps of: the identification method comprises the following steps:

step 1, collecting logging data of an arrangement research area, and establishing a data layer;

step 2, data in the data layer are arranged to obtain a final data set, wherein the data arrangement comprises data arrangement, abnormal data deletion, filling of empty data, data interpolation and data dimension reduction;

step 3, because the lithology data and the gas measurement data are inconsistent with the data fineness of other data in the step 2.1, interpolation is needed to be carried out on the lithology data and the gas measurement data;

step 4, step 1.2, counting 15 factors influencing the reservoir fluid property, and converting a plurality of non-linear related indexes into less indexes which are not related with each other through data dimension reduction by principal component analysis; the principal component analysis was calculated as follows:

step 5, for the data processed in step 4, the data proportion difference between the oil layer and the oil-water same layer and the non-oil layer is too large, the oil layer occupies a relatively low proportion, and for the training of the model, the model is easy to perform poor performance because of the problem of unbalanced data, so that an improved self-adaptive synthetic sampling Adaptive Synthetic Sampling, namely an ADASYN algorithm, is required to be used for data enhancement; step 5.1, for the samples obtained in step 4, using the improved ADASYN to randomly oversample a few types of samples, the algorithm comprises the following calculation steps:

step 6, constructing an AdaBoost.M2-SVM model;

and 7, inputting the data processed in the step 5 into a model for training, and identifying the oil-gas-water layer of logging data through training an AdaBoost.M2-SVM algorithm.

The application has the advantages that:

according to the improved SVM-based unbalanced sample oil reservoir complex fluid identification method, firstly, data are expanded through an ADASYN model, so that the possibility of occurrence of fitting is reduced, and then, a plurality of weak SVM classifiers are combined through an AdaBoost.M2-SVM, so that the model popularization capability is improved, and the problem that the traditional method is inadaptive and multi-solution in unconventional oil reservoir fluid identification is solved. In addition, according to the method, the intelligent identification of the oil, gas and water layers of the reservoir can be realized in the oil reservoir fluid identification combined with logging based on an artificial intelligence algorithm. The method comprises the steps of data processing, algorithm design, model training and model deployment, and a set of reservoir oil-gas-water layer intelligent identification system combining logging data is formed.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a sample graph obtained when the present application is applied;

FIG. 3 is a graph showing the results of inputting a training set and a validation set into an AdaBoost.M2-SVM model when the present application is applied.

Detailed Description

The improved SVM-based complex fluid identification method for the unbalanced sample oil reservoir is characterized by comprising the following steps of: the identification method comprises the following steps:

step 1.1, selecting logging data of a plurality of wells in a work area to form a data set, and taking a group of data according to each interval of 0.1m to obtain a final required sample;

step 1.2, because the depths of samples collected by different sensors in the well are inconsistent, the training requirement of the model cannot be met, matching and splicing are carried out according to the depth characteristics of data, and the samples with different depths are subjected to uniform depth operation. Lithology data, fluorescence levels, gas measurement data comprising C1, C2, C3, iC4, nC4, iC5, nC5, CO2 and H2S, and induction data comprising resistivity, natural gamma, neutrons and density are selected as characteristic data, a gas layer, an oil-water layer, a water layer and a dry layer are used as tag data, and a data layer is established by the above data.

step 2.1, because the data of the "fluorescence level" and "logging comprehensive interpretation" columns in the original excel table of the data layer have default values, the default values need to be filled. For "fluorescence level", because this value is too low in some intervals that are not reservoirs, the staff is not recording the data into the original excel table, should actually be filled with 0, and is the same for "logging comprehensive interpretation", and is typically filled with "water layer" for blank data; for those actual missing values, selecting to delete the entire piece of data; for the air test data, the invalid value in the original data is-999.25, and the whole data with negative data is selected to be deleted.

step 3.1, because the precision of the gas measurement data is 1m for each data sampling point, and the data precision of the sensing data is 0.1m, the gas measurement data is interpolated, and different classical interpolation algorithms are adopted for interpolation, nearest neighbor interpolation, bilinear interpolation and bicubic interpolation algorithms, so that the gas measurement data is more refined;

the principle of the bilinear interpolation algorithm is as follows:

assuming P is a point to be solved of the gas measurement data, approaching points Q11, Q12, Q21 and Q22, respectively carrying out linear interpolation on the points Q11, Q21 and Q12 and Q22 to obtain R1 and R2, and then carrying out linear interpolation on the points R1 and R2 to obtain a value of the point P;

step 3.2, lithology data are interval data, and the lithology of a depth interval is reflected.

in step 4.1, there are n factors affecting the properties of the reservoir fluid, and the overall sample is m, and the sample matrix may be expressed as:

wherein x is _ij A value representing a j-th variable in the i-th set of sample data;

step 4.2, as the factors influencing the low-resistance oil reservoir fluid properties of the research block are more, the factors have different dimensions, and the analysis of the main control factors of the oil reservoir fluid properties and the prediction results of the follow-up flow are influenced; in order to eliminate the dimensional influence among different factors, the matrix X is necessary to be subjected to standardized transformation to obtain a matrix Z, and the data is subjected to normalization processing by adopting mean reduction and variance division;

wherein the method comprises the steps of

Step 4.3, solving a correlation coefficient matrix for the standardized matrix Z

Solving characteristic equation |R-lambda I of sample correlation matrix R _n N eigenvalues of =0 and λ ₁ ≥λ ₂ ≥…≥λ _n ≥0

Step 4.4, determining the k value to ensure that the accumulated contribution rate of the information is more than 85 percent

Finally we get different influencing factors P1, P2 …, P12 in 12.

Step 5, for the data processed in step 4, the data proportion difference between the oil layer and the oil-water same layer and the non-oil layer is too large, the oil layer occupies a relatively low proportion, and for the training of the model, the model is easy to perform poor performance because of the problem of unbalanced data, so that an improved self-adaptive synthetic sampling Adaptive Synthetic Sampling, namely an ADASYN algorithm, is required to be used for data enhancement;

step 5.1, for the samples obtained in step 4, using the improved ADASYN to randomly oversample a few types of samples, the algorithm comprises the following calculation steps:

(1) training a classifier by using a training set of samples, and testing an identification result by using a verification set;

(2) calculating error division rate sigma of each class respectively through the obtained confusion matrix _i ，Where i, j= { a, b, c … }, TP _i The real class is represented as i class, and the prediction is also the number of i class; FN (Fn) _j The number of the real classes is i class, but predicted as j class; when sigma is _i When the data is larger than the threshold value, selecting to perform data enhancement between the class i and the class j, otherwise, not performing data enhancement; the class with more samples between class i and class j is set as s _m The smaller number is s _l ；

(3) Calculation s _m Sum s _l The number of required synthesized samples between G, g=(s) _m -s _l ) X beta, wherein s _m Sum s _l The number of samples respectively representing more classes and less classes, and beta is a random number between 0 and 1;

(4) for each of the fewer classes of samples x _i Calculating the duty ratio r of most classes in k samples with nearest Euclidean distance _i ，Delta in _i Representing the number of most classes of samples among the k samples nearest to euclidean distance, i=1, 2,3 … s _l ；

(5) For r _i Performing standardization

(6) Calculating the number of samples g required to be generated by a minority class

(7) Generating a new sample by using a traditional SMOTE algorithm according to g to be generated by each minority sample;

(8) sample N to be newly generated _i And (3) adding the training set to obtain a new training set, and repeating the steps (1) and (2) until the error fraction sigma is smaller than a threshold value.

Step 6, constructing an AdaBoost.M2-SVM model;

as the popularization of the AdaBoost.M2 algorithm, the multi-classification problem of K classes is converted into K-1 classification problems, so that the AdaBoost algorithm can be applied to the multi-classification problems. The adaboost.m2 algorithm computes classifier weights by combining multiple linear kernel SVM classifiers with different weights by maintaining a set of weight distributions for training data in a training set;

the adaboost.m2 algorithm is as follows:

input training set s= { X _i ,y _i I=1, …, N; the number of samples is N, where the vector X _i Representing the ith training sample, tagWherein i is a number, and K represents different category numbers; iteration times T;

at round t iteration, sample (x _i ,y _i ) The weight distribution is D _t (i) The method comprises the steps of carrying out a first treatment on the surface of the The sample weights are equal at the beginning; the sample weight of the classification error increases for each iteration, resulting in more training. Presence of sample X _i Which is correctly classified as y _i Incorrect classification of y (non-y of K-1 species) _i Class). Assume training to obtain weak classifier h _t The result is [0,1]Takes on values between, for the samples (x _i ,y _i ) Classifier h _t Can be judged for K-1 times, three conditions can appear in each result, the classification is correct, the classification is wrong and the results are y and y _i Randomly selects one type. Then, the probability of each discriminant error is:

for the K classification problem, there are K-1 different y of different importance in different cases, thus each y is given a weightThen the pseudo-loss of adaboost.m2 is epsilon _t ：

(1) Initializing weights of sample data toThe weight of a certain error label y of the sample i in the first iteration is: />

(2) Loop iteration t=1, …, T;

(1) the sum of the weights of the error labels of sample i in the t-th iterationFor y not equal to y _i Has the following componentsSample distribution->

(2) According to the sample distribution D _t (i) Reselecting the sample, calling the SVM to train the sample to obtain a sub-classifier h _t ：

(3) Calculate h _t Pseudo-loss.

(4) Order theUpdating the weight value: />Wherein->

(3) Obtaining a final combined classifier:

In order to verify the correctness of the application, the applicant applies 32 wells of a certain oil well, and the application process is as follows: the complex oil reservoir fluid identification method based on the improved SVM comprises the following steps:

step 1.1, selecting logging data of 32 wells in a work area to form a data set, and taking a group of data according to each interval of 0.1m to obtain a final required sample (see the attached drawing 2 of the specification);

step 1.2, because the depths of samples collected by different sensors are inconsistent and the training requirement of the model cannot be met, matching and splicing are carried out according to the depth characteristics of the data, and the samples with different depths are subjected to depth unification operation. The lithology, fluorescence level, gas measurement data (C1, C2, C3, iC4, nC4, iC5, nC5, CO2 and H2S) and induction data (resistivity, natural gamma, neutrons and density) are selected as characteristic data, an air layer, an oil-water layer, a water layer and a dry layer are used as label data, and a data layer is established by the data.

step 2.1, because the data of the "fluorescence level" and "logging comprehensive interpretation" columns in the original excel table have default values, filling the default values is needed. For "fluorescence level", the operator does not record the data into the original excel table, but should actually fill with 0, and the same process is done for "logging comprehensive interpretation", and generally with "water layer" for blank data, because this value is too low in some intervals that are not reservoirs. For those actual missing values, the entire piece of data is selected for deletion herein. For the air test value, the invalid value in the original data is-999.25, and the whole piece of data with negative deleted data is selected.

and 3.1, for the purposes of the present document, the precision of the gas measurement data is 1m for each data sampling point, and the data precision of the sensing data is 0.1m, so that the gas measurement data is interpolated, different classical interpolation algorithms are adopted for interpolation, nearest neighbor interpolation, bilinear interpolation and bicubic interpolation algorithms are adopted, and the gas measurement data is more refined. The principle of bilinear interpolation algorithm follows; assuming P is a point to be solved of the gas measurement data, approaching points Q11, Q12, Q21 and Q22, respectively carrying out linear interpolation on the points Q11, Q21 and Q12 and Q22 to obtain R1 and R2, and then carrying out linear interpolation on the points R1 and R2 to obtain a value of the point P;

step 3.2, lithology data are interval data, and the lithology of a depth interval is reflected. For example: in EP10-2-2 wells, the lithology is pebble-containing coarse sandstone at a depth of 703-720m, the interval data are converted into data of 0.1m, and the lithology of 703.1m,703.2m … and 719.9m is pebble-containing coarse sandstone.

Step 4, 15 factors affecting the reservoir fluid property are counted (step 1.2), and the principal component analysis converts a plurality of non-linear related indexes into fewer indexes which are not related to each other through data dimension reduction. The principal component analysis was calculated as follows: in step 4.1, there are n factors affecting the properties of the reservoir fluid, and the overall sample is m, and the sample matrix may be expressed as:

and 4.2, as the factors influencing the low-resistance oil reservoir fluid properties of the research block are more, the factors have different dimensions and influence the analysis of the main control factors of the oil reservoir fluid properties and the prediction results of subsequent flows. In order to eliminate the dimensional influence among different factors, the matrix X is necessary to be subjected to standardized transformation to obtain a matrix Z, and the data is subjected to normalization processing by adopting mean reduction and variance division;

wherein the method comprises the steps of

Finally we get different influencing factors P1, P2 …, P12 in 12.

Step 5, for the data processed in the step 4, the data proportion difference between the oil layer and the oil-water same layer and the non-oil layer is too large, the oil layer occupies a relatively low proportion, and for the training of the model, the model is easy to perform poor performance because of the problem of unbalanced data, so that an improved self-adaptive synthetic sampling (Adaptive Synthetic Sampling, abbreviated as ADASYN) algorithm is required to be used for data enhancement;

(2) calculating error division rate sigma of each class respectively through the obtained confusion matrix _i ，Where i, j= { a, b, c … }, TP _i The number of i-class predictions is also indicated as i-class true class. FN (Fn) _j The number of the real class i but the predicted j class i is indicated. When sigma is _i When the data is larger than the threshold value, selecting to perform data enhancement between the class i and the class j, otherwise, performing no data enhancement. The class with more samples between class i and class j is set as s _m The smaller number is s _l ；

(5) For r _i Performing standardization

(8) sample N to be newly generated _i Adding training set to obtain new training setRepeating the steps (1) and (2) until the misclassification rate sigma is less than a threshold.

Step 6, constructing an AdaBoost.M2-SVM model;

the adaboost.m2 algorithm is as follows:

input training set s= { X _i ,y _i I=1, …, N; the number of samples is N, where the vector X _i Representing the ith training sample, tagWherein i is a number, K represents different category numbers, and the iteration times T;

at round t iteration, sample (x _i ,y _i ) The weight distribution is D _t (i) A. The application relates to a method for producing a fibre-reinforced plastic composite Initially the sample weights are equal. The sample weight of the classification error increases for each iteration, resulting in more training. Presence of sample X _i Which is correctly classified as y _i Incorrect classification of y (non-y of K-1 species) _i Class). Assume training to obtain weak classifier h _t The result is [0,1]Takes on values between, for the samples (x _i ,y _i ) Classifier h _t Can be judged for K-1 times, three conditions can appear in each result, the classification is correct, the classification is wrong and the results are y and y _i Randomly selects one type. Then, the probability of each discriminant error is

For the K classification problem, there are K-1 different y of different importance in different cases, thus each y is given a weightThen the pseudo-loss of adaboost.m2 is

(1) Initializing weights of sample data toThe weight of a certain error tag y of sample i in the first iteration is +.>

(2) Loop iteration t=1, …, T;

(3) Calculate h _t Pseudo-loss.

(4) Order theUpdating the weight value: />Wherein->

(3) Obtaining the final combined classifier

Step 6.1, inputting the processed data (step 5) into a model for training, and identifying an oil-gas-water layer of logging data through training an AdaBoost.M2-SVM algorithm;

the AdaBoost.M2-SVM classification model adopts the following steps to explain logging oil-gas-water layers;

(1) Dividing the data expanded in the step 5 into a training set, a verification set and a test set according to the proportion of 7:2:1;

(2) The training set and the verification set are input into an AdaBoost.M2-SVM model, and Accumey (Accuracy), precision (Precision), recall (Recall) and F1-score (F1 value) are calculated through a confusion matrix and serve as evaluation indexes of the model (see figure 3 of the specification). By comparing the accuracy of the traditional SVM model with that of the AdaBoost.M2-SVM under the same data enhancement algorithm, the accuracy of the AdaBoost.M2-SVM is found to be higher. The accuracy of the AdaBoost.M2-SVM is higher on different logging databases, and the AdaBoost.M2-SVM has better generalization capability.

Claims

1. An unbalanced sample oil reservoir complex fluid identification method based on an improved SVM is characterized by comprising the following steps of: the identification method comprises the following steps:

step 2.1, because the data of the 'fluorescence level' and 'logging comprehensive interpretation' columns in the original excel table of the data layer have default values, filling the default values is needed; for "fluorescence level", because this value is too low in some intervals that are not reservoirs, the staff is not recording the data into the original excel table, should actually be filled with 0, and is the same for "logging comprehensive interpretation", and is typically filled with "water layer" for blank data; for those actual missing values, selecting to delete the entire piece of data; for the gas measurement data, the invalid value in the original data is-999.25, and the whole data with negative data is selected to be deleted;

step 3.2, lithology data are interval data, and the lithology of a depth interval is reflected;

step 4, step 1.2, counting 15 factors influencing the reservoir fluid property, and converting a plurality of non-linear related indexes into less indexes which are not related with each other through data dimension reduction by principal component analysis;

(5) For r _i Performing standardization

(8) sample N to be newly generated _i Adding the training set to obtain a new training set, and repeating the steps (1) and (2) until the error fraction sigma is smaller than a threshold value;

step 6, constructing an AdaBoost.M2-SVM model;

the AdaBoost.M2 algorithm is used as popularization of the AdaBoost algorithm, and the multi-classification problem of the K class is converted into K-1 classification problems, so that the AdaBoost algorithm can be applied to the multi-classification problems; the adaboost.m2 algorithm computes classifier weights by combining multiple linear kernel SVM classifiers with different weights by maintaining a set of weight distributions for training data in a training set;

2. The improved SVM based complex fluid identification method for unbalanced sample reservoirs of claim 1, wherein: the method for collecting and sorting logging data of a research area and establishing a data layer comprises the following steps:

step 1.2, matching and splicing are carried out according to the depth characteristics of data because the depths of samples acquired by different sensors in a well are inconsistent and the training requirement of a model cannot be met, and the samples with different depths are subjected to uniform depth operation; lithology data, fluorescence levels, gas measurement data comprising C1, C2, C3, iC4, nC4, iC5, nC5, CO2 and H2S, and induction data comprising resistivity, natural gamma, neutrons and density are selected as characteristic data, a gas layer, an oil-water layer, a water layer and a dry layer are used as tag data, and a data layer is established by the above data.

3. The improved SVM based complex fluid identification method for unbalanced sample reservoirs of claim 1, wherein: the principle of the bilinear interpolation algorithm is as follows:

assuming that P is a point to be solved of the gas measurement data, adjacent points Q11, Q12, Q21 and Q22, respectively carrying out linear interpolation on the points Q11, Q21 and Q12 and Q22 to obtain R1 and R2, and then carrying out linear interpolation on the points R1 and R2 to obtain the value of the point P.

4. The improved SVM based complex fluid identification method for unbalanced sample reservoirs of claim 1, wherein: the main component analysis comprises the following calculation steps:

wherein the method comprises the steps of

Finally we get different influencing factors P1, P2 …, P12 in 12.

5. The improved SVM based complex fluid identification method for unbalanced sample reservoirs of claim 1, wherein: the AdaBoost.M2 algorithm comprises the following steps:

at round t iteration, sample (x _i ,y _i ) The weight distribution is D _t (i) The method comprises the steps of carrying out a first treatment on the surface of the The sample weights are equal at the beginning; the sample weight of the classification error is increased in each iteration, so that more training is obtained; presence of sample X _i Which is correctly classified as y _i Incorrect classification of y (non-y of K-1 species) _i Class); assume training to obtain weak classifier h _t The result is [0,1]Takes on values between, for the samples (x _i ,y _i ) Classifier h _t Can be judged for K-1 times, three conditions can appear in each result, the classification is correct, the classification is wrong and the results are y and y _i Randomly selecting one type; then, the probability of each discriminant error is

(2) Loop iteration t=1, …, T;

(3) Calculate h _t Pseudo loss;

(4) order theUpdating the weight value: />Wherein the method comprises the steps of

(3) Obtaining a final combined classifier: