Disclosure of Invention
In order to solve the problems, the invention aims to provide a method for quantitatively evaluating the main control factors of the quality of the compact sandstone gas reservoir based on a random forest.
The technical scheme of the invention is as follows:
a method for quantitatively evaluating the main control factors of the quality of a tight sandstone gas reservoir based on a random forest comprises the following steps:
s1: collecting relevant influence factors influencing reservoir quality in a research area, and performing parameter processing according to the parameter types of the influence factors:
if the influence factors are data parameters, checking whether the data parameters are missing, and if a certain data parameter is missing, removing all the influence factors of the same batch; if no deletion exists, reserving;
if the influence factor is a text parameter, carrying out assignment processing on the text parameter;
s2: storing the processed parameters as comma separated value files;
s3: taking the characterization parameters of the reservoir quality as dependent variables of the reservoir quality analysis, and taking the influencing factors as independent variables of the reservoir quality analysis;
s4: extracting training data by using a random forest algorithm and a replaced random sampling mode according to the dependent variable and the independent variable to construct a decision tree and a random forest;
s5: and calculating the error of the data outside the bag of the decision tree aiming at each influence factor, and preferably selecting the main control influence factor according to the error of the data outside the bag.
Preferably, in step S1, the relevant influencing factors include a sedimentation-type influencing factor, a diagenesis-type influencing factor, and a construction-type influencing factor.
Preferably, the sedimentation influencing factors comprise granularity lithology type, mineral lithology type, sorting property, roundness grinding, various granularity parameters, different types of mineral particle content, impurity base content and primary pore content; the diagenetic influence factors comprise cementing action types, different types of cementing material contents, different types of dissolving hole contents, different types of alternate mineral contents and compaction strength; the construction type influence factors comprise fracture types and different types of fracture contents.
Preferably, in step S3, the characterization parameter of the reservoir quality is a flow cell index.
Preferably, step S4 specifically includes the following sub-steps:
s41: by using a random forest algorithm, taking the processed data set as an input sample data set, randomly extracting the data set for multiple times to form subset data in a replacement mode, wherein the sampling times are consistent with the number of samples, and the subset data obtained by sampling is used for constructing a decision tree;
s42: randomly extracting part of influence factors from the sub-data set to form a candidate partition set, selecting one influence factor from the candidate partition set as a partition point of the decision tree according to a minimum node purity principle, continuing splitting by adopting the principle until all samples of the node reach leaf nodes, and finishing splitting;
s43: and repeating the step S41 and the step S42 to establish a plurality of decision trees, and forming the random forest by the decision trees.
Preferably, step S5 specifically includes the following sub-steps:
s51: putting all influence factors in the data outside the bags into the constructed random forest, and calculating the predicted value of each data outside the bags through the random forest aiming at a certain decision tree;
s52: calculating a mean square error I between a predicted value and a true value of each data outside the bag;
s53: selecting a certain influence factor in the data outside the bag, randomly adding noise into the influence factor, then placing the influence factor into the random forest, and calculating to obtain a predicted value after the noise is added; calculating a mean square error II between a predicted value and a true value of the influencing factor with noise;
s54: judging the importance of the influence factor according to the magnitude of the first mean square error and the second mean square error of the influence factor:
if the second mean square error is larger than the first mean square error after random noise is added, and the difference value between the second mean square error and the first mean square error is larger than a threshold value, the influence factor is important, otherwise, the influence factor is unimportant;
s55: repeating the step S53 and the step S54, and judging all the remaining influence factors in the data outside the bag;
s56: and (4) repeating the step S51 and the step S55 aiming at each remaining decision tree in the random forest, calculating the out-of-bag data error of each influence factor, taking the average value of the out-of-bag data errors as the importance value of each influence factor, arranging the importance values in a descending order according to the size of the average value, wherein the influence factors in the top order are the main control factors influencing the reservoir quality.
Preferably, the method further comprises the following steps:
s6: reconstructing a decision tree and a random forest according to the main control influence factors selected by the out-of-bag data errors;
s7: calculating the out-of-bag data errors of the optimal main control influence factors, and taking the average value of the out-of-bag data errors of the main control influence factors as the importance value of each main control influence factor;
s8: and calculating the percentage of the importance value of each preferred main control influence factor to all the preferred main control factor importance values, and quantitatively representing the influence degree of each main control influence factor on the reservoir quality.
The invention has the beneficial effects that:
according to the method, the random forest algorithm is used for carrying out quantitative evaluation on related influence factors on the characterization parameters reflecting the reservoir quality, so that the extraction of key factors for controlling the reservoir quality is realized, the deposition and diagenesis of the reservoir are determined, and the formation and distribution of the tight sandstone high-quality reservoir are effectively judged. The traditional method mainly depends on human experience, numerical simulation, scanning electron microscope analysis, core test analysis and other methods to determine main control factors, and cannot quantitatively evaluate the importance of each influence factor. Compared with the traditional method, the method applies the random forest algorithm to the evaluation research of the main control factors of the quality of the compact sandstone gas reservoir, not only can automatically screen out the main control factors, but also can quantitatively evaluate the main control factors. The result obtained by the method has the advantages of universality, objectivity, accuracy and the like, and can provide a referential geological basis for the next gas reservoir description and the oil field development effect improvement.
Detailed Description
The invention is further illustrated with reference to the following figures and examples. It should be noted that, in the present application, the embodiments and the technical features of the embodiments may be combined with each other without conflict. It is noted that, unless otherwise indicated, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "comprising" or "including" and the like in the present disclosure is intended to mean that the elements or items listed before the term cover the elements or items listed after the term and their equivalents, but not to exclude other elements or items.
As shown in fig. 1-5, the invention provides a method for quantitatively evaluating the main control factor of the quality of a tight sandstone gas reservoir based on a random forest, which comprises the following steps:
s1: relevant influencing factors influencing reservoir quality of the research area are collected, wherein the relevant influencing factors comprise sedimentation influencing factors, diagenetic influencing factors and construction influencing factors.
In a specific embodiment, the sedimentation type influencing factors comprise granularity lithology type, mineral lithology type, sorting property, roundness, various granularity parameters, contents of different types of mineral particles, impurity base content and primary pore content; the diagenetic influence factors comprise cementing action types, different types of cementing material contents, different types of dissolving hole contents, different types of alternate mineral contents and compaction strength; the construction type influence factors comprise fracture types and different types of fracture contents. It should be noted that the relevant influencing factors may be different for different research areas, and in other research areas, the influencing factors may not be applicable, or may have other influencing factors besides the influencing factors.
S2: performing parameter processing according to the parameter types of the influence factors, and saving the processed parameters as a comma separated value file (. csv) suitable for R language reading; the specific treatment method comprises the following steps:
if the influence factors are data parameters, checking whether the data parameters are missing, if a certain data parameter is missing, removing all influence factors of the same batch (for example, X influence factors are obtained, XY influence factor data are obtained through Y batch test, if the ith batch test only obtains the data of X-j (j is more than or equal to 1) influence factors, removing all influence factor data of the ith batch, namely, finally obtaining X (Y-1) influence factor data); if no deletion exists, reserving; therefore, the integrity of the data can be ensured, and the calculation precision of the subsequent steps is improved.
If the influence factor is a text parameter, carrying out assignment processing on the text parameter; for example, the values of "poor", "medium" and "good" in the sorting property are respectively assigned to 0, 1 and 2, and the values of the support type particles and the miscellaneous base are assigned to 0 and 1, so that the model can conveniently process the text parameters, and it should be noted that the size of the assignment of the text parameters has no influence on the result.
S3: taking the characterization parameters of the reservoir quality as dependent variables of the reservoir quality analysis, and taking the influencing factors as independent variables of the reservoir quality analysis; the characterization parameter of the reservoir quality is determined by the following steps:
(1) collecting and arranging production test data, wherein the production test data comprises oil production, gas production, water production and liquid production;
(2) collecting and sorting (or calculating) parameter data reflecting reservoir quality, wherein the parameter data comprises porosity, permeability, flow unit index (FZI) and pore throat structure; the flow cell index (FZI) is calculated by the following formula:
in the formula: FZI is a flow unit index, and is dimensionless; RQI is the quality index of the reservoir, and is dimensionless; k is the permeability, D;
porosity,%;
is the ratio of pore volume to particle volume.
(3) And (3) analyzing the correlation between the production test data and porosity, permeability, flow unit index (FZI) and pore throat structure parameters by taking the production test data as a standard for evaluating the quality of the reservoir, and determining the parameter which can most reflect the reservoir quality as a characterization parameter of the reservoir quality.
In a specific embodiment, the above method is used to determine the characterization parameter of the reservoir quality, and finally, the flow cell index is preferably selected as the characterization parameter of the reservoir quality.
S4: extracting training data by using a random forest algorithm and a replaced random sampling mode according to the dependent variable and the independent variable to construct a decision tree and a random forest; the method specifically comprises the following substeps:
s41: and (3) by using a random forest algorithm, taking the processed data set as an input sample data set, randomly extracting for many times to form subset data in a replacement mode, wherein the sampling times are consistent with the number of samples, and the subset data obtained by sampling is used for constructing a decision tree.
S42: randomly extracting part of influence factors from the sub-data set to form a candidate segmentation set, selecting one influence factor from the candidate segmentation set as a segmentation point of the decision tree according to a minimum node purity principle, continuing splitting by adopting the principle until all samples of the node reach leaf nodes, and finishing splitting.
The minimum principle of node purity, namely the minimum principle of the Kearny coefficient, can be characterized by calculating the Kearny coefficient of the segmentation point, and the node with the minimum Kearny coefficient is selected as the segmentation node, so that the minimum node purity after segmentation can be realized.
In the process of forming the decision tree, each node is split according to the mode. For example, in any subset of data, if the influence factors of random extraction include primary inter-granular pores, main particle size and cast mold pores, the kini coefficient is calculated, the primary inter-granular pores are selected as the segmentation nodes, and the splitting is finished until all samples of the node reach the leaf node.
S43: and repeating the step S41 and the step S42 to establish a plurality of decision trees, and forming the random forest by the decision trees.
S5: and calculating the error of the data outside the bag of the decision tree aiming at each influence factor, and preferably selecting the main control influence factor according to the error of the data outside the bag.
In the construction process of the decision tree, as a replaced random sampling mode is adopted, part of sample data is always not extracted and is called as data outside a bag; and the error between the predicted value of the data outside the bag in the decision tree and the real value thereof is the error of the data outside the bag. Step S5 specifically includes the following substeps:
s51: putting all influence factors in the data outside the bags into the constructed random forest, and calculating the predicted value of each data outside the bags through the random forest aiming at a certain decision tree;
s52: calculating a mean square error I between a predicted value and a true value of each data outside the bag;
s53: selecting a certain influence factor in the data outside the bag, randomly adding noise into the influence factor, then placing the influence factor into the random forest, and calculating to obtain a predicted value after the noise is added; calculating a mean square error II between a predicted value and a true value of the influencing factor with noise;
s54: judging the importance of the influence factor according to the magnitude of the first mean square error and the second mean square error of the influence factor:
if the second mean square error is larger than the first mean square error after random noise is added, and the difference value between the second mean square error and the first mean square error is larger than a threshold value, the influence factor is important, otherwise, the influence factor is unimportant;
s55: repeating the step S53 and the step S54, and judging all the remaining influence factors in the data outside the bag;
s56: and (4) repeating the step S51 and the step S55 aiming at each remaining decision tree in the random forest, calculating the out-of-bag data error of each influence factor, taking the average value of the out-of-bag data errors as the importance value of each influence factor, arranging the importance values in a descending order according to the size of the average value, wherein the influence factors in the top order are the main control factors influencing the reservoir quality.
In a specific embodiment, the method for quantitatively evaluating the reservoir quality main control factor further comprises the following steps:
s6: reconstructing a decision tree and a random forest according to the main control influence factors selected by the out-of-bag data errors;
s7: calculating the out-of-bag data errors of the optimal main control influence factors, and taking the average value of the out-of-bag data errors of the main control influence factors as the importance value of each main control influence factor;
s8: and calculating the percentage of the importance value of each preferred main control influence factor to all the preferred main control factor importance values, and quantitatively representing the influence degree of each main control influence factor on the reservoir quality.
In a specific embodiment, taking a research area of a tight sandstone gas reservoir as an example, the research area is located in a sunken xanthate structure in the west lake of the east-sea basin, a target layer is the lower section of a gradual-new Huagang group, the structure is located in the middle and south of a sunken central inversion structure zone in the west lake of the east-sea land frame basin, the structure is an NE-SW anticline structure, the stratum is relatively flat, the xanthate 1-1 mainly develops anticline encirclement, and the xanthate 2-2 mainly develops a low-amplitude anticline and broken anticline structure group on a secondary extrusion zone. The hong Kong group is a product deposited and filled in the initial new fracture-depression transition stage, a shallow lake-delta sedimentary system under the main development continental ground background has the sedimentary thickness of between 1000 and 2000m, the total thickness of the lower flower section is less than that of the upper flower section, and the main lithology is as follows: the porosity of the secondary feldspar sandstone, the secondary cuttings sandstone, the feldspar cuttings sandstone and the rock cuttings feldspar sandstone is between 2.1% and 12.5%, the porosity is mainly concentrated on 8% to 10%, the permeability is between 0.02 and 22.72mD, the permeability is higher except one fracture position, the permeability is lower than 2mD, the permeability is mainly concentrated on 0.1mD to 0.4mD, and the low-porosity and low-permeability sandstone reservoir belongs to a typical low-porosity and low-permeability compact sandstone reservoir. Early development practices show that although sand bodies in the research area are large in vertical thickness and continuously distributed in the transverse direction, the reservoir heterogeneity is extremely high, the productivity difference between different intervals of the same development well and between adjacent different development wells is extremely large, the gas production rate of some intervals reaches dozens of thousands of squares, and even some intervals have no capacity, so that the key of the production problem is the reservoir quality difference. Therefore, the main control factor for knowing the reservoir quality is the core basic problem for breaking the efficient development of the tight sandstone gas reservoir in the research area.
The reservoir quality main control factor quantitative evaluation method of the research area comprises the following steps:
the first step is as follows: and collecting sedimentary, diagenetic and tectonic parameters for controlling the quality of the geological reservoir.
(1) Collecting rock core description and experimental analysis data, wherein the rock core description and experimental analysis data comprise 636 pieces of deposition parameter data such as granularity lithology type, mineral lithology type, sorting property, roundness, different types of mineral particle content, impurity-based content, various granularity parameters (average value, standard deviation, skewness and kurtosis), different types of mineral percentage content and primary pore content, 345 pieces of formation parameter data such as cementing action type, different types of cement content (siliceous cement content, calcium cement content, argillaceous cement content and iron ore content), different types of soluble pore content (including inter-granular soluble pore content, intra-granular soluble pore content and cast film pore content), different types of clay mineral content, different types of cross-substituted mineral content and compaction strength, and 640 pieces of construction parameter data such as crack type and different types of crack content;
(2) for text type parameter data, classifying and assigning values to the text data, for example, assigning values of 'poor', 'medium' and 'good' in the sorting property to 0, 1 and 2 respectively, and assigning values of particles and miscellaneous bases of the support type to 0 and 1;
(3) by adopting the steps, samples with more missing parameters are removed, 340 pieces of sample data are obtained, the data set is represented by Q, and the data set is stored as a comma separated value file (. csv) and is suitable for reading in R language.
The second step is that: characterizing parameters for evaluating reservoir quality are determined.
(1) Collecting and arranging production test data including oil production, gas production, water production and liquid production;
(2) collecting and sorting (or calculating) parameter data reflecting reservoir quality, wherein the parameter data comprises porosity, permeability, flow unit index (FZI) and pore throat structure;
(3) analyzing the correlation between the production data and parameters such as porosity, permeability, flow unit index (FZI), pore throat structure and the like by taking the production data as a reservoir quality standard, preferably selecting the FZI as the parameter which can best reflect the reservoir quality, and determining the FZI as a reservoir quality characterization parameter;
(4) the flow unit index (FZI) is used as a dependent variable for controlling the quality of the compact sandstone gas reservoir, and other related influence factors of the reservoir quality, such as the content of inter-granular dissolved pores, the main grain diameter, the content of cast-die pores, the content of primary inter-granular pores and other parameters are used as independent variables.
The third step: and (5) constructing a decision tree and a random forest.
(1) And (3) by utilizing a random forest algorithm, taking the processed 340 sample data sets Q as input sample data sets, and randomly extracting 340 times from the input sample data sets in a replacement mode to form subset data M, wherein the subset data is used for constructing a decision tree.
(2) Each sample in M contains 34 characteristic values, i.e. 34 influencing factors, such as primary intergranular pore content, argillaceous foreign base content, quartz content and the like. Randomly extracting 1/3 influence factors from M to form a candidate division point set of the decision tree, and using C1And (4) showing. For example, the influence factors of random extraction are 11 influence factors such as potassium feldspar content, plagioclase feldspar content, quartz content, roundness, sorting property, support type, primary intergranular pore content and intergranular pore content, and the like, which constitute C1. Calculating C1Selecting the particle size soluble pore with the smallest influence factor as a division point of the decision tree, and dividing M into left and right sets respectively represented as MLAnd MR;
The calculation method of the kini coefficient is as follows: the random forest adopts CART decision tree, in CART algorithm, because of binary tree classification, if the sample subset M is divided into M according to whether the characteristic A takes a certain possible value b or notLAnd MRThen, under the condition of the feature a, the kini coefficient of the set M is:
in the formula: gini (M) represents the uncertainty of the set M; gini (M, a) represents the uncertainty of the set M after a ═ b segmentation; the larger the kini coefficient, the greater the uncertainty in representing the sample, and the greater the node purity after segmentation.
(3) For set MLBy adopting the method, 11 influencing factors are randomly extracted to form a decisionSet of candidate segmentation points of tree C2For example, the influence factors such as the argillaceous content, the iron ore content, the roundness, the quartz content and the like are randomly extracted, and C is calculated2Selecting iron ore with the smallest influence factor as a segmentation node, and further using M as a node for segmentationLDividing the classification into a left classification set and a right classification set; this process is repeated until MLThe splitting is ended when each sample in the set reaches a leaf node.
(4)MRBy reaction of a compound with MLSame way of treatment up to MRWhen all the samples in the group of the leaf node reach the leaf node, the splitting is finished; at this point, the decision tree construction of the subset data M is completed.
(5) And (4) repeating the steps (1) to (4) for multiple times, establishing a plurality of decision trees, and forming a random forest by the plurality of decision trees.
The fourth step: determining main control factors influencing reservoir quality, and carrying out quantitative expression.
(1) Setting decision tree T1The corresponding data outside the bag is O1(ii) a Mixing O with1Putting the obtained solution into a constructed random forest for calculation to obtain a predicted value e of the random forest1(ii) a Calculating the predicted value e of the data outside the bag1Mean square error from the true value E, denoted error1:
error1=mean(E-e1)2 (5)
(2) Data outside bag O1In (2), for each sample, an intergranular pore (with x) was selected1Express) this feature adds random noise, other feature values remain unchanged, and the out-of-bag data after adding noise is recorded as O2Substituting the prediction value into a random forest for calculation, and recording the obtained prediction value as e2Calculating the true value E and the predicted value E2Mean square error between, noted error2:
error2=mean(E-e2)2 (6)
(3) For inter-granular pore (x)1) Calculating the difference of the two mean square errors, and recording as Sx1:
Sx1=error2-error1 (7)
(4) Repeating the steps, respectively calculating the mean square errors of the remaining 33 characteristics in the data outside the bag, such as the argillaceous content, the primary intergranular pore content, the casting die pore content and the like, and respectively recording the mean square errors as Sxi(i=2,……,34);
(5) Repeating the steps (1) to (4) aiming at each decision tree in the random forest, calculating the error of the data outside the bag of each influence factor, taking the average value of the errors as the importance value of each influence factor, and recording the importance value as Wxi(i=1,2,……,34):
In the formula: k represents the number of decision trees in the random forest.
(6) To WxiAnd (3) sorting in a descending order, eliminating influence factors behind the sorting, such as removing 20 influence factors including plagioclase feldspar content, siliceous cement content, potassium feldspar content, sorting property, quartz content, intra-granular soluble pore content and the like, and preliminarily selecting 14 factors including inter-granular soluble pore content, cast mold pore content, a graphical method granularity average value, primary inter-granular pore content, main grain size and the like as main control factors influencing the reservoir quality.
(7) Based on the preliminarily selected main control influence factors, the decision tree and the random forest are reconstructed by adopting the steps, the error of data outside the bag is calculated, the experiment is repeated for a plurality of times, the average value of the error is taken as the importance value of each influence factor, the importance value is converted into a percentage form, and the result is shown in table 1:
TABLE 1 quantitative evaluation results of reservoir quality master control factors
Serial number
|
Influencing factor
|
Mean square error
|
Percentage of importance
|
1
|
Content of intergranular pores
|
23.78726498
|
44.56%
|
2
|
Content of die holes
|
5.978337387
|
11.20%
|
3
|
Average particle size by graphical method
|
5.916006387
|
11.08%
|
4
|
Primary intergranular pore content
|
3.495146425
|
6.55%
|
5
|
Kaolinite content
|
3.256391689
|
6.10%
|
6
|
Major particle size
|
2.980834705
|
5.58%
|
7
|
Calcium cement content (calcite)Dolomite)
|
2.246414229
|
4.21%
|
8
|
Content of illite
|
1.420160649
|
2.66%
|
9
|
Mud content
|
1.340817622
|
2.51%
|
10
|
Maximum particle size
|
1.33884933
|
2.51%
|
11
|
X.S
|
0.616936967
|
1.16%
|
12
|
Content of illite-montmorillonite mixed layer
|
0.508265171
|
0.95%
|
13
|
Lithology of granularity
|
0.406021903
|
0.76%
|
14
|
Of the cement type
|
0.097017516
|
0.18% |
The percentage of importance of each influence factor in table 1 can be used for describing the influence degree of each influence factor on the reservoir quality, so as to realize quantitative evaluation of the main control factor of the tight sandstone gas reservoir quality. In the example, the influence of the inter-granular dissolved pore content on the reservoir of the compact sandstone is the most important, the importance of the reservoir can reach 44.56% after quantification, and then the influence of the cementing type, the granular lithology and the content of the illite-smectite mixed layer is the least and is less than 1% after quantification. Therefore, compared with the primary deposition effect (average particle size, primary inter-particle pores, main particle size and the like) and the cementation effect (kaolinite cementation, calcareous cementation, illite cementation and the like), the accumulated influence degree of inter-particle dissolution pores and casting mold pores on the reservoir quality reaches 55.8 percent, namely the dissolution and erosion effect is a key factor for controlling the reservoir quality and determines the formation and distribution of the high-quality reservoir of the compact sandstone.
The method can quantitatively evaluate the main control factors influencing the reservoir quality, and has remarkable progress compared with the prior art which relies on methods such as human experience, numerical simulation, scanning electron microscope analysis, core test analysis and the like.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.