CN110119394B

CN110119394B - Improved data cleaning method for separate layer water injection

Info

Publication number: CN110119394B
Application number: CN201910415497.6A
Authority: CN
Inventors: 王海英; 赵国堡
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2019-05-18
Filing date: 2019-05-18
Publication date: 2023-10-27
Anticipated expiration: 2039-05-18
Also published as: CN110119394A

Abstract

An improved method for cleaning layered water injection data aims to solve the problems that in the prior art, the method for cleaning the layered data of the oil field is particularly weak when dealing with high-dimensional data in a big data environment, and the interpolation strategy cannot judge the missing condition of the data; the application comprises determining t factors affecting the instantaneous flow according to the original data; detecting t factors determined in the step S1, and performing significance verification; classifying the original data, and performing iterative analysis through an association equation to obtain a complete data set; and (3) verifying the interpolation precision of the step S3. And finishing data cleaning work of the data of the deficiency value of the separate water injection in the data interpolation work of the separate water injection number.

Description

Improved data cleaning method for separate layer water injection

Technical Field

The application relates to the field of data analysis pretreatment, in particular to an improved data cleaning method for separate-layer water injection.

Background

Along with the continuous development of oilfield informatization, mass separate-layer water injection production data are accumulated in the separate-layer water injection development process. The relation among important parameters in the large data of the separate-layer water injection is comprehensively considered, so that the accuracy of the predicted key index can be improved, and the production efficiency and the safety coefficient of the oil field can be improved. In the process of separate layer water injection, the data distortion and the data loss can be caused by unstable operation of a logging sensor, communication equipment failure, well site power failure and other reasons, so that the abnormal prediction accuracy is not high and the formulation of separate layer water injection allocation scheme is influenced. This presents a significant challenge for analysis, diagnosis and optimization of the water injection system.

The existing mainstream oilfield layered data cleaning method is particularly weak when dealing with high-dimensional data in a big data environment, and the algorithm is difficult to play a role in practical research due to the defect of calculation speed. In data cleaning, interpolation of missing values is particularly important. The original interpolation strategy cannot judge the missing condition of the data, and in actual application, the interpolation method is often selected to perform missing value interpolation according to personal subjective and past experience, so that the interpolation improvement strategy needs to be studied.

Disclosure of Invention

In order to solve the technical defects, the technical scheme adopted by the application is that the hierarchical water injection data cleaning method for improving the interpolation strategy is provided, the branch steps of an original algorithm are improved, the improved algorithm can process a data set of a mixed type without dividing the data set into a classified variable and a continuous variable to be processed by adopting different methods, and the operation time is greatly improved while the interpolation precision is ensured.

The algorithm comprises the following steps:

s1, determining t factors influencing the instantaneous flow according to original data;

s2, detecting t factors determined in the step S1, and performing significance verification;

s3, classifying the original data, and performing iterative analysis through an association equation to obtain a complete data set;

and S4, verifying the interpolation precision of the step S3.

Further, step S1 includes:

s1-1, constructing an original data matrix, wherein the original data matrix is shown in the following formula:

wherein: |G| represents the raw data matrix, G _ij (i=1, 2, …, n; j=1, 2, …, m) is the measured data of the ith water injection well for the jth factor;

s1-2, carrying out averaging treatment on the original data:

s1-3, calculating an absolute difference matrix according to the original data and the data after the averaging treatment;

s1-4, solving an association coefficient matrix;

s1-5, judging the association degree, screening out factors with the association degree meeting the requirements, and eliminating irrelevant variables.

Further, the correlation coefficient matrix in step S1-4 is composed of correlation coefficient elements, and the correlation coefficient elements are obtained by the following formula:

wherein r is _ij Representing elements, delta, in the correlation coefficient matrix _min Representing the minimum, delta, in the absolute difference matrix _max Represents the maximum value, delta, in the absolute difference matrix _ij Elements representing i rows and j columns of the absolute difference matrix, ρ being a resolution factor.

Further, step S2 includes:

s2-1, constructing an association equation, and determining a proper coefficient by solving the minimum value of the residual square sum Q, wherein the association equation is as follows:

wherein: h is a _i The variable of the set of variables is represented,represents the observed value, Q is the sum of squares of the residuals, b _p Coefficients representing the equation;

s2-2, significance verification of correlation equation for determining variable g _pi And variable h _i Whether there is a functional relationship between them;

s2-3, adopting a stepwise analysis method to select independent variables.

Further, step S2-2 specifically performs the sum of squares decomposition of the dispersion by the following formula;

wherein: e (E) _{Total (S)} Mean value is representedIs decomposed into E _{The remainder is} And E is _{Switch for closing} ，E _{The remainder is} Is the sum of the remaining squares, E _{Switch for closing} Is the sum of the associated squares->Mean value, h _i Representing dependent variables +.>Representing the observed value.

Correlation analysis was performed by SPSS software.

Further, step S3 includes:

s3-1, defining an original dataset g= (G1, G2, …, gk);

s3-2, calling a reference interpolation fast interpolation initial data set, and randomly dividing k variables contained in the data set G into mutually exclusive groups with the size of alpha k, wherein 0< alpha <1.

Each group is used as independent variables in turn to carry out multiple correlation analysis by using a correlation equation;

sequentially carrying out 1/alpha multiple analysis by a loop, thereby completing one iteration;

stopping if the tolerance value epsilon of the integration accuracy of the interpolation data is reached, otherwise repeating until convergence.

Further, step S4 includes:

s4-1, setting G under various missing data hypotheses _j Use (1) _1,j ,…,1 _n,j ) Will be denoted G _j The artificial induced deletion of the values of (1) is defined as the vector 0-1, when G _i,j 1 when missing _i,j =1; conversely, 1 _i,j =0; defining θ and σ as a set of continuous and categorical data variables with a plurality of artificially induced missing values;represents G _j The number of artificially induced deletion values; defining θ and η as a set of continuous and categorical data variables with a plurality of artificially induced missing values;

s4-2, determining an interpolation error by the following formula:

wherein:is G _j Modified interpolation strategy τ interpolation processed n-dimensional vector, < >>Representing the interpolated element, G, by an improved interpolation strategy _i,j Representing raw elements untreated,/->Representing the mean of the original n-dimensional vector,x represents _j The number of artificially induced deletion values of 1 _i,j The artificially induced deficiency value is represented by epsilon (tau), the post-improvement interpolation strategy error is represented by epsilon (theta), the continuous data case error is represented by epsilon (eta), and the classified data case error is represented by epsilon (eta).

Compared with the prior art, the application has the beneficial effects that: according to the application, the correlation analysis method is used for determining important factors influencing the instantaneous flow of water injection, the optimal factor variables of the data of the layered water injection are grouped, each group is sequentially used as dependent variables for multiple analysis, the calculation speed improvement algorithm is improved on the premise of ensuring the interpolation precision, the data set of a mixed type can be processed without dividing the data set into the classified variables and the continuous variables and adopting different methods for processing, and the operation time is greatly improved while the interpolation precision is ensured.

The application can set and adjust according to the self requirement of the user, and can meet the time requirement or the precision requirement, thereby improving the flexibility of the algorithm. The improved algorithm uses the data interpolation work of the number of the separate layer water flooding to finish the data cleaning work of the separate layer water flooding missing value data.

The method for analyzing the association degree and establishing the association equation for the important factors improves the original interpolation strategy, improves the interpolation efficiency on the premise of guaranteeing the accuracy of the interpolation strategy, and has stable performance under different missing mechanisms, so that the method integrally meets the expectations.

Drawings

FIG. 1 is a general flow chart of the present application;

FIG. 2 is a statistical graph after calculating the correlation of the important factors affecting the instantaneous flow, wherein the abscissa axis is the temperature, the water flow speed, the porosity, the surface pressure, the depth of layer, the layer thickness, the pipe pressure, the jacket pressure, and the skin coefficient 9 important factors affecting the instantaneous flow, and the ordinate axis is the correlation of the important factors.

Detailed Description

The above and further technical features and advantages of the present application are described in more detail below with reference to the accompanying drawings.

s1-1, constructing an original data matrix:

the original data matrix obtained by m factors affecting the instantaneous flow of water injection is shown in the formula (1):

wherein: |G| represents the raw data matrix, G _ij (i=1, 2, …,18; j=1, 2, …, 9) is the measured data of the ith water injection well for the jth factor;

s1-2, eliminating the influence of dimension, and carrying out averaging treatment on the original data:

in order to eliminate the influence of dimension, and transform the original data into relative values near 1, adopting a averaging method to process the original data; the mean change according to equation (2) is:

wherein: i=1, 2, …, n; j=1, 2, …, m+1; g _ij Representing measured data, g' _ij Representing the data after the averaging process;

s1-3, calculating an absolute difference matrix:

the absolute difference matrix elements according to equation (3) are as follows:

Δ _ij ＝|g′ _ij -g′ _i0 | (3)

wherein: i=1, 2, …, n; j=1, 2, …, m+1; delta _ij Elements representing rows i and columns j in the absolute difference matrix, g' _ij Representing the data after the averaging process.

S1-4, solving an association coefficient matrix, and solving the association coefficient matrix according to a formula (4):

wherein: r is (r) _ij Representing elements, delta, in the correlation coefficient matrix _min Representing the minimum, delta, in the absolute difference matrix _max Represents the maximum value, delta, in the absolute difference matrix _ij The elements of the row i and the column j of the absolute difference matrix are represented, rho is a resolution coefficient, the size of the elements can control the influence of the maximum difference on data conversion, generally, rho takes a value between 0 and 1, and rho=0 is selected according to the actual engineering background;

s1-5, calculating the association degree, and selecting several factors with the association degree larger than 0.7: porosity, gauge pressure, tube pressure, jacket pressure, and skin factor are used as predictors, excluding other extraneous variables.

The average value of the association number sequences of the parent factor and each child factor is called association degree, and for comparing and analyzing the association of the parent factor and each child factor, the association degree is calculated according to the formula (5), namely:

wherein: x-shaped articles _0i Is the factor g _i For the mother factor g ₀ And χ is the correlation degree of _0i The closer to 1, the higher the correlation between the two, r _ij Representing elements in the association coefficient matrix.

s2-1 parameter estimation:

and (3) detecting t important factors obtained by correlation analysis, namely selecting the most suitable coefficient according to a formula (6) to minimize the residual square sum Q, and solving a corresponding correlation equation:

s2-2 correlation equation significance verification:

the significance test, the significance value represents the significance of the test, and statistically, a significance value <0.05 is generally considered as a coefficient test significance, meaning that the absolute value of your regression coefficient is significantly greater than 0, indicating that the independent variable can effectively predict the variation of the dependent variable, and making this conclusion you have a 5% chance of making mistakes, i.e., 95% confidence that the conclusion is correct.

To determine the variable g _pi And variable h _i Whether there is a functional relationship or not, and according to equation (7), the decomposition of the sum of squares of the dispersion is first required. E (E) _{Total (S)} Mean value is representedIs not limited by the fluctuation size of (a):

wherein: e (E) _{Total (S)} Mean value is representedThe fluctuation size of (2) can be classified into E _{The remainder is} And E is _{Switch for closing} 。E _{The remainder is} Is the sum of the remaining squares, E _{Switch for closing} Is the sum of the associated squares->Mean value, h _i Representing dependent variables +.>Representing the observed value;

in SPSS we set the factor affecting the instantaneous flow as a variable, while clicking on "option" sets the confidence percentage to 95%, and the saliency analysis data after using SPSS is shown in the table.

Selection of S2-3 argument:

adopting a step-by-step analysis method to select independent variables, wherein the equation only comprises one constant term at the beginning, and the independent variables are sequentially selected from large to small according to the contribution of the independent variables to the dependent variables; each time an independent variable is introduced, the variable in the equation is checked, and the variables meeting the elimination standard are eliminated one by one;

the independent variable with larger influence is added into the model as much as possible, instead of adding the insignificant variable into the model, so that an excellent model can be constructed.

s3-1, classifying the original data:

defining the original data set g= (G1, G2, …, gk) as an n x k matrix;

s3-2, obtaining a complete data set by using an interpolation method:

the reference interpolation fast interpolation initial data set is called first, k variables contained in the data set G are randomly divided into mutually exclusive groups with the size of alpha k, wherein 0< alpha <1.

if the tolerance value epsilon of the collection precision of the interpolation data is reached, stopping, otherwise repeating until convergence; the tolerance value epsilon of the interpolation data is set according to the actual requirement, and the value of the embodiment is.0.05

And S4, verifying the interpolation precision of the step S3.

S4-1 defines the induced deletion variable:

setting G under various missing data assumptions _j Use (1) _1,j ,…,1 _n,j ) Will be denoted G _j The artificial induced deletion of the values of (1) is defined as the vector 0-1, when G _i,j 1 when missing _i,j =1; conversely, 1 _i,j =0; defining θ and σ as a set of continuous and categorical data variables with a plurality of artificially induced missing values;represents G _j The number of artificially induced deletion values; defining θ and η as a set of continuous and categorical data variables with a plurality of artificially induced missing values;

s4-2 precision test experiment design:

the continuous variable adopts standard root mean square error to evaluate the model performance, and the classified variable adopts error of error division to evaluate the model performance; the calculation formula of the improved interpolation strategy error epsilon (tau) is shown as a formula (8):

wherein:is G _j Modified interpolation strategy τ interpolation processed n-dimensional vector, < >>Representing the interpolated element, G, by an improved interpolation strategy _i,j Representing raw elements untreated,/->Representing the mean of the original n-dimensional vector,x represents _j The number of artificially induced deletion values of 1 _i,j Representing the artificially induced deficiency value, epsilon (tau) representing the improved interpolation strategy error, epsilon (theta) representing the error of the continuous data case, epsilon (eta) representing the error of the classified data case;

for the unmodified interpolation strategy, the relative interpolation error formula of the strategy v and the strategy tau is shown as formula (9):

when X is _R When the value of (tau) is less than 100, the effect of strategy tau is better than strategy v;

for parameter settings of the interpolation accuracy test, the average interpolation accuracy is calculated by repeating 10 experiments for improved and non-improved algorithms; number of variables M preselected by node _j ，(number of all variables), number of nodes in algorithm M _t Set to 1000, i.e., 1000 nodes are included in each algorithm. The parameter setting table for the interpolation accuracy test is shown in table 1.

TABLE 1

The improved algorithm discovers that the values of 0.06 and 0.20 can enable experimental results to have better calculation accuracy and calculation speed through multiple experimental comparison researches on the values of alpha in the correlation equation. According to the value range of the data correlation, the data are divided into three groups of values between [0,50], [50,75] and [75,100] percentiles. The algorithm is obtained with the highest interpolation precision aiming at the MACR mechanism. Three sets of experimental results ranging between the percentages of [0,50], [50,75] and [75,100] are shown in tables 2,3 and 4.

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

TRQS in tables 2,3,4 are data that are completely randomly missing: the data missing is completely random and does not depend on the observed value or the missing value; RQS is random missing data: data loss depends on observations, not missing values; NRQS is non-randomly missing data: data loss depends on observations and missing values; RO represents the original interpolation strategy, RG _α The modified interpolation strategy is shown, and alpha is respectively compared with 0.06 and 0.20.

The improved algorithm of the embodiment discovers that the values of 0.06 and 0.20 can enable experimental results to have better calculation accuracy and calculation speed through multiple experimental comparison researches on the values of alpha in the correlation equation. According to the value range of the data correlation, the data are divided into three groups of values between [0,50], [50,75] and [75,100] percentile. The algorithm is obtained with the highest interpolation precision under the data mechanism aiming at complete random missing.

The foregoing description of the preferred embodiment of the application is merely illustrative of the application and is not intended to be limiting. It will be appreciated by persons skilled in the art that many variations, modifications, and even equivalents may be made thereto without departing from the spirit and scope of the application as defined in the appended claims.

Claims

1. An improved data cleaning method for separate layer water injection is characterized in that: the method comprises the following steps:

s1, determining t factors influencing the instantaneous flow according to the original data, wherein the t factors comprise:

wherein: |G| represents the raw data matrix, G _ij Is the actual measurement data of the ith water injection well about the jth factor, wherein i=1, 2, …, n, j=1, 2, …,9; g _i1 G is the temperature of the ith water injection well _i2 G is the water flow speed of the ith water injection well _i3 Porosity of the ith water injection well g _i4 G is the surface pressure of the ith water injection well _i5 For the i-th water injection well layer depth, g _i6 For the i th water injection well layer thickness, g _i7 Is the i-th water injection well pipe pressure, g _i8 G is the i-th water injection well casing pressure _i9 The skin coefficient of the ith water injection well;

s1-2, carrying out averaging treatment on the original data;

s1-4, solving an association coefficient matrix, wherein the association coefficient matrix is composed of association coefficient elements, and the association coefficient elements are obtained through the following formula:

wherein r is _ij Representing elements, delta, in the correlation coefficient matrix _min Representing the minimum, delta, in the absolute difference matrix _max Represents the maximum value, delta, in the absolute difference matrix _ij Elements representing i rows and j columns of the absolute difference matrix, wherein ρ is a resolution coefficient;

s1-5, calculating the association degree of each factor and the instantaneous flow of water injection, screening out factors with the association degree meeting the requirements, and eliminating irrelevant variables;

s2, detecting t factors determined in the step S1, and performing significance verification, wherein the significance verification method comprises the following steps:

s2-2, verifying significance of an association equation, and performing dispersion square sum decomposition by the following formula to determine a variable g _pi And variable h _i Whether there is a functional relationship between them;

wherein: e (E) _{Total (S)} Mean value is representedIs decomposed into E _{The remainder is} And E is _{Switch for closing} ，E _{The remainder is} Is the sum of the remaining squares, E _{Switch for closing} Is the sum of the associated squares->Mean value, h _i Representing dependent variables +.>Representing the observed value;

in the SPSS transfer, taking a factor affecting the instantaneous flow as a variable, setting a confidence percentage, and carrying out significance analysis through SPSS software;

and S4, verifying the interpolation precision of the step S3.

2. An improved method of data cleansing for stratified charges as claimed in claim 1 wherein:

the step S3 comprises the following steps:

s3-1, defining an original dataset g= (G1, G2, …, gk);

s3-2, obtaining a complete data set by using an interpolation method:

firstly, calling a reference interpolation fast interpolation initial data set, and randomly dividing k variables contained in the data set G into mutually exclusive groups with the size of alpha k, wherein 0< alpha <1;

if the tolerance value epsilon of the integration accuracy of the interpolation data is reached, stopping, otherwise repeating until convergence.

3. An improved method of data cleansing for stratified charges as claimed in claim 1 wherein: the step S4 includes:

s4-2, determining an interpolation error by the following formula: