CN116821733A - Large-scale flow data sampling evaluation method based on dynamic drilling - Google Patents

Large-scale flow data sampling evaluation method based on dynamic drilling Download PDF

Info

Publication number
CN116821733A
CN116821733A CN202310376603.0A CN202310376603A CN116821733A CN 116821733 A CN116821733 A CN 116821733A CN 202310376603 A CN202310376603 A CN 202310376603A CN 116821733 A CN116821733 A CN 116821733A
Authority
CN
China
Prior art keywords
well
stream data
data
sampling
accuracy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310376603.0A
Other languages
Chinese (zh)
Inventor
章昭辉
章鹏
王鹏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN202310376603.0A priority Critical patent/CN116821733A/en
Publication of CN116821733A publication Critical patent/CN116821733A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The application provides a large-scale stream data sampling method based on dynamic drilling. The application further provides a large-scale stream data evaluation method based on dynamic drilling, which is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method, and the value characteristics of the original data set are evaluated based on the sample set. Based on mineral drilling exploration ideas, the application provides a dynamic drilling sampling method, which takes a well as an analysis unit, dynamically changes the size and the position of the well, and accurately locates the position and the range of discrete data. The new stream data value evaluation model is further provided, and the model evaluates the original stream data set from a sample set obtained by a dynamic drilling sampling method in three dimensions of discrete, concentrated and whole, so that the model has important research significance on large data value evaluation.

Description

Large-scale flow data sampling evaluation method based on dynamic drilling
Technical Field
The application relates to a large-scale stream data sampling method based on dynamic drilling, and belongs to the technical field of information.
Background
In the big data age, the value of data is one of the core demands of big data. The data element market has become an integral part of the construction of digital china and the data asset age has come. Currently, the trends of stakeholders are also sweeping through the hurdles of the progress of data asset.
With the development of industrial internet, the big data information technology is rapidly advanced, the data information is already becoming a commodity for each big manufacturer to obtain, the data volume is also rapidly and dynamically increasing, and the fields such as network security, daily transaction, social media, transportation and the like are continuously generated in the form of stream data. The sampling is an indispensable method of the data mining technology, and has wide application in various practical applications such as fraud detection, data mining, transportation and the like. The sampling technology is used for extracting a sample set which retains the characteristics of the original data from a large amount of data, so that the evaluation and prediction of the data quality and the value of the original data can be carried out, and the calculation cost, the storage resource and the like are reduced.
At present, sampling methods for streaming data are mainly classified into three types. The first is unbiased sampling: hierarchical sampling, random sampling, reservoir sampling, etc. Unbiased sampling has randomness, and the sampled stream data can lose a part of key information, which finally leads to inaccurate stream data value evaluation. The second category is biased sampling: probability density sampling, which solves the problem of data loss of unbiased sampling. The biased sampling can well retain a large amount of discrete data in the stream data, but amplifies the effect of the discrete data in the sample set. The third class is mixed sampling. Mixed sampling gives a good sample of the whole dataset, but with a lower sampling accuracy for discrete data. In addition, in anti-fraud financial wind control systems, the anomaly data takes a very important role, which contains a lot of value information, for example, abnormal financial transaction data may indicate that fraud and money back washing may occur, so that the anomaly data needs to be kept in the sample.
At present, the value evaluation for big data is mainly embodied in the field of economics, and the value evaluation for stream data is yet to be studied.
In summary, it is known that due to the non-uniformity of targets of unbiased sampling, biased sampling, and mixed sampling, it is ultimately difficult to comprehensively perform accurate and efficient value assessment of streaming data. Thus, while there are a wide variety of sampling methods at the present stage that retain discrete data in the stream data, there are still some challenges.
Disclosure of Invention
The application aims to solve the technical problems that: accurately predicting the position and the range of discrete data in stream data; and (5) high-efficiency and accurate evaluation is carried out on the value characteristics of the stream data.
In order to solve the above technical problems, an aspect of the present application is to provide a method for sampling large-scale stream data based on dynamic drilling, which is characterized in that the method is used for sampling stream data S, where the stream data S is expressed as: s= { (id) i ,time i ,value i )|1≤i≤N and i∈N + In the formula, id i Time for the order of arrival of the ith stream data i Value for the time of arrival of the ith stream data i For the ith stream data value, the large-scale stream data sampling method includes the steps of:
step 1, determining the position and the range of discrete data in stream data by taking a well as an analysis unit, wherein an ith well is marked as W i The size of the ith well is denoted WS i The number of sampling wells in the original stream data set is recorded as WN, then W i The internal data are expressed as:
W i ={(id j ,time j ,value j )|1≤j≤WS i and j∈N + }(1≤i≤WN)
step 2, calculating to obtain a well interval, wherein the ith well interval is recorded as WI i The ith well interval size is denoted WIS i Then the ith well interval WI i Expressed as:
WI i ={(id j ,time j ,value j )|id wi_max +1≤id j ≤id wi+1_min -1}
in the formula, id wi_max The maximum id, for all stream data in the ith well wi+1_min For all flow data in the (i+1) th wellA minimum id;
step 3, determining the discrete degree of the well by using the deviation coefficient SK, and if the distribution of the well flow data is symmetrical, the deviation coefficient SK is equal to 0; if the deviation coefficient SK is obviously different from 0, the distribution of the well flow data is asymmetric, wherein the deviation coefficient SK is positive and is right deviation distribution, and the deviation coefficient SK is negative and is left deviation distribution;
if the distribution of the well inflow data of the ith well is asymmetric and the ith well deviation coefficient is expressed as SK i The following steps are: if SK i ∈[-0.5,0.5]The i-th well has a smaller degree of dispersion; if SK i E (- ≡1) or (1, ++ infinity a) of the above-mentioned components, the i-th well has a greater degree of dispersion, referred to as a highly deviated profile; if SK i E (-1, -0.5) or (0.5, 1), then is considered to be a medium bias distribution;
step 4, dynamically adjusting the sampling rate and the well interval by using the bias coefficient, and assuming that the current is the ith well, and the initial sampling rate is p init The adjusted sampling rate p is expressed as:
adjusted ith well interval size WIS i The expression is as follows:
in WIS init Representing an initial well interval size for an ith well;
step 5, dynamically adjusting the size of the well through an algorithm combining the pearson correlation coefficient and the variation coefficient:
recording representative wells in a well set, then receiving new stream data by using a sliding window, traversing the well set, setting the size of the sliding window to be the sizes of different wells in the well set, and determining the sizes of the wells by calculating pearson correlation coefficients and variation coefficients of the wells in the well set and the sliding window;
step 6, dynamic drilling sampling:
performing in-class unbiased sampling and inter-class biased sampling in a well, wherein the sampling rate of a few classes is dynamically improved among classes according to the magnitude of a bias coefficient; equidistant sampling is adopted in the well interval, the access rate of sampling stream data is reduced, the discrete data range is accurately positioned according to the size algorithm of the dynamic adjustment well recorded in the step 5, then the size of the well interval is dynamically adjusted according to the deviation coefficient, and the discrete data position is dynamically positioned.
Preferably, in step 3, the ith in-well deviation coefficient SK i Expressed as:
in the method, in the process of the application,representing the mean of the data in the ith well.
Preferably, in step 5, the pearson correlation coefficient ρ (X, Y) of the two sets of variables X and Y is expressed as:
in sigma X 、σ Y Standard deviations of two sets of variables Y and Y, respectively, cov (X, Y) are pearson correlation coefficients before adjustment, expressed by the following formula:
wherein n represents the data amount in variables X and Y, X i Represents the ith variable, y, in the current set of variables X i Representing the ith variable in the current set of variables Y,representing the mean of the current set of variables X,/>representing the mean of the current set of variables Y.
Preferably, in step 5, the coefficient of variation CV is expressed as:
where σ is the standard deviation of the current set of variables and μ is the mean of the current set of variables.
The application further provides a large-scale stream data evaluation method based on dynamic drilling, which is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method, and the value characteristics of the original data set are evaluated by adopting discrete mean value accuracy, discrete variation coefficient accuracy, discrete sampling accuracy, centralized mean value accuracy, centralized variation coefficient accuracy, overall mean value accuracy, overall variation coefficient accuracy and JSD index based on the sample set, wherein:
the discrete mean value accuracy is the accuracy of estimating the mean value of the discrete data set attribute value of the original stream data set by using the mean value of the discrete data set attribute value of the sample set;
discrete coefficient of variation accuracy: estimating the accuracy of the variation coefficient of the discrete data set attribute value of the original stream data set by using the variation coefficient of the discrete data set attribute value of the sample set;
the discrete sampling accuracy is the ratio of the number of intersections of the discrete data sets of the sample set and the discrete data sets of the original stream data set to the length of the discrete data sets of the sample set;
the centralized average value accuracy is the accuracy of estimating the average value of the centralized data set attribute value of the original stream data set by using the average value of the centralized data set attribute value of the sample set;
the accuracy of the concentrated variation coefficient is that the variation coefficient of the concentrated data set attribute value of the original stream data set is estimated by the variation coefficient of the concentrated data set attribute value of the sample set;
overall mean value accuracy: estimating the accuracy of the mean value of the original stream data set attribute value by using the mean value of the sample set attribute value;
the overall variation coefficient accuracy is the accuracy of estimating the variation coefficient of the original stream data set attribute value by using the variation coefficient of the sample set attribute value;
the JSD index measures the distance between two probability distributions by calculating their KL divergences and the average distribution of the two probability distributions, the smaller the JSD index, the greater the similarity between the sample set and the original stream data distribution, the greater the JSD value, the greater the difference between the sample set and the original stream data set distribution, wherein the JSD index is calculated using the following formula:
wherein P (X) is the probability distribution of the sample set, Q (X) is the probability distribution of the original stream data set, D (P||Q) is used for calculating KL divergence of P (X) and Q (X), and the method comprises the following steps of
The current sampling method for the streaming data which is changed in real time and at high speed is easy to lose a large amount of value and information of discrete data, and high-efficiency and accurate evaluation on the value characteristics of the streaming data is difficult to perform. Based on mineral drilling exploration ideas, the application provides a dynamic drilling sampling method, which takes a well as an analysis unit, dynamically changes the size and the position of the well, and accurately locates the position and the range of discrete data. The new stream data value evaluation model is further provided, and the model evaluates the original stream data set from a sample set obtained by a dynamic drilling sampling method in three dimensions of discrete, concentrated and whole, so that the model has important research significance on large data value evaluation.
Drawings
FIG. 1 is a block diagram of a large scale stream data sampling framework for dynamic drilling;
FIG. 2 is a stream data value assessment model diagram;
FIG. 3 is a flow data schematic;
fig. 4 (a) and 4 (b) are flow data peak-trough classification diagrams.
Detailed Description
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
The application relates to a large-scale stream data sampling method based on dynamic drilling, and also relates to an evaluation model establishment method for evaluating stream data value obtained by using the large-scale stream data sampling method.
The application discloses a large-scale stream data sampling method based on dynamic drilling, which mainly comprises the following steps: firstly, setting an initial well and an initial well interval, and clustering data k-means in the well; determining the discrete degree of the data in the well, the sampling rate of each class and the well interval by using the bias coefficient, performing unbiased sampling in the well and biased sampling among the classes, and performing medium-distance sampling in the well interval; the size of the next well is then determined by the correlation coefficient and the coefficient of variation together. The specific sampling method framework is shown in fig. 1, and comprises the following steps:
the application aims to provide a real-time and high-speed discrete stream data, wherein the size of the discrete stream data is denoted as N, and the size is infinitely increased along with the time change.
S101: the stream data S is expressed as:
S={(id i ,time i ,value i )|1≤i≤Nandi∈N +
in the formula, id i Time for the order of arrival of the ith stream data i Value for the time of arrival of the ith stream data i Is the ith stream data value. The streaming data distribution is shown in fig. 3.
S102: a well:
due to streaming dataThe location and extent of the discrete data is uncertain, and the concept of a well is introduced based on the drilling sampling concept. By taking the well as an analysis unit, discrete data positions and ranges in the stream data are determined. The ith well is designated as W i The size of the ith well is denoted WS i The number of sampling wells in the original data set is recorded as WN, then W i The internal data are expressed as:
W i ={(id j ,time j ,value j )|1≤j≤WS i and j∈N + }(1≤i≤WN)
s103: well spacing:
to reduce the access rate of the stream data samples, well spacing is provided between wells. The ith well interval is denoted as WI i The ith well interval size is denoted WIS i Then the ith well interval WI i Expressed as:
WI i ={(id j ,time j ,value j )|id wi_max +1≤id j ≤id wi+1_min -1}
in the formula, id wi_max The maximum id, for all stream data in the ith well wi+1_min Is the smallest id of all stream data in the i+1th well.
S104: determining the degree of well discretization:
in order to determine the position of discrete data, the application introduces the concept of 'bias coefficient' in statistics, which is expressed as SK, and the ith in-well bias coefficient SK i The expression is as follows:
in the method, in the process of the application,representing the mean of the data in the ith well.
If the distribution of the well stream data is symmetrical, the skewness factor SK is equal to 0; if the deviation coefficient SK is obviously not equal to 0, the distribution of the well flow data is asymmetric, wherein the deviation coefficient SK is thatPositive, right-bias distribution, negative bias coefficient SK, left-bias distribution. If SK i ∈[-0.5,0.5]The degree of dispersion of the ith well is small; if SK i E (- ≡1) or (1, ++ infinity a) of the above-mentioned components, known as highly biased distribution; if SK i E (-1, -0.5) or (0.5, 1), then a medium bias distribution is considered.
S105: dynamically adjusting the sampling rate and the well position:
the application dynamically adjusts the sampling rate and well position using the bias factor, assuming the current i-th well, the initial sampling rate is p init The adjusted sampling rate p is expressed as:
by clustering three different classes of data k-means in a well, the present application represents them as: class0, class1, class2. Specifically, three cases are classified into SK i ∈[-0.5,0.5]Sampling rate p=p for all classes init The method comprises the steps of carrying out a first treatment on the surface of the If SK i E (- ≡1) or (1, ++), increasing the sampling rate p=2×p for a smaller number of two classes init ×|SK i I (I); if SK i E (-1, -0.5) or (0.5, 1), increasing the sampling rate p=2×p for the least number of classes init ×|SK i |。
The application dynamically adjusts the well spacing by using the bias coefficient, the ith well spacing WI i The expression is as follows:
in WIS init Indicating the size of the initial well interval. Specifically, two cases are classified into SK i ∈[-0.5,0.5]Increasing well spacing WI i =2×WI init The method comprises the steps of carrying out a first treatment on the surface of the If SK i E (- ≡1) or (1, ++ infinity) or (-1, -0.5) or (0.5, 1), reducing well spacing
S106: dynamically adjusting the size of the well:
the flow data peaks and valleys have three distinct classes of features: 1. the slope varies very greatly; 2. periodically changing; 3. the degree of dispersion is low. For these three features, the corresponding peaks are divided into: an impact-peak SP (shock-peak), an oscillation-peak OP (oscillation-peak), a buffer-peak BP (buffer-peak), as shown in fig. 4 (a); the corresponding valleys are divided into: shock-trough ST (shock-gauge), oscillation-trough OT (oscillation-gauge), buffer-trough BT (buffer-gauge), as shown in fig. 4 (b). Wherein the impact peaks and impact valleys are characterized by a very high slope, and therefore the degree of dispersion will be very high when such peaks or valleys are included in the well; the oscillation wave crest and the oscillation wave trough have the characteristic of periodic variation, and when the wave crest or the wave trough is contained in the well, the degree of dispersion is larger; the buffer peaks and buffer valleys have a low degree of dispersion, and when such peaks and valleys are included in a well, the degree of dispersion of the well is lower than the other two types of peaks and valleys. Furthermore, when two different wells contain the same crest or trough, the two wells have some self-similarity and the degree of dispersion of the two wells can be very close.
Based on the above findings, the present application proposes an algorithm combining pearson correlation coefficient and variation coefficient to dynamically adjust the well size in order to accurately determine the range of discrete data in bitstream data.
The pearson correlation coefficient is used to represent the degree of linear correlation of two sets of variables X and Y, which range from [ -1,1]. If the coefficient approaches 1, it is indicated that X and Y have a large autocorrelation. The coefficients approach-1, indicating that X and Y have anti-autocorrelation. If the coefficient approaches 0, it is stated that there is no apparent autocorrelation between the two sets of linked variables. The formula is as follows:
the above formula is a covariance formula by dividing the covariance by the standard deviation σ of two related variables X 、σ Y Make up forWeak manifestation of covariance data:
coefficient of variation CV = standard deviation σ/mean μ, which describes the relative degree of dispersion of the data and does not require reference to the mean of the data. Thus, it can be used to compare the degree of discretization of two sets of data, as follows:
s107: size algorithm for dynamic adjustment well:
the method comprises the steps of firstly recording representative wells in a well set, then receiving new stream data by using a sliding window, traversing the well set, setting the size of the sliding window to be the sizes of different wells in the well set, and determining the sizes of the wells by calculating the self-similarity and variation coefficients of the wells and the sliding window in the well set.
S108: dynamic drilling sampling algorithm:
the application provides a dynamic drilling sampling algorithm, which is used for carrying out in-class unbiased sampling and inter-class biased sampling in a well, wherein a classical sampling algorithm is used in the class: reservoir sampling, wherein the sampling rate of a few classes is dynamically improved among the classes according to the size of the bias coefficient. Equidistant sampling is adopted in the well interval, so that the access rate of sampling stream data is reduced. And accurately positioning the discrete data range according to a size algorithm of the dynamic adjustment well, and then dynamically adjusting the size of the well interval according to the deviation coefficient to dynamically position the discrete data position.
The application provides a new stream data value evaluation model, which starts from three dimensions of discrete, concentrated and integral, evaluates the value characteristics of an original data set by using a sample set obtained by the large-scale stream data sampling method, and the specific evaluation model is shown in figure 2.
Discrete mean value accuracy:
refers to the accuracy of estimating the mean value of the discrete data set attributes value of the original stream data set with the mean value of the discrete data set attributes value of the sample set. The discrete mean accuracy DMA (Discrete Mean Accuracy) is calculated as follows:
in the method, in the process of the application,mean value of discrete dataset properties value representing original streaming dataset,/for>Representing the mean of the discrete dataset properties value for the sample set.
Discrete coefficient of variation accuracy:
refers to the accuracy of estimating the coefficient of variation of the discrete dataset attribute value of the original stream dataset by using the coefficient of variation of the discrete dataset attribute value of the sample set. The calculation formula of the discrete variation coefficient similarity ADCV (Accuracy Of Discrete Coefficient Of Variation) is as follows:
discrete sampling accuracy:
refers to the ratio of the number of intersections of the discrete data sets of the sample set with the discrete data sets of the original stream data set to the discrete data set length of the sample set. The discrete sampling accuracy DSA (Discrete Sampling Accuracy) is calculated as follows:
where DDSS represents the discrete data set of the sample set, DDRD represents the discrete data set of the original stream data set, len (·) represents the calculated length.
Centralized mean value accuracy rate:
refers to the accuracy of estimating the mean value of the set data set attributes value of the original stream data set with the mean value of the set data set attributes value of the sample set. The calculation formula of the centralized average value accuracy CMA (Centralized Mean Accuracy) is as follows:
in the method, in the process of the application,mean value of the set data set attribute value representing the original stream data set,/for>A mean of the set data set attribute values in the set representing the sample set.
Centralized coefficient of variation accuracy:
refers to the accuracy of estimating the variation coefficient of the set data set attribute value of the original stream data set by using the variation coefficient of the set data set attribute value of the sample set. The calculation formula of the concentrated variation coefficient accuracy ACCV (Accuracy Of Centralized Coefficient Of Variation) is as follows:
in CV CDRD Coefficient of variation, CV, representing a value of a set data attribute of an original stream data set CDSS A coefficient of variation of a value of a set data set attribute in a set representing a sample set.
Overall mean value accuracy:
refers to the accuracy of estimating the mean of the original stream dataset RD (Raw Data) attribute value with the mean of the sample set SS (Sample Set) attribute value. The overall mean accuracy OMA (Overall Mean Accuracy) is calculated as follows:
in the method, in the process of the application,mean value representing the original stream dataset property value, +.>Representing the mean of the sample set attribute value.
Overall coefficient of variation accuracy:
refers to the accuracy of estimating the coefficient of variation of the original stream dataset attribute value by using the coefficient of variation of the sample dataset attribute value. The overall coefficient of variation accuracy AOCV (Accuracy Of Overall Coefficient Of Variation) is calculated as follows:
in CV RD Coefficient of variation, CV, representing original stream dataset Property value SS And represents the coefficient of variation of the sample set attribute value.
JSD index:
the JSD index, also called JS divergence, is an evaluation index for measuring the difference between two probability distributions, which measures the distance between two probability distributions by calculating their KL divergence and the average distribution of the two probability distributions, and its value range is 0, 1. The smaller the KL divergence, the greater the similarity between the sample set and the original stream data distribution, and the greater the JSD value, the greater the difference between the sample set and the original stream data set distribution. Assuming that the probability distribution of the sample set is P (X), the probability distribution of the original stream data set is Q (X), and the calculation formula of the KL divergence is as follows:
the calculation formula of the JS divergence is as follows:

Claims (5)

1. a method for sampling large-scale stream data based on dynamic drilling, which is characterized in that the method is used for adopting stream data S, and the stream data S is expressed as: s= { (id) i ,time i ,value i )|1≤i≤N and i∈N + In the formula, id i Time for the order of arrival of the ith stream data i Value for the time of arrival of the ith stream data i For the ith stream data value, the large-scale stream data sampling method includes the steps of:
step 1, determining the position and the range of discrete data in stream data by taking a well as an analysis unit, wherein an ith well is marked as W i The size of the ith well is denoted WS i The number of sampling wells in the original stream data set is recorded as WN, then W i The internal data are expressed as:
W i ={(id j ,time j ,value j )|1≤j≤WS i and j∈N + }(1≤i≤WN)
step 2, calculating to obtain a well interval, wherein the ith well interval is recorded as WI i The ith well interval size is denoted WIS i Then the ith well interval WI i Expressed as:
WI i ={(id j ,time j ,value j )|id wi_max +1≤id j ≤id wi+1_min -1}
in the formula, id wi_max The maximum id, for all stream data in the ith well wi+1_min A minimum id for all flow data in the (i+1) th well;
step 3, determining the discrete degree of the well by using the deviation coefficient SK, and if the distribution of the well flow data is symmetrical, the deviation coefficient SK is equal to 0; if the deviation coefficient SK is obviously different from 0, the distribution of the well flow data is asymmetric, wherein the deviation coefficient SK is positive and is right deviation distribution, and the deviation coefficient SK is negative and is left deviation distribution;
if the distribution of the well inflow data of the ith well is asymmetric and the ith well deviation coefficient is expressed as SK i The following steps are: if SK i ∈[-0.5,0.5]The i-th well has a smaller degree of dispersion; if SK i E (- ≡1) or (1, ++ infinity a) of the above-mentioned components, the i-th well has a greater degree of dispersion, referred to as a highly deviated profile; if SK i E (-1, -0.5) or (0.5, 1), then is considered to be a medium bias distribution;
step 4, dynamically adjusting the sampling rate and the well interval by using the bias coefficient, and assuming that the current is the ith well, and the initial sampling rate is p init The adjusted sampling rate p is expressed as:
adjusted ith well interval size WIS i The expression is as follows:
in WIS init Representing an initial well interval size for an ith well;
step 5, dynamically adjusting the size of the well through an algorithm combining the pearson correlation coefficient and the variation coefficient:
recording representative wells in a well set, then receiving new stream data by using a sliding window, traversing the well set, setting the size of the sliding window to be the sizes of different wells in the well set, and determining the sizes of the wells by calculating pearson correlation coefficients and variation coefficients of the wells in the well set and the sliding window;
step 6, dynamic drilling sampling:
performing in-class unbiased sampling and inter-class biased sampling in a well, wherein the sampling rate of a few classes is dynamically improved among classes according to the magnitude of a bias coefficient; equidistant sampling is adopted in the well interval, the access rate of sampling stream data is reduced, the discrete data range is accurately positioned according to the size algorithm of the dynamic adjustment well recorded in the step 5, then the size of the well interval is dynamically adjusted according to the deviation coefficient, and the discrete data position is dynamically positioned.
2. The method for sampling large-scale stream data based on dynamic drilling as recited in claim 1, wherein in step 3, the ith in-well deviation coefficient SK is i Expressed as:
in the method, in the process of the application,representing the mean of the data in the ith well.
3. A method of sampling large-scale flow data based on dynamic drilling as claimed in claim 1, wherein in step 5, the pearson correlation coefficient ρ (X, Y) for two sets of variables X and Y is expressed as:
in sigma X 、σ Y Standard deviations of two sets of variables X and Y, respectively, cov (X, Y) are pearson correlation coefficients before adjustment, expressed by the following formula:
wherein n represents the data amount in variables X and Y, X i Representing the ith variable in the current set of variables X,y i representing the ith variable in the current set of variables Y,represents the mean value of the current set of variables X, +.>Representing the mean of the current set of variables Y.
4. The method for sampling large-scale stream data based on dynamic drilling as recited in claim 1, wherein in step 5, the coefficient of variation CV is expressed as:
where σ is the standard deviation of the current set of variables and μ is the mean of the current set of variables.
5. The method for evaluating large-scale stream data based on dynamic drilling is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method according to claim 1, and the value characteristics of the original data set are evaluated by using a discrete mean value accuracy, a discrete variation coefficient accuracy, a discrete sampling accuracy, a centralized mean value accuracy, a centralized variation coefficient accuracy, an overall mean value accuracy, an overall variation coefficient accuracy and a JSD index based on the sample set, wherein:
the discrete mean value accuracy is the accuracy of estimating the mean value of the discrete data set attribute value of the original stream data set by using the mean value of the discrete data set attribute value of the sample set;
discrete coefficient of variation accuracy: estimating the accuracy of the variation coefficient of the discrete data set attribute value of the original stream data set by using the variation coefficient of the discrete data set attribute value of the sample set;
the discrete sampling accuracy is the ratio of the number of intersections of the discrete data sets of the sample set and the discrete data sets of the original stream data set to the length of the discrete data sets of the sample set;
the centralized average value accuracy is the accuracy of estimating the average value of the centralized data set attribute value of the original stream data set by using the average value of the centralized data set attribute value of the sample set;
the accuracy of the concentrated variation coefficient is that the variation coefficient of the concentrated data set attribute value of the original stream data set is estimated by the variation coefficient of the concentrated data set attribute value of the sample set;
overall mean value accuracy: estimating the accuracy of the mean value of the original stream data set attribute value by using the mean value of the sample set attribute value;
the overall variation coefficient accuracy is the accuracy of estimating the variation coefficient of the original stream data set attribute value by using the variation coefficient of the sample set attribute value;
the JSD index measures the distance between two probability distributions by calculating their KL divergences and the average distribution of the two probability distributions, the smaller the JSD index, the greater the similarity between the sample set and the original stream data distribution, the greater the JSD value, the greater the difference between the sample set and the original stream data set distribution, wherein the JSD index is calculated using the following formula:
wherein P (X) is the probability distribution of the sample set, Q (X) is the probability distribution of the original stream data set, D (P||Q) is used for calculating KL divergence of P (X) and Q (X), and the method comprises the following steps of
CN202310376603.0A 2023-04-07 2023-04-07 Large-scale flow data sampling evaluation method based on dynamic drilling Pending CN116821733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310376603.0A CN116821733A (en) 2023-04-07 2023-04-07 Large-scale flow data sampling evaluation method based on dynamic drilling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310376603.0A CN116821733A (en) 2023-04-07 2023-04-07 Large-scale flow data sampling evaluation method based on dynamic drilling

Publications (1)

Publication Number Publication Date
CN116821733A true CN116821733A (en) 2023-09-29

Family

ID=88128217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310376603.0A Pending CN116821733A (en) 2023-04-07 2023-04-07 Large-scale flow data sampling evaluation method based on dynamic drilling

Country Status (1)

Country Link
CN (1) CN116821733A (en)

Similar Documents

Publication Publication Date Title
Yu et al. Unsupervised online anomaly detection with parameter adaptation for KPI abrupt changes
CN110134719B (en) Identification and classification method for sensitive attribute of structured data
Branisavljević et al. Improved real-time data anomaly detection using context classification
CN109829721B (en) Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning
CN105426441B (en) A kind of automatic preprocess method of time series
CN113670616B (en) Bearing performance degradation state detection method and system
CN115309963B (en) Intelligent archive management method, system and storage medium
Toth et al. Group deviation detection methods: a survey
CN109740044B (en) Enterprise transaction early warning method based on time series intelligent prediction
Lahmiri et al. Multi-fluctuation nonlinear patterns of European financial markets based on adaptive filtering with application to family business, green, Islamic, common stocks, and comparison with Bitcoin, NASDAQ, and VIX
CN111145027A (en) Suspected money laundering transaction identification method and device
Gu et al. Application of fuzzy decision tree algorithm based on mobile computing in sports fitness member management
CN107579839A (en) A kind of online service measures of reputation method based on various dimensions evaluation information
Jung et al. Multivariate neighborhood trajectory analysis: an exploration of the functional data analysis approach
CN116821733A (en) Large-scale flow data sampling evaluation method based on dynamic drilling
Ahani et al. A hybrid regionalization method based on canonical correlation analysis and cluster analysis: a case study in northern Iran
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
Falconi et al. A robust approach to ARMA factor modeling
Zhong et al. Analysis and improvement of evaluation indexes for clustering results
CN114925975A (en) Source load power typical daily set generation method considering time sequence curve characteristics
CN111428510B (en) Public praise-based P2P platform risk analysis method
CN113506007B (en) Well drilling type data sampling method and application thereof in big data value risk assessment
Kakinaka et al. Flexible two-point selection approach for characteristic function-based parameter estimation of stable laws
Kumar et al. Clustering the Various Categorical Data: An Exploration of Algorithms and Performance Analysis
Yue et al. Gas flow meter anomaly data detection based on fused LOF-DBSCAN algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination