CN116821733A

CN116821733A - Large-scale flow data sampling evaluation method based on dynamic drilling

Info

Publication number: CN116821733A
Application number: CN202310376603.0A
Authority: CN
Inventors: 章昭辉; 章鹏; 王鹏伟
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-09-29

Abstract

The application provides a large-scale stream data sampling method based on dynamic drilling. The application further provides a large-scale stream data evaluation method based on dynamic drilling, which is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method, and the value characteristics of the original data set are evaluated based on the sample set. Based on mineral drilling exploration ideas, the application provides a dynamic drilling sampling method, which takes a well as an analysis unit, dynamically changes the size and the position of the well, and accurately locates the position and the range of discrete data. The new stream data value evaluation model is further provided, and the model evaluates the original stream data set from a sample set obtained by a dynamic drilling sampling method in three dimensions of discrete, concentrated and whole, so that the model has important research significance on large data value evaluation.

Description

Large-scale flow data sampling evaluation method based on dynamic drilling

Technical Field

The application relates to a large-scale stream data sampling method based on dynamic drilling, and belongs to the technical field of information.

Background

In the big data age, the value of data is one of the core demands of big data. The data element market has become an integral part of the construction of digital china and the data asset age has come. Currently, the trends of stakeholders are also sweeping through the hurdles of the progress of data asset.

With the development of industrial internet, the big data information technology is rapidly advanced, the data information is already becoming a commodity for each big manufacturer to obtain, the data volume is also rapidly and dynamically increasing, and the fields such as network security, daily transaction, social media, transportation and the like are continuously generated in the form of stream data. The sampling is an indispensable method of the data mining technology, and has wide application in various practical applications such as fraud detection, data mining, transportation and the like. The sampling technology is used for extracting a sample set which retains the characteristics of the original data from a large amount of data, so that the evaluation and prediction of the data quality and the value of the original data can be carried out, and the calculation cost, the storage resource and the like are reduced.

At present, sampling methods for streaming data are mainly classified into three types. The first is unbiased sampling: hierarchical sampling, random sampling, reservoir sampling, etc. Unbiased sampling has randomness, and the sampled stream data can lose a part of key information, which finally leads to inaccurate stream data value evaluation. The second category is biased sampling: probability density sampling, which solves the problem of data loss of unbiased sampling. The biased sampling can well retain a large amount of discrete data in the stream data, but amplifies the effect of the discrete data in the sample set. The third class is mixed sampling. Mixed sampling gives a good sample of the whole dataset, but with a lower sampling accuracy for discrete data. In addition, in anti-fraud financial wind control systems, the anomaly data takes a very important role, which contains a lot of value information, for example, abnormal financial transaction data may indicate that fraud and money back washing may occur, so that the anomaly data needs to be kept in the sample.

At present, the value evaluation for big data is mainly embodied in the field of economics, and the value evaluation for stream data is yet to be studied.

In summary, it is known that due to the non-uniformity of targets of unbiased sampling, biased sampling, and mixed sampling, it is ultimately difficult to comprehensively perform accurate and efficient value assessment of streaming data. Thus, while there are a wide variety of sampling methods at the present stage that retain discrete data in the stream data, there are still some challenges.

Disclosure of Invention

The application aims to solve the technical problems that: accurately predicting the position and the range of discrete data in stream data; and (5) high-efficiency and accurate evaluation is carried out on the value characteristics of the stream data.

In order to solve the above technical problems, an aspect of the present application is to provide a method for sampling large-scale stream data based on dynamic drilling, which is characterized in that the method is used for sampling stream data S, where the stream data S is expressed as: s= { (id) _i ，time _i ，value _i )|1≤i≤N and i∈N ⁺ In the formula, id _i Time for the order of arrival of the ith stream data _i Value for the time of arrival of the ith stream data _i For the ith stream data value, the large-scale stream data sampling method includes the steps of:

step 1, determining the position and the range of discrete data in stream data by taking a well as an analysis unit, wherein an ith well is marked as W _i The size of the ith well is denoted WS _i The number of sampling wells in the original stream data set is recorded as WN, then W _i The internal data are expressed as:

W _i ＝{(id _j ，time _j ，value _j )|1≤j≤WS _i and j∈N ⁺ }(1≤i≤WN)

step 2, calculating to obtain a well interval, wherein the ith well interval is recorded as WI _i The ith well interval size is denoted WIS _i Then the ith well interval WI _i Expressed as:

WI _i ＝{(id _j ，time _j ，value _j )|id _{wi_max} +1≤id _j ≤id _{wi+1_min} -1}

in the formula, id _{wi_max} The maximum id, for all stream data in the ith well _{wi+1_min} For all flow data in the (i+1) th wellA minimum id;

step 3, determining the discrete degree of the well by using the deviation coefficient SK, and if the distribution of the well flow data is symmetrical, the deviation coefficient SK is equal to 0; if the deviation coefficient SK is obviously different from 0, the distribution of the well flow data is asymmetric, wherein the deviation coefficient SK is positive and is right deviation distribution, and the deviation coefficient SK is negative and is left deviation distribution;

if the distribution of the well inflow data of the ith well is asymmetric and the ith well deviation coefficient is expressed as SK _i The following steps are: if SK _i ∈[-0.5,0.5]The i-th well has a smaller degree of dispersion; if SK _i E (- ≡1) or (1, ++ infinity a) of the above-mentioned components, the i-th well has a greater degree of dispersion, referred to as a highly deviated profile; if SK _i E (-1, -0.5) or (0.5, 1), then is considered to be a medium bias distribution;

step 4, dynamically adjusting the sampling rate and the well interval by using the bias coefficient, and assuming that the current is the ith well, and the initial sampling rate is p _init The adjusted sampling rate p is expressed as:

adjusted ith well interval size WIS _i The expression is as follows:

in WIS _init Representing an initial well interval size for an ith well;

step 5, dynamically adjusting the size of the well through an algorithm combining the pearson correlation coefficient and the variation coefficient:

recording representative wells in a well set, then receiving new stream data by using a sliding window, traversing the well set, setting the size of the sliding window to be the sizes of different wells in the well set, and determining the sizes of the wells by calculating pearson correlation coefficients and variation coefficients of the wells in the well set and the sliding window;

step 6, dynamic drilling sampling:

performing in-class unbiased sampling and inter-class biased sampling in a well, wherein the sampling rate of a few classes is dynamically improved among classes according to the magnitude of a bias coefficient; equidistant sampling is adopted in the well interval, the access rate of sampling stream data is reduced, the discrete data range is accurately positioned according to the size algorithm of the dynamic adjustment well recorded in the step 5, then the size of the well interval is dynamically adjusted according to the deviation coefficient, and the discrete data position is dynamically positioned.

Preferably, in step 3, the ith in-well deviation coefficient SK _i Expressed as:

in the method, in the process of the application,representing the mean of the data in the ith well.

Preferably, in step 5, the pearson correlation coefficient ρ (X, Y) of the two sets of variables X and Y is expressed as:

in sigma _X 、σ _Y Standard deviations of two sets of variables Y and Y, respectively, cov (X, Y) are pearson correlation coefficients before adjustment, expressed by the following formula:

wherein n represents the data amount in variables X and Y, X _i Represents the ith variable, y, in the current set of variables X _i Representing the ith variable in the current set of variables Y,representing the mean of the current set of variables X,/>representing the mean of the current set of variables Y.

Preferably, in step 5, the coefficient of variation CV is expressed as:

where σ is the standard deviation of the current set of variables and μ is the mean of the current set of variables.

The application further provides a large-scale stream data evaluation method based on dynamic drilling, which is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method, and the value characteristics of the original data set are evaluated by adopting discrete mean value accuracy, discrete variation coefficient accuracy, discrete sampling accuracy, centralized mean value accuracy, centralized variation coefficient accuracy, overall mean value accuracy, overall variation coefficient accuracy and JSD index based on the sample set, wherein:

the discrete mean value accuracy is the accuracy of estimating the mean value of the discrete data set attribute value of the original stream data set by using the mean value of the discrete data set attribute value of the sample set;

discrete coefficient of variation accuracy: estimating the accuracy of the variation coefficient of the discrete data set attribute value of the original stream data set by using the variation coefficient of the discrete data set attribute value of the sample set;

the discrete sampling accuracy is the ratio of the number of intersections of the discrete data sets of the sample set and the discrete data sets of the original stream data set to the length of the discrete data sets of the sample set;

the centralized average value accuracy is the accuracy of estimating the average value of the centralized data set attribute value of the original stream data set by using the average value of the centralized data set attribute value of the sample set;

the accuracy of the concentrated variation coefficient is that the variation coefficient of the concentrated data set attribute value of the original stream data set is estimated by the variation coefficient of the concentrated data set attribute value of the sample set;

overall mean value accuracy: estimating the accuracy of the mean value of the original stream data set attribute value by using the mean value of the sample set attribute value;

the overall variation coefficient accuracy is the accuracy of estimating the variation coefficient of the original stream data set attribute value by using the variation coefficient of the sample set attribute value;

the JSD index measures the distance between two probability distributions by calculating their KL divergences and the average distribution of the two probability distributions, the smaller the JSD index, the greater the similarity between the sample set and the original stream data distribution, the greater the JSD value, the greater the difference between the sample set and the original stream data set distribution, wherein the JSD index is calculated using the following formula:

wherein P (X) is the probability distribution of the sample set, Q (X) is the probability distribution of the original stream data set, D (P||Q) is used for calculating KL divergence of P (X) and Q (X), and the method comprises the following steps of

The current sampling method for the streaming data which is changed in real time and at high speed is easy to lose a large amount of value and information of discrete data, and high-efficiency and accurate evaluation on the value characteristics of the streaming data is difficult to perform. Based on mineral drilling exploration ideas, the application provides a dynamic drilling sampling method, which takes a well as an analysis unit, dynamically changes the size and the position of the well, and accurately locates the position and the range of discrete data. The new stream data value evaluation model is further provided, and the model evaluates the original stream data set from a sample set obtained by a dynamic drilling sampling method in three dimensions of discrete, concentrated and whole, so that the model has important research significance on large data value evaluation.

Drawings

FIG. 1 is a block diagram of a large scale stream data sampling framework for dynamic drilling;

FIG. 2 is a stream data value assessment model diagram;

FIG. 3 is a flow data schematic;

fig. 4 (a) and 4 (b) are flow data peak-trough classification diagrams.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The application relates to a large-scale stream data sampling method based on dynamic drilling, and also relates to an evaluation model establishment method for evaluating stream data value obtained by using the large-scale stream data sampling method.

The application discloses a large-scale stream data sampling method based on dynamic drilling, which mainly comprises the following steps: firstly, setting an initial well and an initial well interval, and clustering data k-means in the well; determining the discrete degree of the data in the well, the sampling rate of each class and the well interval by using the bias coefficient, performing unbiased sampling in the well and biased sampling among the classes, and performing medium-distance sampling in the well interval; the size of the next well is then determined by the correlation coefficient and the coefficient of variation together. The specific sampling method framework is shown in fig. 1, and comprises the following steps:

the application aims to provide a real-time and high-speed discrete stream data, wherein the size of the discrete stream data is denoted as N, and the size is infinitely increased along with the time change.

S101: the stream data S is expressed as:

S＝{(id _i ,time _i ,value _i )|1≤i≤Nandi∈N ⁺

in the formula, id _i Time for the order of arrival of the ith stream data _i Value for the time of arrival of the ith stream data _i Is the ith stream data value. The streaming data distribution is shown in fig. 3.

S102: a well:

due to streaming dataThe location and extent of the discrete data is uncertain, and the concept of a well is introduced based on the drilling sampling concept. By taking the well as an analysis unit, discrete data positions and ranges in the stream data are determined. The ith well is designated as W _i The size of the ith well is denoted WS _i The number of sampling wells in the original data set is recorded as WN, then W _i The internal data are expressed as:

s103: well spacing:

to reduce the access rate of the stream data samples, well spacing is provided between wells. The ith well interval is denoted as WI _i The ith well interval size is denoted WIS _i Then the ith well interval WI _i Expressed as:

in the formula, id _{wi_max} The maximum id, for all stream data in the ith well _{wi+1_min} Is the smallest id of all stream data in the i+1th well.

S104: determining the degree of well discretization:

in order to determine the position of discrete data, the application introduces the concept of 'bias coefficient' in statistics, which is expressed as SK, and the ith in-well bias coefficient SK _i The expression is as follows:

If the distribution of the well stream data is symmetrical, the skewness factor SK is equal to 0; if the deviation coefficient SK is obviously not equal to 0, the distribution of the well flow data is asymmetric, wherein the deviation coefficient SK is thatPositive, right-bias distribution, negative bias coefficient SK, left-bias distribution. If SK _i ∈[-0.5,0.5]The degree of dispersion of the ith well is small; if SK _i E (- ≡1) or (1, ++ infinity a) of the above-mentioned components, known as highly biased distribution; if SK _i E (-1, -0.5) or (0.5, 1), then a medium bias distribution is considered.

S105: dynamically adjusting the sampling rate and the well position:

the application dynamically adjusts the sampling rate and well position using the bias factor, assuming the current i-th well, the initial sampling rate is p _init The adjusted sampling rate p is expressed as:

by clustering three different classes of data k-means in a well, the present application represents them as: class0, class1, class2. Specifically, three cases are classified into SK _i ∈[-0.5,0.5]Sampling rate p=p for all classes _init The method comprises the steps of carrying out a first treatment on the surface of the If SK _i E (- ≡1) or (1, ++), increasing the sampling rate p=2×p for a smaller number of two classes _init ×|SK _i I (I); if SK _i E (-1, -0.5) or (0.5, 1), increasing the sampling rate p=2×p for the least number of classes _init ×|SK _i |。

The application dynamically adjusts the well spacing by using the bias coefficient, the ith well spacing WI _i The expression is as follows:

in WIS _init Indicating the size of the initial well interval. Specifically, two cases are classified into SK _i ∈[-0.5,0.5]Increasing well spacing WI _i ＝2×WI _init The method comprises the steps of carrying out a first treatment on the surface of the If SK _i E (- ≡1) or (1, ++ infinity) or (-1, -0.5) or (0.5, 1), reducing well spacing

S106: dynamically adjusting the size of the well:

the flow data peaks and valleys have three distinct classes of features: 1. the slope varies very greatly; 2. periodically changing; 3. the degree of dispersion is low. For these three features, the corresponding peaks are divided into: an impact-peak SP (shock-peak), an oscillation-peak OP (oscillation-peak), a buffer-peak BP (buffer-peak), as shown in fig. 4 (a); the corresponding valleys are divided into: shock-trough ST (shock-gauge), oscillation-trough OT (oscillation-gauge), buffer-trough BT (buffer-gauge), as shown in fig. 4 (b). Wherein the impact peaks and impact valleys are characterized by a very high slope, and therefore the degree of dispersion will be very high when such peaks or valleys are included in the well; the oscillation wave crest and the oscillation wave trough have the characteristic of periodic variation, and when the wave crest or the wave trough is contained in the well, the degree of dispersion is larger; the buffer peaks and buffer valleys have a low degree of dispersion, and when such peaks and valleys are included in a well, the degree of dispersion of the well is lower than the other two types of peaks and valleys. Furthermore, when two different wells contain the same crest or trough, the two wells have some self-similarity and the degree of dispersion of the two wells can be very close.

Based on the above findings, the present application proposes an algorithm combining pearson correlation coefficient and variation coefficient to dynamically adjust the well size in order to accurately determine the range of discrete data in bitstream data.

The pearson correlation coefficient is used to represent the degree of linear correlation of two sets of variables X and Y, which range from [ -1,1]. If the coefficient approaches 1, it is indicated that X and Y have a large autocorrelation. The coefficients approach-1, indicating that X and Y have anti-autocorrelation. If the coefficient approaches 0, it is stated that there is no apparent autocorrelation between the two sets of linked variables. The formula is as follows:

the above formula is a covariance formula by dividing the covariance by the standard deviation σ of two related variables _X 、σ _Y Make up forWeak manifestation of covariance data:

coefficient of variation CV = standard deviation σ/mean μ, which describes the relative degree of dispersion of the data and does not require reference to the mean of the data. Thus, it can be used to compare the degree of discretization of two sets of data, as follows:

s107: size algorithm for dynamic adjustment well:

the method comprises the steps of firstly recording representative wells in a well set, then receiving new stream data by using a sliding window, traversing the well set, setting the size of the sliding window to be the sizes of different wells in the well set, and determining the sizes of the wells by calculating the self-similarity and variation coefficients of the wells and the sliding window in the well set.

S108: dynamic drilling sampling algorithm:

the application provides a dynamic drilling sampling algorithm, which is used for carrying out in-class unbiased sampling and inter-class biased sampling in a well, wherein a classical sampling algorithm is used in the class: reservoir sampling, wherein the sampling rate of a few classes is dynamically improved among the classes according to the size of the bias coefficient. Equidistant sampling is adopted in the well interval, so that the access rate of sampling stream data is reduced. And accurately positioning the discrete data range according to a size algorithm of the dynamic adjustment well, and then dynamically adjusting the size of the well interval according to the deviation coefficient to dynamically position the discrete data position.

The application provides a new stream data value evaluation model, which starts from three dimensions of discrete, concentrated and integral, evaluates the value characteristics of an original data set by using a sample set obtained by the large-scale stream data sampling method, and the specific evaluation model is shown in figure 2.

Discrete mean value accuracy:

refers to the accuracy of estimating the mean value of the discrete data set attributes value of the original stream data set with the mean value of the discrete data set attributes value of the sample set. The discrete mean accuracy DMA (Discrete Mean Accuracy) is calculated as follows:

in the method, in the process of the application,mean value of discrete dataset properties value representing original streaming dataset,/for>Representing the mean of the discrete dataset properties value for the sample set.

Discrete coefficient of variation accuracy:

refers to the accuracy of estimating the coefficient of variation of the discrete dataset attribute value of the original stream dataset by using the coefficient of variation of the discrete dataset attribute value of the sample set. The calculation formula of the discrete variation coefficient similarity ADCV (Accuracy Of Discrete Coefficient Of Variation) is as follows:

discrete sampling accuracy:

refers to the ratio of the number of intersections of the discrete data sets of the sample set with the discrete data sets of the original stream data set to the discrete data set length of the sample set. The discrete sampling accuracy DSA (Discrete Sampling Accuracy) is calculated as follows:

where DDSS represents the discrete data set of the sample set, DDRD represents the discrete data set of the original stream data set, len (·) represents the calculated length.

Centralized mean value accuracy rate:

refers to the accuracy of estimating the mean value of the set data set attributes value of the original stream data set with the mean value of the set data set attributes value of the sample set. The calculation formula of the centralized average value accuracy CMA (Centralized Mean Accuracy) is as follows:

in the method, in the process of the application,mean value of the set data set attribute value representing the original stream data set,/for>A mean of the set data set attribute values in the set representing the sample set.

Centralized coefficient of variation accuracy:

refers to the accuracy of estimating the variation coefficient of the set data set attribute value of the original stream data set by using the variation coefficient of the set data set attribute value of the sample set. The calculation formula of the concentrated variation coefficient accuracy ACCV (Accuracy Of Centralized Coefficient Of Variation) is as follows:

in CV _CDRD Coefficient of variation, CV, representing a value of a set data attribute of an original stream data set _CDSS A coefficient of variation of a value of a set data set attribute in a set representing a sample set.

Overall mean value accuracy:

refers to the accuracy of estimating the mean of the original stream dataset RD (Raw Data) attribute value with the mean of the sample set SS (Sample Set) attribute value. The overall mean accuracy OMA (Overall Mean Accuracy) is calculated as follows:

in the method, in the process of the application,mean value representing the original stream dataset property value, +.>Representing the mean of the sample set attribute value.

Overall coefficient of variation accuracy:

refers to the accuracy of estimating the coefficient of variation of the original stream dataset attribute value by using the coefficient of variation of the sample dataset attribute value. The overall coefficient of variation accuracy AOCV (Accuracy Of Overall Coefficient Of Variation) is calculated as follows:

in CV _RD Coefficient of variation, CV, representing original stream dataset Property value _SS And represents the coefficient of variation of the sample set attribute value.

JSD index:

the JSD index, also called JS divergence, is an evaluation index for measuring the difference between two probability distributions, which measures the distance between two probability distributions by calculating their KL divergence and the average distribution of the two probability distributions, and its value range is 0, 1. The smaller the KL divergence, the greater the similarity between the sample set and the original stream data distribution, and the greater the JSD value, the greater the difference between the sample set and the original stream data set distribution. Assuming that the probability distribution of the sample set is P (X), the probability distribution of the original stream data set is Q (X), and the calculation formula of the KL divergence is as follows:

the calculation formula of the JS divergence is as follows:

Claims

1. a method for sampling large-scale stream data based on dynamic drilling, which is characterized in that the method is used for adopting stream data S, and the stream data S is expressed as: s= { (id) _i ，time _i ，value _i )|1≤i≤N and i∈N ⁺ In the formula, id _i Time for the order of arrival of the ith stream data _i Value for the time of arrival of the ith stream data _i For the ith stream data value, the large-scale stream data sampling method includes the steps of:

in the formula, id _{wi_max} The maximum id, for all stream data in the ith well _{wi+1_min} A minimum id for all flow data in the (i+1) th well;

adjusted ith well interval size WIS _i The expression is as follows:

in WIS _init Representing an initial well interval size for an ith well;

step 6, dynamic drilling sampling:

2. The method for sampling large-scale stream data based on dynamic drilling as recited in claim 1, wherein in step 3, the ith in-well deviation coefficient SK is _i Expressed as:

3. A method of sampling large-scale flow data based on dynamic drilling as claimed in claim 1, wherein in step 5, the pearson correlation coefficient ρ (X, Y) for two sets of variables X and Y is expressed as:

in sigma _X 、σ _Y Standard deviations of two sets of variables X and Y, respectively, cov (X, Y) are pearson correlation coefficients before adjustment, expressed by the following formula:

wherein n represents the data amount in variables X and Y, X _i Representing the ith variable in the current set of variables X,y _i representing the ith variable in the current set of variables Y,represents the mean value of the current set of variables X, +.>Representing the mean of the current set of variables Y.

4. The method for sampling large-scale stream data based on dynamic drilling as recited in claim 1, wherein in step 5, the coefficient of variation CV is expressed as:

5. The method for evaluating large-scale stream data based on dynamic drilling is characterized in that a sample set is obtained by sampling an original stream data set by using the large-scale stream data sampling method according to claim 1, and the value characteristics of the original data set are evaluated by using a discrete mean value accuracy, a discrete variation coefficient accuracy, a discrete sampling accuracy, a centralized mean value accuracy, a centralized variation coefficient accuracy, an overall mean value accuracy, an overall variation coefficient accuracy and a JSD index based on the sample set, wherein: