WO2024016731A1 - Procédé et appareil d'interrogation de point de données, groupe de dispositifs, produit de programme et support de stockage - Google Patents

Procédé et appareil d'interrogation de point de données, groupe de dispositifs, produit de programme et support de stockage Download PDF

Info

Publication number
WO2024016731A1
WO2024016731A1 PCT/CN2023/086007 CN2023086007W WO2024016731A1 WO 2024016731 A1 WO2024016731 A1 WO 2024016731A1 CN 2023086007 W CN2023086007 W CN 2023086007W WO 2024016731 A1 WO2024016731 A1 WO 2024016731A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
target
sketch
quantile
data point
Prior art date
Application number
PCT/CN2023/086007
Other languages
English (en)
Chinese (zh)
Inventor
刘超
叶冠宇
李云川
李仕林
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211091505.4A external-priority patent/CN117472975A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024016731A1 publication Critical patent/WO2024016731A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Definitions

  • Embodiments of the present application relate to the field of cloud computing technology, and in particular to a data point query method, device, equipment cluster, program product and storage medium.
  • Data points refer to data collected by relevant devices in Internet of Things technology, such as temperatures collected by temperature sensing devices.
  • Data point query is used to query the characteristics of a certain data point in a batch of data points, such as querying the quantile of the data point in a batch of data points based on the data value of the data point, or based on the quantile of the data point Query the data value of this data point.
  • the quantile indicates the position of the data point in a batch of data points sorted by size.
  • Embodiments of the present application provide a data point query method, device, equipment cluster, program product and storage medium, which can efficiently and accurately query a certain data point from massive data points.
  • the technical solutions are as follows:
  • a data point query method is provided.
  • the target scale function is determined from multiple scale functions. Different scales in the multiple scale functions
  • the density of clusters in the sketch constructed by the function is different, and the target quantile indicates the position of the target data point among multiple data points sorted by size;
  • the target sketch is constructed based on the target scale function and multiple data points, and the target sketch includes Multiple clusters, each cluster includes a cluster mean and a cluster weight.
  • the cluster mean indicates the mean value of the data points of the corresponding cluster obtained by clustering, and the cluster weight indicates the number of data points obtained by clustering of the corresponding cluster; query the target data points based on the target sketch.
  • the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale.
  • the sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
  • the multiple scale functions include a first scale function and a second scale function
  • the clusters in the sketch constructed based on the first scale function are dense on the first quantile interval.
  • the degree is greater than the density of the clusters in the first quantile interval in the sketch constructed based on the second scale function.
  • the density of the clusters in the sketch constructed based on the first scale function in the second quantile interval is less than
  • the clusters in the sketch built based on the second scale function are in the second Intensity on the quantile interval.
  • the implementation method of determining the target scale function from multiple scale functions can be: if the target quantile is located in the first quantile interval, then the first scale The function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  • the sketches constructed based on the first scale function have denser clusters on the first quantile interval
  • the sketches constructed based on the second scale function have denser clusters on the second quantile interval
  • the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2 ;
  • the second quantile interval includes the interval from x1 to x2.
  • the method provided by the embodiment of the present application can realize accurate query of the data points corresponding to any quantile in the global quantile interval [0,1], that is, high-precision query in the entire range can be achieved.
  • querying the target data point based on the target sketch may be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
  • the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. That is to say, the method provided by the embodiment of the present application is suitable for data point query in various scenarios, which improves the flexibility of the embodiment of the present application.
  • the data point query request is used to query the data value of the target data point among multiple data points.
  • the data point query request carries the standard quantile of the target data point; determine the standard quantile carried in the data point query request. is the target quantile.
  • a data point query request which is used to query the data value of a target data point among multiple data points
  • the data point query request carries the standard quantile of the target data point.
  • the standard quantile carried in the data point query request is determined as the target quantile, so that the target sketch can be constructed based on the target quantile, and then the data value of the target data point can be queried. In this case, the accuracy of the queried data values can be improved.
  • a equal height histogram query request may also be received, and the equal height histogram query request is used to query a equal height histogram constructed based on multiple data points, And the equal height histogram query request carries the number of buckets h, h is an integer greater than 1; based on the number of buckets h and the total number of multiple data points, determine the first bucket from left to right in the equal height histogram to the h-th
  • the quantiles of 1 bucket are obtained by h-1 quantiles; each quantile in the h-1 quantiles is used as the target quantile, and the target corresponding to the target data point to be queried is executed.
  • Quantile the operation of determining the target scale function from multiple scale functions to obtain h-1 data values that correspond to h-1 quantiles one-to-one.
  • the target scale function may also be determined based on the target quantile corresponding to the target data point to be queried.
  • querying the target data point based on the target sketch can be implemented by querying the standard quantile of the target data point based on the data value of the target sketch and the target data point.
  • the data value of the target data point can be queried based on the quantile of the target data point.
  • a quantile can be estimated based on the data value of the data point, and the estimated quantile can be The number is used as the target quantile and the scale function is adaptively selected to construct the sketch to improve the accuracy of the standard quantile obtained by subsequent queries.
  • an estimate of the target data point is determined based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
  • the quantile query request is used to query the standard quantile of the target data point among multiple data points.
  • the quantile query request carries the data value of the target data point.
  • Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where a quantile query request is received, which improves the accuracy of the standard quantile queried in this scenario.
  • an equal-width histogram query request may also be received, and the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, And the equal-width histogram query request carries a bucket boundary array.
  • the bucket boundary array includes n boundary values.
  • the n boundary values divide n+ between the data value of the smallest data point and the data value of the largest data point among multiple data points. 1 interval; use each boundary value in the n boundary values as the data value of the target data point, and execute the data value based on the target data point, as well as the data value of the largest data point and the smallest data point among multiple data points.
  • the operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one.
  • Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where an equal-width histogram query request is received, which improves the accuracy of the equal-width histogram queried in this scenario.
  • clusters to be updated corresponding to the data points to be updated in the cache can also be generated, and the clusters to be updated include clusters Mean, cluster weight and cluster mark.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster weight of the cluster to be updated indicates the number of data points to be updated.
  • the cluster mark of the cluster to be updated indicates the update of the data point to be updated. Type; update the target sketch based on the cluster to be updated.
  • the data points in the cache are expressed as clusters to be updated in the form of triples as mentioned above, so as to facilitate Subsequently, the target sketch is updated based on the data points to be updated in the cache.
  • the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtaining the cluster to be merged; Merge the clusters to be merged into the target sketch.
  • the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
  • merging the clusters to be merged into the target sketch may be implemented by: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and Each cluster performs the following operations in turn: for the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster If the quantile is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the i-th cluster The current quantile and target scale function of the cluster update the quantile threshold and traverse the next cluster.
  • the merged cluster to be updated can be added to other clusters of the target sketch to update the target sketch.
  • the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; Remove the cluster to be deleted from the target sketch.
  • the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
  • deleting the clusters to be deleted from the target sketch can be implemented by: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; The first cluster after sorting starts to traverse each cluster, and performs the following operations on each cluster in turn: for the jth cluster, determine the cluster mark of the jth cluster, if the cluster mark of the jth cluster is a mark to be deleted , then delete the jth cluster and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
  • the cluster to be deleted can be deleted from the target sketch to update the target sketch.
  • the implementation of updating the cluster weights of clusters adjacent to j clusters can be: if the j-th cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters.
  • the cluster weight of the left adjacent cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters.
  • the cluster mean of the left adjacent cluster and the cluster mean of the right adjacent clusters of j clusters based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster and the cluster mean and cluster weight of
  • the implementation of constructing a target sketch based on the target scale function and multiple data points may be: obtaining a cached sketch based on some of the data points among the multiple data points and the target scale function. , obtain the first sketch; construct a sketch based on the data points except some data points among the multiple data points and the target scale function, and obtain the second sketch; aggregate the first sketch and the second sketch to obtain the target sketch.
  • the currently constructed sketch can be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
  • a sketch that has been cached based on some of the data points among the multiple data points and the target scale function is obtained.
  • the implementation of obtaining the first sketch may be: obtaining the target time window to be queried. , the target data point is a data point whose timestamp is within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata includes the sketch time window and the sketch timeline identifier.
  • the sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch.
  • the sketch time window The intermediate line identifier is the identifier of the timeline to which the data points of the corresponding sketch belong; based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, and the sketch time window in the first metadata is For part or all of the target time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs; the sketch corresponding to the first metadata is determined as the first sketch.
  • the cached sketches can be managed through the metadata set, so that when querying a certain data point, the cached sketches can be obtained based on the metadata set, which improves the efficiency of obtaining cached sketches.
  • a sketch is constructed based on data points other than some data points among the multiple data points and the target scale function. After obtaining the second sketch, the elements of the second sketch can also be determined. data, get the second metadata; cache the second sketch, and add the second metadata to the metadata set.
  • the metadata set can also be updated based on the second sketch, so that subsequent query operations can be performed based on the updated metadata set.
  • the timestamp of the data point to be written and the identity of the timeline to which the data point to be written can also be determined; if the data point to be written is If the timestamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata will be deleted and the metadata set will be updated.
  • the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
  • the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record multiple usage information matching any sketch timeline identification. The usage time of each sketch in the sketches.
  • the sketch to be eliminated among the multiple sketches matching any sketch timeline identifier can also be determined based on the first usage information, and the sketch to be eliminated can be deleted.
  • the metadata set further includes second usage information, and the second usage information is used to record the usage corresponding to each of the multiple sketch timeline identifications in the metadata set.
  • the usage information corresponding to each sketch timeline ID indicates the usage time of the sketch that matches the corresponding sketch timeline ID.
  • the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers can also be determined based on the second usage information; and the sketch matching the sketch timeline identifier to be eliminated is deleted.
  • a data point query device in a second aspect, has the function of realizing the behavior of the data point query method in the first aspect.
  • the data point query device includes at least one module, and the at least one module is used to implement the data point query method provided in the first aspect.
  • a computing device cluster includes at least one computing device, each computing device includes a processor and a memory; the processor of the at least one computing device is used to execute the memory of the at least one computing device. instructions stored in the computing device cluster to cause the computing device cluster to execute the data point query method provided in the first aspect Law.
  • a computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute the data point query method described in the first aspect.
  • a fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the data point query method described in the first aspect.
  • Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application
  • Figure 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application;
  • Figure 3 is a schematic diagram of the curve change trend of the derivative of a second scale function S 2 (q) and S 2 (q) provided by the embodiment of the present application;
  • Figure 4 is a schematic diagram of a query process for querying data values based on target sketches and target quantiles provided by an embodiment of the present application
  • Figure 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a query process for querying the standard quantile q of a target data point based on the target sketch and the data value Q of the target data point provided by the embodiment of the present application;
  • Figure 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application.
  • Figure 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application.
  • Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application.
  • Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application.
  • Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application.
  • Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application.
  • Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by the embodiment of the present application.
  • Figure 14 is a schematic structural diagram of a data point query device provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 16 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
  • Figure 17 is a schematic diagram of a connection method between computing device clusters provided by an embodiment of the present application.
  • Quantile is used to characterize the position of a certain data point in a sequence of a large number of data points sorted in order of size. Compared with using extreme values (maximum and/or minimum values) to characterize the characteristics of a large number of data points, quantiles can shield false extreme value information caused by abnormal data points, thereby representing the real information at each stage in a large number of data points. . Based on this, for companies that provide Internet services, the quantile can be used as one of the important indicators to measure the company's network operating status. In addition, quantile query is also used in weather temperature trends, log mining, stock trend analysis, virtual currency volume and price indicators, financial data analysis and other fields.
  • all data points need to be sorted, and then the quantile corresponding to each data point is calculated based on the position of each sorted data point.
  • the value range of q is a real number between 0 and 1, that is, q ⁇ [0,1].
  • the time and space complexity of quantiles determined by this technique is O(NlogN), where N is the total number of full data points.
  • the quantile of each data point is known, if the quantile of the data point to be queried is q, then the quantile of all sorted data points is determined based on the quantile q. item, the result obtained is the data value of the data point, that is, the query result.
  • the t-digest (an online clustering algorithm) algorithm is currently a commonly used algorithm in approximate quantile calculation technology.
  • the basic principle of this algorithm is to cluster all data to obtain multiple clusters.
  • Each cluster has a corresponding cluster mean and cluster weight.
  • the cluster mean indicates the aggregation to obtain the average value of the data points of the corresponding cluster
  • the cluster weight indicates the aggregation to obtain the corresponding cluster mean.
  • the number of data points in the cluster. Multiple clusters of builds are often called sketches.
  • the quantile of each cluster can be determined based on the cluster mean and cluster weight corresponding to each cluster in the sketch.
  • linear interpolation is used to calculate the approximate data value of the data point based on the quantile and cluster mean of each cluster in the sketch.
  • the accuracy and efficiency of queries in this algorithm can be adjusted by the number of clusters in the sketch.
  • histograms can intuitively describe the data distribution characteristics of multiple data points. Therefore, histograms are widely used in the field of network monitoring and operation and maintenance.
  • the abscissa in the histogram represents the data value of the data point, and the ordinate represents the number of data points.
  • the histogram includes multiple bars. Each bar can be called a bucket. The height of each bucket represents the fall of the data value. The number of data points in the data value interval corresponding to this bucket.
  • histograms include equal-height histograms and equal-width histograms.
  • the equal-height histogram refers to a histogram in which the height of each bucket is close.
  • An equal-width histogram is a histogram in which each bin has the same width.
  • embodiments of this application provide a data point query method.
  • the method provided by the embodiments of the present application can achieve the following technical effects: first, high-precision query of quantiles of data points in the entire range; second, deletion of sketches; and third, incremental update to avoid updating every time.
  • the sketch needs to be rebuilt every time a query is made to avoid wasting resources.
  • Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps 101 to 103.
  • Step 101 Based on the target quantile corresponding to the target data point to be queried, determine the target scale function from multiple scale functions. The density of clusters in the sketch constructed by different scale functions among the multiple scale functions is different. The target quantile The number indicates the position of the target data point among multiple data points sorted by size.
  • the scale function is used to control the density of each cluster in the sketch.
  • the density of each cluster in the sketch is related to the size of each cluster.
  • the size of a cluster indicates the number of data points that are aggregated into the cluster. The larger the cluster, the more data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a large number of data points.
  • the clusters in the sketch are relatively sparse, making it difficult to distinguish individual data from the sketch. The data values of the points, so the accuracy of the sketch is also lower. The smaller the cluster, the fewer data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a small number of data points.
  • the clusters in the sketch are denser, making it easier to distinguish each data from the sketch.
  • the data value of the point so the accuracy of the sketch is also relatively high.
  • the scale function can be used to control the accuracy of the sketch to improve the accuracy of subsequent queries.
  • the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale.
  • the sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
  • the multiple scale functions include a first scale function and a second scale function
  • the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those based on the second scale function.
  • the clusters in the constructed sketch are denser on the first quantile interval
  • the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than the sketch constructed based on the second scale function. How dense the clusters in are on the second quantile interval.
  • the implementation process of determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point in step 101 can be: if the target quantile is located in the first quantile interval, then The first scale function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  • the sketches constructed based on the first scale function have denser clusters on the first quantile interval
  • the sketches constructed based on the second scale function have denser clusters on the second quantile interval
  • the first quantile interval and the second quantile interval can be any interval in the global quantile interval [0,1].
  • the sum of the first quantile interval and the second quantile interval is the global quantile interval [0,1].
  • the global quantile interval [0,1] can be realized through the method provided by the embodiment of the present application. Accurate query of data points corresponding to any quantile on 0,1], that is, achieving high-precision query in the entire range.
  • the first quantile interval includes the interval from 0 to x1, and the interval from x2 to 1, x1 and x2 are both greater than 0 and less than 1, and x1 is less than x2; the second quantile interval includes the interval from x1 to x2 interval. That is, the first quantile interval is the interval near both ends of the global quantile interval [0,1], and the second quantile interval is the middle interval of the global quantile interval [0,1].
  • x1 can be 0.2 and x2 can be 0.8.
  • the first quantile interval corresponding to the first scale function is [0,0.2] and [0.8,1]
  • the first quantile interval corresponding to the second scale function is [0.2,0.8].
  • x1 and x2 can also be other real numbers on the global quantile interval [0,1], and the embodiments of this application will not give examples one by one here.
  • the first scale function can be designed as the function shown in the following formula (1)
  • the second scale function can be designed as the function shown in the following formula (2):
  • q in formula (1) and formula (2) represents the quantile
  • represents the hyperparameter, indicating the number of clusters
  • S 1 (q) and S 2 (q) represent the first scale function and the second scale function respectively
  • the derivatives of S 1 (q) and S 2 (q) can characterize the density of clusters in the constructed sketch.
  • FIG. 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of the curve change trend of the second scale function S 2 (q) and the derivative of S 2 (q) provided by the embodiment of the present application.
  • the first scale function S 1 (q) can be selected to construct the sketch.
  • the second scale function S 2 (q) can be selected to construct the sketch, to improve the accuracy of the constructed sketch, thereby Improve the accuracy of querying data points. That is to say, the embodiment of the present application provides a method for adaptively selecting a scale function to construct a sketch based on the query environment.
  • scale functions correspond to different intervals of the global quantile interval [0,1]. Different cluster density levels, that is, these scale functions have different performances in different intervals of the global quantile interval [0,1], thereby realizing the method of adaptively selecting scale functions to construct sketches based on the query environment provided in the embodiment of this application. .
  • Step 102 Construct a target sketch based on the target scale function and multiple data points.
  • the target sketch includes multiple clusters.
  • Each cluster includes a cluster mean and a cluster weight.
  • the cluster mean indicates the clustering to obtain the mean of the data points of the corresponding cluster.
  • the cluster weight indicates Clustering yields the number of data points corresponding to the cluster.
  • the implementation method of constructing the target sketch based on the target scale function and multiple data points may refer to the t-digest algorithm or other clustering methods, which is not limited in the embodiments of the present application.
  • Step 103 Query target data points based on the target sketch.
  • the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. This is explained below in two application scenarios.
  • the first application scenario querying data values based on quantile
  • step 103 can be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
  • the target quantile is marked as q
  • q is a decimal between 0 and 1
  • the query obtained based on the target sketch and q The result is: the sorted result of all data points
  • the approximate estimated value of the elements, the query result is the data value of the target data point.
  • C 1 weight is the cluster weight of the first cluster in the target sketch
  • C 1 value is the cluster mean of the first cluster in the target sketch
  • the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large.
  • C m weight is the cluster weight of the last cluster in the target sketch
  • C m value is the cluster mean of the last cluster in the target sketch.
  • the last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
  • Wi the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
  • Case 1 Query in response to a data point query request
  • a data point query request can also be received.
  • the data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the target data.
  • the standard quantile of the point In this case, the standard quantile carried in the query request for the data point is determined as the target quantile.
  • the standard quantile can be the quantile input by the user, that is, when the user triggers the data point query request, he also inputs a quantile, so that the subsequent quantile can be based on the user input through the method provided by the embodiment of the application. Query specific data values.
  • the scale function can be adaptively selected according to the quantile input by the user, and a sketch can be constructed.
  • the constructed sketch is relatively dense in the interval near the quantile input by the user, thereby improving the accuracy of the query results. .
  • a contour histogram query request which is used to query a contour histogram constructed based on multiple data points
  • the request carries the number of buckets h.
  • the way to determine the target quantile is: based on the number of buckets h and the total number of multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram.
  • Quantile get h-1 quantiles; use each quantile in h-1 quantiles as the target quantile, and perform steps 101 to 103 to get h-1 quantiles.
  • h-1 data values that correspond to h-1 quantiles one-to-one it can be based on the h-1 data values, as well as the data value of the largest data point and the data of the smallest data point among the multiple data points. value, draw a histogram of equal heights.
  • the height of each bucket in the equal-height histogram is equal, which is the ratio of the total number N to the number of buckets h.
  • the coordinates on the horizontal axis in the equal-height histogram become larger from left to right, in order
  • the h buckets from left to right in the equal-height histogram are marked as the first bucket, the second bucket, ..., and the h-th bucket.
  • the implementation method of determining the quantile from the first bucket to the h-1th bucket in the equal-height histogram from left to right can be:
  • the quantile of i buckets can be expressed as i/h, where i is an integer greater than or equal to 1 and less than or h.
  • each bucket in the equal-height histogram has a corresponding left boundary value and a right boundary value on the abscissa.
  • the quantile of each bucket mentioned above specifically refers to the quantile corresponding to the right boundary value of each bucket. number. Therefore, the quantile corresponding to the h-th bucket above is 1.
  • h-1 data values that correspond to h-1 quantiles
  • the left and right boundary values of a bucket, and the height of each bucket is the ratio between the total number and the number of buckets h.
  • FIG. 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application. As shown in Figure 5, the process of querying the equal height histogram includes the following steps:
  • the second application scenario querying quantiles based on data values
  • the quantile of the data point is not known in advance.
  • Quantile use the estimated quantile as the target quantile and adaptively select the scale function to build the sketch.
  • the implementation of determining the target quantile may be: based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine The estimated quantile of the target data point, using the estimated quantile as the target quantile.
  • the implementation method of determining the estimated quantile of the target data point can be implemented by the following formula :
  • Q is the data value of the target data point to be queried.
  • determining the estimated quantile of the target data point can also be implemented in other ways.
  • the embodiments of the present application do not limit this.
  • step 103 can be implemented by querying the standard quantile of the target data point based on the target sketch and the data value of the target data point.
  • the quantiles obtained by the query are called standard quantiles.
  • the data value of the target data point is marked as Q
  • the standard quantile is marked as q
  • the query result obtained based on the target sketch and Q is q.
  • C 1 weight is the cluster weight of the first cluster in the target sketch
  • C 1 value is the cluster mean of the first cluster in the target sketch
  • the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large.
  • C m weight is the cluster weight of the last cluster in the target sketch
  • C m value is the cluster mean of the last cluster in the target sketch.
  • the last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
  • Wi the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
  • the queried standard quantile q can be obtained by the following formula:
  • a quantile can be estimated based on the data value input by the user, and then the scale function can be adaptively selected based on the estimated quantile, and a sketch can be constructed.
  • the interval near the corresponding quantile is relatively dense, thereby improving the accuracy of the query results.
  • the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, and the equal-width histogram query
  • the request carries a bucket boundary array.
  • the bucket boundary array includes n boundary values.
  • the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points; n
  • Each of the boundary values is used as the data value of the target data point, and steps 101 to 103 are performed to obtain n standard quantiles corresponding to the n boundary values one-to-one.
  • an equal-width histogram can be drawn based on the n standard quantiles that correspond to n boundary values.
  • the n boundary values in the bucket boundary array are arranged in order from small to large, and the n boundary values constitute an arithmetic sequence to achieve the equal width of each bucket in the equal-width histogram.
  • the coordinates on the horizontal axis in the equal-width histogram become larger from left to right.
  • the n+1 buckets from left to right in the equal-width histogram are marked as the first bucket.
  • the second bucket the n+1th bucket.
  • the left boundary value of the first bucket is the data value of the smallest data point among all the data points
  • the left boundary value of the second bucket (that is, the right boundary value of the first bucket) is the first data point in the bucket boundary array.
  • the left boundary value of the third bucket (that is, the right boundary value of the second bucket) is the second boundary value in the bucket boundary array,..., and so on, the left boundary value of the n+1th bucket (that is, the right boundary value of the nth bucket) is the nth boundary value in the bucket boundary array, and the right boundary value of the n+1th bucket is the data value of the largest data point among all data points.
  • the specific implementation process of drawing an equal-width histogram can be: after determining the quantile corresponding to each boundary value in the bucket boundary array, then The number of data points falling into two adjacent boundary values can be determined based on the total number and the quantile corresponding to each boundary value. Based on the number of data points falling into two adjacent boundary values, an equal-width histogram can be obtained. The height of each barrel in the picture.
  • the specific implementation method will be explained in detail later.
  • FIG. 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application. As shown in Figure 7, the process of querying an equal-width histogram includes the following steps:
  • each element in the array C obtained in this way is also the height of a bucket.
  • each The height of a bucket represents the ratio between the number of data points whose data values fall within the bounds of the bucket and the total number N.
  • the scale function can be adaptively selected according to the target quantile corresponding to the target data point to be queried, so as to improve the accuracy of the constructed target sketch near the target quantile, thereby improving the accuracy of the query results.
  • This method of adaptively selecting scale functions can be applied in the scenario of querying data values based on quantiles, in the scenario of querying quantiles based on numerical values, and in the scenario of querying equal-height histograms. It can also be applied to the scenario of querying equal-width histograms. Therefore, the method provided by the embodiments of the present application can improve the accuracy of query results in various query scenarios. Spend.
  • the above embodiment is used to explain how to adaptively select a scale function to construct a target sketch.
  • a method of inserting data points or deleting data points into the target sketch is also provided to update the target sketch.
  • FIG. 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps 801 to 802.
  • Step 801 Generate a cluster to be updated corresponding to the data point to be updated in the cache.
  • the cluster to be updated includes a cluster mean, a cluster weight and a cluster tag.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster of the cluster to be updated is The weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
  • Step 802 Update the target sketch based on the cluster to be updated.
  • a triplet may be used to represent a cluster.
  • This triplet can be expressed as ⁇ v, w, f>, where v represents the cluster mean of the cluster, w represents the cluster weight of the cluster, and f represents the cluster label of the cluster.
  • the cluster mark indicates whether the cluster is to be deleted or merged.
  • the data points in the cache are expressed as clusters to be updated in the form of triples as above. That is, the data point to be updated in the cache corresponds to the cluster to be updated.
  • the cluster to be updated includes the cluster mean, cluster weight and cluster mark.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster weight of the cluster to be updated indicates the cluster to be updated.
  • the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
  • the cluster mark of the cluster to be updated includes a mark to be merged and a mark to be deleted.
  • the cluster mark is a mark to be merged, indicating that the corresponding cluster is a cluster to be merged into the target sketch.
  • the cluster mark is a mark to be deleted, indicating that the corresponding cluster is a cluster to be deleted from the target sketch.
  • the current update operation of the target sketch includes inserting data points into the target sketch or deleting data points from the target sketch. This is explained in two cases below.
  • step 802 is implemented by: obtaining the clusters to be updated whose clusters are marked as to-be-merged markers from the clusters to be updated, and obtaining the clusters to be merged; and merging the clusters to be merged into the target sketch.
  • the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
  • the implementation process of merging clusters to be merged into the target sketch may be: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; for the first cluster after sorting , determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
  • For the i-th cluster based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster is lower than the quantile threshold, Then merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the current quantile of the i-th cluster and the target The scale function updates the quantile threshold and traverses the next cluster.
  • the quantile threshold can indicate the limited capacity of the corresponding cluster.
  • k(q 0 ) represents the scale function
  • the implementation method of determining the current quantile of the i-th cluster can be: determining the sum of the cluster weights of the clusters that have been traversed (including the i-th cluster), and determining The cluster weights of all clusters after sorting are summed, and the ratio between the two sums is used as the current quantile of the i-th cluster.
  • the i-th cluster is merged into the previous cluster.
  • merging the i-th cluster into the previous cluster means updating the cluster weight and cluster mean of the previous cluster based on the cluster weight and cluster mean of the i-th cluster.
  • the cluster mean of the i-th cluster and the cluster mean of the previous cluster are weighted according to their respective cluster weights, and the resulting value is used as the updated cluster mean of the previous cluster.
  • the cluster weight overlap of the i-th cluster is added to On the cluster weight of the previous cluster, the obtained value is used as the updated cluster weight of the previous cluster.
  • the implementation method of updating the quantile threshold based on the current quantile of the i-th cluster and the target scale function can also refer to the above-mentioned formula for determining the quantile threshold q threshold , which will not be described again here.
  • Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application.
  • the newly added data points are first placed in the cache (that is, the buffer), and the new data points in the cache are expressed in the form of triples to obtain the clusters to be merged.
  • the quantile threshold is recalculated based on the quantile of the current cluster and the next cluster is traversed.
  • step 802 is implemented by: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; and deleting the cluster to be deleted from the target sketch.
  • the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
  • clusters with the same cluster mean in the clusters to be deleted can be merged.
  • the cluster weight of the merged cluster is the sum of the cluster weights of each cluster before the merge. Then the target sketch is updated based on the merged clusters to be deleted.
  • the implementation process of deleting clusters to be deleted from the target sketch may be: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; starting from the first cluster after sorting Traverse each cluster and perform the following operations on each cluster traversed: for the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster. clusters, and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
  • updating the cluster weights of clusters adjacent to j clusters includes the following situations:
  • the cluster weight of the first cluster is subtracted from the cluster weight of the right adjacent cluster of the first cluster, and the value obtained is used as the updated cluster weight of the right adjacent cluster of the first cluster.
  • the cluster weight of the right adjacent cluster of the first cluster is less than the cluster weight of the first cluster, delete the right adjacent cluster of the first cluster and determine the right adjacent cluster of the first cluster.
  • the difference between the cluster weight of and the cluster weight of the first cluster, and the cluster weight of the next right-neighboring cluster adjacent to the right-neighboring cluster is updated based on the difference. If the difference is still greater than the cluster weight of the next right adjacent cluster adjacent to the right adjacent cluster, continue to update the cluster weight of the next right adjacent cluster through the above method until the most recently obtained right adjacent cluster's cluster weight The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the right.
  • the minimum value of the target sketch (that is, the data value of the smallest data point among all the data points in the target sketch) has changed.
  • the cluster mean of the first cluster in the updated target sketch can be used as the minimum value of the target sketch.
  • the cluster weight of the last cluster is subtracted from the cluster weight of the left adjacent cluster of the last cluster, and the value obtained is used as the updated cluster weight of the left adjacent cluster of the last cluster.
  • the cluster weight of the left adjacent cluster of the last cluster is less than the cluster weight of the last cluster, delete the left adjacent cluster of the last cluster and determine the cluster weight of the left adjacent cluster of the last cluster and The difference between the cluster weights of the last cluster, based on which the cluster weight of the next left-neighboring cluster adjacent to the left-neighboring cluster is updated. If the difference is still greater than the cluster weight of the next left adjacent cluster adjacent to the left adjacent cluster, continue to update the cluster weight of the next left adjacent cluster through the above method until the most recently obtained cluster weight of the left adjacent cluster. The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the left.
  • the maximum value of the target sketch (that is, the data value of the maximum data point among all the data points in the target sketch) has changed. At this time, it can be updated.
  • the maximum value of the target sketch For example, the cluster mean of the last cluster in the updated target sketch can be used as the maximum value of the target sketch.
  • Case 3 If the jth cluster is the middle cluster after sorting, the cluster weight of the left adjacent cluster and the cluster weight of the right adjacent cluster of the jth cluster need to be updated.
  • the implementation process of updating the cluster weights of clusters adjacent to j clusters can be as follows: obtaining the cluster mean of the left adjacent clusters of j clusters and the cluster mean of the right adjacent clusters of j clusters; based on the left The cluster mean of the adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the jth cluster determine the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster respectively; based on The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  • the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively.
  • Deleting weights can be achieved through the following formula:
  • d l represents the deletion weight corresponding to the left adjacent cluster
  • d r represents the deletion weight corresponding to the right adjacent cluster
  • w c represents the cluster weight of the jth cluster
  • v c represents the cluster mean of the jth cluster
  • v r represents the cluster mean of the left adjacent cluster
  • v l represents the cluster mean of the right adjacent cluster.
  • updating the cluster weight of the left adjacent cluster based on the deletion weight corresponding to the left adjacent cluster may be, for example: subtracting the deletion weight corresponding to the left adjacent cluster from the cluster weight of the left adjacent cluster, and the obtained value is as The updated cluster weight of the left adjacent cluster.
  • updating the cluster weight of the right adjacent cluster based on the deletion weight corresponding to the right adjacent cluster can be: subtracting the deletion weight corresponding to the right adjacent cluster from the cluster weight of the right adjacent cluster, and the obtained value is used as the updated value.
  • updating the cluster weight of the left adjacent cluster can also refer to the aforementioned leftward recursive update of the cluster weight.
  • updating the cluster weight of the right adjacent cluster you can also refer to the aforementioned rightward recursive update of the cluster weight. The explanation will not be repeated here.
  • Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application. As shown in Figure 10,
  • each data point to be deleted in the buffer is counted.
  • Each data point is represented by the aforementioned triplet, that is, each data point to be deleted is represented in the form of a cluster to construct the cluster to be deleted. Sort the clusters to be deleted and the clusters in the target sketch according to the cluster mean from small to large.
  • the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the right. If the deletion of the first cluster of the target sketch affects the minimum value of the target sketch, the minimum value of the target sketch needs to be updated based on the updated first cluster of the target sketch. After the cluster weight of the right adjacent cluster is updated based on the cluster weight of the cluster to be deleted, the current cluster to be deleted is deleted, the first cluster is marked as the current cluster and the backward traversal continues.
  • the current cluster is the last cluster, delete the data of the left adjacent cluster set, that is, modify the cluster weight of the left adjacent cluster. If the cluster weight of the left adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the left. If the deletion of the last cluster of the target sketch affects the maximum value of the target sketch, the maximum value of the target sketch can be updated based on the last cluster of the updated target sketch. After the cluster weight of the left adjacent cluster is currently updated based on the cluster weight of the cluster to be deleted, the cluster to be deleted is deleted and the deletion operation is completed.
  • the current cluster is located in the middle position, then determine the deletion weight of the left adjacent cluster and the deletion weight of the right adjacent cluster of the current cluster, and then delete recursively from left to right, that is, update the deletion weight based on the left adjacent cluster.
  • the cluster weight of the left adjacent cluster is updated based on the deletion weight of the right adjacent cluster.
  • the left adjacent cluster of the cluster to be deleted is marked as the current cluster and the cluster to be deleted is deleted, and then the backward traversal continues.
  • the data points to be updated in the cache can be expressed as clusters to be updated in the form of triples, because the cluster tags in the clusters to be updated can indicate whether the clusters to be updated are clusters to be deleted or clusters to be deleted. Merged clusters, so based on cluster tags, data points to be inserted in the cache can be inserted into the target sketch, or data points to be deleted in the cache can be deleted from the target sketch.
  • the target sketch is temporarily constructed in the manner shown in Figure 1 .
  • this application implements The example provides an incremental update method. Through the incremental update method, when querying data points, a sketch is constructed based only on the newly added data points, and then the constructed sketch and the existing sketches in the cache are aggregated. Get a target sketch, thus avoiding the waste of computing resources.
  • the data points stored in the time series database have corresponding timestamps.
  • the timestamp of each data point can represent the collection time of the data point. Therefore, the data points stored in the time series time database have time series characteristics.
  • the data points stored in the time series database can usually include data points on different indicators, such as data points collected for temperature and data points collected for humidity, etc.
  • each indicator is The data points on are called data points on a timeline. Based on this, the data points in the time series database include data points corresponding to multiple timelines, and each timeline represents an indicator.
  • the embodiments of the present application also provide an incremental update system.
  • the incremental update system provided by the embodiments of the present application is explained here. .
  • FIG 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application. As shown in Figure 11, the incremental update system includes the following components.
  • the single timeline component (seriesCusor), also known as the single timeline read data executor, is responsible for reading the original data points within the specified time range of a timeline in response to the query statement.
  • the single-timeline aggregation component also known as the single-timeline aggregation executor, is responsible for calculating the data points of the timeline according to a specific aggregation method and outputting the aggregation results.
  • the data points of the timeline are constructed as sketches, and the insertion and deletion operations of the sketches in the aforementioned embodiments can be implemented through this component.
  • the single-timeline sketch cache component also known as the single-timeline sketch cache executor, is responsible for caching the already built sketches.
  • the incremental update system as shown in Figure 11, also includes a data cache (CacheData) and a metadata cache (CacheMeta). These two caches are used to store the built sketches and the metadata of the sketches respectively.
  • the metadata indication of the sketches The metadata used to index sketches.
  • the multi-timeline sorting component (tagSetCursor), also known as the multi-timeline sorting and merging executor, is responsible for sorting the sketches that are aggregated based on the data points of multiple timelines according to the space and time dimensions to ensure the orderliness of the cached sketches.
  • Multi-timeline inter-group component also called multi-timeline inter-group executor, is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components.
  • Serial scheduling is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components.
  • the logical concurrent component also known as the logical concurrent executor, serves as the smallest granular parallel scheduling unit and is responsible for the conversion of data structures and the assembly of metadata.
  • data structure conversion refers to converting the storage layer data structure into a query data structure to output query results.
  • Assembly of metadata is used to generate metadata for sketches.
  • the Aggregation Transformation component also known as the multi-timeline aggregation executor, is responsible for further aggregating the output results of components between multi-timeline groups, such as the merging of sketches.
  • Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application. As shown in Figure 12, the method includes the following steps 1201 to 1203.
  • Step 1201 Obtain a cached sketch based on some of the multiple data points and the target scale function to obtain a first sketch.
  • Step 1202 Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch.
  • Step 1203 Aggregate the first sketch and the second sketch to obtain the target sketch.
  • the sketch when the target data point needs to be queried, if some sketches have been constructed based on some data points and the target scale function in advance, the sketch can currently be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you can avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
  • the cached sketch based on some of the data points among the multiple data points and the target scale function is obtained.
  • the first sketch can be obtained by: obtaining the target time window to be queried, and the target data point is the timestamp located at Data points within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata of each sketch includes the sketch time window and The sketch timeline identifier.
  • the sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch.
  • the sketch timeline identifier is the identifier of the timeline to which the data point that constructs the corresponding sketch belongs; based on the target time window and the target data point.
  • the timeline of the first metadata is determined from the metadata set.
  • the sketch time window in the first metadata is part or all of the target time window.
  • the sketch timeline identifier in the first metadata is consistent with the target data point.
  • the identities of the timelines are the same; the sketch corresponding to the first metadata is determined as the first sketch.
  • the target time window to be queried may be the time window carried in the query statement input by the user. For example, if the user inputs a query statement of "query the highest temperature in the last quarter", then the target time window is "last quarter".
  • the metadata set can be maintained by the metadata cache (CacheMeta) shown in Figure 11.
  • the metadata set stores the metadata of each cached sketch in the form of a list.
  • the implementation method of determining the first metadata from the metadata set can be: traverse each metadata in the metadata set, if the sketch time of a certain metadata If the line identifier is the same as the identifier of the timeline to which the target data point belongs, and the sketch time window of the metadata is part or all of the time window in the target time window, then the metadata is determined to be the first metadata.
  • each SID represents a timeline
  • each SID corresponds to multiple time windows (windows)
  • corresponding sketches are cached based on each time window.
  • the metadata of the metadata set can be stored in a key-value format.
  • the key is the data fragmentation identifier (SharId), where each SharId represents a time range (timerange), so the value corresponding to each SharId includes multiple metadata, and the sketch time window in each metadata is within that time Within the scope, the sketch timeline identifiers in these multiple metadata can be different timeline identifiers.
  • the value corresponding to SharId1 in Figure 13 includes metadata corresponding to SID1.
  • These metadata can be uniformly marked as SID1+timerange11, indicating that the timeline identifier in these metadata is SID1, and the time window in these metadata All are within the time range timerange11 corresponding to SharId1.
  • the value corresponding to SharId1 also includes metadata corresponding to SID2.
  • These metadata can be uniformly marked as SID2+timerange12, indicating that the timeline identifier in these metadata is SID2.
  • the time windows in these metadata are all corresponding to SharId1.
  • the time range is within timerange12.
  • the value corresponding to SharId1 also includes metadata corresponding to SID3.
  • These metadata can be uniformly marked as SID2+timerange13, indicating that the timeline identifier in these metadata is SID2.
  • the time windows in these metadata are all corresponding to SharId1.
  • the time range is within timerange13.
  • the implementation method of determining the first metadata from the metadata set can be: determining the SharId that matches the target time window, and the time range represented by the matching SharId falls within In this target time window, the metadata whose sketch timeline identifier is the target timeline identifier is then queried from the value corresponding to the matching SharId, and the metadata obtained is the first metadata.
  • the implementation process of constructing a sketch based on data points other than some data points among the multiple data points and the target scale function to obtain the second sketch is: obtaining the data points corresponding to the second time window among the multiple data points, and the second The time window is the time window in the target time window except the first time window, and the first time window is the part of the target time window that overlaps with the sketch time window in the first metadata; based on the target scale function and the second time window Corresponding data points, construct a second sketch.
  • the temporary construction sketch can be realized through the single timeline component and the single timeline aggregation component in Figure 11.
  • the metadata of the second sketch can also be determined to obtain the second metadata; the second metadata can be cached Sketch and add secondary metadata to the metadata set to enable updates to the metadata set.
  • the metadata set corresponds to the scale function.
  • Different metadata sets can be maintained, each metadata set only maintaining metadata for sketches built based on the corresponding scale function.
  • the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
  • the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written can also be determined; if the time of the data point to be written is If the stamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted and the metadata set is updated.
  • matching the timestamp of the data point to be written and the identifier of the timeline to which it belongs matches the third metadata in the metadata set means: the timestamp of the data point to be written falls within the sketch time window of the third metadata.
  • the identity of the timeline to which the written data point belongs is the same as the sketch timeline identity of the third metadata.
  • the sketch elimination method provided by the embodiment of the present application can eliminate sketches from two aspects.
  • the first aspect is to eliminate some sketches among multiple sketches belonging to the same timeline, so as to eliminate sketches from the time dimension.
  • the sketches of a certain timeline in different timelines are eliminated to eliminate the sketches from the spatial dimension.
  • the metadata set also includes first usage information corresponding to the sketch timeline identification, and the first usage information is used to record matches with the sketch timeline identification.
  • the elimination of sketches based on the time dimension can be implemented by: determining the sketches to be eliminated among the multiple sketches that match the sketch timeline identifier based on the first usage information, and deleting the sketches to be eliminated.
  • elimination can be carried out through the least recently used (LRU) elimination mechanism. That is, the less frequently used sketches among the multiple sketches matching the sketch timeline ID will be deleted to save cache.
  • LRU least recently used
  • the metadata set further includes second usage information.
  • the second usage information is used to record the usage information corresponding to each sketch timeline identification among the plurality of sketch timeline identifications.
  • Each sketch timeline identification corresponds to The usage information indicates when the sketch that matches the corresponding sketch timeline ID was used.
  • the implementation method of eliminating sketches based on the spatial dimension can be: determining the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers based on the second usage information; and deleting the sketch that matches the sketch timeline identifier to be eliminated. .
  • elimination can also be performed through the LRU elimination mechanism. That is, among the various sketch timeline identifiers, the sketches corresponding to the sketch timeline identifiers that have been used less frequently recently are deleted to save cache.
  • the embodiments of the present application provide an incremental update system and an incremental update method, which can eliminate the need to build a target sketch based on a full amount of data points every time a data point is queried, thereby saving computing resources.
  • An embodiment of the present application also provides a data point query device.
  • the device 1400 includes the following modules.
  • the first determination module 1401 is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of clusters in the sketch constructed by different scale functions in the multiple scale functions. Differently, the target quantile indicates the position of the target data point among multiple data points sorted by size. For specific implementation methods, reference can be made to step 101 in the embodiment of Figure 1 .
  • the construction module 1402 is used to construct a target sketch based on the target scale function and multiple data points.
  • the target sketch includes multiple clusters, each cluster includes a cluster mean and a cluster weight, and the cluster mean indicates clustering to obtain the mean of the data points of the corresponding cluster, Cluster weights indicate the number of data points that clustered into corresponding clusters.
  • step 102 in the embodiment of Figure 1 .
  • Query module 1403 used to query target data points based on the target sketch. For specific implementation, please refer to step 103 in the embodiment of Figure 1 .
  • the multiple scale functions include a first scale function and a second scale function.
  • the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those constructed based on the second scale function.
  • the clusters in the sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than in the sketch constructed based on the second scale function. How dense the clusters are on the second quantile interval;
  • the first determination module 1401 is used for:
  • the first scale function is determined as the target scale function
  • the second scale function is determined as the target scale function.
  • the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2;
  • the second quantile interval includes the interval from x1 to x2.
  • the query module 1403 is used to:
  • the device 1400 also includes:
  • the receiving module is used to receive a data point query request.
  • the data point query request is used to query the data value of a target data point among multiple data points.
  • the data point query request carries the standard quantile of the target data point;
  • the first determination module is also used to determine the standard quantile carried in the data point query request as the target quantile.
  • the device 1400 also includes:
  • the receiving module is used to receive the equal-height histogram query request.
  • the equal-height histogram query request is used to query the equal-height histogram constructed based on multiple data points.
  • the equal-height histogram query request carries the number of buckets h, and h is greater than 1. an integer;
  • the first determination module is also used to determine the quantiles from the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of multiple data points, and obtain h- 1 quantile;
  • the query module is also used to use each quantile in h-1 quantiles as a target quantile, and execute the target quantile corresponding to the target data point to be queried from multiple scale functions.
  • the apparatus 1400 further includes a drawing module configured to draw a contour histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
  • the first determination module is also used to:
  • the query module is used for:
  • the device 1400 also includes:
  • the receiving module is used to receive a quantile query request.
  • the quantile query request is used to query the standard quantile of a target data point among multiple data points.
  • the quantile query request carries the data value of the target data point.
  • the device 1400 also includes:
  • the receiving module is used to receive an equal-width histogram query request.
  • the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points.
  • the equal-width histogram query request carries a bucket boundary array, and the bucket boundary array includes n boundary values, n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points;
  • the query module is used to treat each of the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, as well as the data value and minimum data of the largest data point among multiple data points.
  • the device also includes a drawing module for drawing an equal-width histogram based on n standard quantiles that correspond one-to-one to the n boundary values.
  • the device 1400 also includes:
  • the generation module is used to generate clusters to be updated corresponding to the data points to be updated in the cache.
  • the clusters to be updated include cluster means, cluster weights and cluster tags.
  • the cluster mean of the clusters to be updated indicates the data values of the data points to be updated.
  • the clusters to be updated are The cluster weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated;
  • the update module is used to update the target sketch based on the cluster to be updated.
  • update modules are used to:
  • update modules are used to:
  • For the first cluster after sorting determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
  • i-th cluster For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1;
  • the quantile threshold is updated based on the current quantile of the i-th cluster and the target scale function, and the next cluster is traversed.
  • update modules are used to:
  • update modules are used to:
  • For the jth cluster determine the cluster mark of the jth cluster. If the cluster mark of the jth cluster is a mark to be deleted, delete the jth cluster and update the cluster weight of the cluster adjacent to j cluster, j is an integer greater than or equal to 1.
  • update modules are used to:
  • the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent cluster of j clusters and the cluster mean of the right adjacent cluster of j clusters;
  • the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively;
  • the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  • building blocks are used to:
  • building blocks are used to:
  • the target data points are data points whose timestamps are within the target time window;
  • the metadata set includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata of each sketch includes the sketch time window and the sketch timeline identifier.
  • the sketch time window The time window corresponding to the timestamp of the data point for constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data points for constructing the corresponding sketch belong.
  • the first metadata is determined from the metadata set, the sketch time window in the first metadata is part or all of the target time window, and the sketch in the first metadata
  • the identity of the timeline is the same as the identity of the timeline to which the target data point belongs;
  • the sketch corresponding to the first metadata is determined as the first sketch.
  • the device 1400 also includes:
  • the second determination module is used to determine the metadata of the second sketch and obtain the second metadata
  • a cache module that caches the second sketch and adds the second metadata to the metadata set.
  • the device 1400 also includes:
  • the third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;
  • the first deletion module is used to delete the sketch corresponding to the third metadata and update the metadata set if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set.
  • the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record the usage time of each of the multiple sketches matching any sketch timeline identification;
  • Device 1400 also includes:
  • the second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any sketch timeline identifier, and delete the sketches to be eliminated.
  • the metadata set also includes second usage information.
  • the second usage information is used to record the usage information corresponding to each sketch timeline identification among the multiple sketch timeline identifications in the metadata set.
  • the second usage information corresponding to each sketch timeline identification is The usage information indicates the usage time of the sketch matching the corresponding sketch timeline identifier; the device 1400 further includes:
  • the third deletion module is configured to determine the sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.
  • the first determination module 1401, the construction module 1402, the query module 1403 and other modules can all be implemented by software, or can be implemented by hardware.
  • the implementation of the first determination module 1401 is introduced below, taking the first determination module 1401 as an example.
  • the implementation of the building module 1402, the query module 1403 and other modules can refer to the implementation of the first determination module 1401.
  • the first determination module 1401 may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more.
  • the first determination module 1401 may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
  • the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs.
  • VPC virtual private cloud
  • Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
  • the first determination module 1401 may include at least one computing device, such as a server.
  • the first determination module 1401 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the multiple computing devices included in the first determination module 1401 may be distributed in the same region or in different regions.
  • the multiple computing devices included in the first determination module 1401 may be distributed in the same AZ or in different AZs.
  • multiple computing devices included in the first determination module 1401 may be distributed in the same VPC, It can also be distributed across multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the first determination module 1401 can be used to perform any step in the data point query method
  • the building module 1402 can be used to perform any step in the data point query method
  • the query module 1403 can be used In executing any step in the data point query method, the steps responsible for implementation by the first determination module 1401, the construction module 1402, and the query module 1403 can be specified as needed, through the first determination module 1401, the construction module 1402, and the query module 1403 respectively. Implement different steps in the data point query method to realize all functions of the data point query device.
  • computing device 1500 includes: bus 1502, processor 1504, memory 1506, and communication interface 1508.
  • the processor 1504, the memory 1506 and the communication interface 1508 communicate through a bus 1502.
  • Computing device 1500 may be a server or terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1500.
  • the bus 1502 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 15, but it does not mean that there is only one bus or one type of bus.
  • Bus 1504 may include a path that carries information between various components of computing device 1500 (eg, memory 1506, processor 1504, communications interface 1508).
  • the processor 1504 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 1506 may include volatile memory, such as random access memory (RAM).
  • the processor 1504 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 1506 stores executable program code, and the processor 1504 executes the executable program code to respectively realize the functions of the aforementioned first determination module, construction module, query module and other modules, thereby realizing the data points provided by the embodiments of this application.
  • Query method That is, the memory 1506 stores instructions for executing the data point query method provided by the embodiment of the present application.
  • the communication interface 1503 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1500 and other devices or communication networks.
  • An embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • the computing device cluster includes at least one computing device 1500.
  • the memory 1506 in one or more computing devices 1500 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
  • the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application.
  • a Or a combination of multiple computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
  • the memories 1506 in different computing devices 1500 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the data point query device. That is, instructions stored in the memory 1506 in different computing devices 1500 may implement the functions of one or more modules among the first determination module, the construction module, and the query module.
  • one or more computing devices in a cluster of computing devices may be connected through a network.
  • the network may be a wide area network or a local area network, etc.
  • Figure 17 shows a possible implementation. As shown in Figure 17, two computing devices 1500A and 1500B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device.
  • the memory 1506 in the computing device 1500A stores instructions for performing the functions of the first determining module and the building module. At the same time, instructions for performing the functions of the query module are stored in memory 1506 in computing device 1500B.
  • connection method between the computing device clusters shown in Figure 17 may be: Considering that the data point query method provided by the embodiment of the present application requires a large amount of calculation data, it is considered that the functions implemented by the first determination module and the building module are handed over to the computing device 1500A execution.
  • computing device 1500A shown in FIG. 17 may also be performed by multiple computing devices 1500.
  • computing device 1500B may also be performed by multiple computing devices 1500.
  • the embodiment of the present application also provides another computing device cluster.
  • the connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 16 and FIG. 17 .
  • the difference is that the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
  • the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application.
  • a combination of one or more computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium.
  • the computer program product is run on at least one computing device, at least one computing device is caused to execute the data point query method provided by the embodiment of the present application.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to execute the data point query method provided by embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Des modes de réalisation de la présente invention se rapportent au domaine technique de l'informatique en nuage. L'invention concerne un procédé et un appareil d'interrogation de point de données, un groupe de dispositifs, un produit de programme et un support de stockage. Le procédé comprend : la détermination d'une fonction d'échelle cible parmi une pluralité de fonctions d'échelle sur la base d'un quantile cible correspondant à un point de données cible à interroger ; la construction d'un dessin cible sur la base de la fonction d'échelle cible et d'une pluralité de points de données ; et l'interrogation du point de données cible sur la base du dessin cible. Étant donné que les densités de groupes dans des dessins construits sur la base de différentes fonctions d'échelle sont différentes, dans les modes de réalisation de la présente invention, la fonction d'échelle cible peut être sélectionnée de façon adaptative sur la base du quantile cible correspondant au point de données cible à interroger, de sorte que le dessin construit sur la base de la fonction d'échelle cible a des groupes denses à proximité du quantile cible. Lorsque des groupes d'un dessin sont denses, les groupes dans le dessin peuvent représenter plus précisément des caractéristiques de points de données des groupes obtenus par regroupement, de sorte que la précision d'interrogation du point de données cible sur la base du dessin est améliorée.
PCT/CN2023/086007 2022-07-19 2023-04-03 Procédé et appareil d'interrogation de point de données, groupe de dispositifs, produit de programme et support de stockage WO2024016731A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210855232 2022-07-19
CN202210855232.X 2022-07-19
CN202211091505.4 2022-09-07
CN202211091505.4A CN117472975A (zh) 2022-07-19 2022-09-07 数据点查询方法、装置、设备集群、程序产品及存储介质

Publications (1)

Publication Number Publication Date
WO2024016731A1 true WO2024016731A1 (fr) 2024-01-25

Family

ID=89616930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086007 WO2024016731A1 (fr) 2022-07-19 2023-04-03 Procédé et appareil d'interrogation de point de données, groupe de dispositifs, produit de programme et support de stockage

Country Status (1)

Country Link
WO (1) WO2024016731A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088813A1 (en) * 2016-09-23 2018-03-29 Samsung Electronics Co., Ltd. Summarized data storage management system for streaming data
CN108388603A (zh) * 2018-02-05 2018-08-10 中国科学院信息工程研究所 基于Spark框架的分布式概要数据结构的构建方法及查询方法
US10248476B2 (en) * 2017-05-22 2019-04-02 Sas Institute Inc. Efficient computations and network communications in a distributed computing environment
CN110968835A (zh) * 2019-12-12 2020-04-07 清华大学 一种近似分位数计算方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088813A1 (en) * 2016-09-23 2018-03-29 Samsung Electronics Co., Ltd. Summarized data storage management system for streaming data
US10248476B2 (en) * 2017-05-22 2019-04-02 Sas Institute Inc. Efficient computations and network communications in a distributed computing environment
CN108388603A (zh) * 2018-02-05 2018-08-10 中国科学院信息工程研究所 基于Spark框架的分布式概要数据结构的构建方法及查询方法
CN110968835A (zh) * 2019-12-12 2020-04-07 清华大学 一种近似分位数计算方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAIDU GEEK SPEAKING: "A System and Method Based on Real-time Quantile Calculation)", CSDN BLOG, 27 May 2021 (2021-05-27), XP093131717, Retrieved from the Internet <URL:https://blog.csdn.net/lihui49/article/details/117250392> [retrieved on 20240215] *
LOVE TO EAT CORIANDER AND SCALLION: "T-digest", CSDN BLOG, 20 July 2020 (2020-07-20), XP093131713, Retrieved from the Internet <URL:https://blog.csdn.net/qq_41648804/article/details/107474870> [retrieved on 20240215] *

Similar Documents

Publication Publication Date Title
US7603339B2 (en) Merging synopses to determine number of distinct values in large databases
US7636731B2 (en) Approximating a database statistic
CN114168608B (zh) 一种用于更新知识图谱的数据处理系统
CN111061758B (zh) 数据存储方法、装置及存储介质
EP3379415A1 (fr) Gestion de mémoire et d&#39;espace de mémorisation pour une opération de données
AU2020101071A4 (en) A Parallel Association Mining Algorithm for Analyzing Passenger Travel Characteristics
Awad et al. Dynamic graphs on the GPU
CN112558869B (zh) 基于大数据遥感影像缓存方法
CN105045806A (zh) 一种面向分位数查询的概要数据动态分裂与维护方法
CN116756494B (zh) 数据异常值处理方法、装置、计算机设备和可读存储介质
CN108829343B (zh) 一种基于人工智能的缓存优化方法
CN112925821A (zh) 基于MapReduce的并行频繁项集增量数据挖掘方法
WO2015168988A1 (fr) Procédé et dispositif de création d&#39;indice de données, et support d&#39;enregistrement informatique
CN113867627A (zh) 一种存储系统性能优化方法及系统
Beyer et al. Distinct-value synopses for multiset operations
WO2023009182A1 (fr) Chaînage de filtres de bloom pour estimer le nombre de tons basses fréquences dans un ensemble de données
Hershberger et al. Adaptive sampling for geometric problems over data streams
CN107846327A (zh) 一种网管性能数据的处理方法及装置
CN113704565A (zh) 基于全局区间误差的学习型时空索引方法、装置及介质
WO2024016731A1 (fr) Procédé et appareil d&#39;interrogation de point de données, groupe de dispositifs, produit de programme et support de stockage
CN113544683B (zh) 数据一般化装置、数据一般化方法、程序
Wang et al. Stull: Unbiased online sampling for visual exploration of large spatiotemporal data
JP6006740B2 (ja) インデックス管理装置
CN117472975A (zh) 数据点查询方法、装置、设备集群、程序产品及存储介质
Nabil et al. Mining frequent itemsets from online data streams: Comparative study

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23841804

Country of ref document: EP

Kind code of ref document: A1