WO2024016731A1 - Data point query method and apparatus, device cluster, program product, and storage medium - Google Patents

Data point query method and apparatus, device cluster, program product, and storage medium Download PDF

Info

Publication number
WO2024016731A1
WO2024016731A1 PCT/CN2023/086007 CN2023086007W WO2024016731A1 WO 2024016731 A1 WO2024016731 A1 WO 2024016731A1 CN 2023086007 W CN2023086007 W CN 2023086007W WO 2024016731 A1 WO2024016731 A1 WO 2024016731A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
target
sketch
quantile
data point
Prior art date
Application number
PCT/CN2023/086007
Other languages
French (fr)
Chinese (zh)
Inventor
刘超
叶冠宇
李云川
李仕林
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211091505.4A external-priority patent/CN117472975A/en
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024016731A1 publication Critical patent/WO2024016731A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Definitions

  • Embodiments of the present application relate to the field of cloud computing technology, and in particular to a data point query method, device, equipment cluster, program product and storage medium.
  • Data points refer to data collected by relevant devices in Internet of Things technology, such as temperatures collected by temperature sensing devices.
  • Data point query is used to query the characteristics of a certain data point in a batch of data points, such as querying the quantile of the data point in a batch of data points based on the data value of the data point, or based on the quantile of the data point Query the data value of this data point.
  • the quantile indicates the position of the data point in a batch of data points sorted by size.
  • Embodiments of the present application provide a data point query method, device, equipment cluster, program product and storage medium, which can efficiently and accurately query a certain data point from massive data points.
  • the technical solutions are as follows:
  • a data point query method is provided.
  • the target scale function is determined from multiple scale functions. Different scales in the multiple scale functions
  • the density of clusters in the sketch constructed by the function is different, and the target quantile indicates the position of the target data point among multiple data points sorted by size;
  • the target sketch is constructed based on the target scale function and multiple data points, and the target sketch includes Multiple clusters, each cluster includes a cluster mean and a cluster weight.
  • the cluster mean indicates the mean value of the data points of the corresponding cluster obtained by clustering, and the cluster weight indicates the number of data points obtained by clustering of the corresponding cluster; query the target data points based on the target sketch.
  • the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale.
  • the sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
  • the multiple scale functions include a first scale function and a second scale function
  • the clusters in the sketch constructed based on the first scale function are dense on the first quantile interval.
  • the degree is greater than the density of the clusters in the first quantile interval in the sketch constructed based on the second scale function.
  • the density of the clusters in the sketch constructed based on the first scale function in the second quantile interval is less than
  • the clusters in the sketch built based on the second scale function are in the second Intensity on the quantile interval.
  • the implementation method of determining the target scale function from multiple scale functions can be: if the target quantile is located in the first quantile interval, then the first scale The function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  • the sketches constructed based on the first scale function have denser clusters on the first quantile interval
  • the sketches constructed based on the second scale function have denser clusters on the second quantile interval
  • the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2 ;
  • the second quantile interval includes the interval from x1 to x2.
  • the method provided by the embodiment of the present application can realize accurate query of the data points corresponding to any quantile in the global quantile interval [0,1], that is, high-precision query in the entire range can be achieved.
  • querying the target data point based on the target sketch may be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
  • the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. That is to say, the method provided by the embodiment of the present application is suitable for data point query in various scenarios, which improves the flexibility of the embodiment of the present application.
  • the data point query request is used to query the data value of the target data point among multiple data points.
  • the data point query request carries the standard quantile of the target data point; determine the standard quantile carried in the data point query request. is the target quantile.
  • a data point query request which is used to query the data value of a target data point among multiple data points
  • the data point query request carries the standard quantile of the target data point.
  • the standard quantile carried in the data point query request is determined as the target quantile, so that the target sketch can be constructed based on the target quantile, and then the data value of the target data point can be queried. In this case, the accuracy of the queried data values can be improved.
  • a equal height histogram query request may also be received, and the equal height histogram query request is used to query a equal height histogram constructed based on multiple data points, And the equal height histogram query request carries the number of buckets h, h is an integer greater than 1; based on the number of buckets h and the total number of multiple data points, determine the first bucket from left to right in the equal height histogram to the h-th
  • the quantiles of 1 bucket are obtained by h-1 quantiles; each quantile in the h-1 quantiles is used as the target quantile, and the target corresponding to the target data point to be queried is executed.
  • Quantile the operation of determining the target scale function from multiple scale functions to obtain h-1 data values that correspond to h-1 quantiles one-to-one.
  • the target scale function may also be determined based on the target quantile corresponding to the target data point to be queried.
  • querying the target data point based on the target sketch can be implemented by querying the standard quantile of the target data point based on the data value of the target sketch and the target data point.
  • the data value of the target data point can be queried based on the quantile of the target data point.
  • a quantile can be estimated based on the data value of the data point, and the estimated quantile can be The number is used as the target quantile and the scale function is adaptively selected to construct the sketch to improve the accuracy of the standard quantile obtained by subsequent queries.
  • an estimate of the target data point is determined based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
  • the quantile query request is used to query the standard quantile of the target data point among multiple data points.
  • the quantile query request carries the data value of the target data point.
  • Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where a quantile query request is received, which improves the accuracy of the standard quantile queried in this scenario.
  • an equal-width histogram query request may also be received, and the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, And the equal-width histogram query request carries a bucket boundary array.
  • the bucket boundary array includes n boundary values.
  • the n boundary values divide n+ between the data value of the smallest data point and the data value of the largest data point among multiple data points. 1 interval; use each boundary value in the n boundary values as the data value of the target data point, and execute the data value based on the target data point, as well as the data value of the largest data point and the smallest data point among multiple data points.
  • the operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one.
  • Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where an equal-width histogram query request is received, which improves the accuracy of the equal-width histogram queried in this scenario.
  • clusters to be updated corresponding to the data points to be updated in the cache can also be generated, and the clusters to be updated include clusters Mean, cluster weight and cluster mark.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster weight of the cluster to be updated indicates the number of data points to be updated.
  • the cluster mark of the cluster to be updated indicates the update of the data point to be updated. Type; update the target sketch based on the cluster to be updated.
  • the data points in the cache are expressed as clusters to be updated in the form of triples as mentioned above, so as to facilitate Subsequently, the target sketch is updated based on the data points to be updated in the cache.
  • the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtaining the cluster to be merged; Merge the clusters to be merged into the target sketch.
  • the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
  • merging the clusters to be merged into the target sketch may be implemented by: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and Each cluster performs the following operations in turn: for the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster If the quantile is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the i-th cluster The current quantile and target scale function of the cluster update the quantile threshold and traverse the next cluster.
  • the merged cluster to be updated can be added to other clusters of the target sketch to update the target sketch.
  • the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; Remove the cluster to be deleted from the target sketch.
  • the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
  • deleting the clusters to be deleted from the target sketch can be implemented by: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; The first cluster after sorting starts to traverse each cluster, and performs the following operations on each cluster in turn: for the jth cluster, determine the cluster mark of the jth cluster, if the cluster mark of the jth cluster is a mark to be deleted , then delete the jth cluster and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
  • the cluster to be deleted can be deleted from the target sketch to update the target sketch.
  • the implementation of updating the cluster weights of clusters adjacent to j clusters can be: if the j-th cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters.
  • the cluster weight of the left adjacent cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters.
  • the cluster mean of the left adjacent cluster and the cluster mean of the right adjacent clusters of j clusters based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster and the cluster mean and cluster weight of
  • the implementation of constructing a target sketch based on the target scale function and multiple data points may be: obtaining a cached sketch based on some of the data points among the multiple data points and the target scale function. , obtain the first sketch; construct a sketch based on the data points except some data points among the multiple data points and the target scale function, and obtain the second sketch; aggregate the first sketch and the second sketch to obtain the target sketch.
  • the currently constructed sketch can be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
  • a sketch that has been cached based on some of the data points among the multiple data points and the target scale function is obtained.
  • the implementation of obtaining the first sketch may be: obtaining the target time window to be queried. , the target data point is a data point whose timestamp is within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata includes the sketch time window and the sketch timeline identifier.
  • the sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch.
  • the sketch time window The intermediate line identifier is the identifier of the timeline to which the data points of the corresponding sketch belong; based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, and the sketch time window in the first metadata is For part or all of the target time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs; the sketch corresponding to the first metadata is determined as the first sketch.
  • the cached sketches can be managed through the metadata set, so that when querying a certain data point, the cached sketches can be obtained based on the metadata set, which improves the efficiency of obtaining cached sketches.
  • a sketch is constructed based on data points other than some data points among the multiple data points and the target scale function. After obtaining the second sketch, the elements of the second sketch can also be determined. data, get the second metadata; cache the second sketch, and add the second metadata to the metadata set.
  • the metadata set can also be updated based on the second sketch, so that subsequent query operations can be performed based on the updated metadata set.
  • the timestamp of the data point to be written and the identity of the timeline to which the data point to be written can also be determined; if the data point to be written is If the timestamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata will be deleted and the metadata set will be updated.
  • the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
  • the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record multiple usage information matching any sketch timeline identification. The usage time of each sketch in the sketches.
  • the sketch to be eliminated among the multiple sketches matching any sketch timeline identifier can also be determined based on the first usage information, and the sketch to be eliminated can be deleted.
  • the metadata set further includes second usage information, and the second usage information is used to record the usage corresponding to each of the multiple sketch timeline identifications in the metadata set.
  • the usage information corresponding to each sketch timeline ID indicates the usage time of the sketch that matches the corresponding sketch timeline ID.
  • the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers can also be determined based on the second usage information; and the sketch matching the sketch timeline identifier to be eliminated is deleted.
  • a data point query device in a second aspect, has the function of realizing the behavior of the data point query method in the first aspect.
  • the data point query device includes at least one module, and the at least one module is used to implement the data point query method provided in the first aspect.
  • a computing device cluster includes at least one computing device, each computing device includes a processor and a memory; the processor of the at least one computing device is used to execute the memory of the at least one computing device. instructions stored in the computing device cluster to cause the computing device cluster to execute the data point query method provided in the first aspect Law.
  • a computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute the data point query method described in the first aspect.
  • a fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the data point query method described in the first aspect.
  • Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application
  • Figure 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application;
  • Figure 3 is a schematic diagram of the curve change trend of the derivative of a second scale function S 2 (q) and S 2 (q) provided by the embodiment of the present application;
  • Figure 4 is a schematic diagram of a query process for querying data values based on target sketches and target quantiles provided by an embodiment of the present application
  • Figure 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a query process for querying the standard quantile q of a target data point based on the target sketch and the data value Q of the target data point provided by the embodiment of the present application;
  • Figure 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application.
  • Figure 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application.
  • Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application.
  • Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application.
  • Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application.
  • Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application.
  • Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by the embodiment of the present application.
  • Figure 14 is a schematic structural diagram of a data point query device provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 16 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
  • Figure 17 is a schematic diagram of a connection method between computing device clusters provided by an embodiment of the present application.
  • Quantile is used to characterize the position of a certain data point in a sequence of a large number of data points sorted in order of size. Compared with using extreme values (maximum and/or minimum values) to characterize the characteristics of a large number of data points, quantiles can shield false extreme value information caused by abnormal data points, thereby representing the real information at each stage in a large number of data points. . Based on this, for companies that provide Internet services, the quantile can be used as one of the important indicators to measure the company's network operating status. In addition, quantile query is also used in weather temperature trends, log mining, stock trend analysis, virtual currency volume and price indicators, financial data analysis and other fields.
  • all data points need to be sorted, and then the quantile corresponding to each data point is calculated based on the position of each sorted data point.
  • the value range of q is a real number between 0 and 1, that is, q ⁇ [0,1].
  • the time and space complexity of quantiles determined by this technique is O(NlogN), where N is the total number of full data points.
  • the quantile of each data point is known, if the quantile of the data point to be queried is q, then the quantile of all sorted data points is determined based on the quantile q. item, the result obtained is the data value of the data point, that is, the query result.
  • the t-digest (an online clustering algorithm) algorithm is currently a commonly used algorithm in approximate quantile calculation technology.
  • the basic principle of this algorithm is to cluster all data to obtain multiple clusters.
  • Each cluster has a corresponding cluster mean and cluster weight.
  • the cluster mean indicates the aggregation to obtain the average value of the data points of the corresponding cluster
  • the cluster weight indicates the aggregation to obtain the corresponding cluster mean.
  • the number of data points in the cluster. Multiple clusters of builds are often called sketches.
  • the quantile of each cluster can be determined based on the cluster mean and cluster weight corresponding to each cluster in the sketch.
  • linear interpolation is used to calculate the approximate data value of the data point based on the quantile and cluster mean of each cluster in the sketch.
  • the accuracy and efficiency of queries in this algorithm can be adjusted by the number of clusters in the sketch.
  • histograms can intuitively describe the data distribution characteristics of multiple data points. Therefore, histograms are widely used in the field of network monitoring and operation and maintenance.
  • the abscissa in the histogram represents the data value of the data point, and the ordinate represents the number of data points.
  • the histogram includes multiple bars. Each bar can be called a bucket. The height of each bucket represents the fall of the data value. The number of data points in the data value interval corresponding to this bucket.
  • histograms include equal-height histograms and equal-width histograms.
  • the equal-height histogram refers to a histogram in which the height of each bucket is close.
  • An equal-width histogram is a histogram in which each bin has the same width.
  • embodiments of this application provide a data point query method.
  • the method provided by the embodiments of the present application can achieve the following technical effects: first, high-precision query of quantiles of data points in the entire range; second, deletion of sketches; and third, incremental update to avoid updating every time.
  • the sketch needs to be rebuilt every time a query is made to avoid wasting resources.
  • Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps 101 to 103.
  • Step 101 Based on the target quantile corresponding to the target data point to be queried, determine the target scale function from multiple scale functions. The density of clusters in the sketch constructed by different scale functions among the multiple scale functions is different. The target quantile The number indicates the position of the target data point among multiple data points sorted by size.
  • the scale function is used to control the density of each cluster in the sketch.
  • the density of each cluster in the sketch is related to the size of each cluster.
  • the size of a cluster indicates the number of data points that are aggregated into the cluster. The larger the cluster, the more data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a large number of data points.
  • the clusters in the sketch are relatively sparse, making it difficult to distinguish individual data from the sketch. The data values of the points, so the accuracy of the sketch is also lower. The smaller the cluster, the fewer data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a small number of data points.
  • the clusters in the sketch are denser, making it easier to distinguish each data from the sketch.
  • the data value of the point so the accuracy of the sketch is also relatively high.
  • the scale function can be used to control the accuracy of the sketch to improve the accuracy of subsequent queries.
  • the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale.
  • the sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
  • the multiple scale functions include a first scale function and a second scale function
  • the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those based on the second scale function.
  • the clusters in the constructed sketch are denser on the first quantile interval
  • the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than the sketch constructed based on the second scale function. How dense the clusters in are on the second quantile interval.
  • the implementation process of determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point in step 101 can be: if the target quantile is located in the first quantile interval, then The first scale function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  • the sketches constructed based on the first scale function have denser clusters on the first quantile interval
  • the sketches constructed based on the second scale function have denser clusters on the second quantile interval
  • the first quantile interval and the second quantile interval can be any interval in the global quantile interval [0,1].
  • the sum of the first quantile interval and the second quantile interval is the global quantile interval [0,1].
  • the global quantile interval [0,1] can be realized through the method provided by the embodiment of the present application. Accurate query of data points corresponding to any quantile on 0,1], that is, achieving high-precision query in the entire range.
  • the first quantile interval includes the interval from 0 to x1, and the interval from x2 to 1, x1 and x2 are both greater than 0 and less than 1, and x1 is less than x2; the second quantile interval includes the interval from x1 to x2 interval. That is, the first quantile interval is the interval near both ends of the global quantile interval [0,1], and the second quantile interval is the middle interval of the global quantile interval [0,1].
  • x1 can be 0.2 and x2 can be 0.8.
  • the first quantile interval corresponding to the first scale function is [0,0.2] and [0.8,1]
  • the first quantile interval corresponding to the second scale function is [0.2,0.8].
  • x1 and x2 can also be other real numbers on the global quantile interval [0,1], and the embodiments of this application will not give examples one by one here.
  • the first scale function can be designed as the function shown in the following formula (1)
  • the second scale function can be designed as the function shown in the following formula (2):
  • q in formula (1) and formula (2) represents the quantile
  • represents the hyperparameter, indicating the number of clusters
  • S 1 (q) and S 2 (q) represent the first scale function and the second scale function respectively
  • the derivatives of S 1 (q) and S 2 (q) can characterize the density of clusters in the constructed sketch.
  • FIG. 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of the curve change trend of the second scale function S 2 (q) and the derivative of S 2 (q) provided by the embodiment of the present application.
  • the first scale function S 1 (q) can be selected to construct the sketch.
  • the second scale function S 2 (q) can be selected to construct the sketch, to improve the accuracy of the constructed sketch, thereby Improve the accuracy of querying data points. That is to say, the embodiment of the present application provides a method for adaptively selecting a scale function to construct a sketch based on the query environment.
  • scale functions correspond to different intervals of the global quantile interval [0,1]. Different cluster density levels, that is, these scale functions have different performances in different intervals of the global quantile interval [0,1], thereby realizing the method of adaptively selecting scale functions to construct sketches based on the query environment provided in the embodiment of this application. .
  • Step 102 Construct a target sketch based on the target scale function and multiple data points.
  • the target sketch includes multiple clusters.
  • Each cluster includes a cluster mean and a cluster weight.
  • the cluster mean indicates the clustering to obtain the mean of the data points of the corresponding cluster.
  • the cluster weight indicates Clustering yields the number of data points corresponding to the cluster.
  • the implementation method of constructing the target sketch based on the target scale function and multiple data points may refer to the t-digest algorithm or other clustering methods, which is not limited in the embodiments of the present application.
  • Step 103 Query target data points based on the target sketch.
  • the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. This is explained below in two application scenarios.
  • the first application scenario querying data values based on quantile
  • step 103 can be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
  • the target quantile is marked as q
  • q is a decimal between 0 and 1
  • the query obtained based on the target sketch and q The result is: the sorted result of all data points
  • the approximate estimated value of the elements, the query result is the data value of the target data point.
  • C 1 weight is the cluster weight of the first cluster in the target sketch
  • C 1 value is the cluster mean of the first cluster in the target sketch
  • the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large.
  • C m weight is the cluster weight of the last cluster in the target sketch
  • C m value is the cluster mean of the last cluster in the target sketch.
  • the last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
  • Wi the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
  • Case 1 Query in response to a data point query request
  • a data point query request can also be received.
  • the data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the target data.
  • the standard quantile of the point In this case, the standard quantile carried in the query request for the data point is determined as the target quantile.
  • the standard quantile can be the quantile input by the user, that is, when the user triggers the data point query request, he also inputs a quantile, so that the subsequent quantile can be based on the user input through the method provided by the embodiment of the application. Query specific data values.
  • the scale function can be adaptively selected according to the quantile input by the user, and a sketch can be constructed.
  • the constructed sketch is relatively dense in the interval near the quantile input by the user, thereby improving the accuracy of the query results. .
  • a contour histogram query request which is used to query a contour histogram constructed based on multiple data points
  • the request carries the number of buckets h.
  • the way to determine the target quantile is: based on the number of buckets h and the total number of multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram.
  • Quantile get h-1 quantiles; use each quantile in h-1 quantiles as the target quantile, and perform steps 101 to 103 to get h-1 quantiles.
  • h-1 data values that correspond to h-1 quantiles one-to-one it can be based on the h-1 data values, as well as the data value of the largest data point and the data of the smallest data point among the multiple data points. value, draw a histogram of equal heights.
  • the height of each bucket in the equal-height histogram is equal, which is the ratio of the total number N to the number of buckets h.
  • the coordinates on the horizontal axis in the equal-height histogram become larger from left to right, in order
  • the h buckets from left to right in the equal-height histogram are marked as the first bucket, the second bucket, ..., and the h-th bucket.
  • the implementation method of determining the quantile from the first bucket to the h-1th bucket in the equal-height histogram from left to right can be:
  • the quantile of i buckets can be expressed as i/h, where i is an integer greater than or equal to 1 and less than or h.
  • each bucket in the equal-height histogram has a corresponding left boundary value and a right boundary value on the abscissa.
  • the quantile of each bucket mentioned above specifically refers to the quantile corresponding to the right boundary value of each bucket. number. Therefore, the quantile corresponding to the h-th bucket above is 1.
  • h-1 data values that correspond to h-1 quantiles
  • the left and right boundary values of a bucket, and the height of each bucket is the ratio between the total number and the number of buckets h.
  • FIG. 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application. As shown in Figure 5, the process of querying the equal height histogram includes the following steps:
  • the second application scenario querying quantiles based on data values
  • the quantile of the data point is not known in advance.
  • Quantile use the estimated quantile as the target quantile and adaptively select the scale function to build the sketch.
  • the implementation of determining the target quantile may be: based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine The estimated quantile of the target data point, using the estimated quantile as the target quantile.
  • the implementation method of determining the estimated quantile of the target data point can be implemented by the following formula :
  • Q is the data value of the target data point to be queried.
  • determining the estimated quantile of the target data point can also be implemented in other ways.
  • the embodiments of the present application do not limit this.
  • step 103 can be implemented by querying the standard quantile of the target data point based on the target sketch and the data value of the target data point.
  • the quantiles obtained by the query are called standard quantiles.
  • the data value of the target data point is marked as Q
  • the standard quantile is marked as q
  • the query result obtained based on the target sketch and Q is q.
  • C 1 weight is the cluster weight of the first cluster in the target sketch
  • C 1 value is the cluster mean of the first cluster in the target sketch
  • the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large.
  • C m weight is the cluster weight of the last cluster in the target sketch
  • C m value is the cluster mean of the last cluster in the target sketch.
  • the last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
  • Wi the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
  • the queried standard quantile q can be obtained by the following formula:
  • a quantile can be estimated based on the data value input by the user, and then the scale function can be adaptively selected based on the estimated quantile, and a sketch can be constructed.
  • the interval near the corresponding quantile is relatively dense, thereby improving the accuracy of the query results.
  • the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, and the equal-width histogram query
  • the request carries a bucket boundary array.
  • the bucket boundary array includes n boundary values.
  • the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points; n
  • Each of the boundary values is used as the data value of the target data point, and steps 101 to 103 are performed to obtain n standard quantiles corresponding to the n boundary values one-to-one.
  • an equal-width histogram can be drawn based on the n standard quantiles that correspond to n boundary values.
  • the n boundary values in the bucket boundary array are arranged in order from small to large, and the n boundary values constitute an arithmetic sequence to achieve the equal width of each bucket in the equal-width histogram.
  • the coordinates on the horizontal axis in the equal-width histogram become larger from left to right.
  • the n+1 buckets from left to right in the equal-width histogram are marked as the first bucket.
  • the second bucket the n+1th bucket.
  • the left boundary value of the first bucket is the data value of the smallest data point among all the data points
  • the left boundary value of the second bucket (that is, the right boundary value of the first bucket) is the first data point in the bucket boundary array.
  • the left boundary value of the third bucket (that is, the right boundary value of the second bucket) is the second boundary value in the bucket boundary array,..., and so on, the left boundary value of the n+1th bucket (that is, the right boundary value of the nth bucket) is the nth boundary value in the bucket boundary array, and the right boundary value of the n+1th bucket is the data value of the largest data point among all data points.
  • the specific implementation process of drawing an equal-width histogram can be: after determining the quantile corresponding to each boundary value in the bucket boundary array, then The number of data points falling into two adjacent boundary values can be determined based on the total number and the quantile corresponding to each boundary value. Based on the number of data points falling into two adjacent boundary values, an equal-width histogram can be obtained. The height of each barrel in the picture.
  • the specific implementation method will be explained in detail later.
  • FIG. 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application. As shown in Figure 7, the process of querying an equal-width histogram includes the following steps:
  • each element in the array C obtained in this way is also the height of a bucket.
  • each The height of a bucket represents the ratio between the number of data points whose data values fall within the bounds of the bucket and the total number N.
  • the scale function can be adaptively selected according to the target quantile corresponding to the target data point to be queried, so as to improve the accuracy of the constructed target sketch near the target quantile, thereby improving the accuracy of the query results.
  • This method of adaptively selecting scale functions can be applied in the scenario of querying data values based on quantiles, in the scenario of querying quantiles based on numerical values, and in the scenario of querying equal-height histograms. It can also be applied to the scenario of querying equal-width histograms. Therefore, the method provided by the embodiments of the present application can improve the accuracy of query results in various query scenarios. Spend.
  • the above embodiment is used to explain how to adaptively select a scale function to construct a target sketch.
  • a method of inserting data points or deleting data points into the target sketch is also provided to update the target sketch.
  • FIG. 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps 801 to 802.
  • Step 801 Generate a cluster to be updated corresponding to the data point to be updated in the cache.
  • the cluster to be updated includes a cluster mean, a cluster weight and a cluster tag.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster of the cluster to be updated is The weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
  • Step 802 Update the target sketch based on the cluster to be updated.
  • a triplet may be used to represent a cluster.
  • This triplet can be expressed as ⁇ v, w, f>, where v represents the cluster mean of the cluster, w represents the cluster weight of the cluster, and f represents the cluster label of the cluster.
  • the cluster mark indicates whether the cluster is to be deleted or merged.
  • the data points in the cache are expressed as clusters to be updated in the form of triples as above. That is, the data point to be updated in the cache corresponds to the cluster to be updated.
  • the cluster to be updated includes the cluster mean, cluster weight and cluster mark.
  • the cluster mean of the cluster to be updated indicates the data value of the data point to be updated.
  • the cluster weight of the cluster to be updated indicates the cluster to be updated.
  • the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
  • the cluster mark of the cluster to be updated includes a mark to be merged and a mark to be deleted.
  • the cluster mark is a mark to be merged, indicating that the corresponding cluster is a cluster to be merged into the target sketch.
  • the cluster mark is a mark to be deleted, indicating that the corresponding cluster is a cluster to be deleted from the target sketch.
  • the current update operation of the target sketch includes inserting data points into the target sketch or deleting data points from the target sketch. This is explained in two cases below.
  • step 802 is implemented by: obtaining the clusters to be updated whose clusters are marked as to-be-merged markers from the clusters to be updated, and obtaining the clusters to be merged; and merging the clusters to be merged into the target sketch.
  • the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
  • the implementation process of merging clusters to be merged into the target sketch may be: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; for the first cluster after sorting , determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
  • For the i-th cluster based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster is lower than the quantile threshold, Then merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the current quantile of the i-th cluster and the target The scale function updates the quantile threshold and traverses the next cluster.
  • the quantile threshold can indicate the limited capacity of the corresponding cluster.
  • k(q 0 ) represents the scale function
  • the implementation method of determining the current quantile of the i-th cluster can be: determining the sum of the cluster weights of the clusters that have been traversed (including the i-th cluster), and determining The cluster weights of all clusters after sorting are summed, and the ratio between the two sums is used as the current quantile of the i-th cluster.
  • the i-th cluster is merged into the previous cluster.
  • merging the i-th cluster into the previous cluster means updating the cluster weight and cluster mean of the previous cluster based on the cluster weight and cluster mean of the i-th cluster.
  • the cluster mean of the i-th cluster and the cluster mean of the previous cluster are weighted according to their respective cluster weights, and the resulting value is used as the updated cluster mean of the previous cluster.
  • the cluster weight overlap of the i-th cluster is added to On the cluster weight of the previous cluster, the obtained value is used as the updated cluster weight of the previous cluster.
  • the implementation method of updating the quantile threshold based on the current quantile of the i-th cluster and the target scale function can also refer to the above-mentioned formula for determining the quantile threshold q threshold , which will not be described again here.
  • Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application.
  • the newly added data points are first placed in the cache (that is, the buffer), and the new data points in the cache are expressed in the form of triples to obtain the clusters to be merged.
  • the quantile threshold is recalculated based on the quantile of the current cluster and the next cluster is traversed.
  • step 802 is implemented by: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; and deleting the cluster to be deleted from the target sketch.
  • the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
  • clusters with the same cluster mean in the clusters to be deleted can be merged.
  • the cluster weight of the merged cluster is the sum of the cluster weights of each cluster before the merge. Then the target sketch is updated based on the merged clusters to be deleted.
  • the implementation process of deleting clusters to be deleted from the target sketch may be: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; starting from the first cluster after sorting Traverse each cluster and perform the following operations on each cluster traversed: for the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster. clusters, and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
  • updating the cluster weights of clusters adjacent to j clusters includes the following situations:
  • the cluster weight of the first cluster is subtracted from the cluster weight of the right adjacent cluster of the first cluster, and the value obtained is used as the updated cluster weight of the right adjacent cluster of the first cluster.
  • the cluster weight of the right adjacent cluster of the first cluster is less than the cluster weight of the first cluster, delete the right adjacent cluster of the first cluster and determine the right adjacent cluster of the first cluster.
  • the difference between the cluster weight of and the cluster weight of the first cluster, and the cluster weight of the next right-neighboring cluster adjacent to the right-neighboring cluster is updated based on the difference. If the difference is still greater than the cluster weight of the next right adjacent cluster adjacent to the right adjacent cluster, continue to update the cluster weight of the next right adjacent cluster through the above method until the most recently obtained right adjacent cluster's cluster weight The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the right.
  • the minimum value of the target sketch (that is, the data value of the smallest data point among all the data points in the target sketch) has changed.
  • the cluster mean of the first cluster in the updated target sketch can be used as the minimum value of the target sketch.
  • the cluster weight of the last cluster is subtracted from the cluster weight of the left adjacent cluster of the last cluster, and the value obtained is used as the updated cluster weight of the left adjacent cluster of the last cluster.
  • the cluster weight of the left adjacent cluster of the last cluster is less than the cluster weight of the last cluster, delete the left adjacent cluster of the last cluster and determine the cluster weight of the left adjacent cluster of the last cluster and The difference between the cluster weights of the last cluster, based on which the cluster weight of the next left-neighboring cluster adjacent to the left-neighboring cluster is updated. If the difference is still greater than the cluster weight of the next left adjacent cluster adjacent to the left adjacent cluster, continue to update the cluster weight of the next left adjacent cluster through the above method until the most recently obtained cluster weight of the left adjacent cluster. The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the left.
  • the maximum value of the target sketch (that is, the data value of the maximum data point among all the data points in the target sketch) has changed. At this time, it can be updated.
  • the maximum value of the target sketch For example, the cluster mean of the last cluster in the updated target sketch can be used as the maximum value of the target sketch.
  • Case 3 If the jth cluster is the middle cluster after sorting, the cluster weight of the left adjacent cluster and the cluster weight of the right adjacent cluster of the jth cluster need to be updated.
  • the implementation process of updating the cluster weights of clusters adjacent to j clusters can be as follows: obtaining the cluster mean of the left adjacent clusters of j clusters and the cluster mean of the right adjacent clusters of j clusters; based on the left The cluster mean of the adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the jth cluster determine the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster respectively; based on The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  • the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively.
  • Deleting weights can be achieved through the following formula:
  • d l represents the deletion weight corresponding to the left adjacent cluster
  • d r represents the deletion weight corresponding to the right adjacent cluster
  • w c represents the cluster weight of the jth cluster
  • v c represents the cluster mean of the jth cluster
  • v r represents the cluster mean of the left adjacent cluster
  • v l represents the cluster mean of the right adjacent cluster.
  • updating the cluster weight of the left adjacent cluster based on the deletion weight corresponding to the left adjacent cluster may be, for example: subtracting the deletion weight corresponding to the left adjacent cluster from the cluster weight of the left adjacent cluster, and the obtained value is as The updated cluster weight of the left adjacent cluster.
  • updating the cluster weight of the right adjacent cluster based on the deletion weight corresponding to the right adjacent cluster can be: subtracting the deletion weight corresponding to the right adjacent cluster from the cluster weight of the right adjacent cluster, and the obtained value is used as the updated value.
  • updating the cluster weight of the left adjacent cluster can also refer to the aforementioned leftward recursive update of the cluster weight.
  • updating the cluster weight of the right adjacent cluster you can also refer to the aforementioned rightward recursive update of the cluster weight. The explanation will not be repeated here.
  • Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application. As shown in Figure 10,
  • each data point to be deleted in the buffer is counted.
  • Each data point is represented by the aforementioned triplet, that is, each data point to be deleted is represented in the form of a cluster to construct the cluster to be deleted. Sort the clusters to be deleted and the clusters in the target sketch according to the cluster mean from small to large.
  • the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the right. If the deletion of the first cluster of the target sketch affects the minimum value of the target sketch, the minimum value of the target sketch needs to be updated based on the updated first cluster of the target sketch. After the cluster weight of the right adjacent cluster is updated based on the cluster weight of the cluster to be deleted, the current cluster to be deleted is deleted, the first cluster is marked as the current cluster and the backward traversal continues.
  • the current cluster is the last cluster, delete the data of the left adjacent cluster set, that is, modify the cluster weight of the left adjacent cluster. If the cluster weight of the left adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the left. If the deletion of the last cluster of the target sketch affects the maximum value of the target sketch, the maximum value of the target sketch can be updated based on the last cluster of the updated target sketch. After the cluster weight of the left adjacent cluster is currently updated based on the cluster weight of the cluster to be deleted, the cluster to be deleted is deleted and the deletion operation is completed.
  • the current cluster is located in the middle position, then determine the deletion weight of the left adjacent cluster and the deletion weight of the right adjacent cluster of the current cluster, and then delete recursively from left to right, that is, update the deletion weight based on the left adjacent cluster.
  • the cluster weight of the left adjacent cluster is updated based on the deletion weight of the right adjacent cluster.
  • the left adjacent cluster of the cluster to be deleted is marked as the current cluster and the cluster to be deleted is deleted, and then the backward traversal continues.
  • the data points to be updated in the cache can be expressed as clusters to be updated in the form of triples, because the cluster tags in the clusters to be updated can indicate whether the clusters to be updated are clusters to be deleted or clusters to be deleted. Merged clusters, so based on cluster tags, data points to be inserted in the cache can be inserted into the target sketch, or data points to be deleted in the cache can be deleted from the target sketch.
  • the target sketch is temporarily constructed in the manner shown in Figure 1 .
  • this application implements The example provides an incremental update method. Through the incremental update method, when querying data points, a sketch is constructed based only on the newly added data points, and then the constructed sketch and the existing sketches in the cache are aggregated. Get a target sketch, thus avoiding the waste of computing resources.
  • the data points stored in the time series database have corresponding timestamps.
  • the timestamp of each data point can represent the collection time of the data point. Therefore, the data points stored in the time series time database have time series characteristics.
  • the data points stored in the time series database can usually include data points on different indicators, such as data points collected for temperature and data points collected for humidity, etc.
  • each indicator is The data points on are called data points on a timeline. Based on this, the data points in the time series database include data points corresponding to multiple timelines, and each timeline represents an indicator.
  • the embodiments of the present application also provide an incremental update system.
  • the incremental update system provided by the embodiments of the present application is explained here. .
  • FIG 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application. As shown in Figure 11, the incremental update system includes the following components.
  • the single timeline component (seriesCusor), also known as the single timeline read data executor, is responsible for reading the original data points within the specified time range of a timeline in response to the query statement.
  • the single-timeline aggregation component also known as the single-timeline aggregation executor, is responsible for calculating the data points of the timeline according to a specific aggregation method and outputting the aggregation results.
  • the data points of the timeline are constructed as sketches, and the insertion and deletion operations of the sketches in the aforementioned embodiments can be implemented through this component.
  • the single-timeline sketch cache component also known as the single-timeline sketch cache executor, is responsible for caching the already built sketches.
  • the incremental update system as shown in Figure 11, also includes a data cache (CacheData) and a metadata cache (CacheMeta). These two caches are used to store the built sketches and the metadata of the sketches respectively.
  • the metadata indication of the sketches The metadata used to index sketches.
  • the multi-timeline sorting component (tagSetCursor), also known as the multi-timeline sorting and merging executor, is responsible for sorting the sketches that are aggregated based on the data points of multiple timelines according to the space and time dimensions to ensure the orderliness of the cached sketches.
  • Multi-timeline inter-group component also called multi-timeline inter-group executor, is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components.
  • Serial scheduling is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components.
  • the logical concurrent component also known as the logical concurrent executor, serves as the smallest granular parallel scheduling unit and is responsible for the conversion of data structures and the assembly of metadata.
  • data structure conversion refers to converting the storage layer data structure into a query data structure to output query results.
  • Assembly of metadata is used to generate metadata for sketches.
  • the Aggregation Transformation component also known as the multi-timeline aggregation executor, is responsible for further aggregating the output results of components between multi-timeline groups, such as the merging of sketches.
  • Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application. As shown in Figure 12, the method includes the following steps 1201 to 1203.
  • Step 1201 Obtain a cached sketch based on some of the multiple data points and the target scale function to obtain a first sketch.
  • Step 1202 Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch.
  • Step 1203 Aggregate the first sketch and the second sketch to obtain the target sketch.
  • the sketch when the target data point needs to be queried, if some sketches have been constructed based on some data points and the target scale function in advance, the sketch can currently be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you can avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
  • the cached sketch based on some of the data points among the multiple data points and the target scale function is obtained.
  • the first sketch can be obtained by: obtaining the target time window to be queried, and the target data point is the timestamp located at Data points within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata of each sketch includes the sketch time window and The sketch timeline identifier.
  • the sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch.
  • the sketch timeline identifier is the identifier of the timeline to which the data point that constructs the corresponding sketch belongs; based on the target time window and the target data point.
  • the timeline of the first metadata is determined from the metadata set.
  • the sketch time window in the first metadata is part or all of the target time window.
  • the sketch timeline identifier in the first metadata is consistent with the target data point.
  • the identities of the timelines are the same; the sketch corresponding to the first metadata is determined as the first sketch.
  • the target time window to be queried may be the time window carried in the query statement input by the user. For example, if the user inputs a query statement of "query the highest temperature in the last quarter", then the target time window is "last quarter".
  • the metadata set can be maintained by the metadata cache (CacheMeta) shown in Figure 11.
  • the metadata set stores the metadata of each cached sketch in the form of a list.
  • the implementation method of determining the first metadata from the metadata set can be: traverse each metadata in the metadata set, if the sketch time of a certain metadata If the line identifier is the same as the identifier of the timeline to which the target data point belongs, and the sketch time window of the metadata is part or all of the time window in the target time window, then the metadata is determined to be the first metadata.
  • each SID represents a timeline
  • each SID corresponds to multiple time windows (windows)
  • corresponding sketches are cached based on each time window.
  • the metadata of the metadata set can be stored in a key-value format.
  • the key is the data fragmentation identifier (SharId), where each SharId represents a time range (timerange), so the value corresponding to each SharId includes multiple metadata, and the sketch time window in each metadata is within that time Within the scope, the sketch timeline identifiers in these multiple metadata can be different timeline identifiers.
  • the value corresponding to SharId1 in Figure 13 includes metadata corresponding to SID1.
  • These metadata can be uniformly marked as SID1+timerange11, indicating that the timeline identifier in these metadata is SID1, and the time window in these metadata All are within the time range timerange11 corresponding to SharId1.
  • the value corresponding to SharId1 also includes metadata corresponding to SID2.
  • These metadata can be uniformly marked as SID2+timerange12, indicating that the timeline identifier in these metadata is SID2.
  • the time windows in these metadata are all corresponding to SharId1.
  • the time range is within timerange12.
  • the value corresponding to SharId1 also includes metadata corresponding to SID3.
  • These metadata can be uniformly marked as SID2+timerange13, indicating that the timeline identifier in these metadata is SID2.
  • the time windows in these metadata are all corresponding to SharId1.
  • the time range is within timerange13.
  • the implementation method of determining the first metadata from the metadata set can be: determining the SharId that matches the target time window, and the time range represented by the matching SharId falls within In this target time window, the metadata whose sketch timeline identifier is the target timeline identifier is then queried from the value corresponding to the matching SharId, and the metadata obtained is the first metadata.
  • the implementation process of constructing a sketch based on data points other than some data points among the multiple data points and the target scale function to obtain the second sketch is: obtaining the data points corresponding to the second time window among the multiple data points, and the second The time window is the time window in the target time window except the first time window, and the first time window is the part of the target time window that overlaps with the sketch time window in the first metadata; based on the target scale function and the second time window Corresponding data points, construct a second sketch.
  • the temporary construction sketch can be realized through the single timeline component and the single timeline aggregation component in Figure 11.
  • the metadata of the second sketch can also be determined to obtain the second metadata; the second metadata can be cached Sketch and add secondary metadata to the metadata set to enable updates to the metadata set.
  • the metadata set corresponds to the scale function.
  • Different metadata sets can be maintained, each metadata set only maintaining metadata for sketches built based on the corresponding scale function.
  • the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
  • the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written can also be determined; if the time of the data point to be written is If the stamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted and the metadata set is updated.
  • matching the timestamp of the data point to be written and the identifier of the timeline to which it belongs matches the third metadata in the metadata set means: the timestamp of the data point to be written falls within the sketch time window of the third metadata.
  • the identity of the timeline to which the written data point belongs is the same as the sketch timeline identity of the third metadata.
  • the sketch elimination method provided by the embodiment of the present application can eliminate sketches from two aspects.
  • the first aspect is to eliminate some sketches among multiple sketches belonging to the same timeline, so as to eliminate sketches from the time dimension.
  • the sketches of a certain timeline in different timelines are eliminated to eliminate the sketches from the spatial dimension.
  • the metadata set also includes first usage information corresponding to the sketch timeline identification, and the first usage information is used to record matches with the sketch timeline identification.
  • the elimination of sketches based on the time dimension can be implemented by: determining the sketches to be eliminated among the multiple sketches that match the sketch timeline identifier based on the first usage information, and deleting the sketches to be eliminated.
  • elimination can be carried out through the least recently used (LRU) elimination mechanism. That is, the less frequently used sketches among the multiple sketches matching the sketch timeline ID will be deleted to save cache.
  • LRU least recently used
  • the metadata set further includes second usage information.
  • the second usage information is used to record the usage information corresponding to each sketch timeline identification among the plurality of sketch timeline identifications.
  • Each sketch timeline identification corresponds to The usage information indicates when the sketch that matches the corresponding sketch timeline ID was used.
  • the implementation method of eliminating sketches based on the spatial dimension can be: determining the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers based on the second usage information; and deleting the sketch that matches the sketch timeline identifier to be eliminated. .
  • elimination can also be performed through the LRU elimination mechanism. That is, among the various sketch timeline identifiers, the sketches corresponding to the sketch timeline identifiers that have been used less frequently recently are deleted to save cache.
  • the embodiments of the present application provide an incremental update system and an incremental update method, which can eliminate the need to build a target sketch based on a full amount of data points every time a data point is queried, thereby saving computing resources.
  • An embodiment of the present application also provides a data point query device.
  • the device 1400 includes the following modules.
  • the first determination module 1401 is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of clusters in the sketch constructed by different scale functions in the multiple scale functions. Differently, the target quantile indicates the position of the target data point among multiple data points sorted by size. For specific implementation methods, reference can be made to step 101 in the embodiment of Figure 1 .
  • the construction module 1402 is used to construct a target sketch based on the target scale function and multiple data points.
  • the target sketch includes multiple clusters, each cluster includes a cluster mean and a cluster weight, and the cluster mean indicates clustering to obtain the mean of the data points of the corresponding cluster, Cluster weights indicate the number of data points that clustered into corresponding clusters.
  • step 102 in the embodiment of Figure 1 .
  • Query module 1403 used to query target data points based on the target sketch. For specific implementation, please refer to step 103 in the embodiment of Figure 1 .
  • the multiple scale functions include a first scale function and a second scale function.
  • the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those constructed based on the second scale function.
  • the clusters in the sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than in the sketch constructed based on the second scale function. How dense the clusters are on the second quantile interval;
  • the first determination module 1401 is used for:
  • the first scale function is determined as the target scale function
  • the second scale function is determined as the target scale function.
  • the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2;
  • the second quantile interval includes the interval from x1 to x2.
  • the query module 1403 is used to:
  • the device 1400 also includes:
  • the receiving module is used to receive a data point query request.
  • the data point query request is used to query the data value of a target data point among multiple data points.
  • the data point query request carries the standard quantile of the target data point;
  • the first determination module is also used to determine the standard quantile carried in the data point query request as the target quantile.
  • the device 1400 also includes:
  • the receiving module is used to receive the equal-height histogram query request.
  • the equal-height histogram query request is used to query the equal-height histogram constructed based on multiple data points.
  • the equal-height histogram query request carries the number of buckets h, and h is greater than 1. an integer;
  • the first determination module is also used to determine the quantiles from the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of multiple data points, and obtain h- 1 quantile;
  • the query module is also used to use each quantile in h-1 quantiles as a target quantile, and execute the target quantile corresponding to the target data point to be queried from multiple scale functions.
  • the apparatus 1400 further includes a drawing module configured to draw a contour histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
  • the first determination module is also used to:
  • the query module is used for:
  • the device 1400 also includes:
  • the receiving module is used to receive a quantile query request.
  • the quantile query request is used to query the standard quantile of a target data point among multiple data points.
  • the quantile query request carries the data value of the target data point.
  • the device 1400 also includes:
  • the receiving module is used to receive an equal-width histogram query request.
  • the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points.
  • the equal-width histogram query request carries a bucket boundary array, and the bucket boundary array includes n boundary values, n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points;
  • the query module is used to treat each of the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, as well as the data value and minimum data of the largest data point among multiple data points.
  • the device also includes a drawing module for drawing an equal-width histogram based on n standard quantiles that correspond one-to-one to the n boundary values.
  • the device 1400 also includes:
  • the generation module is used to generate clusters to be updated corresponding to the data points to be updated in the cache.
  • the clusters to be updated include cluster means, cluster weights and cluster tags.
  • the cluster mean of the clusters to be updated indicates the data values of the data points to be updated.
  • the clusters to be updated are The cluster weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated;
  • the update module is used to update the target sketch based on the cluster to be updated.
  • update modules are used to:
  • update modules are used to:
  • For the first cluster after sorting determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
  • i-th cluster For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1;
  • the quantile threshold is updated based on the current quantile of the i-th cluster and the target scale function, and the next cluster is traversed.
  • update modules are used to:
  • update modules are used to:
  • For the jth cluster determine the cluster mark of the jth cluster. If the cluster mark of the jth cluster is a mark to be deleted, delete the jth cluster and update the cluster weight of the cluster adjacent to j cluster, j is an integer greater than or equal to 1.
  • update modules are used to:
  • the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent cluster of j clusters and the cluster mean of the right adjacent cluster of j clusters;
  • the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively;
  • the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  • building blocks are used to:
  • building blocks are used to:
  • the target data points are data points whose timestamps are within the target time window;
  • the metadata set includes the metadata of multiple sketches in the cache.
  • the multiple sketches are sketches built based on the target scale function.
  • the metadata of each sketch includes the sketch time window and the sketch timeline identifier.
  • the sketch time window The time window corresponding to the timestamp of the data point for constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data points for constructing the corresponding sketch belong.
  • the first metadata is determined from the metadata set, the sketch time window in the first metadata is part or all of the target time window, and the sketch in the first metadata
  • the identity of the timeline is the same as the identity of the timeline to which the target data point belongs;
  • the sketch corresponding to the first metadata is determined as the first sketch.
  • the device 1400 also includes:
  • the second determination module is used to determine the metadata of the second sketch and obtain the second metadata
  • a cache module that caches the second sketch and adds the second metadata to the metadata set.
  • the device 1400 also includes:
  • the third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;
  • the first deletion module is used to delete the sketch corresponding to the third metadata and update the metadata set if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set.
  • the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record the usage time of each of the multiple sketches matching any sketch timeline identification;
  • Device 1400 also includes:
  • the second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any sketch timeline identifier, and delete the sketches to be eliminated.
  • the metadata set also includes second usage information.
  • the second usage information is used to record the usage information corresponding to each sketch timeline identification among the multiple sketch timeline identifications in the metadata set.
  • the second usage information corresponding to each sketch timeline identification is The usage information indicates the usage time of the sketch matching the corresponding sketch timeline identifier; the device 1400 further includes:
  • the third deletion module is configured to determine the sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.
  • the first determination module 1401, the construction module 1402, the query module 1403 and other modules can all be implemented by software, or can be implemented by hardware.
  • the implementation of the first determination module 1401 is introduced below, taking the first determination module 1401 as an example.
  • the implementation of the building module 1402, the query module 1403 and other modules can refer to the implementation of the first determination module 1401.
  • the first determination module 1401 may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more.
  • the first determination module 1401 may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
  • the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs.
  • VPC virtual private cloud
  • Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
  • the first determination module 1401 may include at least one computing device, such as a server.
  • the first determination module 1401 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the multiple computing devices included in the first determination module 1401 may be distributed in the same region or in different regions.
  • the multiple computing devices included in the first determination module 1401 may be distributed in the same AZ or in different AZs.
  • multiple computing devices included in the first determination module 1401 may be distributed in the same VPC, It can also be distributed across multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the first determination module 1401 can be used to perform any step in the data point query method
  • the building module 1402 can be used to perform any step in the data point query method
  • the query module 1403 can be used In executing any step in the data point query method, the steps responsible for implementation by the first determination module 1401, the construction module 1402, and the query module 1403 can be specified as needed, through the first determination module 1401, the construction module 1402, and the query module 1403 respectively. Implement different steps in the data point query method to realize all functions of the data point query device.
  • computing device 1500 includes: bus 1502, processor 1504, memory 1506, and communication interface 1508.
  • the processor 1504, the memory 1506 and the communication interface 1508 communicate through a bus 1502.
  • Computing device 1500 may be a server or terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1500.
  • the bus 1502 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 15, but it does not mean that there is only one bus or one type of bus.
  • Bus 1504 may include a path that carries information between various components of computing device 1500 (eg, memory 1506, processor 1504, communications interface 1508).
  • the processor 1504 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 1506 may include volatile memory, such as random access memory (RAM).
  • the processor 1504 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 1506 stores executable program code, and the processor 1504 executes the executable program code to respectively realize the functions of the aforementioned first determination module, construction module, query module and other modules, thereby realizing the data points provided by the embodiments of this application.
  • Query method That is, the memory 1506 stores instructions for executing the data point query method provided by the embodiment of the present application.
  • the communication interface 1503 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1500 and other devices or communication networks.
  • An embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • the computing device cluster includes at least one computing device 1500.
  • the memory 1506 in one or more computing devices 1500 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
  • the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application.
  • a Or a combination of multiple computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
  • the memories 1506 in different computing devices 1500 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the data point query device. That is, instructions stored in the memory 1506 in different computing devices 1500 may implement the functions of one or more modules among the first determination module, the construction module, and the query module.
  • one or more computing devices in a cluster of computing devices may be connected through a network.
  • the network may be a wide area network or a local area network, etc.
  • Figure 17 shows a possible implementation. As shown in Figure 17, two computing devices 1500A and 1500B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device.
  • the memory 1506 in the computing device 1500A stores instructions for performing the functions of the first determining module and the building module. At the same time, instructions for performing the functions of the query module are stored in memory 1506 in computing device 1500B.
  • connection method between the computing device clusters shown in Figure 17 may be: Considering that the data point query method provided by the embodiment of the present application requires a large amount of calculation data, it is considered that the functions implemented by the first determination module and the building module are handed over to the computing device 1500A execution.
  • computing device 1500A shown in FIG. 17 may also be performed by multiple computing devices 1500.
  • computing device 1500B may also be performed by multiple computing devices 1500.
  • the embodiment of the present application also provides another computing device cluster.
  • the connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 16 and FIG. 17 .
  • the difference is that the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
  • the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application.
  • a combination of one or more computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium.
  • the computer program product is run on at least one computing device, at least one computing device is caused to execute the data point query method provided by the embodiment of the present application.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to execute the data point query method provided by embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present application relate to the technical field of cloud computing. Disclosed are a data point query method and apparatus, a device cluster, a program product, and a storage medium. The method comprises: determining a target scale function from a plurality of scale functions on the basis of a target quantile corresponding to a target data point to be queried; constructing a target sketch on the basis of the target scale function and a plurality of data points; and querying the target data point on the basis of the target sketch. Because the densities of clusters in sketches constructed on the basis of different scale functions are different, in the embodiments of the present application, the target scale function can be adaptively selected on the basis of the target quantile corresponding to the target data point to be queried, such that the sketch constructed on the basis of the target scale function has dense clusters near the target quantile. When clusters of a sketch are dense, the clusters in the sketch can more accurately represent features of data points of the clusters obtained by clustering, so that the precision of querying the target data point on the basis of the sketch is improved.

Description

数据点查询方法、装置、设备集群、程序产品及存储介质Data point query method, device, equipment cluster, program product and storage medium
本申请要求于2022年07月19日提交的申请号为202210855232.X、发明名称为“一种基于统计分析算子的高效聚合系统与方法”,以及于2022年09月07日提交的申请号为202211091505.4、发明名称为“数据点查询方法、装置、设备集群、程序产品及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires that the application number submitted on July 19, 2022 is 202210855232. Priority is given to the Chinese patent application 202211091505.4, whose invention title is "data point query method, device, equipment cluster, program product and storage medium", the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请实施例涉及云计算技术领域,特别涉及一种数据点查询方法、装置、设备集群、程序产品及存储介质。Embodiments of the present application relate to the field of cloud computing technology, and in particular to a data point query method, device, equipment cluster, program product and storage medium.
背景技术Background technique
数据点(data point)是指物联网技术中相关设备采集的一个个数据,比如温度感应设备采集的一个个温度。数据点查询用于查询一批数据点中某个数据点的特征,比如基于该数据点的数据值查询该数据点在一批数据点中的分位数,或者基于该数据点的分位数查询该数据点的数据值。其中,分位数指示该数据点在按照大小排序后的一批数据点中的位置。随着物联网技术的发展,各行业的数据点数量呈现爆炸式增长,这种场景下如何从海量数据点中高效且准确地查询某个数据点是当前研究的热点。Data points refer to data collected by relevant devices in Internet of Things technology, such as temperatures collected by temperature sensing devices. Data point query is used to query the characteristics of a certain data point in a batch of data points, such as querying the quantile of the data point in a batch of data points based on the data value of the data point, or based on the quantile of the data point Query the data value of this data point. Among them, the quantile indicates the position of the data point in a batch of data points sorted by size. With the development of Internet of Things technology, the number of data points in various industries has exploded. In this scenario, how to efficiently and accurately query a certain data point from massive data points is a current research hotspot.
发明内容Contents of the invention
本申请实施例提供了一种数据点查询方法、装置、设备集群、程序产品及存储介质,可以高效且准确地从海量数据点中查询某个数据点。所述技术方案如下:Embodiments of the present application provide a data point query method, device, equipment cluster, program product and storage medium, which can efficiently and accurately query a certain data point from massive data points. The technical solutions are as follows:
第一方面,提供了一种数据点查询方法,在该方法中,基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,多个尺度函数中不同尺度函数构建的草图中的簇的密集程度不同,目标分位数指示目标数据点在按照大小排序后的多个数据点中的位置;基于目标尺度函数和多个数据点构建目标草图,目标草图包括多个簇,每个簇包括簇均值和簇权重,簇均值指示聚类得到相应簇的数据点的均值,簇权重指示聚类得到相应簇的数据点的数量;基于目标草图查询目标数据点。In the first aspect, a data point query method is provided. In this method, based on the target quantile corresponding to the target data point to be queried, the target scale function is determined from multiple scale functions. Different scales in the multiple scale functions The density of clusters in the sketch constructed by the function is different, and the target quantile indicates the position of the target data point among multiple data points sorted by size; the target sketch is constructed based on the target scale function and multiple data points, and the target sketch includes Multiple clusters, each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the mean value of the data points of the corresponding cluster obtained by clustering, and the cluster weight indicates the number of data points obtained by clustering of the corresponding cluster; query the target data points based on the target sketch.
由于不同尺度函数构建的草图中的簇的密集程度不同,因此在本申请实施例中,可以基于待查询的目标数据点对应的目标分位数,自适应选择目标尺度函数,以使基于目标尺度函数构建的草图在目标分位数附近的簇的比较密集。草图的簇比较密集时,该草图中的簇能够更准确表征聚类得到簇的数据点的特征,从而提高基于草图查询目标数据点的精度。Since the clusters in the sketches constructed by different scale functions have different density, in this embodiment of the present application, the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale. The sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
基于第一方面提供的方法,在一些实施例中,多个尺度函数包括第一尺度函数和第二尺度函数,基于第一尺度函数构建的草图中的簇在第一分位数区间上的密集程度,大于基于第二尺度函数构建的草图中的簇在第一分位数区间上的密集程度,基于第一尺度函数构建的草图中的簇在第二分位数区间上的密集程度,小于基于第二尺度函数构建的草图中的簇在第二 分位数区间上的密集程度。Based on the method provided in the first aspect, in some embodiments, the multiple scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are dense on the first quantile interval. The degree is greater than the density of the clusters in the first quantile interval in the sketch constructed based on the second scale function. The density of the clusters in the sketch constructed based on the first scale function in the second quantile interval is less than The clusters in the sketch built based on the second scale function are in the second Intensity on the quantile interval.
这种场景下,基于目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的实现方式可以为:如果目标分位数位于第一分位数区间,则将第一尺度函数确定为目标尺度函数;如果目标分位数位于第二分位数区间,则将第二尺度函数确定为目标尺度函数。In this scenario, based on the target quantile corresponding to the target data point, the implementation method of determining the target scale function from multiple scale functions can be: if the target quantile is located in the first quantile interval, then the first scale The function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
由于基于第一尺度函数构建的草图在第一分位数区间上的簇更为密集,基于第二尺度函数构建的草图在第二分位数区间上的簇更为密集,因此可以根据目标数据点对应的目标分位数,自适应选择第一尺度函数或第二尺度函数来构建草图,以使构建的草图在目标分位数附近的区间上的簇比较密集。Since the sketches constructed based on the first scale function have denser clusters on the first quantile interval, and the sketches constructed based on the second scale function have denser clusters on the second quantile interval, it can be determined based on the target data At the target quantile corresponding to the point, the first scale function or the second scale function is adaptively selected to construct the sketch, so that the constructed sketch has dense clusters in the interval near the target quantile.
基于第一方面提供的方法,在一些实施例中,第一分位数区间包括从0至x1的区间、以及从x2至1的区间,x1和x2均大于0且小于1,且x1小于x2;第二分位数区间包括从x1至x2的区间。Based on the method provided in the first aspect, in some embodiments, the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2 ;The second quantile interval includes the interval from x1 to x2.
此时通过本申请实施例提供的方法能够实现对全局分位数区间[0,1]上任一分位数对应的数据点的精准查询,也即实现全范围内的高精度查询。At this time, the method provided by the embodiment of the present application can realize accurate query of the data points corresponding to any quantile in the global quantile interval [0,1], that is, high-precision query in the entire range can be achieved.
基于第一方面提供的方法,在一些实施例中,基于目标草图查询目标数据点的实现方式可以为:基于目标草图和目标分位数,查询目标数据点的数据值。Based on the method provided in the first aspect, in some embodiments, querying the target data point based on the target sketch may be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
在本申请实施例中,可以基于目标数据点的分位数查询目标数据点的数据值,也可以基于目标数据点的数据值查询目标数据点的分位数。也即本申请实施例提供的方法适应于各种场景下的数据点查询,提高了本申请实施例的灵活性。In the embodiment of the present application, the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. That is to say, the method provided by the embodiment of the present application is suitable for data point query in various scenarios, which improves the flexibility of the embodiment of the present application.
基于第一方面提供的方法,在一些实施例中,在基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数之前,在该方法中,还接收数据点查询请求,数据点查询请求用于查询多个数据点中的目标数据点的数据值,数据点查询请求携带目标数据点的标准分位数;将数据点查询请求中携带的标准分位数确定为目标分位数。Based on the method provided in the first aspect, in some embodiments, before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, in this method, the data point is also received Query request, data point query request is used to query the data value of the target data point among multiple data points. The data point query request carries the standard quantile of the target data point; determine the standard quantile carried in the data point query request. is the target quantile.
在构建草图之前,还可以接收数据点查询请求,该数据点查询请求用于查询多个数据点中的目标数据点的数据值,且该数据点查询请求携带目标数据点的标准分位数。这种情况下,将该数据点查询请求中携带的标准分位数确定为目标分位数,以便基于目标分位数构建目标草图,进而查询目标数据点的数据值。这种情况下,可以提高查询到的数据值的准确性。Before building the sketch, you can also receive a data point query request, which is used to query the data value of a target data point among multiple data points, and the data point query request carries the standard quantile of the target data point. In this case, the standard quantile carried in the data point query request is determined as the target quantile, so that the target sketch can be constructed based on the target quantile, and then the data value of the target data point can be queried. In this case, the accuracy of the queried data values can be improved.
基于第一方面提供的方法,在一些实施例中,在该方法中,还可以接收等高直方图查询请求,等高直方图查询请求用于查询基于多个数据点构建的等高直方图,且等高直方图查询请求携带桶数量h,h为大于1的整数;基于桶数量h和多个数据点的总数量,确定等高直方图中从左到右第一个桶至第h-1个桶的分位数,得到h-1个分位数;将h-1个分位数中每个分位数分别作为目标分位数,并执行基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的操作,以得到与h-1个分位数一一对应的h-1个数据值。基于h-1个数据值、以及多个数据点中的最大数据点的数据值和最小数据点的数据值,绘制等高直方图。Based on the method provided in the first aspect, in some embodiments, in this method, a equal height histogram query request may also be received, and the equal height histogram query request is used to query a equal height histogram constructed based on multiple data points, And the equal height histogram query request carries the number of buckets h, h is an integer greater than 1; based on the number of buckets h and the total number of multiple data points, determine the first bucket from left to right in the equal height histogram to the h-th The quantiles of 1 bucket are obtained by h-1 quantiles; each quantile in the h-1 quantiles is used as the target quantile, and the target corresponding to the target data point to be queried is executed. Quantile, the operation of determining the target scale function from multiple scale functions to obtain h-1 data values that correspond to h-1 quantiles one-to-one. Draw a contour histogram based on h-1 data values, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
在构建草图之前,还可以接收等高直方图查询请求,该等高直方图查询请求用于查询基于多个数据点构建的等高直方图。这种情况下,可以提高构建的等高直方图的准确性。Before building a sketch, you can also receive a contour histogram query request, which is used to query a contour histogram built based on multiple data points. In this case, the accuracy of the constructed equal-height histogram can be improved.
基于第一方面提供的方法,在一些实施例中,在基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数之前,在该方法中,还可以基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的 估计分位数,将估计分位数作为目标分位数。Based on the method provided in the first aspect, in some embodiments, before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, in this method, the target scale function may also be determined based on the target quantile corresponding to the target data point to be queried. The data value of the data point, as well as the data value of the largest data point and the data value of the smallest data point among multiple data points, determine the target data point Estimate the quantile and use the estimated quantile as the target quantile.
这种场景下,基于目标草图查询目标数据点的实现方式可以为:基于目标草图和目标数据点的数据值,查询目标数据点的标准分位数。In this scenario, querying the target data point based on the target sketch can be implemented by querying the standard quantile of the target data point based on the data value of the target sketch and the target data point.
在本申请实施例中,可以基于目标数据点的分位数查询目标数据点的数据值,这种场景下,可以先根据数据点的数据值预估一个分位数,将预估的分位数作为目标分位数并自适应选择尺度函数来构建草图,以提高后续查询到的标准分位数的准确性。In the embodiment of this application, the data value of the target data point can be queried based on the quantile of the target data point. In this scenario, a quantile can be estimated based on the data value of the data point, and the estimated quantile can be The number is used as the target quantile and the scale function is adaptively selected to construct the sketch to improve the accuracy of the standard quantile obtained by subsequent queries.
基于第一方面提供的方法,在一些实施例中,在基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数之前,还可以接收分位数查询请求,分位数查询请求用于查询多个数据点中的目标数据点的标准分位数,分位数查询请求携带目标数据点的数据值。Based on the method provided in the first aspect, in some embodiments, an estimate of the target data point is determined based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points. Before quantile, you can also receive a quantile query request. The quantile query request is used to query the standard quantile of the target data point among multiple data points. The quantile query request carries the data value of the target data point.
基于目标数据点的分位数查询目标数据点的数据值可以应用在接收到分位数查询请求的场景中,提高了这种场景下查询到的标准分位数的准确性。Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where a quantile query request is received, which improves the accuracy of the standard quantile queried in this scenario.
基于第一方面提供的方法,在一些实施例中,在该方法中,还可以接收等宽直方图查询请求,等宽直方图查询请求用于查询基于多个数据点构建的等宽直方图,且等宽直方图查询请求携带桶边界数组,桶边界数组包括n个边界值,n个边界值将多个数据点中最小数据点的数据值与最大数据点的数据值之间划分出n+1个区间;将n个边界值中每个边界值分别作为目标数据点的数据值,并执行基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数的操作,以得到与n个边界值一一对应的n个标准分位数。基于与n个边界值一一对应的n个标准分位数,绘制等宽直方图。Based on the method provided in the first aspect, in some embodiments, in this method, an equal-width histogram query request may also be received, and the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, And the equal-width histogram query request carries a bucket boundary array. The bucket boundary array includes n boundary values. The n boundary values divide n+ between the data value of the smallest data point and the data value of the largest data point among multiple data points. 1 interval; use each boundary value in the n boundary values as the data value of the target data point, and execute the data value based on the target data point, as well as the data value of the largest data point and the smallest data point among multiple data points. The operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one. Draw an equal-width histogram based on n standard quantiles corresponding to n boundary values.
基于目标数据点的分位数查询目标数据点的数据值可以应用在接收到等宽直方图查询请求的场景中,提高了这种场景下查询到的等宽直方图的准确性。Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where an equal-width histogram query request is received, which improves the accuracy of the equal-width histogram queried in this scenario.
基于第一方面提供的方法,在一些实施例中,在基于目标尺度函数和多个数据点构建目标草图之后,还可以生成与缓存中待更新数据点对应的待更新簇,待更新簇包括簇均值、簇权重以及簇标记,待更新簇的簇均值指示待更新数据点的数据值,待更新簇的簇权重指示待更新数据点的数量,待更新簇的簇标记指示待更新数据点的更新类型;基于待更新簇,更新目标草图。Based on the method provided in the first aspect, in some embodiments, after constructing the target sketch based on the target scale function and multiple data points, clusters to be updated corresponding to the data points to be updated in the cache can also be generated, and the clusters to be updated include clusters Mean, cluster weight and cluster mark. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster weight of the cluster to be updated indicates the number of data points to be updated. The cluster mark of the cluster to be updated indicates the update of the data point to be updated. Type; update the target sketch based on the cluster to be updated.
在本申请实施例中,为了能够实现向目标草图中插入数据点或删除数据点,对于缓存中的数据点,将缓存中的数据点采用上述三元组的方式表示为待更新簇,以便于后续根据缓存中待更新数据点更新目标草图。In the embodiment of the present application, in order to be able to insert data points or delete data points into the target sketch, for the data points in the cache, the data points in the cache are expressed as clusters to be updated in the form of triples as mentioned above, so as to facilitate Subsequently, the target sketch is updated based on the data points to be updated in the cache.
基于第一方面提供的方法,在一些实施例中,基于待更新簇,更新目标草图的实现方式可以为:从待更新簇中获取簇标记为待合并标记的待更新簇,得到待合并簇;将待合并簇合并至目标草图中。Based on the method provided in the first aspect, in some embodiments, based on the cluster to be updated, the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtaining the cluster to be merged; Merge the clusters to be merged into the target sketch.
由于缓存中缓存有需要删除或需要新增的数据点,在将缓存中的数据点表示为待更新簇之后,可以依据簇标记从缓存中筛选出待合并簇,也即筛选出需要新增的数据点,进而将待合并簇合并至目标草图中。Since there are data points in the cache that need to be deleted or added, after the data points in the cache are represented as clusters to be updated, the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
基于第一方面提供的方法,在一些实施例中,将待合并簇合并至目标草图中的实现方式可以为:将目标草图中的簇和待合并簇按照簇均值从小到大的顺序进行排序;对于排序后的第一个簇,基于目标尺度函数确定分位数阈值,从排序后的第二个簇开始遍历每个簇,并对 每个簇依次执行下述操作:对于第i个簇,基于第i个簇的簇权重,确定第i个簇的当前分位数,i为大于1的整数;如果第i个簇的当前分位数低于分位数阈值,则将第i个簇合并至前一个簇中,并从前一个簇继续开始遍历;如果第i个簇的当前分位数超过分位数阈值,则基于第i个簇的当前分位数和目标尺度函数更新分位数阈值,并遍历下一个簇。Based on the method provided in the first aspect, in some embodiments, merging the clusters to be merged into the target sketch may be implemented by: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and Each cluster performs the following operations in turn: for the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster If the quantile is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the i-th cluster The current quantile and target scale function of the cluster update the quantile threshold and traverse the next cluster.
通过上述方式可以实现将待更合并簇添加至目标草图的其他簇中,以实现对目标草图的更新。Through the above method, the merged cluster to be updated can be added to other clusters of the target sketch to update the target sketch.
基于第一方面提供的方法,在一些实施例中,基于待更新簇,更新目标草图的实现方式可以为:从待更新簇中获取簇标记为待删除标记的待更新簇,得到待删除簇;从目标草图中删除待删除簇。Based on the method provided in the first aspect, in some embodiments, based on the cluster to be updated, the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; Remove the cluster to be deleted from the target sketch.
由于缓存中缓存有需要删除或需要新增的数据点,在将缓存中的数据点全部表示为待更新簇之后,可以依据簇标记从缓存中筛选出待删除簇,也即筛选出需要删除的数据点,进而将待删除簇从目标草图中删除。Since there are data points in the cache that need to be deleted or added, after all the data points in the cache are represented as clusters to be updated, the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
基于第一方面提供的方法,在一些实施例中,从目标草图中删除待删除簇的实现方式可以为:将目标草图中的簇和待删除簇按照簇均值从小到大的顺序进行排序;从排序后的第一个簇开始遍历每个簇,对每个簇依次执行下述操作:对于第j个簇,确定第j个簇的簇标记,如果第j个簇的簇标记为待删除标记,则删除第j个簇,并更新与j个簇相邻的簇的簇权重,j为大于或等于1的整数。Based on the method provided in the first aspect, in some embodiments, deleting the clusters to be deleted from the target sketch can be implemented by: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; The first cluster after sorting starts to traverse each cluster, and performs the following operations on each cluster in turn: for the jth cluster, determine the cluster mark of the jth cluster, if the cluster mark of the jth cluster is a mark to be deleted , then delete the jth cluster and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
通过上述方式可以实现将待更删除簇从目标草图中删除,以实现对目标草图的更新。Through the above method, the cluster to be deleted can be deleted from the target sketch to update the target sketch.
基于第一方面提供的方法,在一些实施例中,更新与j个簇相邻的簇的簇权重的实现方式可以为:如果第j个簇为排序之后的中间簇,则获取j个簇的左相邻簇的簇均值以及j个簇的右相邻簇的簇均值;基于左相邻簇的簇均值、右相邻簇的簇均值以及j个簇的簇均值和簇权重,分别确定与左相邻簇对应的删除权重、以及与右相邻簇对应的删除权重;基于与左相邻簇对应的删除权重更新左相邻簇的簇权重,基于与右相邻簇对应的删除权重更新左相邻簇的簇权重。Based on the method provided in the first aspect, in some embodiments, the implementation of updating the cluster weights of clusters adjacent to j clusters can be: if the j-th cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters. The cluster mean of the left adjacent cluster and the cluster mean of the right adjacent clusters of j clusters; based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster and the cluster mean and cluster weight of j clusters, respectively determine and The deletion weight corresponding to the left adjacent cluster, and the deletion weight corresponding to the right adjacent cluster; the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight corresponding to the right adjacent cluster is updated. The cluster weight of the left adjacent cluster.
由于从目标草图中删除某个簇后,将影响与该簇相邻的簇的簇权重,因此在删除簇时,还需更新与该簇相邻的簇的簇权重。Since deleting a cluster from the target sketch will affect the cluster weights of clusters adjacent to this cluster, when a cluster is deleted, the cluster weights of clusters adjacent to this cluster also need to be updated.
基于第一方面提供的方法,在一些实施例中,基于目标尺度函数和多个数据点构建目标草图的实现方式可以为:获取基于多个数据点中部分数据点和目标尺度函数已经缓存的草图,得到第一草图;基于多个数据点中除部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图;将第一草图和第二草图进行聚合,得到目标草图。Based on the method provided in the first aspect, in some embodiments, the implementation of constructing a target sketch based on the target scale function and multiple data points may be: obtaining a cached sketch based on some of the data points among the multiple data points and the target scale function. , obtain the first sketch; construct a sketch based on the data points except some data points among the multiple data points and the target scale function, and obtain the second sketch; aggregate the first sketch and the second sketch to obtain the target sketch.
在本申请实施例中,当需要查询目标数据点时,如果预先已经基于部分数据点和目标尺度函数构建了一些草图,则当前可以基于其他数据点构建草图,将当前构建的草图和预先已经构建的草图进行合并,便可得到目标草图。通过这种方式可以避免在每次查询时都需要基于全量数据点构建目标草图,进而节省了计算机资源。In the embodiment of this application, when the target data point needs to be queried, if some sketches have been constructed in advance based on some data points and the target scale function, the currently constructed sketch can be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
基于第一方面提供的方法,在一些实施例中,获取基于多个数据点中部分数据点和目标尺度函数已经缓存的草图,得到第一草图的实现方式可以为:获取待查询的目标时间窗,目标数据点为时间戳位于目标时间窗内的数据点;获取元数据集,元数据集包括缓存中的多个草图的元数据,多个草图为基于目标尺度函数构建的草图,每个草图的元数据包括草图时间窗和草图时间线标识,草图时间窗为构建相应草图的数据点的时间戳对应的时间窗,草图时 间线标识为构建相应草图的数据点所属的时间线的标识;基于目标时间窗和目标数据点所属的时间线,从元数据集确定第一元数据,第一元数据中的草图时间窗为目标时间窗的部分或全部时间窗,第一元数据中的草图时间线的标识与目标数据点所属的时间线的标识相同;将第一元数据对应的草图确定为第一草图。Based on the method provided in the first aspect, in some embodiments, a sketch that has been cached based on some of the data points among the multiple data points and the target scale function is obtained. The implementation of obtaining the first sketch may be: obtaining the target time window to be queried. , the target data point is a data point whose timestamp is within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. Each sketch The metadata includes the sketch time window and the sketch timeline identifier. The sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch. The sketch time window The intermediate line identifier is the identifier of the timeline to which the data points of the corresponding sketch belong; based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, and the sketch time window in the first metadata is For part or all of the target time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs; the sketch corresponding to the first metadata is determined as the first sketch.
通过元数据集可以实现对以缓存草图的管理,以便于在查询某个数据点时,基于元数据集获取到已经缓存的草图,提高了获取已缓存草图的效率。The cached sketches can be managed through the metadata set, so that when querying a certain data point, the cached sketches can be obtained based on the metadata set, which improves the efficiency of obtaining cached sketches.
基于第一方面提供的方法,在一些实施例中,基于多个数据点中除部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图之后,还可以确定第二草图的元数据,得到第二元数据;缓存第二草图,并将第二元数据添加到元数据集中。Based on the method provided in the first aspect, in some embodiments, a sketch is constructed based on data points other than some data points among the multiple data points and the target scale function. After obtaining the second sketch, the elements of the second sketch can also be determined. data, get the second metadata; cache the second sketch, and add the second metadata to the metadata set.
由于当前新构建了第二草图,因此还可以基于第二草图对元数据集进行更新,以便于后续基于更新后的元数据集进行其他查询操作。Since the second sketch is currently newly constructed, the metadata set can also be updated based on the second sketch, so that subsequent query operations can be performed based on the updated metadata set.
基于第一方面提供的方法,在一些实施例中,在该方法中,还可以确定待写入数据点的时间戳、以及待写入数据点所属的时间线的标识;如果待写入数据点的时间戳和所属时间线的标识与元数据集中第三元数据匹配,则将第三元数据对应的草图删除,并更新元数据集。Based on the method provided in the first aspect, in some embodiments, in this method, the timestamp of the data point to be written and the identity of the timeline to which the data point to be written can also be determined; if the data point to be written is If the timestamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata will be deleted and the metadata set will be updated.
在本申请实施例中,当在已经缓存的草图对应的时间范围内覆盖写新数据点时,则需要将已经缓存的草图执行失效处理,避免查询结果与实际数据不一致。In this embodiment of the present application, when new data points are overwritten within the time range corresponding to the cached sketch, the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
基于第一方面提供的方法,在一些实施例中,元数据集还包括与任一草图时间线标识对应的第一使用信息,第一使用信息用于记录与任一草图时间线标识匹配的多个草图中每个草图的使用时间。Based on the method provided in the first aspect, in some embodiments, the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record multiple usage information matching any sketch timeline identification. The usage time of each sketch in the sketches.
这种场景下,在该方法中,还可以基于第一使用信息确定与任一草图时间线标识匹配的多个草图中待淘汰的草图,并删除待淘汰的草图。In this scenario, in this method, the sketch to be eliminated among the multiple sketches matching any sketch timeline identifier can also be determined based on the first usage information, and the sketch to be eliminated can be deleted.
在本申请实施例中,随着时间推移,缓存的草图越来越多,为了避免过多草图浪费缓存,还可以对草图进行淘汰。具体地,可以针对属于同一时间线的多个草图中的部分草图进行淘汰,以实现从时间维度上对草图进行淘汰。In the embodiment of the present application, as time goes by, more and more sketches are cached. In order to avoid too many sketches wasting the cache, the sketches can also be eliminated. Specifically, some sketches among multiple sketches belonging to the same timeline can be eliminated, so as to eliminate the sketches from the time dimension.
基于第一方面提供的方法,在一些实施例中,元数据集还包括第二使用信息,第二使用信息用于记录元数据集中多个草图时间线标识中每个草图时间线标识对应的使用信息,每个草图时间线标识对应的使用信息指示与相应草图时间线标识匹配的草图的使用时间。Based on the method provided in the first aspect, in some embodiments, the metadata set further includes second usage information, and the second usage information is used to record the usage corresponding to each of the multiple sketch timeline identifications in the metadata set. Information, the usage information corresponding to each sketch timeline ID indicates the usage time of the sketch that matches the corresponding sketch timeline ID.
这种场景下,在该方法中,还可以基于第二使用信息确定多个草图时间线标识中待淘汰的草图时间线标识;将与待淘汰的草图时间线标识匹配的草图删除。In this scenario, in this method, the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers can also be determined based on the second usage information; and the sketch matching the sketch timeline identifier to be eliminated is deleted.
另外,还可以针对不同时间线中某条时间线的草图进行淘汰,以实现从空间维度上对草图进行淘汰。In addition, you can also eliminate sketches from a certain timeline in different timelines to eliminate sketches from the spatial dimension.
第二方面,提供了一种数据点查询装置,所述数据点查询装置具有实现上述第一方面中数据点查询方法行为的功能。所述数据点查询装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的数据点查询方法。In a second aspect, a data point query device is provided. The data point query device has the function of realizing the behavior of the data point query method in the first aspect. The data point query device includes at least one module, and the at least one module is used to implement the data point query method provided in the first aspect.
第三方面,提供了一种计算设备集群,计算设备集群包括至少一个计算设备,每个计算设备包括处理器和存储器;所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行上述第一方面所提供的数据点查询方 法。In a third aspect, a computing device cluster is provided. The computing device cluster includes at least one computing device, each computing device includes a processor and a memory; the processor of the at least one computing device is used to execute the memory of the at least one computing device. instructions stored in the computing device cluster to cause the computing device cluster to execute the data point query method provided in the first aspect Law.
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据点查询方法。In a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute the data point query method described in the first aspect.
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据点查询方法。A fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the data point query method described in the first aspect.
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。The technical effects obtained by the above-mentioned second aspect, third aspect, fourth aspect and fifth aspect are similar to those obtained by the corresponding technical means in the first aspect, and will not be described again here.
附图说明Description of drawings
图1是本申请实施例提供的一种数据点查询方法流程图;Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application;
图2是本申请实施例提供的一种第一尺度函数S1(q)以及S1(q)的导数的曲线变化趋势示意图;Figure 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application;
图3是本申请实施例提供的一种第二尺度函数S2(q)以及S2(q)的导数的曲线变化趋势示意图;Figure 3 is a schematic diagram of the curve change trend of the derivative of a second scale function S 2 (q) and S 2 (q) provided by the embodiment of the present application;
图4是本申请实施例提供的一种基于目标草图和目标分位数查询数据值的查询流程示意图;Figure 4 is a schematic diagram of a query process for querying data values based on target sketches and target quantiles provided by an embodiment of the present application;
图5是本申请实施例提供的一种查询等高直方图的流程示意图;Figure 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application;
图6是本申请实施例提供的一种基于目标草图和目标数据点的数据值Q查询目标数据点的标准分位数q的查询流程示意图;Figure 6 is a schematic diagram of a query process for querying the standard quantile q of a target data point based on the target sketch and the data value Q of the target data point provided by the embodiment of the present application;
图7是本申请实施例提供的一种查询等宽直方图的流程示意图;Figure 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application;
图8是本申请实施例提供的一种更新目标草图的流程示意图;Figure 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application;
图9是本申请实施例提供的一种向目标草图插入数据点的流程示意图;Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application;
图10是本申请实施例提供的一种从目标草图中删除数据点的流程示意图;Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application;
图11是本申请实施例提供的一种增量更新系统的架构示意图;Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application;
图12是本申请实施例提供的一种增量更新方法流程图;Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application;
图13是本申请实施例提供的一种从空间和时间维度上管理元数据的示意图;Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by the embodiment of the present application;
图14是本申请实施例提供的一种数据点查询装置的结构示意图;Figure 14 is a schematic structural diagram of a data point query device provided by an embodiment of the present application;
图15是本申请实施例提供的一种计算设备的结构示意图;Figure 15 is a schematic structural diagram of a computing device provided by an embodiment of the present application;
图16是本申请实施例提供的一种计算设备集群的结构示意图;Figure 16 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application;
图17是本申请实施例提供的一种计算设备集群之间的连接方式示意图。Figure 17 is a schematic diagram of a connection method between computing device clusters provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
应当理解的是,本文提及的“多个”是指两个或两个以上。在本申请的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A, 同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。It should be understood that "plurality" mentioned herein refers to two or more. In the description of this application, unless otherwise stated, "/" means or, for example, A/B can mean A or B; "and/or" in this article is just an association relationship describing related objects, It means that there can be three relationships, for example, A and/or B, it can mean: A alone exists, There are three situations: A and B exist at the same time, and B exists alone. In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as “first” and “second” are used to distinguish the same or similar items with basically the same functions and effects. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not limit the number and execution order.
在对本申请实施例进行详细解释说明之前,先对本申请实施例的应用场景进行介绍。Before explaining the embodiments of the present application in detail, the application scenarios of the embodiments of the present application are first introduced.
随着第五代移动通信技术(5th Generation Mobile Communication Technology,5G)与物联网(Internet of Things,IoT)技术的快速发展,各行业中出现的数据点呈爆发式大规模增长,每个数据点代表一个具体的数据,比如温度、湿度或天气等。因此有必要对大量数据点进行统计与分析,以挖掘数据点中的有用特征。With the rapid development of 5th Generation Mobile Communication Technology (5G) and Internet of Things (IoT) technology, the number of data points appearing in various industries has exploded and massively increased. Each data point Represents a specific data, such as temperature, humidity or weather. Therefore, it is necessary to perform statistics and analysis on a large number of data points to mine useful features in the data points.
目前数据点分析方法包括分位数方法与直方图方法。Current data point analysis methods include quantile method and histogram method.
分位数用于表征某个数据点在大量数据点按照大小顺序排序后的序列中的位置。相对于采用极值(最大值和/或最小值)来表征大量数据点的特征,分位数可以屏蔽因异常数据点带来的虚假极值信息,从而表示大量数据点中各阶段的真实信息。基于此,对于提供互联网服务的公司,分位数可以作为衡量该公司的网络运行状态的重要指标之一。此外,分位数查询还应用在天气温度趋势、日志挖掘、股票趋势分析、虚拟货币量价指标、金融数据分析等领域。Quantile is used to characterize the position of a certain data point in a sequence of a large number of data points sorted in order of size. Compared with using extreme values (maximum and/or minimum values) to characterize the characteristics of a large number of data points, quantiles can shield false extreme value information caused by abnormal data points, thereby representing the real information at each stage in a large number of data points. . Based on this, for companies that provide Internet services, the quantile can be used as one of the important indicators to measure the company's network operating status. In addition, quantile query is also used in weather temperature trends, log mining, stock trend analysis, virtual currency volume and price indicators, financial data analysis and other fields.
在一些技术中,为了精确计算分位数,需要对全量数据点进行排序,再根据排序后的各个数据点的位置计算每个数据点对应的分位数。比如,对于某个数据点的分位数q,q的取值范围是0到1之间的一个实数,即q∈[0,1],q=1表示该数据点为全量数据点中的最大数据点,q=0.5表示该数据点为排序后的全量数据点中的中间数据点。通过该技术确定的分位数的时间和空间复杂度为O(NlogN),其中N为全量数据点的总数量。In some technologies, in order to accurately calculate the quantile, all data points need to be sorted, and then the quantile corresponding to each data point is calculated based on the position of each sorted data point. For example, for the quantile q of a certain data point, the value range of q is a real number between 0 and 1, that is, q∈[0,1]. q=1 means that the data point is a quantile of all data points. The maximum data point, q=0.5, indicates that the data point is the middle data point among the total data points after sorting. The time and space complexity of quantiles determined by this technique is O(NlogN), where N is the total number of full data points.
在已知各个数据点的分位数的场景中,如果要查询的数据点的分位数为q,则基于该分位数q确定排序后的全量数据点中第项,得到的结果即为该数据点的数据值,也即查询结果。In a scenario where the quantile of each data point is known, if the quantile of the data point to be queried is q, then the quantile of all sorted data points is determined based on the quantile q. item, the result obtained is the data value of the data point, that is, the query result.
但是,在IoT与DevOps(development和operations的组合词,一组过程、方法与系统的统称)领域,数据点通常在时序数据库中存储,且由于数据点的体量庞大导致时序数据库的规模较大。比如,针对大规模时序数据库中的数据点,数据点的数量达到TB(Terabyte,一种存储单位)乃至PB(Petabyte,一种存储单位)级别,如此一般的计算机的内存无法容纳全量数据点。并且对于如此庞大的数据量,对全量数据点强排序所需的计算开销也非常巨大,这种场景下精确计算分位数的技术已经不具备实际价值。因此近似分位数计算技术逐渐兴起。近似分位数计算技术是指采用近似算法计算分位数的一种技术。However, in the fields of IoT and DevOps (a combination of development and operations, a collective name for a set of processes, methods and systems), data points are usually stored in time series databases, and due to the large volume of data points, the size of the time series database is large. . For example, for data points in large-scale time series databases, the number of data points reaches the level of TB (Terabyte, a storage unit) or even PB (Petabyte, a storage unit). The memory of such a general computer cannot accommodate the full amount of data points. And for such a huge amount of data, the computational overhead required for strong sorting of all data points is also very huge. In this scenario, the technology of accurately calculating quantiles is no longer of practical value. Therefore, approximate quantile calculation technology is gradually emerging. Approximate quantile calculation technology refers to a technology that uses approximate algorithms to calculate quantiles.
t-digest(一种在线聚类算法)算法是目前近似分位数计算技术中常用的算法。该算法的基本原理是对全量数据进行聚类,得到多个簇,每个簇有对应的簇均值和簇权重,簇均值指示聚合得到相应簇的数据点的平均值,簇权重指示聚合得到相应簇的数据点的数量。构建的多个簇通常称为草图。根据草图中各个簇对应的簇均值和簇权重可确定各个簇的分位数。后续在需要基于分位数q查询某个数据点的数据值时,基于草图中各个簇的分位数和簇均值利用线性插值计算得到该数据点的近似数据值。该算法中查询的精度与效率可通过草图中簇数量调节。 The t-digest (an online clustering algorithm) algorithm is currently a commonly used algorithm in approximate quantile calculation technology. The basic principle of this algorithm is to cluster all data to obtain multiple clusters. Each cluster has a corresponding cluster mean and cluster weight. The cluster mean indicates the aggregation to obtain the average value of the data points of the corresponding cluster, and the cluster weight indicates the aggregation to obtain the corresponding cluster mean. The number of data points in the cluster. Multiple clusters of builds are often called sketches. The quantile of each cluster can be determined based on the cluster mean and cluster weight corresponding to each cluster in the sketch. Later, when the data value of a certain data point needs to be queried based on the quantile q, linear interpolation is used to calculate the approximate data value of the data point based on the quantile and cluster mean of each cluster in the sketch. The accuracy and efficiency of queries in this algorithm can be adjusted by the number of clusters in the sketch.
另外,直方图作为一种简单高效的统计分析工具,可以直观地描述多个数据点的数据分布特征,因此直方图在网络监控与运维领域应用广泛。直方图中横坐标代表数据点的数据值,纵坐标代表数据点数量,直方图中包括多个条形柱,每个条形柱可以称为一个桶,每个桶的高度表征数据值落入该桶所对应的数据值区间内的数据点的数量。In addition, as a simple and efficient statistical analysis tool, histograms can intuitively describe the data distribution characteristics of multiple data points. Therefore, histograms are widely used in the field of network monitoring and operation and maintenance. The abscissa in the histogram represents the data value of the data point, and the ordinate represents the number of data points. The histogram includes multiple bars. Each bar can be called a bucket. The height of each bucket represents the fall of the data value. The number of data points in the data value interval corresponding to this bucket.
目前直方图包括等高直方图与等宽直方图。其中,等高直方图指每个桶的高度接近的直方图。等宽直方图指每个桶的宽度相同的直方图。Currently, histograms include equal-height histograms and equal-width histograms. Among them, the equal-height histogram refers to a histogram in which the height of each bucket is close. An equal-width histogram is a histogram in which each bin has the same width.
基于上述应用场景,本申请实施例提供了一种数据点查询方法。本申请实施例提供的方法可以实现如下技术效果,第一实现对全范围内的数据点的分位数的高精度查询,第二实现对草图的删除,第三实现增量更新以避免在每次查询时都需要重建草图,从而避免资源浪费。Based on the above application scenarios, embodiments of this application provide a data point query method. The method provided by the embodiments of the present application can achieve the following technical effects: first, high-precision query of quantiles of data points in the entire range; second, deletion of sketches; and third, incremental update to avoid updating every time. The sketch needs to be rebuilt every time a query is made to avoid wasting resources.
下面对本申请实施例提供的数据点查询方法进行详细解释说明。The data point query method provided by the embodiment of the present application will be explained in detail below.
图1是本申请实施例提供的一种数据点查询方法流程图。如图1所示,该方法包括如下步骤101至103。Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps 101 to 103.
步骤101:基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,多个尺度函数中不同尺度函数构建的草图中的簇的密集程度不同,目标分位数指示目标数据点在按照大小排序后的多个数据点中的位置。Step 101: Based on the target quantile corresponding to the target data point to be queried, determine the target scale function from multiple scale functions. The density of clusters in the sketch constructed by different scale functions among the multiple scale functions is different. The target quantile The number indicates the position of the target data point among multiple data points sorted by size.
其中,尺度函数用于控制草图中各个簇之间的密集程度,草图中各个簇之间的密集程度和每个簇的大小有关。簇的大小指示聚合得到该簇的数据点的数量多少。簇越大,该簇聚合的数据点越多,此时该簇的簇均值表征大量数据点的数据值,相应地草图中的各个簇之间比较稀疏,这样很难从草图中区分出各个数据点的数据值,因此草图的精度也比较低。簇越小,该簇聚合的数据点越少,此时该簇的簇均值表征少量数据点的数据值,相应地草图中的各个簇之间比较密集,这样很容易从草图中区分出各个数据点的数据值,因此草图的精度也比较高。基于此,在本申请实施例中,可以利用尺度函数控制草图的精度,以提高后续查询精度。Among them, the scale function is used to control the density of each cluster in the sketch. The density of each cluster in the sketch is related to the size of each cluster. The size of a cluster indicates the number of data points that are aggregated into the cluster. The larger the cluster, the more data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a large number of data points. Correspondingly, the clusters in the sketch are relatively sparse, making it difficult to distinguish individual data from the sketch. The data values of the points, so the accuracy of the sketch is also lower. The smaller the cluster, the fewer data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a small number of data points. Correspondingly, the clusters in the sketch are denser, making it easier to distinguish each data from the sketch. The data value of the point, so the accuracy of the sketch is also relatively high. Based on this, in the embodiment of the present application, the scale function can be used to control the accuracy of the sketch to improve the accuracy of subsequent queries.
由于不同尺度函数构建的草图中的簇的密集程度不同,因此在本申请实施例中,可以基于待查询的目标数据点对应的目标分位数,自适应选择目标尺度函数,以使基于目标尺度函数构建的草图在目标分位数附近的簇的比较密集。草图的簇比较密集时,该草图中的簇能够更准确表征聚类得到簇的数据点的特征,从而提高基于草图查询目标数据点的精度。Since the clusters in the sketches constructed by different scale functions have different density, in this embodiment of the present application, the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale. The sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.
在一些实施例中,多个尺度函数包括第一尺度函数和第二尺度函数,基于第一尺度函数构建的草图中的簇在第一分位数区间上的密集程度,大于基于第二尺度函数构建的草图中的簇在第一分位数区间上的密集程度,基于第一尺度函数构建的草图中的簇在第二分位数区间上的密集程度,小于基于第二尺度函数构建的草图中的簇在第二分位数区间上的密集程度。In some embodiments, the multiple scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those based on the second scale function. The clusters in the constructed sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than the sketch constructed based on the second scale function. How dense the clusters in are on the second quantile interval.
这种场景下,步骤101中基于目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的实现过程可以为:如果目标分位数位于第一分位数区间,则将第一尺度函数确定为目标尺度函数;如果目标分位数位于第二分位数区间,则将第二尺度函数确定为目标尺度函数。In this scenario, the implementation process of determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point in step 101 can be: if the target quantile is located in the first quantile interval, then The first scale function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
由于基于第一尺度函数构建的草图在第一分位数区间上的簇更为密集,基于第二尺度函数构建的草图在第二分位数区间上的簇更为密集,因此可以根据目标数据点对应的目标分位数,自适应选择第一尺度函数或第二尺度函数来构建草图,以使构建的草图在目标分位数附近的区间上的簇比较密集。 Since the sketches constructed based on the first scale function have denser clusters on the first quantile interval, and the sketches constructed based on the second scale function have denser clusters on the second quantile interval, it can be determined based on the target data At the target quantile corresponding to the point, the first scale function or the second scale function is adaptively selected to construct the sketch, so that the constructed sketch has dense clusters in the interval near the target quantile.
其中,第一分位数区间和第二分位数区间可以为全局分位数区间[0,1]中的任一个区间。示例地,第一分位数区间和第二分位数区间的加和为全局分位数区间[0,1],此时通过本申请实施例提供的方法能够实现对全局分位数区间[0,1]上任一分位数对应的数据点的精准查询,也即实现全范围内的高精度查询。Among them, the first quantile interval and the second quantile interval can be any interval in the global quantile interval [0,1]. For example, the sum of the first quantile interval and the second quantile interval is the global quantile interval [0,1]. At this time, the global quantile interval [0,1] can be realized through the method provided by the embodiment of the present application. Accurate query of data points corresponding to any quantile on 0,1], that is, achieving high-precision query in the entire range.
比如,第一分位数区间包括从0至x1的区间、以及从x2至1的区间,x1和x2均大于0且小于1,且x1小于x2;第二分位数区间包括从x1至x2的区间。也即,第一分位数区间为全局分位数区间[0,1]两头附近的区间,第二分位数区间为全局分位数区间[0,1]的中间区间。例如,x1可以为0.2,x2可以为0.8。这种场景下,第一尺度函数对应的第一分位数区间为[0,0.2]和[0.8,1],第二尺度函数对应的第一分位数区间为[0.2,0.8]。可选地,x1和x2也可以取全局分位数区间[0,1]上的其他实数,本申请实施例在此不再一一举例说明。For example, the first quantile interval includes the interval from 0 to x1, and the interval from x2 to 1, x1 and x2 are both greater than 0 and less than 1, and x1 is less than x2; the second quantile interval includes the interval from x1 to x2 interval. That is, the first quantile interval is the interval near both ends of the global quantile interval [0,1], and the second quantile interval is the middle interval of the global quantile interval [0,1]. For example, x1 can be 0.2 and x2 can be 0.8. In this scenario, the first quantile interval corresponding to the first scale function is [0,0.2] and [0.8,1], and the first quantile interval corresponding to the second scale function is [0.2,0.8]. Optionally, x1 and x2 can also be other real numbers on the global quantile interval [0,1], and the embodiments of this application will not give examples one by one here.
在本申请实施例中,第一尺度函数可以设计为下述公式(1)所示的函数,第二尺度函数可以设计为下述公式(2)所示的函数:

In the embodiment of the present application, the first scale function can be designed as the function shown in the following formula (1), and the second scale function can be designed as the function shown in the following formula (2):

公式(1)和公式(2)中的q代表分位数,α代表超参数,指示簇的数量,S1(q)和S2(q)分别代表第一尺度函数和第二尺度函数,其中S1(q)和S2(q)的导数能够表征构建的草图中的簇的密集程度。q in formula (1) and formula (2) represents the quantile, α represents the hyperparameter, indicating the number of clusters, S 1 (q) and S 2 (q) represent the first scale function and the second scale function respectively, The derivatives of S 1 (q) and S 2 (q) can characterize the density of clusters in the constructed sketch.
图2是本申请实施例提供的一种第一尺度函数S1(q)以及S1(q)的导数的曲线变化趋势示意图。图3是本申请实施例提供的一种第二尺度函数S2(q)以及S2(q)的导数的曲线变化趋势示意图。FIG. 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S 1 (q) and S 1 (q) provided by the embodiment of the present application. FIG. 3 is a schematic diagram of the curve change trend of the second scale function S 2 (q) and the derivative of S 2 (q) provided by the embodiment of the present application.
如图2所示,从第一尺度函数S1(q)的曲线图中可看出第一尺度函数在全局分位数区间[0,1]两头附近的区间上增长速率较快,在全局分位数区间[0,1]的中间区间增长速率缓慢,因此第一尺度函数S1(q)的导数在全局分位数区间[0,1]两头附近的区间的取值比较大,该特征可以从图2中第一尺度函数S1(q)的导数的曲线图得到验证。所以基于第一尺度函数S1(q)构建的草图在全局分位数区间[0,1]两头附近的区间上的簇比较密集,也即簇偏小,相应地草图在全局分位数区间[0,1]两头附近的区间上的精度更高。As shown in Figure 2, it can be seen from the curve of the first scale function S 1 (q) that the first scale function grows faster in the interval near both ends of the global quantile interval [0,1]. The growth rate of the middle interval of the quantile interval [0,1] is slow, so the derivative of the first scale function S 1 (q) has a relatively large value in the interval near both ends of the global quantile interval [0,1]. The characteristics can be verified from the graph of the derivative of the first scale function S 1 (q) in Figure 2 . Therefore, the sketch constructed based on the first scale function S 1 (q) has relatively dense clusters in the interval near both ends of the global quantile interval [0,1], that is, the clusters are relatively small. Correspondingly, the sketch is in the global quantile interval. The accuracy on the interval near both ends of [0,1] is higher.
如图3所示,从第二尺度函数S2(q)的曲线图中可看出第二尺度函数在全局分位数区间[0,1]两头附近的区间上增长速率缓慢,在全局分位数区间[0,1]的中间区间增长速率较快,因此第二尺度函数S2(q)的导数在全局分位数区间[0,1]的中间区间的取值比较大,该特征可以从图3中第二尺度函数S2(q)的导数的曲线图得到验证。所以基于第二尺度函数S2(q)构建的草图在全局分位数区间[0,1]的中间区间上的簇比较密集,也即簇偏小,相应地草图在全局分位数区间[0,1]的中间区间上的精度更高。As shown in Figure 3, it can be seen from the curve of the second scale function S 2 (q) that the second scale function grows slowly in the interval near both ends of the global quantile interval [0,1]. The growth rate of the middle interval of the quantile interval [0,1] is relatively fast, so the derivative of the second scale function S 2 (q) has a relatively large value in the middle interval of the global quantile interval [0,1]. This feature This can be verified from the graph of the derivative of the second scale function S 2 (q) in Figure 3 . Therefore, the sketch constructed based on the second scale function S 2 (q) has relatively dense clusters in the middle interval of the global quantile interval [0,1], that is, the clusters are relatively small. Correspondingly, the sketch is in the global quantile interval [ The accuracy is higher on the intermediate interval of 0,1].
基于图2和图3所示的两种尺度函数,当待查询的目标数据点对应的目标分位数位于全局分位数区间[0,1]两头附近的区间比如[0,0.2]和[0.8,1]时,可以选择第一尺度函数S1(q)来构建草图。当待查询的目标数据点对应的目标分位数位于全局分位数区间[0,1]的中间区间比如[0.2,0.8]时,可以选择第二尺度函数S2(q)来构建草图,以提高构建的草图的精度,从而 提高查询数据点的准确性。也即本申请实施例提供了一种基于查询环境自适应选择尺度函数构建草图的方法。Based on the two scaling functions shown in Figure 2 and Figure 3, when the target quantile corresponding to the target data point to be queried is located in the interval near both ends of the global quantile interval [0,1], such as [0,0.2] and [ 0.8,1], the first scale function S 1 (q) can be selected to construct the sketch. When the target quantile corresponding to the target data point to be queried is located in the middle interval of the global quantile interval [0,1], such as [0.2,0.8], the second scale function S 2 (q) can be selected to construct the sketch, to improve the accuracy of the constructed sketch, thereby Improve the accuracy of querying data points. That is to say, the embodiment of the present application provides a method for adaptively selecting a scale function to construct a sketch based on the query environment.
上述是以第一尺度函数和第二尺度函数为例进行说明,可选地,还可以设计超过两个的尺度函数,这些尺度函数在全局分位数区间[0,1]的不同区间上对应不同的簇密集程度,也即这些尺度函数在全局分位数区间[0,1]的不同区间有不同的表现,从而实现本申请实施例提供的基于查询环境自适应选择尺度函数构建草图的方法。The above is explained using the first scale function and the second scale function as examples. Optionally, more than two scale functions can also be designed. These scale functions correspond to different intervals of the global quantile interval [0,1]. Different cluster density levels, that is, these scale functions have different performances in different intervals of the global quantile interval [0,1], thereby realizing the method of adaptively selecting scale functions to construct sketches based on the query environment provided in the embodiment of this application. .
步骤102:基于目标尺度函数和多个数据点构建目标草图,目标草图包括多个簇,每个簇包括簇均值和簇权重,簇均值指示聚类得到相应簇的数据点的均值,簇权重指示聚类得到相应簇的数据点的数量。Step 102: Construct a target sketch based on the target scale function and multiple data points. The target sketch includes multiple clusters. Each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the clustering to obtain the mean of the data points of the corresponding cluster. The cluster weight indicates Clustering yields the number of data points corresponding to the cluster.
其中,基于目标尺度函数和多个数据点构建目标草图的实现方式可以参考t-digest算法或其他聚类方法,本申请实施例对此不做限定。Among them, the implementation method of constructing the target sketch based on the target scale function and multiple data points may refer to the t-digest algorithm or other clustering methods, which is not limited in the embodiments of the present application.
步骤103:基于目标草图查询目标数据点。Step 103: Query target data points based on the target sketch.
在本申请实施例中,可以基于目标数据点的分位数查询目标数据点的数据值,也可以基于目标数据点的数据值查询目标数据点的分位数。下面分两种应用场景对此进行解释说明。In the embodiment of the present application, the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. This is explained below in two application scenarios.
第一种应用场景:基于分位数查询数据值The first application scenario: querying data values based on quantile
在第一种应用场景中,步骤103的实现方式可以为:基于目标草图和目标分位数,查询目标数据点的数据值。In the first application scenario, step 103 can be implemented by querying the data value of the target data point based on the target sketch and the target quantile.
为了便于后续说明,将目标分位数标记为q,q是一个介于0到1之间的小数,假设构建目标草图的全量数据点的总量为N,则基于目标草图和q得到的查询结果为:全量数据点的排序结果中第个的元素的近似估计值,该查询结果即为目标数据点的数据值。For the convenience of subsequent explanation, the target quantile is marked as q, q is a decimal between 0 and 1, assuming that the total number of data points to construct the target sketch is N, then the query obtained based on the target sketch and q The result is: the sorted result of all data points The approximate estimated value of the elements, the query result is the data value of the target data point.
假设全量数据点中最大数据点的数据值为max,全量数据点中最小数据点的数据值为min,则基于目标草图和目标分位数查询数据值的查询流程示例如图4所示,图4中的查询流程如下:Assuming that the data value of the largest data point among all data points is max, and the data value of the smallest data point among all data points is min, the query process example of querying data values based on the target sketch and target quantile is shown in Figure 4. Figure 4 The query process in 4 is as follows:
(1)如果N*q<0.5C1 weight,则利用插值法得到查询结果如下:
(1) If N*q<0.5C 1 weight , the query results obtained by using interpolation method are as follows:
其中,C1 weight为目标草图中第1个簇的簇权重,C1 value为目标草图中第1个簇的簇均值,目标草图中的第1个簇是指各个簇按照簇均值从小到大的顺序排序后的第1个簇。Among them, C 1 weight is the cluster weight of the first cluster in the target sketch, C 1 value is the cluster mean of the first cluster in the target sketch, and the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large. The first cluster after sorting.
(2)如果N*q>N-0.5Cm weight,则利用插值法得到查询结果如下:
(2) If N*q>N-0.5C m weight , the query results obtained by using the interpolation method are as follows:
其中,Cm weight为目标草图中最后一个簇的簇权重,Cm value为目标草图中最后一个簇的簇均值,目标草图中的最后一个簇是指各个簇按照簇均值从小到大的顺序排序后的最后一个簇。Among them, C m weight is the cluster weight of the last cluster in the target sketch, and C m value is the cluster mean of the last cluster in the target sketch. The last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
(3)如果(1)和(2)中的条件均不满足,则从第一个簇开始遍历所有簇,假设当前遍历到第i个簇,则对第i个簇执行下述操作:(3) If the conditions in (1) and (2) are not met, all clusters will be traversed starting from the first cluster. Assuming that the i-th cluster is currently traversed, the following operations will be performed on the i-th cluster:
a)计算已经遍历的簇(包括当前簇)的簇权重的累积和Wi,Wi可以表示如下:
a) Calculate the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
b)如果Wi≤N*q<Wi+1,则继续遍历下一个簇,否则基于当前簇和下一个簇利用插值法计算查询结果。插值法计算方式示例如下: b) If Wi ≤ N*q < Wi + 1 , continue traversing the next cluster, otherwise calculate the query result based on the current cluster and the next cluster using the interpolation method. An example of the interpolation calculation method is as follows:
假设待插值的左右的两个簇的簇均值分别为vl和vr,簇权重分别为wl和wr,将最后得到的查询结果表示为Qq,则Qq可以通过如下两个公式得到:

Qq=p*(vr-vl)+vl
Assume that the cluster means of the left and right clusters to be interpolated are v l and v r respectively, and the cluster weights are w l and w r respectively. The final query result is expressed as Qq, then Qq can be obtained by the following two formulas:

Q q =p*(v r -v l )+v l
需要说明的是,图4所示的查询流程用于示例说明,本申请实施例并不限定基于已经构建的目标草图和目标分位数查询目标数据点的数据值的实现方式。上述插值法中的公式同样用于示例说明插值法,本申请实施例对此同样不做限定。It should be noted that the query process shown in Figure 4 is for illustration, and the embodiment of the present application does not limit the implementation of querying the data value of the target data point based on the already constructed target sketch and target quantile. The formulas in the above interpolation method are also used to illustrate the interpolation method, and the embodiments of the present application do not limit this.
此外,在第一种应用场景中,还示例地存在以下两种情况需要基于分位数查询数据值。下面对此进行解释说明。In addition, in the first application scenario, there are the following two situations where data values need to be queried based on quantiles. This is explained below.
第一种情况:响应于数据点查询请求查询Case 1: Query in response to a data point query request
在第一种情况中,在构建草图之前,还可以接收数据点查询请求,该数据点查询请求用于查询多个数据点中的目标数据点的数据值,且该数据点查询请求携带目标数据点的标准分位数。这种情况下,将该数据点查询请求中携带的标准分位数确定为目标分位数。In the first case, before building the sketch, a data point query request can also be received. The data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the target data. The standard quantile of the point. In this case, the standard quantile carried in the query request for the data point is determined as the target quantile.
其中,标准分位数可以为用户输入的分位数,也即用户在触发数据点查询请求时,还输入一个分位数,以便后续通过本申请实施例提供的方法基于用户输入的分位数查询具体的数据值。Among them, the standard quantile can be the quantile input by the user, that is, when the user triggers the data point query request, he also inputs a quantile, so that the subsequent quantile can be based on the user input through the method provided by the embodiment of the application. Query specific data values.
如此,在第一种情况中,可以根据用户输入的分位数自适应选择尺度函数,并构建草图,构建的草图在用户输入的分位数附近的区间上比较密集,从而提高查询结果的精度。In this way, in the first case, the scale function can be adaptively selected according to the quantile input by the user, and a sketch can be constructed. The constructed sketch is relatively dense in the interval near the quantile input by the user, thereby improving the accuracy of the query results. .
第二种情况:响应于等高直方图查询请求查询Second case: query in response to equal height histogram query request
在第二种情况中,在构建草图之前,还可以接收等高直方图查询请求,该等高直方图查询请求用于查询基于多个数据点构建的等高直方图,且等高直方图查询请求携带桶数量h。这种情况下,确定目标分位数的实现方式为:基于桶数量h和多个数据点的总数量,确定等高直方图中从左到右第一个桶至第h-1个桶的分位数,得到h-1个分位数;将h-1个分位数中每个分位数分别作为目标分位数,并执行步骤101至步骤103,以得到与h-1个分位数一一对应的h-1个数据值。In the second case, before building the sketch, you can also receive a contour histogram query request, which is used to query a contour histogram constructed based on multiple data points, and the contour histogram query The request carries the number of buckets h. In this case, the way to determine the target quantile is: based on the number of buckets h and the total number of multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram. Quantile, get h-1 quantiles; use each quantile in h-1 quantiles as the target quantile, and perform steps 101 to 103 to get h-1 quantiles. h-1 data values corresponding to the number of digits.
在得到与h-1个分位数一一对应的h-1个数据值,便可基于h-1个数据值、以及多个数据点中的最大数据点的数据值和最小数据点的数据值,绘制等高直方图。After obtaining h-1 data values that correspond to h-1 quantiles one-to-one, it can be based on the h-1 data values, as well as the data value of the largest data point and the data of the smallest data point among the multiple data points. value, draw a histogram of equal heights.
其中,等高直方图中每个桶的高度相等,均为总数量N与桶数量h的比值,另外,等高直方图中水平方向的坐标轴上的坐标从左向右依次变大,为了便于说明,将等高直方图中从左到右的h个桶依次标记为第一个桶、第二个桶、…、第h个桶。这种情况下,基于桶数量h和多个数据点的总数量,确定等高直方图中从左到右第一个桶至第h-1个桶的分位数的实现方式可以为:第i个桶的分位数可以表示为i/h,i为大于或等于1且小于或h的整数。Among them, the height of each bucket in the equal-height histogram is equal, which is the ratio of the total number N to the number of buckets h. In addition, the coordinates on the horizontal axis in the equal-height histogram become larger from left to right, in order For the convenience of explanation, the h buckets from left to right in the equal-height histogram are marked as the first bucket, the second bucket, ..., and the h-th bucket. In this case, based on the number of buckets h and the total number of multiple data points, the implementation method of determining the quantile from the first bucket to the h-1th bucket in the equal-height histogram from left to right can be: The quantile of i buckets can be expressed as i/h, where i is an integer greater than or equal to 1 and less than or h.
需要说明的是,等高直方图中每个桶在横坐标上对应有左边界值和右边界值,上述每个桶的分位数具体是指与每个桶的右边界值对应的分位数。因此上述第h个桶对应的分位数为1。It should be noted that each bucket in the equal-height histogram has a corresponding left boundary value and a right boundary value on the abscissa. The quantile of each bucket mentioned above specifically refers to the quantile corresponding to the right boundary value of each bucket. number. Therefore, the quantile corresponding to the h-th bucket above is 1.
另外,确定与h-1个分位数一一对应的h-1个数据值可以参考图4所示的流程图,在此不再赘述。In addition, to determine the h-1 data values that correspond to the h-1 quantiles one-to-one, refer to the flow chart shown in Figure 4, which will not be described again here.
在得到与h-1个分位数一一对应的h-1个数据值,便可基于h-1个数据值、以及多个数据 点中的最大数据点的数据值和最小数据点的数据值,绘制等高直方图。示例地,将最小数据点的数据值、h-1个数据值以及最大数据点的数据值按照从小到大的顺序进行排序,排序后每相邻两个数据值分别为等高直方图中的一个桶的左边界值和右边界值,每个桶的高度为总数量和桶数量h之间的比值。After obtaining h-1 data values that correspond to h-1 quantiles, we can use h-1 data values and multiple data Draw a contour histogram of the data value of the largest data point and the data value of the smallest data point among the points. For example, the data value of the minimum data point, h-1 data values, and the data value of the maximum data point are sorted from small to large. After sorting, each two adjacent data values are in the equal-height histogram. The left and right boundary values of a bucket, and the height of each bucket is the ratio between the total number and the number of buckets h.
图5是本申请实施例提供的一种查询等高直方图的流程示意图。如图5所示,查询等高直方图的流程包括如下几个步骤:FIG. 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application. As shown in Figure 5, the process of querying the equal height histogram includes the following steps:
a)确定等高直方图中的桶的数量h,初始化长度为h+1的分位数数组T=[0,0,…,1]和边界数组B=[0,…,0]。a) Determine the number of buckets h in the equal-height histogram, and initialize the quantile array T=[0,0,…,1] and the boundary array B=[0,…,0] with a length of h+1.
b)计算第一个桶至第h-1个桶中每个桶的分位数q值,每个桶的分位数指示相应桶的右边界值对应的分位数。其中第i个桶的q值为qi,将其填充到分位数数组T的第i+1个位置,得到数组T=[0,q1,q2,…,qh-1,1]。b) Calculate the quantile q value of each bucket from the first bucket to the h-1th bucket. The quantile of each bucket indicates the quantile corresponding to the right boundary value of the corresponding bucket. The q value of the i-th bucket is qi, which is filled into the i+1th position of the quantile array T to obtain the array T=[0,q1,q2,…,qh-1,1].
c)遍历数组T,利用目标草图确定每个qi值对应Qi值,将Qi值添加到边界数组B中的第i个位置,得到边界数组B=[Q0,Q1,Q2,…,Qh-1,Qh]。其中,Q0为全量数据点中的最小数据点的数据值,Qh为全量数据点中的最大数据点的数据值。c) Traverse the array T, use the target sketch to determine the Qi value corresponding to each qi value, add the Qi value to the i-th position in the boundary array B, and obtain the boundary array B = [Q0, Q1, Q2,..., Qh-1 ,Qh]. Among them, Q0 is the data value of the smallest data point among all data points, and Qh is the data value of the largest data point among all data points.
d)最终根据边界数组B=[Q0,Q1,Q2,…,Qh-1,Qh]构建等高直方图。d) Finally, a equal-height histogram is constructed based on the boundary array B = [Q0, Q1, Q2,..., Qh-1, Qh].
需要说明的是,上述两种情况用于示例说明基于分位数查询数据值的应用场景,本申请实施例对基于分位数查询数据值的应用场景不做限定。It should be noted that the above two situations are used to illustrate the application scenarios of querying data values based on quantiles. The embodiments of this application do not limit the application scenarios of querying data values based on quantiles.
第二种应用场景:基于数据值查询分位数The second application scenario: querying quantiles based on data values
在第二种应用场景中,由于是基于数据点的数据值查询分位数,因此数据点的分位数预先是不知道的,这种场景下,可以先根据数据点的数据值预估一个分位数,将预估的分位数作为目标分位数并自适应选择尺度函数来构建草图。基于此,在一些实施例中,确定目标分位数的实现方式可以为:基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数,将估计分位数作为目标分位数。In the second application scenario, since the quantile is queried based on the data value of the data point, the quantile of the data point is not known in advance. In this scenario, you can first estimate a quantile based on the data value of the data point. Quantile, use the estimated quantile as the target quantile and adaptively select the scale function to build the sketch. Based on this, in some embodiments, the implementation of determining the target quantile may be: based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine The estimated quantile of the target data point, using the estimated quantile as the target quantile.
示例地,基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数的实现方式可以通过如下公式来实现:
For example, based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, the implementation method of determining the estimated quantile of the target data point can be implemented by the following formula :
其中,Q为待查询的目标数据点的数据值。Among them, Q is the data value of the target data point to be queried.
可选地,基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数也可以通过其他方式来实现,本申请实施例对此不做限定。Optionally, based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, determining the estimated quantile of the target data point can also be implemented in other ways. The embodiments of the present application do not limit this.
这种场景下,步骤103的实现方式可以为:基于目标草图和目标数据点的数据值,查询目标数据点的标准分位数。为了便于区分前述的估计分位数,将查询得到的分位数称为标准分位数。In this scenario, step 103 can be implemented by querying the standard quantile of the target data point based on the target sketch and the data value of the target data point. In order to easily distinguish the aforementioned estimated quantiles, the quantiles obtained by the query are called standard quantiles.
为了便于后续说明,将目标数据点的数据值标记为Q,标准分位数标记为q,则基于目标草图和Q得到的查询结果为q。For the convenience of subsequent explanation, the data value of the target data point is marked as Q, the standard quantile is marked as q, and the query result obtained based on the target sketch and Q is q.
假设构建目标草图的全量数据点的总量为N,全量数据点中最大数据点的数据值为max,全量数据点中最小数据点的数据值为min,则基于目标草图和目标数据点的数据值Q查询目标数据点的标准分位数q的查询流程示例如图6所示,图6中的查询流程如下: Assume that the total number of data points to construct the target sketch is N, the data value of the largest data point among the total data points is max, and the data value of the smallest data point among the total data points is min, then based on the data of the target sketch and the target data point An example of the query process for querying the standard quantile q of the target data point by value Q is shown in Figure 6. The query process in Figure 6 is as follows:
(1)如果Q<C1 value,则利用插值法得到查询结果如下:
(1) If Q<C 1 value , the query result obtained by using interpolation method is as follows:
其中,C1 weight为目标草图中第1个簇的簇权重,C1 value为目标草图中第1个簇的簇均值,目标草图中的第1个簇是指各个簇按照簇均值从小到大的顺序排序后的第1个簇。Among them, C 1 weight is the cluster weight of the first cluster in the target sketch, C 1 value is the cluster mean of the first cluster in the target sketch, and the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large. The first cluster after sorting.
(2)如果Q≥Cm value,则利用插值法得到查询结果如下:
(2) If Q≥C m value , the query result obtained by using interpolation method is as follows:
其中,Cm weight为目标草图中最后一个簇的簇权重,Cm value为目标草图中最后一个簇的簇均值,目标草图中的最后一个簇是指各个簇按照簇均值从小到大的顺序排序后的最后一个簇。Among them, C m weight is the cluster weight of the last cluster in the target sketch, and C m value is the cluster mean of the last cluster in the target sketch. The last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.
(3)如果(1)和(2)中的条件均不满足,则从第一个簇开始遍历所有簇,假设当前遍历到第i个簇,则对第i个簇执行下述操作:(3) If the conditions in (1) and (2) are not met, all clusters will be traversed starting from the first cluster. Assuming that the i-th cluster is currently traversed, the following operations will be performed on the i-th cluster:
a)计算已经遍历的簇(包括当前簇)的簇权重的累积和Wi,Wi可以表示如下:
a) Calculate the cumulative sum Wi of cluster weights of clusters that have been traversed (including the current cluster). Wi can be expressed as follows:
b)如果Ci value≤Q<Ci+1 value,则基于当前簇、下一个簇以及Wi利用插值法计算查询结果。如果Q不满足Ci value≤Q<Ci+1 value,则继续遍历下一个簇。其中,插值法计算方式示例如下:b) If C i value ≤ Q < C i+1 value , use the interpolation method to calculate the query result based on the current cluster, the next cluster and Wi . If Q does not satisfy C i value ≤ Q<C i+1 value , continue traversing the next cluster. Among them, examples of interpolation calculation methods are as follows:
假设待插值的左右的两个簇的簇均值分别为vl和vr,簇权重分别为wl和wr,则查询到的标准分位数q可以通过如下公式得到:
Assume that the cluster means of the left and right clusters to be interpolated are v l and v r respectively, and the cluster weights are w l and w r respectively, then the queried standard quantile q can be obtained by the following formula:
需要说明的是,图6所示的查询流程用于示例说明,本申请实施例并不限定基于已经构建的目标草图和目标数据点的数据值查询分位数的实现方式。上述插值法中的公式同样用于示例说明插值法,本申请实施例对此同样不做限定。It should be noted that the query process shown in Figure 6 is for illustration, and the embodiment of the present application does not limit the implementation of querying quantiles based on the data values of the already constructed target sketch and target data points. The formulas in the above interpolation method are also used to illustrate the interpolation method, and the embodiments of the present application do not limit this.
此外,在第二种应用场景中,还存在以下两种情况需要基于分位数查询数据值。下面对此进行解释说明。In addition, in the second application scenario, there are the following two situations where data values need to be queried based on quantiles. This is explained below.
第一种情况:响应于分位数查询请求查询First case: query in response to quantile query request
在第一种情况中,在构建草图之前,还可以接收分位数查询请求,该分位数查询请求用于查询多个数据点中的目标数据点的标准分位数,该分位数查询请求携带目标数据点的数据值。In the first case, before building the sketch, you can also receive a quantile query request for querying the standard quantile of a target data point among multiple data points. The quantile query The request carries the data value of the target data point.
如此,在第一种情况中,可以根据用户输入的数据值预估一个分位数,然后根据预估的分位数自适应选择尺度函数,并构建草图,构建的草图在用户输入的数据值对应的分位数附近的区间上比较密集,从而提高查询结果的精度。In this way, in the first case, a quantile can be estimated based on the data value input by the user, and then the scale function can be adaptively selected based on the estimated quantile, and a sketch can be constructed. The interval near the corresponding quantile is relatively dense, thereby improving the accuracy of the query results.
第二种情况:响应于等宽直方图查询请求查询Second case: Query in response to an equal-width histogram query request
在第二种情况中,在构建草图之前,还可以接收等宽直方图查询请求,该等宽直方图查询请求用于查询基于多个数据点构建的等宽直方图,且等宽直方图查询请求携带桶边界数组,桶边界数组包括n个边界值,n个边界值将多个数据点中最小数据点的数据值与最大数据点的数据值之间划分出n+1个区间;将n个边界值中每个边界值分别作为目标数据点的数据值,并执行步骤101至步骤103,以得到与n个边界值一一对应的n个标准分位数。 In the second case, before building the sketch, you can also receive an equal-width histogram query request. The equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, and the equal-width histogram query The request carries a bucket boundary array. The bucket boundary array includes n boundary values. The n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points; n Each of the boundary values is used as the data value of the target data point, and steps 101 to 103 are performed to obtain n standard quantiles corresponding to the n boundary values one-to-one.
在得到与n个边界值一一对应的n个标准分位数,便可基于与n个边界值一一对应的n个标准分位数,绘制等宽直方图。After obtaining n standard quantiles that correspond to n boundary values one-to-one, an equal-width histogram can be drawn based on the n standard quantiles that correspond to n boundary values.
其中,桶边界数组中的n个边界值按照从小到大的顺序排列,且n个边界值构成一个等差数列,以实现等宽直方图中每个桶的宽度相等。Among them, the n boundary values in the bucket boundary array are arranged in order from small to large, and the n boundary values constitute an arithmetic sequence to achieve the equal width of each bucket in the equal-width histogram.
另外,等宽直方图中水平方向的坐标轴上的坐标从左向右依次变大,为了便于说明,将等宽直方图中从左到右的n+1个桶依次标记为第一个桶、第二个桶、…、第n+1个桶。如此,第一个桶的左边界值为全量数据点中的最小数据点的数据值,第二个桶的左边界值(也即第一个桶的右边界值)为桶边界数组中第一个边界值,第三个桶的左边界值(也即第二个桶的右边界值)为桶边界数组中第二个边界值,…,依次类推,第n+1个桶的左边界值(也即第n个桶的右边界值)为桶边界数组中第n个边界值,第n+1个桶的右边界值为全量数据点中的最大数据点的数据值。In addition, the coordinates on the horizontal axis in the equal-width histogram become larger from left to right. For the convenience of explanation, the n+1 buckets from left to right in the equal-width histogram are marked as the first bucket. , the second bucket,..., the n+1th bucket. In this way, the left boundary value of the first bucket is the data value of the smallest data point among all the data points, and the left boundary value of the second bucket (that is, the right boundary value of the first bucket) is the first data point in the bucket boundary array. boundary value, the left boundary value of the third bucket (that is, the right boundary value of the second bucket) is the second boundary value in the bucket boundary array,..., and so on, the left boundary value of the n+1th bucket (that is, the right boundary value of the nth bucket) is the nth boundary value in the bucket boundary array, and the right boundary value of the n+1th bucket is the data value of the largest data point among all data points.
如此,基于与h个边界值一一对应的h个标准分位数,绘制等宽直方图的实现过程具体可以为:在确定出桶边界数组中每个边界值对应的分位数之后,便可基于总数量和每个边界值对应的分位数,确定落入相邻两个边界值的数据点的数量,基于落入相邻两个边界值的数据点的数量便可得到等宽直方图中每个桶的高度。具体实现方式后续有详细说明。In this way, based on h standard quantiles corresponding to h boundary values, the specific implementation process of drawing an equal-width histogram can be: after determining the quantile corresponding to each boundary value in the bucket boundary array, then The number of data points falling into two adjacent boundary values can be determined based on the total number and the quantile corresponding to each boundary value. Based on the number of data points falling into two adjacent boundary values, an equal-width histogram can be obtained. The height of each barrel in the picture. The specific implementation method will be explained in detail later.
另外,确定与h个边界值一一对应的h个标准分位数可以参考图6所示的流程图,在此不再赘述。In addition, to determine the h standard quantiles corresponding to the h boundary values one by one, reference can be made to the flow chart shown in Figure 6, which will not be described again here.
图7是本申请实施例提供的一种查询等宽直方图的流程示意图。如图7所示,查询等宽直方图的流程包括如下几个步骤:FIG. 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application. As shown in Figure 7, the process of querying an equal-width histogram includes the following steps:
1)输入待查询等宽直方图的数据桶边界B=[b1,b2,b3,...,bh](也即桶边界数组),初始化数组C=[0,0,...,0],数组C的长度为h+1。1) Enter the data bucket boundary B = [b1, b2, b3,..., bh] (that is, the bucket boundary array) of the equal-width histogram to be queried, and initialize the array C = [0,0,...,0 ], the length of array C is h+1.
2)基于步骤101至步骤103通过自适应选择尺度函数计算B中每个元素对应q值,假设当前遍历到第i个,计算出的q值为qi,则数组C中各个元素的确定方式如下:2) Based on steps 101 to 103, calculate the q value corresponding to each element in B through the adaptive selection scale function. Assume that the current traversal reaches the i-th and the calculated q value is qi. Then each element in array C is determined as follows :
a)如果是第一个元素,将C中第一个元素置为q1。a) If it is the first element, set the first element in C to q1.
b)如果是最后一个元素,将C中最后一个元素置为1-qh。b) If it is the last element, set the last element in C to 1-qh.
c)否则,将C中第i个元素置为qi-qi-1。c) Otherwise, set the i-th element in C to qi-qi-1.
3)如果等宽直方图的纵坐标代表频次,则令C[i]=n*C[i],如此得到的数组C中每个元素为一个桶的高度,此时每个桶的高度表征数据值落入该桶的边界值范围之内的数据点的数量。3) If the ordinate of the equal-width histogram represents frequency, then let C[i]=n*C[i]. Each element in the array C obtained in this way is the height of a bucket. At this time, the height of each bucket represents The number of data points whose data values fall within the bounds of this bucket.
可选地,如果等宽直方图的纵坐标代表概率,则无需令C[i]=n*C[i],如此得到的数组C中每个元素也为一个桶的高度,此时每个桶的高度表征数据值落入该桶的边界值范围之内的数据点的数量与总数量N之间的比值。Alternatively, if the ordinate of the equal-width histogram represents probability, there is no need to set C[i]=n*C[i]. Each element in the array C obtained in this way is also the height of a bucket. At this time, each The height of a bucket represents the ratio between the number of data points whose data values fall within the bounds of the bucket and the total number N.
需要说明的是,上述两种情况用于示例说明基于数据值查询分位数的应用场景,本申请实施例对基于数据值查询分位数的应用场景不做限定。It should be noted that the above two situations are used to illustrate the application scenarios of querying quantiles based on data values. The embodiments of this application do not limit the application scenarios of querying quantiles based on data values.
基于图1所示的实施例,可以根据待查询目标数据点对应的目标分位数,自适应选择尺度函数,以提高构建的目标草图在目标分位数附近的精度,从而提高查询结果的精度。这种自适应选择尺度函数的方式可以应用在基于分位数查询数据值的场景中,也可以应用在基于数值查询分位数的场景中,还可以应用在查询等高直方图的场景中,还可以应用在查询等宽直方图的场景中。因此,本申请实施例提供的方法可以提高各种查询场景下的查询结果的精 度。Based on the embodiment shown in Figure 1, the scale function can be adaptively selected according to the target quantile corresponding to the target data point to be queried, so as to improve the accuracy of the constructed target sketch near the target quantile, thereby improving the accuracy of the query results. . This method of adaptively selecting scale functions can be applied in the scenario of querying data values based on quantiles, in the scenario of querying quantiles based on numerical values, and in the scenario of querying equal-height histograms. It can also be applied to the scenario of querying equal-width histograms. Therefore, the method provided by the embodiments of the present application can improve the accuracy of query results in various query scenarios. Spend.
上述实施例用于解释说明如何自适应选择尺度函数构建目标草图。在本申请实施例中,针对已经构建的目标草图,还提供了一种向目标草图中插入数据点或删除数据点的方法,以更新目标草图。The above embodiment is used to explain how to adaptively select a scale function to construct a target sketch. In the embodiment of the present application, for the target sketch that has been constructed, a method of inserting data points or deleting data points into the target sketch is also provided to update the target sketch.
图8是本申请实施例提供的一种更新目标草图的流程示意图。如图8所示,该方法包括如下步骤801至步骤802。FIG. 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps 801 to 802.
步骤801:生成与缓存中待更新数据点对应的待更新簇,待更新簇包括簇均值、簇权重以及簇标记,待更新簇的簇均值指示待更新数据点的数据值,待更新簇的簇权重指示待更新数据点的数量,待更新簇的簇标记指示待更新数据点的更新类型。Step 801: Generate a cluster to be updated corresponding to the data point to be updated in the cache. The cluster to be updated includes a cluster mean, a cluster weight and a cluster tag. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster of the cluster to be updated is The weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
步骤802:基于待更新簇,更新目标草图。Step 802: Update the target sketch based on the cluster to be updated.
在本申请实施例中,为了实现对目标草图的更新,可以采用一个三元组来表征一个簇。该三元组可表示为<v,w,f>,其中,v表示该簇的簇均值,w表示该簇的簇权重,f表示该簇的簇标记。其中,簇标记指示该簇为待删除的簇还是待合并的簇。In this embodiment of the present application, in order to update the target sketch, a triplet may be used to represent a cluster. This triplet can be expressed as <v, w, f>, where v represents the cluster mean of the cluster, w represents the cluster weight of the cluster, and f represents the cluster label of the cluster. The cluster mark indicates whether the cluster is to be deleted or merged.
基于此,对于缓存中的数据点,将缓存中的数据点采用如上的三元组的方式表示为待更新簇。也即,缓存中待更新数据点对应有待更新簇,待更新簇包括簇均值、簇权重以及簇标记,待更新簇的簇均值指示待更新数据点的数据值,待更新簇的簇权重指示待更新数据点的数量,待更新簇的簇标记指示待更新数据点的更新类型。Based on this, for the data points in the cache, the data points in the cache are expressed as clusters to be updated in the form of triples as above. That is, the data point to be updated in the cache corresponds to the cluster to be updated. The cluster to be updated includes the cluster mean, cluster weight and cluster mark. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster weight of the cluster to be updated indicates the cluster to be updated. The number of updated data points. The cluster tag of the cluster to be updated indicates the update type of the data point to be updated.
示例地,待更新簇的簇标记包括待合并标记和待删除标记。比如,三元组中的f=1时,簇标记为待合并标记,表征相应簇为待合并至目标草图的簇。f=-1时,簇标记为待删除标记,表征相应簇为待从目标草图中删除的簇。For example, the cluster mark of the cluster to be updated includes a mark to be merged and a mark to be deleted. For example, when f=1 in the triplet, the cluster mark is a mark to be merged, indicating that the corresponding cluster is a cluster to be merged into the target sketch. When f=-1, the cluster mark is a mark to be deleted, indicating that the corresponding cluster is a cluster to be deleted from the target sketch.
目前目标草图的更新操作包括向目标草图插入数据点或从目标草图中删除数据点,下面分两种情况对此进行解释说明。The current update operation of the target sketch includes inserting data points into the target sketch or deleting data points from the target sketch. This is explained in two cases below.
第一种情况:向目标草图中插入数据点Case 1: Insert data points into the target sketch
在第一种情况中,步骤802的实现方式为:从待更新簇中获取簇标记为待合并标记的待更新簇,得到待合并簇;将待合并簇合并至目标草图中。In the first case, step 802 is implemented by: obtaining the clusters to be updated whose clusters are marked as to-be-merged markers from the clusters to be updated, and obtaining the clusters to be merged; and merging the clusters to be merged into the target sketch.
由于缓存中缓存有需要删除或需要新增的数据点,在将缓存中的数据点表示为待更新簇之后,可以依据簇标记从缓存中筛选出待合并簇,也即筛选出需要新增的数据点,进而将待合并簇合并至目标草图中。Since there are data points in the cache that need to be deleted or added, after the data points in the cache are represented as clusters to be updated, the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.
在一些实施例中,将待合并簇合并至目标草图中的实现过程可以为:将目标草图中的簇和待合并簇按照簇均值从小到大的顺序进行排序;对于排序后的第一个簇,基于目标尺度函数确定分位数阈值,从排序后的第二个簇开始遍历每个簇,并对每个簇依次执行下述操作:In some embodiments, the implementation process of merging clusters to be merged into the target sketch may be: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; for the first cluster after sorting , determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
对于第i个簇,基于第i个簇的簇权重,确定第i个簇的当前分位数,i为大于1的整数;如果第i个簇的当前分位数低于分位数阈值,则将第i个簇合并至前一个簇中,并从前一个簇开始继续遍历;如果第i个簇的当前分位数超过分位数阈值,则基于第i个簇的当前分位数和目标尺度函数更新分位数阈值,并遍历下一个簇。For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster is lower than the quantile threshold, Then merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the current quantile of the i-th cluster and the target The scale function updates the quantile threshold and traverses the next cluster.
也即,基于某个簇的分位数阈值来判断相邻簇是否适合合并至当前簇。其中,分位数阈值能够指示相应簇的限定容量。That is, based on the quantile threshold of a certain cluster, it is judged whether the adjacent cluster is suitable to be merged into the current cluster. Among them, the quantile threshold can indicate the limited capacity of the corresponding cluster.
示例地,对于排序后的第一个簇,基于目标尺度函数确定分位数阈值的实现方式可以为: 设置第一簇的当前分位数q0为0,通过下述公式确定分位数阈值qthreshold
qthreshold=k-1(k(q0)+1)
For example, for the first cluster after sorting, the implementation method of determining the quantile threshold based on the target scale function can be: Set the current quantile q 0 of the first cluster to 0, and determine the quantile threshold q threshold through the following formula:
q threshold =k -1 (k(q 0 )+1)
其中,k(q0)表示尺度函数。Among them, k(q 0 ) represents the scale function.
另外,示例地,基于第i个簇的簇权重,确定第i个簇的当前分位数的实现方式可以为:确定已经遍历的簇(包括第i个簇)的簇权重的加和,确定排序后的全部簇的簇权重加和,将两个加和之间的比值作为第i个簇的当前分位数。In addition, for example, based on the cluster weight of the i-th cluster, the implementation method of determining the current quantile of the i-th cluster can be: determining the sum of the cluster weights of the clusters that have been traversed (including the i-th cluster), and determining The cluster weights of all clusters after sorting are summed, and the ratio between the two sums is used as the current quantile of the i-th cluster.
此外,如果第i个簇的当前分位数低于分位数阈值,则将第i个簇合并至前一个簇中。示例地,将第i个簇合并至前一个簇是指:基于第i个簇的簇权重和簇均值更新前一个簇的簇权重和簇均值。比如,将第i个簇的簇均值和前一个簇的簇均值按照各自的簇权重进行加权处理,得到的数值作为更新后的前一个簇的簇均值,将第i个簇的簇权重叠加到前一个簇的簇权重上,得到的数值作为更新后的前一个簇的簇权重。Furthermore, if the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster. For example, merging the i-th cluster into the previous cluster means updating the cluster weight and cluster mean of the previous cluster based on the cluster weight and cluster mean of the i-th cluster. For example, the cluster mean of the i-th cluster and the cluster mean of the previous cluster are weighted according to their respective cluster weights, and the resulting value is used as the updated cluster mean of the previous cluster. The cluster weight overlap of the i-th cluster is added to On the cluster weight of the previous cluster, the obtained value is used as the updated cluster weight of the previous cluster.
另外,基于第i个簇的当前分位数和目标尺度函数更新分位数阈值的实现方式同样可以参考上述确定分位数阈值qthreshold的公式,在此不再赘述。In addition, the implementation method of updating the quantile threshold based on the current quantile of the i-th cluster and the target scale function can also refer to the above-mentioned formula for determining the quantile threshold q threshold , which will not be described again here.
图9是本申请实施例提供的一种向目标草图插入数据点的流程示意图。如图9所示,对于新增的数据点先放在缓存(也即缓冲区)中,将缓存中新增数据点采用三元组的方式表示,得到待合并簇。将待合并簇和目标草图中的簇进行排序。根据排序后的第一个簇计算分位数阈值,然后从第二簇开始遍历。对于遍历到的任一个当前簇,判断当前簇的分位数是否小于或等于分位数阈值。如果当前簇的分位数小于或等于分位数阈值,则将当前簇合并到前一个簇中,并删除当前簇,然后将更新后的前一个簇重新确定为当前簇,继续执行上述操作。如果如果当前簇的分位数大于分位数阈值,则基于当前簇的分位数重新计算分位数阈值,并遍历下一个簇。Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application. As shown in Figure 9, the newly added data points are first placed in the cache (that is, the buffer), and the new data points in the cache are expressed in the form of triples to obtain the clusters to be merged. Sort the clusters to be merged and the clusters in the target sketch. Calculate the quantile threshold based on the first cluster after sorting, and then start traversing from the second cluster. For any current cluster traversed, determine whether the quantile of the current cluster is less than or equal to the quantile threshold. If the quantile of the current cluster is less than or equal to the quantile threshold, merge the current cluster into the previous cluster, delete the current cluster, and then redefine the updated previous cluster as the current cluster and continue the above operations. If the quantile of the current cluster is greater than the quantile threshold, the quantile threshold is recalculated based on the quantile of the current cluster and the next cluster is traversed.
需要说明的是,图9所示的插入过程的详细实现方式均为示例说明,本申请实施例对插入操作的详细实现方式不做限定。It should be noted that the detailed implementation manner of the insertion process shown in Figure 9 is an illustration, and the embodiments of the present application do not limit the detailed implementation manner of the insertion operation.
第二种情况:从目标草图中删除数据点Second case: delete data points from target sketch
在第二种情况中,步骤802的实现方式为:从待更新簇中获取簇标记为待删除标记的待更新簇,得到待删除簇;从目标草图中删除待删除簇。In the second case, step 802 is implemented by: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; and deleting the cluster to be deleted from the target sketch.
由于缓存中缓存有需要删除或需要新增的数据点,在将缓存中的数据点全部表示为待更新簇之后,可以依据簇标记从缓存中筛选出待删除簇,也即筛选出需要删除的数据点,进而将待删除簇从目标草图中删除。Since there are data points in the cache that need to be deleted or added, after all the data points in the cache are represented as clusters to be updated, the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.
可选地,待删除的数据点中可能存在数据值相同的数据点,也即待更新簇中可能存在两个簇的簇均值相同。这种场景下,为了提高删除效率,可以将待删除簇中簇均值相同的簇进行合并,合并之后的簇的簇权重为合并前的各个簇的簇权重的加和。然后基于合并之后的待删除簇更新目标草图。Optionally, there may be data points with the same data value among the data points to be deleted, that is, there may be two clusters with the same cluster mean in the cluster to be updated. In this scenario, in order to improve deletion efficiency, clusters with the same cluster mean in the clusters to be deleted can be merged. The cluster weight of the merged cluster is the sum of the cluster weights of each cluster before the merge. Then the target sketch is updated based on the merged clusters to be deleted.
在一些实施例中,从目标草图中删除待删除簇的实现过程可以为:将目标草图中的簇和待删除簇按照簇均值从小到大的顺序进行排序;从排序后的第一个簇开始遍历每个簇,对遍历到的每个簇依次执行下述操作:对于第j个簇,确定第j个簇的簇标记,如果第j个簇的簇标记为待删除标记,则删除第j个簇,并更新与j个簇相邻的簇的簇权重,j为大于或等于1的整数。In some embodiments, the implementation process of deleting clusters to be deleted from the target sketch may be: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; starting from the first cluster after sorting Traverse each cluster and perform the following operations on each cluster traversed: for the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster. clusters, and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.
由于从目标草图中删除某个簇后,将影响与该簇相邻的簇的簇权重,因此在删除簇时, 还需更新与该簇相邻的簇的簇权重。Since deleting a cluster from the target sketch will affect the cluster weights of clusters adjacent to the cluster, when deleting a cluster, The cluster weights of clusters adjacent to this cluster also need to be updated.
其中,更新与j个簇相邻的簇的簇权重包括以下几种情况:Among them, updating the cluster weights of clusters adjacent to j clusters includes the following situations:
情况一:如果第j个簇为排序后的第一个簇,则只需更新第一个簇的右相邻簇的簇权重即可。Case 1: If the j-th cluster is the first cluster after sorting, you only need to update the cluster weight of the right adjacent cluster of the first cluster.
示例地,将第一个簇的右相邻簇的簇权重减去第一个簇的簇权重,得到数值作为更新后的第一个簇的右相邻簇的簇权重。For example, the cluster weight of the first cluster is subtracted from the cluster weight of the right adjacent cluster of the first cluster, and the value obtained is used as the updated cluster weight of the right adjacent cluster of the first cluster.
需要说明的是,如果第一个簇的右相邻簇的簇权重小于第一个簇的簇权重,则删除第一个簇的右相邻簇,并确定第一个簇的右相邻簇的簇权重与第一个簇的簇权重之间的差值,基于该差值更新与右相邻簇相邻的下一个右相邻簇的簇权重。如果该差值还大于与右相邻簇相邻的下一个右相邻簇的簇权重,则继续通过上述方式更新下一个右相邻簇的簇权重,直至最近一次得到的右相邻簇的簇权重大于最近一次确定的差值。这种方式可以称为向右递归更新簇权重。It should be noted that if the cluster weight of the right adjacent cluster of the first cluster is less than the cluster weight of the first cluster, delete the right adjacent cluster of the first cluster and determine the right adjacent cluster of the first cluster. The difference between the cluster weight of and the cluster weight of the first cluster, and the cluster weight of the next right-neighboring cluster adjacent to the right-neighboring cluster is updated based on the difference. If the difference is still greater than the cluster weight of the next right adjacent cluster adjacent to the right adjacent cluster, continue to update the cluster weight of the next right adjacent cluster through the above method until the most recently obtained right adjacent cluster's cluster weight The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the right.
在上述场景中,由于排序后的第一个簇被删除掉,因此目标草图的最小值(也即构建目标草图中的全量数据点中的最小数据点的数据值)发生了变化,此时可以更新目标草图的最小值。示例地,可以将更新后的目标草图中第一个簇的簇均值作为目标草图的最小值。In the above scenario, since the first cluster after sorting is deleted, the minimum value of the target sketch (that is, the data value of the smallest data point among all the data points in the target sketch) has changed. At this time, you can Update the minimum value of the target sketch. For example, the cluster mean of the first cluster in the updated target sketch can be used as the minimum value of the target sketch.
情况二:如果第j个簇为排序后的最后一个簇,则只需更新最后一个簇的左相邻簇的簇权重即可。Case 2: If the jth cluster is the last cluster after sorting, you only need to update the cluster weight of the left adjacent cluster of the last cluster.
示例地,将最后一个簇的的左相邻簇的簇权重减去最后一个簇的簇权重,得到数值作为更新后的最后一个簇的左相邻簇的簇权重。For example, the cluster weight of the last cluster is subtracted from the cluster weight of the left adjacent cluster of the last cluster, and the value obtained is used as the updated cluster weight of the left adjacent cluster of the last cluster.
需要说明的是,如果最后一个簇的左相邻簇的簇权重小于最后一个簇的簇权重,则删除最后一个簇的左相邻簇,并确定最后一个簇的左相邻簇的簇权重与最后一个簇的簇权重之间的差值,基于该差值更新与左相邻簇相邻的下一个左相邻簇的簇权重。如果该差值还大于与左相邻簇相邻的下一个左相邻簇的簇权重,则继续通过上述方式更新下一个左相邻簇的簇权重,直至最近一次得到的左相邻簇的簇权大于最近一次确定的差值。这种方式可以称为向左递归更新簇权重。It should be noted that if the cluster weight of the left adjacent cluster of the last cluster is less than the cluster weight of the last cluster, delete the left adjacent cluster of the last cluster and determine the cluster weight of the left adjacent cluster of the last cluster and The difference between the cluster weights of the last cluster, based on which the cluster weight of the next left-neighboring cluster adjacent to the left-neighboring cluster is updated. If the difference is still greater than the cluster weight of the next left adjacent cluster adjacent to the left adjacent cluster, continue to update the cluster weight of the next left adjacent cluster through the above method until the most recently obtained cluster weight of the left adjacent cluster. The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the left.
在上述场景中,由于排序后的最后一个簇被删除掉,因此目标草图的最大值(也即构建目标草图中的全量数据点中的最大数据点的数据值)发生了变化,此时可以更新目标草图的最大值。示例地,可以将更新后的目标草图中最后一个簇的簇均值作为目标草图的最大值。In the above scenario, since the last cluster after sorting is deleted, the maximum value of the target sketch (that is, the data value of the maximum data point among all the data points in the target sketch) has changed. At this time, it can be updated. The maximum value of the target sketch. For example, the cluster mean of the last cluster in the updated target sketch can be used as the maximum value of the target sketch.
情况三:如果第j个簇为排序后的中间簇,则需更新第j个簇的左相邻簇的簇权重和右相邻簇的簇权重。Case 3: If the jth cluster is the middle cluster after sorting, the cluster weight of the left adjacent cluster and the cluster weight of the right adjacent cluster of the jth cluster need to be updated.
在情况三中,更新与j个簇相邻的簇的簇权重的实现过程可以为:获取j个簇的左相邻簇的簇均值以及j个簇的右相邻簇的簇均值;基于左相邻簇的簇均值、右相邻簇的簇均值以及第j个簇的簇均值和簇权重,分别确定与左相邻簇对应的删除权重、以及与右相邻簇对应的删除权重;基于与左相邻簇对应的删除权重更新左相邻簇的簇权重,基于与右相邻簇对应的删除权重更新左相邻簇的簇权重。In case three, the implementation process of updating the cluster weights of clusters adjacent to j clusters can be as follows: obtaining the cluster mean of the left adjacent clusters of j clusters and the cluster mean of the right adjacent clusters of j clusters; based on the left The cluster mean of the adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the jth cluster determine the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster respectively; based on The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
示例地,基于左相邻簇的簇均值、右相邻簇的簇均值以及j个簇的簇均值和簇权重,分别确定与左相邻簇对应的删除权重、以及与右相邻簇对应的删除权重可以通过如下公式来实现:

For example, based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively. Deleting weights can be achieved through the following formula:

其中,dl表示与左相邻簇对应的删除权重,dr表示与右相邻簇对应的删除权重,wc表示第j个簇的簇权重,vc表示第j个簇的簇均值,vr表示左相邻簇的簇均值,vl表示右相邻簇的簇均值。Among them, d l represents the deletion weight corresponding to the left adjacent cluster, d r represents the deletion weight corresponding to the right adjacent cluster, w c represents the cluster weight of the jth cluster, v c represents the cluster mean of the jth cluster, v r represents the cluster mean of the left adjacent cluster, and v l represents the cluster mean of the right adjacent cluster.
另外,基于与左相邻簇对应的删除权重更新左相邻簇的簇权重示例地可以为:从左相邻簇的簇权重中减去与左相邻簇对应的删除权重,得到的数值作为更新后的左相邻簇的簇权重。基于与右相邻簇对应的删除权重更新右相邻簇的簇权重示例地可以为:从右相邻簇的簇权重中减去与右相邻簇对应的删除权重,得到的数值作为更新后的右相邻簇的簇权重。In addition, updating the cluster weight of the left adjacent cluster based on the deletion weight corresponding to the left adjacent cluster may be, for example: subtracting the deletion weight corresponding to the left adjacent cluster from the cluster weight of the left adjacent cluster, and the obtained value is as The updated cluster weight of the left adjacent cluster. For example, updating the cluster weight of the right adjacent cluster based on the deletion weight corresponding to the right adjacent cluster can be: subtracting the deletion weight corresponding to the right adjacent cluster from the cluster weight of the right adjacent cluster, and the obtained value is used as the updated value. The cluster weight of the right adjacent cluster of .
此外,可选地,更新左相邻簇的簇权重同样可以参考前述的向左递归更新簇权重。更新右相邻簇的簇权重同样可以参考前述的向右递归更新簇权重。在此不再重复说明。In addition, optionally, updating the cluster weight of the left adjacent cluster can also refer to the aforementioned leftward recursive update of the cluster weight. To update the cluster weight of the right adjacent cluster, you can also refer to the aforementioned rightward recursive update of the cluster weight. The explanation will not be repeated here.
图10是本申请实施例提供的一种从目标草图中删除数据点的流程示意图。如图10所示,Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application. As shown in Figure 10,
首先统计缓冲区中的待删除数据点,每个数据点采用前述的三元组表示,也即将每个待删除数据点以簇的方式表示,以构建待删除簇。将待删除簇和目标草图中各个簇按照簇均值从小到大进行排序。First, the data points to be deleted in the buffer are counted. Each data point is represented by the aforementioned triplet, that is, each data point to be deleted is represented in the form of a cluster to construct the cluster to be deleted. Sort the clusters to be deleted and the clusters in the target sketch according to the cluster mean from small to large.
设置排序后的第一个簇为当前簇,从当前簇开始向后遍历,若当前簇f=1,则继续向后遍历。若当前簇f=-1,则表示当前簇为待删除簇,需要进行删除,删除规则如下:Set the first cluster after sorting as the current cluster, and start traversing backward from the current cluster. If the current cluster f=1, continue traversing backward. If the current cluster f=-1, it means that the current cluster is a cluster to be deleted and needs to be deleted. The deletion rules are as follows:
1)当前簇为第一个簇,则删除右相邻簇的数据,即修改右相邻簇的簇权重。1) If the current cluster is the first cluster, delete the data of the right adjacent cluster, that is, modify the cluster weight of the right adjacent cluster.
若右相邻簇的簇权重不足以删除当前簇的簇权重,则通过上述向右递归更新簇权重的方式继续删除。若目标草图的第一个簇的删除影响到了目标草图的最小值,需根据更新后的目标草图的第一个簇更新目标草图的最小值。当前基于待删除簇的簇权重更新完右相邻簇的簇权重后,将当前待删除簇删除,将第一个簇标记为当前簇并继续向后遍历。If the cluster weight of the right adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the right. If the deletion of the first cluster of the target sketch affects the minimum value of the target sketch, the minimum value of the target sketch needs to be updated based on the updated first cluster of the target sketch. After the cluster weight of the right adjacent cluster is updated based on the cluster weight of the cluster to be deleted, the current cluster to be deleted is deleted, the first cluster is marked as the current cluster and the backward traversal continues.
2)当前簇为最后一个簇,则删除左相邻簇集合的数据,即修改左相邻簇的簇权重。若左相邻簇的簇权重不足以删除当前簇的簇权重,则通过上述向左递归更新簇权重的方式继续删除。若目标草图的最后一个簇的删除影响到了目标草图的最大值,可以根据更新后的目标草图的最后一个簇更新目标草图的最大值。当前基于待删除簇的簇权重更新完左相邻簇的簇权重后,将该待删除簇删除,结束删除操作。2) If the current cluster is the last cluster, delete the data of the left adjacent cluster set, that is, modify the cluster weight of the left adjacent cluster. If the cluster weight of the left adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the left. If the deletion of the last cluster of the target sketch affects the maximum value of the target sketch, the maximum value of the target sketch can be updated based on the last cluster of the updated target sketch. After the cluster weight of the left adjacent cluster is currently updated based on the cluster weight of the cluster to be deleted, the cluster to be deleted is deleted and the deletion operation is completed.
3)当前簇位于中间位置,则确定当前簇的左相邻簇的删除权重和右相邻簇的删除权重,进而递归地向左向右进行删除,也即基于左相邻簇的删除权重更新左相邻簇的簇权重,基于右相邻簇的删除权重更新右相邻簇的簇权重。当前基于待删除簇的簇权重更新完左右相邻簇的簇权重后,将待删除簇的左相邻簇标记为当前簇并将该待删除簇删除,然后继续向后遍历。3) The current cluster is located in the middle position, then determine the deletion weight of the left adjacent cluster and the deletion weight of the right adjacent cluster of the current cluster, and then delete recursively from left to right, that is, update the deletion weight based on the left adjacent cluster. The cluster weight of the left adjacent cluster is updated based on the deletion weight of the right adjacent cluster. After the cluster weights of the left and right adjacent clusters are currently updated based on the cluster weight of the cluster to be deleted, the left adjacent cluster of the cluster to be deleted is marked as the current cluster and the cluster to be deleted is deleted, and then the backward traversal continues.
基于图8所示的实施例,可以将缓存中待更新数据点以三元组的方式表示为待更新簇,由于待更新簇中的簇标记能够指示待更新的簇为待删除的簇还是待合并的簇,因此基于簇标记便可实现将缓存中的待插入的数据点插入至目标草图,或者将缓存中待删除的数据点从目标草图中删除。Based on the embodiment shown in Figure 8, the data points to be updated in the cache can be expressed as clusters to be updated in the form of triples, because the cluster tags in the clusters to be updated can indicate whether the clusters to be updated are clusters to be deleted or clusters to be deleted. Merged clusters, so based on cluster tags, data points to be inserted in the cache can be inserted into the target sketch, or data points to be deleted in the cache can be deleted from the target sketch.
在前述实施例中,如果需要查询目标数据点,则临时通过图1所示的方式构建目标草图。但是对于时序数据库中的数据点,由于时序数据库的数据点的数量很庞大,这种情况下,如果在每次需要查询数据点时临时构建目标草图,很容易浪费计算资源。基于此,本申请实施 例提供了一种增量更新方法,通过增量更新方法能够实现在查询数据点时,仅仅基于新增的数据点构建草图,然后将构建的草图和缓存中已有的草图进行聚合,便可得到目标草图,从而避免了计算资源的浪费。In the aforementioned embodiment, if the target data point needs to be queried, the target sketch is temporarily constructed in the manner shown in Figure 1 . However, for the data points in the time series database, since the number of data points in the time series database is very large, in this case, it is easy to waste computing resources if the target sketch is temporarily constructed every time a data point needs to be queried. Based on this, this application implements The example provides an incremental update method. Through the incremental update method, when querying data points, a sketch is constructed based only on the newly added data points, and then the constructed sketch and the existing sketches in the cache are aggregated. Get a target sketch, thus avoiding the waste of computing resources.
为了便于理解,先对时序数据库的特性进行解释。时序数据库中存储的数据点有对应的时间戳,每个数据点的时间戳可以表征该数据点的采集时间,因此时序时间库中存储的数据点有时序特性。另外,时序数据库中存储的数据点通常可以包括不同指标上的数据点,比如针对温度采集的数据点以及针对湿度采集的数据点等等,为了便于区分不同指标上的数据点,将每个指标上的数据点称为一条时间线上的数据点。基于此,时序数据库中的数据点包括多条时间线对应的数据点,每条时间线表征一个指标。In order to facilitate understanding, the characteristics of the time series database will be explained first. The data points stored in the time series database have corresponding timestamps. The timestamp of each data point can represent the collection time of the data point. Therefore, the data points stored in the time series time database have time series characteristics. In addition, the data points stored in the time series database can usually include data points on different indicators, such as data points collected for temperature and data points collected for humidity, etc. In order to facilitate the differentiation of data points on different indicators, each indicator is The data points on are called data points on a timeline. Based on this, the data points in the time series database include data points corresponding to multiple timelines, and each timeline represents an indicator.
另外,为了实现本申请实施例提供的增量更新方法,本申请实施例还提供了一种增量更新系统,为了便于后续理解,在此先对本申请实施例提供的增量更新系统进行解释说明。In addition, in order to implement the incremental update method provided by the embodiments of the present application, the embodiments of the present application also provide an incremental update system. To facilitate subsequent understanding, the incremental update system provided by the embodiments of the present application is explained here. .
图11是本申请实施例提供的一种增量更新系统的架构示意图。如图11所示,该增量更新系统包括如下组件。Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application. As shown in Figure 11, the incremental update system includes the following components.
1)单时间线组件(seriesCusor),也称为单时间线读数据执行器,负责响应于查询语句读取某时间线指定时间范围内的原始数据点。1) The single timeline component (seriesCusor), also known as the single timeline read data executor, is responsible for reading the original data points within the specified time range of a timeline in response to the query statement.
2)单时间线聚合组件(aggregateCursor),也称为单时间线聚合执行器,负责将该时间线的数据点按照特定的聚合方法计算,并输出聚合结果。比如将该时间线的数据点构建为草图,前述实施例中草图的插入与删除操作都可以通过该组件实现。2) The single-timeline aggregation component (aggregateCursor), also known as the single-timeline aggregation executor, is responsible for calculating the data points of the timeline according to a specific aggregation method and outputting the aggregation results. For example, the data points of the timeline are constructed as sketches, and the insertion and deletion operations of the sketches in the aforementioned embodiments can be implemented through this component.
3)单时间线草图缓存组件(SketchCacheCursor),也称为单时间线草图缓存执行器,负责将已经构建的草图进行缓存。其中,该增量更新系统如图11所示还包含数据缓存(CacheData)和元数据缓存(CacheMeta),这两个缓存分别用于存储已经构建的草图和草图的元数据,草图的元数据指示用于索引到草图的元数据,具体功能后续实施例有详细说明,在此先不展开。3) The single-timeline sketch cache component (SketchCacheCursor), also known as the single-timeline sketch cache executor, is responsible for caching the already built sketches. Among them, the incremental update system, as shown in Figure 11, also includes a data cache (CacheData) and a metadata cache (CacheMeta). These two caches are used to store the built sketches and the metadata of the sketches respectively. The metadata indication of the sketches The metadata used to index sketches. The specific functions will be described in detail in subsequent embodiments and will not be elaborated here.
4)多时间线排序组件(tagSetCursor),也称为多时间线排序合并执行器,负责将基于多时间线的数据点分别聚合的草图按照空间与时间维度排序,确保缓存的草图有序性。4) The multi-timeline sorting component (tagSetCursor), also known as the multi-timeline sorting and merging executor, is responsible for sorting the sketches that are aggregated based on the data points of multiple timelines according to the space and time dimensions to ensure the orderliness of the cached sketches.
5)多时间线组间组件(groupCursor),也称为多时间线组间执行器,负责将多个多时间线排序组件输出的结果进行汇聚,以实现不同的多时间线排序组件之间的串行调度。5) Multi-timeline inter-group component (groupCursor), also called multi-timeline inter-group executor, is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components. Serial scheduling.
6)逻辑并发组件(ChunkReader),也称为逻辑并发执行器,作为最小粒度的并行调度单元,负责数据结构的转换以及元数据的组装。其中,数据结构的转换是指将存储层数据结构转换成查询数据结构,以输出查询结果。元数据的组装用于生成草图的元数据。6) The logical concurrent component (ChunkReader), also known as the logical concurrent executor, serves as the smallest granular parallel scheduling unit and is responsible for the conversion of data structures and the assembly of metadata. Among them, data structure conversion refers to converting the storage layer data structure into a query data structure to output query results. Assembly of metadata is used to generate metadata for sketches.
7)聚合转换组件(AggregateTransform),也称为多时间线聚合执行器,负责将多时间线组间组件的输出结果进一步聚合,如草图的合并。7) The Aggregation Transformation component (AggregateTransform), also known as the multi-timeline aggregation executor, is responsible for further aggregating the output results of components between multi-timeline groups, such as the merging of sketches.
另外,如图11所示,基于增量更新系统中各个组件的职责,可以实现如下三个功能:1.草图构建、2.草图缓存以及3.草图聚合。In addition, as shown in Figure 11, based on the responsibilities of each component in the incremental update system, the following three functions can be implemented: 1. Sketch construction, 2. Sketch caching, and 3. Sketch aggregation.
基于图11所示的增量更新系统,下面对本申请实施例提供的增量更新方法进行详细解释说明。下面以如何构建图1所示实施例中的步骤102的目标草图为例进行说明。图12是本申请实施例提供的一种增量更新方法流程图。如图12所示,该方法包括如下步骤1201至步骤1203。Based on the incremental update system shown in Figure 11, the incremental update method provided by the embodiment of the present application will be explained in detail below. The following describes how to construct the target sketch of step 102 in the embodiment shown in FIG. 1 as an example. Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application. As shown in Figure 12, the method includes the following steps 1201 to 1203.
步骤1201:获取基于多个数据点中部分数据点和目标尺度函数已经缓存的草图,得到第一草图。 Step 1201: Obtain a cached sketch based on some of the multiple data points and the target scale function to obtain a first sketch.
步骤1202:基于多个数据点中除部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图。Step 1202: Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch.
步骤1203:将第一草图和第二草图进行聚合,得到目标草图。Step 1203: Aggregate the first sketch and the second sketch to obtain the target sketch.
在本申请实施例中,当需要查询目标数据点时,如果预先已经基于部分数据点和目标尺度函数构建了一些草图,则当前可以基于其他数据点构建草图,将当前构建的草图和预先已经构建的草图进行合并,便可得到目标草图。通过这种方式可以避免在每次查询时都需要基于全量数据点构建目标草图,进而节省了计算机资源。In the embodiment of this application, when the target data point needs to be queried, if some sketches have been constructed based on some data points and the target scale function in advance, the sketch can currently be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you can avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.
在一些实施例中,获取基于多个数据点中部分数据点和目标尺度函数已经缓存的草图,得到第一草图的实现方式可以为:获取待查询的目标时间窗,目标数据点为时间戳位于目标时间窗内的数据点;获取元数据集,元数据集包括缓存中的多个草图的元数据,多个草图为基于目标尺度函数构建的草图,每个草图的元数据包括草图时间窗和草图时间线标识,草图时间窗为构建相应草图的数据点的时间戳对应的时间窗,草图时间线标识为构建相应草图的数据点所属的时间线的标识;基于目标时间窗和目标数据点所属的时间线,从元数据集确定第一元数据,第一元数据中的草图时间窗为目标时间窗的部分或全部时间窗,第一元数据中的草图时间线标识与目标数据点所属的时间线的标识相同;将第一元数据对应的草图确定为第一草图。In some embodiments, the cached sketch based on some of the data points among the multiple data points and the target scale function is obtained. The first sketch can be obtained by: obtaining the target time window to be queried, and the target data point is the timestamp located at Data points within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. The metadata of each sketch includes the sketch time window and The sketch timeline identifier. The sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch. The sketch timeline identifier is the identifier of the timeline to which the data point that constructs the corresponding sketch belongs; based on the target time window and the target data point. The timeline of the first metadata is determined from the metadata set. The sketch time window in the first metadata is part or all of the target time window. The sketch timeline identifier in the first metadata is consistent with the target data point. The identities of the timelines are the same; the sketch corresponding to the first metadata is determined as the first sketch.
其中,待查询的目标时间窗可以为用户输入的查询语句中携带的时间窗。比如用户输入查询语句为“查询上个季度的最高温度”,则目标时间窗为“上个季度”。The target time window to be queried may be the time window carried in the query statement input by the user. For example, if the user inputs a query statement of "query the highest temperature in the last quarter", then the target time window is "last quarter".
另外,元数据集可以由图11所示的元数据缓存(CacheMeta)维护。示例地,元数据集中以列表方式存储有已经缓存的各个草图的元数据。这种情况下,基于目标时间窗和目标数据点所属的时间线,从元数据集确定第一元数据的实现方式可以为:遍历元数据集中每个元数据,如果某个元数据的草图时间线标识和目标数据点所属的时间线的标识相同,且该元数据的草图时间窗为目标时间窗中的部分或全部时间窗,则将该元数据确定为第一元数据。In addition, the metadata set can be maintained by the metadata cache (CacheMeta) shown in Figure 11. For example, the metadata set stores the metadata of each cached sketch in the form of a list. In this case, based on the target time window and the timeline to which the target data point belongs, the implementation method of determining the first metadata from the metadata set can be: traverse each metadata in the metadata set, if the sketch time of a certain metadata If the line identifier is the same as the identifier of the timeline to which the target data point belongs, and the sketch time window of the metadata is part or all of the time window in the target time window, then the metadata is determined to be the first metadata.
可选地,为了提高元数据查询效率,可以将元数据集中的元数据按照空间和时间维度进行管理。图13是本申请实施例提供的一种从空间和时间维度上管理元数据的示意图。如图13所示,每个SID表征一条时间线,每个SID对应的多个时间窗(windows),基于每个时间窗缓存有对应的草图。Optionally, in order to improve the efficiency of metadata query, the metadata in the metadata set can be managed according to space and time dimensions. Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by an embodiment of the present application. As shown in Figure 13, each SID represents a timeline, each SID corresponds to multiple time windows (windows), and corresponding sketches are cached based on each time window.
这种场景下,元数据集的元数据可以键值(key-value)的方式进行存储。键为数据分片标识(SharId),其中,每个SharId表征一个时间范围(timerange),如此每个SharId对应的值则包括多个元数据,每个元数据中的草图时间窗均在该时间范围之内,这多个元数据中的草图时间线标识则可以为不同时间线标识。In this scenario, the metadata of the metadata set can be stored in a key-value format. The key is the data fragmentation identifier (SharId), where each SharId represents a time range (timerange), so the value corresponding to each SharId includes multiple metadata, and the sketch time window in each metadata is within that time Within the scope, the sketch timeline identifiers in these multiple metadata can be different timeline identifiers.
比如,图13中SharId1对应的值(value)包括与SID1对应的元数据,这些元数据可以统一标记为SID1+timerange11,指示这些元数据中的时间线标识为SID1,这些元数据中的时间窗均在SharId1对应的时间范围timerange11之内。SharId1对应的值(value)还包括与SID2对应的元数据,这些元数据可以统一标记为SID2+timerange12,指示这些元数据中的时间线标识为SID2,这些元数据中的时间窗均在SharId1对应的时间范围timerange12之内。SharId1对应的值(value)还包括与SID3对应的元数据,这些元数据可以统一标记为SID2+timerange13,指示这些元数据中的时间线标识为SID2,这些元数据中的时间窗均在SharId1对应的时间范围timerange13之内。 For example, the value corresponding to SharId1 in Figure 13 includes metadata corresponding to SID1. These metadata can be uniformly marked as SID1+timerange11, indicating that the timeline identifier in these metadata is SID1, and the time window in these metadata All are within the time range timerange11 corresponding to SharId1. The value corresponding to SharId1 also includes metadata corresponding to SID2. These metadata can be uniformly marked as SID2+timerange12, indicating that the timeline identifier in these metadata is SID2. The time windows in these metadata are all corresponding to SharId1. The time range is within timerange12. The value corresponding to SharId1 also includes metadata corresponding to SID3. These metadata can be uniformly marked as SID2+timerange13, indicating that the timeline identifier in these metadata is SID2. The time windows in these metadata are all corresponding to SharId1. The time range is within timerange13.
关于图13中的其他数据分片标识SharId2和SharId3对应的值(value),同样可以参考上述解释。Regarding the values (values) corresponding to other data fragmentation identifiers SharId2 and SharId3 in Figure 13, you can also refer to the above explanation.
此时,基于目标时间窗和目标数据点所属的时间线,从元数据集确定第一元数据的实现方式可以为:确定与目标时间窗匹配的SharId,匹配的SharId所表征的时间范围落入该目标时间窗,然后从匹配的SharId对应的值中查询草图时间线标识为目标时间线标识的元数据,得到的元数据即为第一元数据。At this time, based on the target time window and the timeline to which the target data point belongs, the implementation method of determining the first metadata from the metadata set can be: determining the SharId that matches the target time window, and the time range represented by the matching SharId falls within In this target time window, the metadata whose sketch timeline identifier is the target timeline identifier is then queried from the value corresponding to the matching SharId, and the metadata obtained is the first metadata.
上述过程可以通过图11中的多时间线组间组件来实现。The above process can be realized through the multi-timeline inter-group component in Figure 11.
相应地,基于多个数据点中部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图的实现过程为:获取多个数据点中第二时间窗对应的数据点,第二时间窗为目标时间窗中除第一时间窗之外的时间窗,第一时间窗为目标时间窗中与第一元数据中的草图时间窗重叠的部分;基于目标尺度函数和第二时间窗对应的数据点,构建第二草图。Correspondingly, the implementation process of constructing a sketch based on data points other than some data points among the multiple data points and the target scale function to obtain the second sketch is: obtaining the data points corresponding to the second time window among the multiple data points, and the second The time window is the time window in the target time window except the first time window, and the first time window is the part of the target time window that overlaps with the sketch time window in the first metadata; based on the target scale function and the second time window Corresponding data points, construct a second sketch.
也即,对于尚未构建草图的数据点,则临时构建草图,将临时构建的草图作为第二草图,以便后续和已经缓存的第一草图进行合并。其中,临时构建草图可以通过图11中的单时间线组件以及单时间线聚合组件来实现。That is, for data points that have not yet constructed a sketch, a sketch is temporarily constructed, and the temporarily constructed sketch is used as the second sketch for subsequent merging with the cached first sketch. Among them, the temporary construction sketch can be realized through the single timeline component and the single timeline aggregation component in Figure 11.
此外,在基于多个数据点中除部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图之后,还可以确定第二草图的元数据,得到第二元数据;缓存第二草图,并将第二元数据添加到元数据集中,以实现对元数据集的更新。该过程可以通过图11中的单时间线草图缓存组件来实现。In addition, after constructing a sketch based on data points other than some data points among the plurality of data points and the target scale function to obtain the second sketch, the metadata of the second sketch can also be determined to obtain the second metadata; the second metadata can be cached Sketch and add secondary metadata to the metadata set to enable updates to the metadata set. This process can be implemented through the single-timeline sketch cache component in Figure 11.
另外,基于图1所示的实施例可知,同一批数据点基于不同尺度函数构建的草图不同,因此在本申请实施例中,元数据集和尺度函数对应,对于不同的尺度函数构建的草图,可以维护不同的元数据集,每个元数据集中仅仅维护基于相应尺度函数构建的草图的元数据。In addition, based on the embodiment shown in Figure 1, it can be seen that the same batch of data points construct different sketches based on different scale functions. Therefore, in the embodiment of the present application, the metadata set corresponds to the scale function. For the sketches constructed with different scale functions, Different metadata sets can be maintained, each metadata set only maintaining metadata for sketches built based on the corresponding scale function.
另外,在本申请实施例中,当在已经缓存的草图对应的时间范围内覆盖写新数据点时,则需要将已经缓存的草图执行失效处理,避免查询结果与实际数据不一致。In addition, in the embodiment of the present application, when new data points are overwritten within the time range corresponding to the cached sketch, the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.
基于此,在一些实施例中,对于待写入的数据点,还可以确定待写入数据点的时间戳、以及待写入数据点所属的时间线的标识;如果待写入数据点的时间戳和所属的时间线的标识与元数据集中第三元数据匹配,则将第三元数据对应的草图删除,并更新元数据集。Based on this, in some embodiments, for the data point to be written, the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written can also be determined; if the time of the data point to be written is If the stamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted and the metadata set is updated.
其中,待写入数据点的时间戳和所属的时间线的标识与元数据集中第三元数据匹配是指:待写入数据点的时间戳落入第三元数据的草图时间窗内,待写入数据点所属的时间线的标识与第三元数据的草图时间线标识相同。该过程可以通过图11中的单时间线草图缓存组件来实现。Among them, matching the timestamp of the data point to be written and the identifier of the timeline to which it belongs matches the third metadata in the metadata set means: the timestamp of the data point to be written falls within the sketch time window of the third metadata. The identity of the timeline to which the written data point belongs is the same as the sketch timeline identity of the third metadata. This process can be implemented through the single-timeline sketch cache component in Figure 11.
另外,在本申请实施例中,随着时间推移,缓存的草图越来越多,为了避免过多草图浪费缓存,还可以对草图进行淘汰。In addition, in the embodiment of the present application, as time goes by, more and more sketches are cached. In order to avoid too many sketches wasting the cache, the sketches can also be eliminated.
本申请实施例提供的草图淘汰方法可以从两个方面来淘汰草图,第一个方面是针对属于同一时间线的多个草图中的部分草图进行淘汰,以实现从时间维度上对草图进行淘汰。另一方面是针对不同时间线中某条时间线的草图进行淘汰,以实现从空间维度上对草图进行淘汰。The sketch elimination method provided by the embodiment of the present application can eliminate sketches from two aspects. The first aspect is to eliminate some sketches among multiple sketches belonging to the same timeline, so as to eliminate sketches from the time dimension. On the other hand, the sketches of a certain timeline in different timelines are eliminated to eliminate the sketches from the spatial dimension.
在一些实施例中,对于元数据集中的任一草图时间线标识,元数据集还包括与该草图时间线标识对应的第一使用信息,第一使用信息用于记录与该草图时间线标识匹配的多个草图中每个草图的使用时间。这种场景下,基于时间维度淘汰草图的实现方式可以为:基于第一使用信息确定与该草图时间线标识匹配的多个草图中待淘汰的草图,并删除待淘汰的草图。 In some embodiments, for any sketch timeline identification in the metadata set, the metadata set also includes first usage information corresponding to the sketch timeline identification, and the first usage information is used to record matches with the sketch timeline identification. The time spent on each of the multiple sketches. In this scenario, the elimination of sketches based on the time dimension can be implemented by: determining the sketches to be eliminated among the multiple sketches that match the sketch timeline identifier based on the first usage information, and deleting the sketches to be eliminated.
示例地,可以通过最近最少使用(least recently used,LRU)淘汰机制进行淘汰。也即,将与该草图时间线标识匹配的多个草图中近期使用频率较低的草图删除,以节省缓存。For example, elimination can be carried out through the least recently used (LRU) elimination mechanism. That is, the less frequently used sketches among the multiple sketches matching the sketch timeline ID will be deleted to save cache.
另外,在一些实施例中,元数据集还包括第二使用信息,第二使用信息用于记录多个草图时间线标识中每个草图时间线标识对应的使用信息,每个草图时间线标识对应的使用信息指示与相应草图时间线标识匹配的草图的使用时间。这种场景下,基于空间维度淘汰草图的实现方式可以为:基于第二使用信息确定多个草图时间线标识中待淘汰的草图时间线标识;将与待淘汰的草图时间线标识匹配的草图删除。In addition, in some embodiments, the metadata set further includes second usage information. The second usage information is used to record the usage information corresponding to each sketch timeline identification among the plurality of sketch timeline identifications. Each sketch timeline identification corresponds to The usage information indicates when the sketch that matches the corresponding sketch timeline ID was used. In this scenario, the implementation method of eliminating sketches based on the spatial dimension can be: determining the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers based on the second usage information; and deleting the sketch that matches the sketch timeline identifier to be eliminated. .
示例地,同样可以通过LRU淘汰机制进行淘汰。也即,将各个草图时间线标识中近期使用频率较低的草图时间线标识对应的草图删除,以节省缓存。For example, elimination can also be performed through the LRU elimination mechanism. That is, among the various sketch timeline identifiers, the sketches corresponding to the sketch timeline identifiers that have been used less frequently recently are deleted to save cache.
综上,本申请实施例提供了一种增量更新系统和增量更新方法,能够实现在每次查询数据点时无需基于全量数据点构建目标草图,从而节省了计算资源。In summary, the embodiments of the present application provide an incremental update system and an incremental update method, which can eliminate the need to build a target sketch based on a full amount of data points every time a data point is queried, thereby saving computing resources.
下面对本申请实施例涉及的装置及设备进行解释说明。The devices and equipment involved in the embodiments of the present application will be explained below.
本申请实施例还提供一种数据点查询装置,如图14所示,该装置1400包括如下几个模块。An embodiment of the present application also provides a data point query device. As shown in Figure 14, the device 1400 includes the following modules.
第一确定模块1401,用于基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,多个尺度函数中不同尺度函数构建的草图中的簇的密集程度不同,目标分位数指示目标数据点在按照大小排序后的多个数据点中的位置。具体实现方式可以参考图1实施例中的步骤101。The first determination module 1401 is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of clusters in the sketch constructed by different scale functions in the multiple scale functions. Differently, the target quantile indicates the position of the target data point among multiple data points sorted by size. For specific implementation methods, reference can be made to step 101 in the embodiment of Figure 1 .
构建模块1402,用于基于目标尺度函数和多个数据点构建目标草图,目标草图包括多个簇,每个簇包括簇均值和簇权重,簇均值指示聚类得到相应簇的数据点的均值,簇权重指示聚类得到相应簇的数据点的数量。具体实现方式可以参考图1实施例中的步骤102。The construction module 1402 is used to construct a target sketch based on the target scale function and multiple data points. The target sketch includes multiple clusters, each cluster includes a cluster mean and a cluster weight, and the cluster mean indicates clustering to obtain the mean of the data points of the corresponding cluster, Cluster weights indicate the number of data points that clustered into corresponding clusters. For specific implementation, please refer to step 102 in the embodiment of Figure 1 .
查询模块1403,用于基于目标草图查询目标数据点。具体实现方式可以参考图1实施例中的步骤103。Query module 1403, used to query target data points based on the target sketch. For specific implementation, please refer to step 103 in the embodiment of Figure 1 .
可选地,多个尺度函数包括第一尺度函数和第二尺度函数,基于第一尺度函数构建的草图中的簇在第一分位数区间上的密集程度,大于基于第二尺度函数构建的草图中的簇在第一分位数区间上的密集程度,基于第一尺度函数构建的草图中的簇在第二分位数区间上的密集程度,小于基于第二尺度函数构建的草图中的簇在第二分位数区间上的密集程度;Optionally, the multiple scale functions include a first scale function and a second scale function. The clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those constructed based on the second scale function. The clusters in the sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than in the sketch constructed based on the second scale function. How dense the clusters are on the second quantile interval;
第一确定模块1401用于:The first determination module 1401 is used for:
如果目标分位数位于第一分位数区间,则将第一尺度函数确定为目标尺度函数;If the target quantile is located in the first quantile interval, the first scale function is determined as the target scale function;
如果目标分位数位于第二分位数区间,则将第二尺度函数确定为目标尺度函数。If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
可选地,第一分位数区间包括从0至x1的区间、以及从x2至1的区间,x1和x2均大于0且小于1,且x1小于x2;Optionally, the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2;
第二分位数区间包括从x1至x2的区间。The second quantile interval includes the interval from x1 to x2.
可选地,查询模块1403用于:Optionally, the query module 1403 is used to:
基于目标草图和目标分位数,查询目标数据点的数据值。Query the data value of the target data point based on the target sketch and the target quantile.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
接收模块,用于接收数据点查询请求,数据点查询请求用于查询多个数据点中的目标数据点的数据值,数据点查询请求携带目标数据点的标准分位数; The receiving module is used to receive a data point query request. The data point query request is used to query the data value of a target data point among multiple data points. The data point query request carries the standard quantile of the target data point;
第一确定模块,还用于将数据点查询请求中携带的标准分位数确定为目标分位数。The first determination module is also used to determine the standard quantile carried in the data point query request as the target quantile.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
接收模块,用于接收等高直方图查询请求,等高直方图查询请求用于查询基于多个数据点构建的等高直方图,且等高直方图查询请求携带桶数量h,h为大于1的整数;The receiving module is used to receive the equal-height histogram query request. The equal-height histogram query request is used to query the equal-height histogram constructed based on multiple data points. The equal-height histogram query request carries the number of buckets h, and h is greater than 1. an integer;
第一确定模块,还用于基于桶数量h和多个数据点的总数量,确定等高直方图中从左到右第一个桶至第h-1个桶的分位数,得到h-1个分位数;The first determination module is also used to determine the quantiles from the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of multiple data points, and obtain h- 1 quantile;
查询模块,还用于将h-1个分位数中每个分位数分别作为目标分位数,并执行基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的操作,以得到与h-1个分位数一一对应的h-1个数据值。The query module is also used to use each quantile in h-1 quantiles as a target quantile, and execute the target quantile corresponding to the target data point to be queried from multiple scale functions. The operation of the target scale function to obtain h-1 data values corresponding to h-1 quantiles.
装置1400还包括绘制模块,用于基于h-1个数据值、以及多个数据点中的最大数据点的数据值和最小数据点的数据值,绘制等高直方图。The apparatus 1400 further includes a drawing module configured to draw a contour histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
可选地,第一确定模块还用于:Optionally, the first determination module is also used to:
基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数,将估计分位数作为目标分位数;Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine the estimated quantile of the target data point, and use the estimated quantile as the target quantile;
查询模块用于:The query module is used for:
基于目标草图和目标数据点的数据值,查询目标数据点的标准分位数。Query the standard quantile of the target data point based on the target sketch and the data value of the target data point.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
接收模块,用于接收分位数查询请求,分位数查询请求用于查询多个数据点中的目标数据点的标准分位数,分位数查询请求携带目标数据点的数据值。The receiving module is used to receive a quantile query request. The quantile query request is used to query the standard quantile of a target data point among multiple data points. The quantile query request carries the data value of the target data point.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
接收模块,用于接收等宽直方图查询请求,等宽直方图查询请求用于查询基于多个数据点构建的等宽直方图,且等宽直方图查询请求携带桶边界数组,桶边界数组包括n个边界值,n个边界值将多个数据点中最小数据点的数据值与最大数据点的数据值之间划分出n+1个区间;The receiving module is used to receive an equal-width histogram query request. The equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points. The equal-width histogram query request carries a bucket boundary array, and the bucket boundary array includes n boundary values, n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points;
查询模块,用于将n个边界值中每个边界值分别作为目标数据点的数据值,并执行基于目标数据点的数据值,以及多个数据点中的最大数据点的数据值和最小数据点的数据值,确定目标数据点的估计分位数的操作,以得到与n个边界值一一对应的n个标准分位数。The query module is used to treat each of the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, as well as the data value and minimum data of the largest data point among multiple data points. The operation of determining the estimated quantile of the target data point based on the data value of the point to obtain n standard quantiles that correspond one-to-one to the n boundary values.
装置还包括绘制模块,用于基于与n个边界值一一对应的n个标准分位数,绘制等宽直方图。The device also includes a drawing module for drawing an equal-width histogram based on n standard quantiles that correspond one-to-one to the n boundary values.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
生成模块,用于生成与缓存中待更新数据点对应的待更新簇,待更新簇包括簇均值、簇权重以及簇标记,待更新簇的簇均值指示待更新数据点的数据值,待更新簇的簇权重指示待更新数据点的数量,待更新簇的簇标记指示待更新数据点的更新类型;The generation module is used to generate clusters to be updated corresponding to the data points to be updated in the cache. The clusters to be updated include cluster means, cluster weights and cluster tags. The cluster mean of the clusters to be updated indicates the data values of the data points to be updated. The clusters to be updated are The cluster weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated;
更新模块,用于基于待更新簇,更新目标草图。The update module is used to update the target sketch based on the cluster to be updated.
可选地,更新模块用于:Optionally, update modules are used to:
从待更新簇中获取簇标记为待合并标记的待更新簇,得到待合并簇;Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;
将待合并簇合并至目标草图中。Merge the clusters to be merged into the target sketch.
可选地,更新模块用于:Optionally, update modules are used to:
将目标草图中的簇和待合并簇按照簇均值从小到大的顺序进行排序; Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;
对于排序后的第一个簇,基于目标尺度函数确定分位数阈值,从排序后的第二个簇开始遍历每个簇,并对每个簇依次执行下述操作:For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
对于第i个簇,基于第i个簇的簇权重,确定第i个簇的当前分位数,i为大于1的整数;For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1;
如果第i个簇的当前分位数低于分位数阈值,则将第i个簇合并至前一个簇中,并从前一个簇继续开始遍历;If the current quantile of the i-th cluster is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster;
如果第i个簇的当前分位数超过分位数阈值,则基于第i个簇的当前分位数和目标尺度函数更新分位数阈值,并遍历下一个簇。If the current quantile of the i-th cluster exceeds the quantile threshold, the quantile threshold is updated based on the current quantile of the i-th cluster and the target scale function, and the next cluster is traversed.
可选地,更新模块用于:Optionally, update modules are used to:
从待更新簇中获取簇标记为待删除标记的待更新簇,得到待删除簇;Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;
从目标草图中删除待删除簇。Remove the cluster to be deleted from the target sketch.
可选地,更新模块用于:Optionally, update modules are used to:
将目标草图中的簇和待删除簇按照簇均值从小到大的顺序进行排序;Sort the clusters in the target sketch and the clusters to be deleted according to the cluster mean value from small to large;
从排序后的第一个簇开始遍历每个簇,对每个簇依次执行下述操作:Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:
对于第j个簇,确定第j个簇的簇标记,如果第j个簇的簇标记为待删除标记,则删除第j个簇,并更新与j个簇相邻的簇的簇权重,j为大于或等于1的整数。For the jth cluster, determine the cluster mark of the jth cluster. If the cluster mark of the jth cluster is a mark to be deleted, delete the jth cluster and update the cluster weight of the cluster adjacent to j cluster, j is an integer greater than or equal to 1.
可选地,更新模块用于:Optionally, update modules are used to:
如果第j个簇为排序之后的中间簇,则获取j个簇的左相邻簇的簇均值以及j个簇的右相邻簇的簇均值;If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent cluster of j clusters and the cluster mean of the right adjacent cluster of j clusters;
基于左相邻簇的簇均值、右相邻簇的簇均值以及j个簇的簇均值和簇权重,分别确定与左相邻簇对应的删除权重、以及与右相邻簇对应的删除权重;Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively;
基于与左相邻簇对应的删除权重更新左相邻簇的簇权重,基于与右相邻簇对应的删除权重更新左相邻簇的簇权重。The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
可选地,构建模块用于:Optionally, building blocks are used to:
获取基于多个数据点中部分数据点和目标尺度函数已经缓存的草图,得到第一草图;Obtain the cached sketch based on some of the data points among the multiple data points and the target scale function, and obtain the first sketch;
基于多个数据点中除部分数据点之外的数据点和目标尺度函数构建草图,得到第二草图;Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch;
将第一草图和第二草图进行聚合,得到目标草图。Aggregate the first sketch and the second sketch to obtain the target sketch.
可选地,构建模块用于:Optionally, building blocks are used to:
获取待查询的目标时间窗,目标数据点为时间戳位于目标时间窗内的数据点;Obtain the target time window to be queried. The target data points are data points whose timestamps are within the target time window;
获取元数据集,元数据集包括缓存中的多个草图的元数据,多个草图为基于目标尺度函数构建的草图,每个草图的元数据包括草图时间窗和草图时间线标识,草图时间窗为构建相应草图的数据点的时间戳对应的时间窗,草图时间线标识为构建相应草图的数据点所属的时间线的标识。Obtain the metadata set. The metadata set includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. The metadata of each sketch includes the sketch time window and the sketch timeline identifier. The sketch time window The time window corresponding to the timestamp of the data point for constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data points for constructing the corresponding sketch belong.
基于目标时间窗和目标数据点所属的时间线,从元数据集确定第一元数据,第一元数据中的草图时间窗为目标时间窗的部分或全部时间窗,第一元数据中的草图时间线的标识与目标数据点所属的时间线的标识相同;Based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, the sketch time window in the first metadata is part or all of the target time window, and the sketch in the first metadata The identity of the timeline is the same as the identity of the timeline to which the target data point belongs;
将第一元数据对应的草图确定为第一草图。The sketch corresponding to the first metadata is determined as the first sketch.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
第二确定模块,用于确定第二草图的元数据,得到第二元数据;The second determination module is used to determine the metadata of the second sketch and obtain the second metadata;
缓存模块,用于缓存第二草图,并将第二元数据添加到元数据集中。 A cache module that caches the second sketch and adds the second metadata to the metadata set.
可选地,装置1400还包括:Optionally, the device 1400 also includes:
第三确定模块,用于确定待写入数据点的时间戳、以及待写入数据点所属的时间线的标识;The third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;
第一删除模块,用于如果待写入数据点的时间戳和所属时间线的标识与元数据集中第三元数据匹配,则将第三元数据对应的草图删除,并更新元数据集。The first deletion module is used to delete the sketch corresponding to the third metadata and update the metadata set if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set.
可选地,元数据集还包括与任一草图时间线标识对应的第一使用信息,第一使用信息用于记录与任一草图时间线标识匹配的多个草图中每个草图的使用时间;装置1400还包括:Optionally, the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record the usage time of each of the multiple sketches matching any sketch timeline identification; Device 1400 also includes:
第二删除模块,用于基于第一使用信息确定与任一草图时间线标识匹配的多个草图中待淘汰的草图,并删除待淘汰的草图。The second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any sketch timeline identifier, and delete the sketches to be eliminated.
可选地,元数据集还包括第二使用信息,第二使用信息用于记录元数据集中多个草图时间线标识中每个草图时间线标识对应的使用信息,每个草图时间线标识对应的使用信息指示与相应草图时间线标识匹配的草图的使用时间;装置1400还包括:Optionally, the metadata set also includes second usage information. The second usage information is used to record the usage information corresponding to each sketch timeline identification among the multiple sketch timeline identifications in the metadata set. The second usage information corresponding to each sketch timeline identification is The usage information indicates the usage time of the sketch matching the corresponding sketch timeline identifier; the device 1400 further includes:
第三删除模块,用于基于第二使用信息确定多个草图时间线标识中待淘汰的草图时间线标识;将与待淘汰的草图时间线标识匹配的草图删除。The third deletion module is configured to determine the sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.
其中,第一确定模块1401、构建模块1402和查询模块1403以及其他模块等均可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以第一确定模块1401为例,介绍第一确定模块1401的实现方式。类似的,构建模块1402和查询模块1403以及其他模块等模块的实现方式可以参考第一确定模块1401的实现方式。Among them, the first determination module 1401, the construction module 1402, the query module 1403 and other modules can all be implemented by software, or can be implemented by hardware. Illustratively, the implementation of the first determination module 1401 is introduced below, taking the first determination module 1401 as an example. Similarly, the implementation of the building module 1402, the query module 1403 and other modules can refer to the implementation of the first determination module 1401.
模块作为软件功能单元的一种举例,第一确定模块1401可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,第一确定模块1401可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。Module As an example of a software functional unit, the first determination module 1401 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more. For example, the first determination module 1401 may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。Likewise, the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs. Among them, usually a VPC is set up in a region. Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
模块作为硬件功能单元的一种举例,第一确定模块1401可以包括至少一个计算设备,如服务器等。或者,第一确定模块1401也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。Module As an example of a hardware functional unit, the first determination module 1401 may include at least one computing device, such as a server. Alternatively, the first determination module 1401 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). Among them, the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
第一确定模块1401包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。第一确定模块1401包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,第一确定模块1401包括的多个计算设备可以分布在同一个VPC中, 也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。The multiple computing devices included in the first determination module 1401 may be distributed in the same region or in different regions. The multiple computing devices included in the first determination module 1401 may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the first determination module 1401 may be distributed in the same VPC, It can also be distributed across multiple VPCs. The plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
需要说明的是,在其他实施例中,第一确定模块1401可以用于执行数据点查询方法中的任意步骤,构建模块1402可以用于执行数据点查询方法中的任意步骤,查询模块1403可以用于执行数据点查询方法中的任意步骤,第一确定模块1401、构建模块1402、以及查询模块1403负责实现的步骤可根据需要指定,通过第一确定模块1401、构建模块1402、以及查询模块1403分别实现数据点查询方法中不同的步骤来实现数据点查询装置的全部功能。It should be noted that in other embodiments, the first determination module 1401 can be used to perform any step in the data point query method, the building module 1402 can be used to perform any step in the data point query method, and the query module 1403 can be used In executing any step in the data point query method, the steps responsible for implementation by the first determination module 1401, the construction module 1402, and the query module 1403 can be specified as needed, through the first determination module 1401, the construction module 1402, and the query module 1403 respectively. Implement different steps in the data point query method to realize all functions of the data point query device.
本申请实施例还提供一种计算设备。如图15所示,计算设备1500包括:总线1502、处理器1504、存储器1506和通信接口1508。处理器1504、存储器1506和通信接口1508之间通过总线1502通信。计算设备1500可以是服务器或终端设备。应理解,本申请不限定计算设备1500中的处理器、存储器的个数。An embodiment of the present application also provides a computing device. As shown in Figure 15, computing device 1500 includes: bus 1502, processor 1504, memory 1506, and communication interface 1508. The processor 1504, the memory 1506 and the communication interface 1508 communicate through a bus 1502. Computing device 1500 may be a server or terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1500.
总线1502可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图15中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1504可包括在计算设备1500各个部件(例如,存储器1506、处理器1504、通信接口1508)之间传送信息的通路。The bus 1502 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 15, but it does not mean that there is only one bus or one type of bus. Bus 1504 may include a path that carries information between various components of computing device 1500 (eg, memory 1506, processor 1504, communications interface 1508).
处理器1504可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 1504 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
存储器1506可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1504还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。Memory 1506 may include volatile memory, such as random access memory (RAM). The processor 1504 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
存储器1506中存储有可执行的程序代码,处理器1504执行该可执行的程序代码以分别实现前述第一确定模块、构建模块和查询模块等模块的功能,从而实现本申请实施例提供的数据点查询方法。也即,存储器1506上存有用于执行本申请实施例提供的数据点查询方法的指令。The memory 1506 stores executable program code, and the processor 1504 executes the executable program code to respectively realize the functions of the aforementioned first determination module, construction module, query module and other modules, thereby realizing the data points provided by the embodiments of this application. Query method. That is, the memory 1506 stores instructions for executing the data point query method provided by the embodiment of the present application.
通信接口1503使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备1500与其他设备或通信网络之间的通信。The communication interface 1503 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1500 and other devices or communication networks.
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。An embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
如图16所示,所述计算设备集群包括至少一个计算设备1500。计算设备集群中的一个或多个计算设备1500中的存储器1506中可以存有相同的用于执行本申请实施例提供的数据点查询方法的指令。As shown in Figure 16, the computing device cluster includes at least one computing device 1500. The memory 1506 in one or more computing devices 1500 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备1500的存储器1506中也可以分别存有用于执行本申请实施例提供的数据点查询方法的部分指令。换言之,一个 或多个计算设备1500的组合可以共同执行用于执行本申请实施例提供的数据点查询方法的指令。In some possible implementations, the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application. In other words, a Or a combination of multiple computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
需要说明的是,计算设备集群中的不同的计算设备1500中的存储器1506可以存储不同的指令,分别用于执行数据点查询装置的部分功能。也即,不同的计算设备1500中的存储器1506存储的指令可以实现第一确定模块、构建模块和查询模块中的一个或多个模块的功能。It should be noted that the memories 1506 in different computing devices 1500 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the data point query device. That is, instructions stored in the memory 1506 in different computing devices 1500 may implement the functions of one or more modules among the first determination module, the construction module, and the query module.
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图17示出了一种可能的实现方式。如图17所示,两个计算设备1500A和1500B之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备1500A中的存储器1506中存有执行第一确定模块和构建模块的功能的指令。同时,计算设备1500B中的存储器1506中存有执行查询模块的功能的指令。In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein, the network may be a wide area network or a local area network, etc. Figure 17 shows a possible implementation. As shown in Figure 17, two computing devices 1500A and 1500B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, the memory 1506 in the computing device 1500A stores instructions for performing the functions of the first determining module and the building module. At the same time, instructions for performing the functions of the query module are stored in memory 1506 in computing device 1500B.
图17所示的计算设备集群之间的连接方式可以是考虑到本申请实施例提供的数据点查询方法需要大量地计算数据,因此考虑将第一确定模块和构建模块实现的功能交由计算设备1500A执行。The connection method between the computing device clusters shown in Figure 17 may be: Considering that the data point query method provided by the embodiment of the present application requires a large amount of calculation data, it is considered that the functions implemented by the first determination module and the building module are handed over to the computing device 1500A execution.
应理解,图17中示出的计算设备1500A的功能也可以由多个计算设备1500完成。同样,计算设备1500B的功能也可以由多个计算设备1500完成。It should be understood that the functions of computing device 1500A shown in FIG. 17 may also be performed by multiple computing devices 1500. Likewise, the functions of computing device 1500B may also be performed by multiple computing devices 1500.
本申请实施例还提供了另一种计算设备集群。该计算设备集群中各计算设备之间的连接关系可以类似的参考图16和图17所述计算设备集群的连接方式。不同的是,该计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行本申请实施例提供的数据点查询方法的指令。The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 16 and FIG. 17 . The difference is that the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备1500的存储器1506中也可以分别存有用于执行本申请实施例提供的数据点查询方法的部分指令。换言之,一个或多个计算设备1500的组合可以共同执行用于执行本申请实施例提供的数据点查询方法的指令。In some possible implementations, the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application. In other words, a combination of one or more computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时,使得至少一个计算设备执行本申请实施例提供的数据点查询方法。An embodiment of the present application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, at least one computing device is caused to execute the data point query method provided by the embodiment of the present application.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行本申请实施例提供的数据点查询方法。An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc. The computer-readable storage medium includes instructions that instruct the computing device to execute the data point query method provided by embodiments of the present application.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修 改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的保护范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications Modifications or substitutions will not cause the essence of the corresponding technical solution to depart from the protection scope of the technical solution of each embodiment of the present application.

Claims (45)

  1. 一种数据点查询方法,其特征在于,所述方法包括:A data point query method, characterized in that the method includes:
    基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,所述多个尺度函数中不同尺度函数构建的草图中的簇的密集程度不同,所述目标分位数指示所述目标数据点在按照大小排序后的多个数据点中的位置;Based on the target quantile corresponding to the target data point to be queried, the target scale function is determined from multiple scale functions. The density of clusters in the sketch constructed by different scale functions in the multiple scale functions is different. The target quantile is The number of bits indicates the position of the target data point among the plurality of data points sorted by size;
    基于所述目标尺度函数和所述多个数据点构建目标草图,所述目标草图包括多个簇,每个簇包括簇均值和簇权重,所述簇均值指示聚类得到相应簇的数据点的均值,所述簇权重指示聚类得到相应簇的数据点的数量;A target sketch is constructed based on the target scale function and the plurality of data points. The target sketch includes a plurality of clusters. Each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the clustering result of the data points of the corresponding cluster. Mean value, the cluster weight indicates the number of data points that cluster to obtain the corresponding cluster;
    基于所述目标草图查询所述目标数据点。Query the target data points based on the target sketch.
  2. 如权利要求1所述的方法,其特征在于,所述多个尺度函数包括第一尺度函数和第二尺度函数,基于所述第一尺度函数构建的草图中的簇在第一分位数区间上的密集程度,大于基于所述第二尺度函数构建的草图中的簇在所述第一分位数区间上的密集程度,基于所述第一尺度函数构建的草图中的簇在所述第二分位数区间上的密集程度,小于基于所述第二尺度函数构建的草图中的簇在所述第二分位数区间上的密集程度;The method of claim 1, wherein the plurality of scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are in the first quantile interval. The density of the clusters in the sketch constructed based on the second scale function is greater than the density of the clusters in the sketch constructed based on the second scale function on the first quantile interval. The clusters in the sketch constructed based on the first scale function are clustered in the first quantile interval. The density of the clusters on the second quantile interval is less than the density of the clusters in the sketch constructed based on the second scale function on the second quantile interval;
    所述基于所述目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,包括:The target scale function is determined from multiple scale functions based on the target quantile corresponding to the target data point, including:
    如果所述目标分位数位于所述第一分位数区间,则将所述第一尺度函数确定为所述目标尺度函数;If the target quantile is located in the first quantile interval, determine the first scale function as the target scale function;
    如果所述目标分位数位于所述第二分位数区间,则将所述第二尺度函数确定为所述目标尺度函数。If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  3. 如权利要求2所述的方法,其特征在于,所述第一分位数区间包括从0至x1的区间、以及从x2至1的区间,所述x1和所述x2均大于0且小于1,且所述x1小于所述x2;The method of claim 2, wherein the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, and both x1 and x2 are greater than 0 and less than 1 , and the x1 is smaller than the x2;
    所述第二分位数区间包括从x1至x2的区间。The second quantile interval includes an interval from x1 to x2.
  4. 如权利要求1-3任一所述的方法,其特征在于,所述基于所述目标草图查询所述目标数据点,包括:The method according to any one of claims 1-3, wherein querying the target data points based on the target sketch includes:
    基于所述目标草图和所述目标分位数,查询所述目标数据点的数据值。Based on the target sketch and the target quantile, the data value of the target data point is queried.
  5. 如权利要求4所述的方法,其特征在于,所述基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数之前,所述方法还包括:The method of claim 4, wherein before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, the method further includes:
    接收数据点查询请求,所述数据点查询请求用于查询多个数据点中的目标数据点的数据值,所述数据点查询请求携带所述目标数据点的标准分位数;Receive a data point query request, the data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the standard quantile of the target data point;
    将所述数据点查询请求中携带的标准分位数确定为所述目标分位数。 The standard quantile carried in the data point query request is determined as the target quantile.
  6. 如权利要求4所述的方法,其特征在于,所述方法还包括:The method of claim 4, further comprising:
    接收等高直方图查询请求,所述等高直方图查询请求用于查询基于所述多个数据点构建的等高直方图,且所述等高直方图查询请求携带桶数量h,所述h为大于1的整数;Receive a equal height histogram query request, the equal height histogram query request is used to query the equal height histogram constructed based on the multiple data points, and the equal height histogram query request carries the number of buckets h, and the h is an integer greater than 1;
    基于所述桶数量h和所述多个数据点的总数量,确定所述等高直方图中从左到右第一个桶至第h-1个桶的分位数,得到h-1个分位数;Based on the number of buckets h and the total number of the multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram, and obtain h-1 Quantile;
    将所述h-1个分位数中每个分位数分别作为所述目标分位数,并执行基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的操作,以得到与所述h-1个分位数一一对应的h-1个数据值;Each of the h-1 quantiles is used as the target quantile, and the target scale is determined from multiple scale functions based on the target quantile corresponding to the target data point to be queried. Operation of the function to obtain h-1 data values corresponding to the h-1 quantiles;
    基于所述h-1个数据值、以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,绘制所述等高直方图。The contour histogram is drawn based on the h-1 data values, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
  7. 如权利要求1-3任一所述的方法,其特征在于,所述基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数之前,所述方法还包括:The method according to any one of claims 1 to 3, characterized in that, before determining the target scale function from a plurality of scale functions based on the target quantile corresponding to the target data point to be queried, the method further includes :
    基于所述目标数据点的数据值,以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,确定所述目标数据点的估计分位数,将所述估计分位数作为所述目标分位数;Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points, an estimated quantile of the target data point is determined, and the estimated quantile is digit as the target quantile;
    所述基于所述目标草图查询所述目标数据点,包括:The querying the target data points based on the target sketch includes:
    基于所述目标草图和所述目标数据点的数据值,查询所述目标数据点的标准分位数。Based on the target sketch and the data value of the target data point, the standard quantile of the target data point is queried.
  8. 如权利要求7所述的方法,其特征在于,所述基于所述目标数据点的数据值,以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,确定所述目标数据点的估计分位数之前,所述方法还包括:The method of claim 7, wherein the determination of the target data point is based on a data value of the target data point, a data value of a maximum data point and a data value of a minimum data point among the plurality of data points. Before describing the estimated quantile of the target data point, the method further includes:
    接收分位数查询请求,所述分位数查询请求用于查询多个数据点中的目标数据点的标准分位数,所述分位数查询请求携带所述目标数据点的数据值。Receive a quantile query request, the quantile query request is used to query the standard quantile of a target data point among multiple data points, and the quantile query request carries the data value of the target data point.
  9. 如权利要求7所述的方法,其特征在于,所述方法还包括:The method of claim 7, further comprising:
    接收等宽直方图查询请求,所述等宽直方图查询请求用于查询基于所述多个数据点构建的等宽直方图,且所述等宽直方图查询请求携带桶边界数组,所述桶边界数组包括n个边界值,所述n个边界值将所述多个数据点中最小数据点的数据值与最大数据点的数据值之间划分出n+1个区间;Receive an equal-width histogram query request, the equal-width histogram query request is used to query an equal-width histogram constructed based on the multiple data points, and the equal-width histogram query request carries a bucket boundary array, and the bucket The boundary array includes n boundary values, and the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among the plurality of data points;
    将所述n个边界值中每个边界值分别作为所述目标数据点的数据值,并执行基于所述目标数据点的数据值,以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,确定所述目标数据点的估计分位数的操作,以得到与所述n个边界值一一对应的n个标准分位数;Each of the n boundary values is used as the data value of the target data point, and the execution is based on the data value of the target data point and the data value of the largest data point among the plurality of data points. and the data value of the minimum data point, the operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one;
    基于与所述n个边界值一一对应的n个标准分位数,绘制所述等宽直方图。The equal-width histogram is drawn based on n standard quantiles corresponding one-to-one to the n boundary values.
  10. 如权利要求1-9任一所述的方法,其特征在于,所述基于所述目标尺度函数和所述多个数据点构建目标草图之后,所述方法还包括:The method according to any one of claims 1 to 9, characterized in that after constructing a target sketch based on the target scale function and the plurality of data points, the method further includes:
    生成与缓存中待更新数据点对应的待更新簇,所述待更新簇包括簇均值、簇权重以及簇标记,所述待更新簇的簇均值指示所述待更新数据点的数据值,所述待更新簇的簇权重指示 所述待更新数据点的数量,所述待更新簇的簇标记指示所述待更新数据点的更新类型;Generate a cluster to be updated corresponding to the data point to be updated in the cache, the cluster to be updated includes a cluster mean, a cluster weight and a cluster mark, the cluster mean of the cluster to be updated indicates the data value of the data point to be updated, the Cluster weight indication of the cluster to be updated The number of data points to be updated, and the cluster mark of the cluster to be updated indicates the update type of the data points to be updated;
    基于所述待更新簇,更新所述目标草图。Based on the cluster to be updated, the target sketch is updated.
  11. 如权利要求10所述的方法,其特征在于,所述基于所述待更新簇,更新所述目标草图,包括:The method of claim 10, wherein updating the target sketch based on the cluster to be updated includes:
    从所述待更新簇中获取簇标记为待合并标记的待更新簇,得到待合并簇;Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;
    将所述待合并簇合并至所述目标草图中。Merge the clusters to be merged into the target sketch.
  12. 如权利要求11所述的方法,其特征在于,所述将所述待合并簇合并至所述目标草图中,包括:The method of claim 11, wherein merging the clusters to be merged into the target sketch includes:
    将所述目标草图中的簇和所述待合并簇按照簇均值从小到大的顺序进行排序;Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;
    对于排序后的第一个簇,基于所述目标尺度函数确定分位数阈值,从排序后的第二个簇开始遍历每个簇,并对每个簇依次执行下述操作:For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
    对于第i个簇,基于所述第i个簇的簇权重,确定所述第i个簇的当前分位数,所述i为大于1的整数;For the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1;
    如果所述第i个簇的当前分位数低于所述分位数阈值,则将所述第i个簇合并至前一个簇中,并从所述前一个簇继续开始遍历;If the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster, and the traversal continues from the previous cluster;
    如果所述第i个簇的当前分位数超过所述分位数阈值,则基于所述第i个簇的当前分位数和所述目标尺度函数更新所述分位数阈值,并遍历下一个簇。If the current quantile of the i-th cluster exceeds the quantile threshold, update the quantile threshold based on the current quantile of the i-th cluster and the target scale function, and traverse the following a cluster.
  13. 如权利要求10所述的方法,其特征在于,所述基于所述待更新簇,更新所述目标草图,包括:The method of claim 10, wherein updating the target sketch based on the cluster to be updated includes:
    从所述待更新簇中获取簇标记为待删除标记的待更新簇,得到待删除簇;Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;
    从所述目标草图中删除所述待删除簇。The cluster to be deleted is deleted from the target sketch.
  14. 如权利要求13所述的方法,其特征在于,所述从所述目标草图中删除所述待删除簇,包括:The method of claim 13, wherein deleting the cluster to be deleted from the target sketch includes:
    将所述目标草图中的簇和所述待删除簇按照簇均值从小到大的顺序进行排序;Sort the clusters in the target sketch and the clusters to be deleted in ascending order of cluster mean values;
    从排序后的第一个簇开始遍历每个簇,对每个簇依次执行下述操作:Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:
    对于第j个簇,确定所述第j个簇的簇标记,如果所述第j个簇的簇标记为待删除标记,则删除所述第j个簇,并更新与所述j个簇相邻的簇的簇权重,所述j为大于或等于1的整数。For the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster and update the information related to the j-th cluster. The cluster weight of the neighboring cluster, where j is an integer greater than or equal to 1.
  15. 如权利要求14所述的方法,其特征在于,所述更新与所述j个簇相邻的簇的簇权重,包括:The method of claim 14, wherein updating cluster weights of clusters adjacent to the j clusters includes:
    如果所述第j个簇为排序之后的中间簇,则获取所述j个簇的左相邻簇的簇均值以及所述所述j个簇的右相邻簇的簇均值;If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent clusters of the j clusters and the cluster mean of the right adjacent clusters of the j clusters;
    基于所述左相邻簇的簇均值、所述右相邻簇的簇均值以及所述j个簇的簇均值和簇权重,分别确定与所述左相邻簇对应的删除权重、以及与所述右相邻簇对应的删除权重; Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the j clusters are respectively determined. Describe the deletion weight corresponding to the right adjacent cluster;
    基于与所述左相邻簇对应的删除权重更新所述左相邻簇的簇权重,基于与所述右相邻簇对应的删除权重更新所述左相邻簇的簇权重。The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  16. 如权利要求1-15任一所述的方法,其特征在于,所述基于所述目标尺度函数和所述多个数据点构建目标草图,包括:The method according to any one of claims 1 to 15, wherein said constructing a target sketch based on the target scale function and the plurality of data points includes:
    获取基于所述多个数据点中部分数据点和所述目标尺度函数已经缓存的草图,得到第一草图;Obtain a cached sketch based on some of the plurality of data points and the target scale function to obtain a first sketch;
    基于所述多个数据点中除所述部分数据点之外的数据点和所述目标尺度函数构建草图,得到第二草图;Construct a sketch based on the data points other than the partial data points among the plurality of data points and the target scale function to obtain a second sketch;
    将所述第一草图和所述第二草图进行聚合,得到所述目标草图。The first sketch and the second sketch are aggregated to obtain the target sketch.
  17. 如权利要求16所述的方法,其特征在于,所述获取基于所述多个数据点中部分数据点和所述目标尺度函数已经缓存的草图,得到第一草图,包括:The method of claim 16, wherein the obtaining the first sketch is based on some of the data points among the plurality of data points and the cached sketch of the target scale function, including:
    获取待查询的目标时间窗,所述目标数据点为时间戳位于所述目标时间窗内的数据点;Obtain the target time window to be queried, and the target data point is a data point with a timestamp located within the target time window;
    获取元数据集,所述元数据集包括缓存中的多个草图的元数据,所述多个草图为基于所述目标尺度函数构建的草图,每个草图的元数据包括草图时间窗和草图时间线标识,所述草图时间窗为构建相应草图的数据点的时间戳对应的时间窗,所述草图时间线标识为构建相应草图的数据点所属的时间线的标识;Obtain a metadata set. The metadata set includes metadata of multiple sketches in the cache. The multiple sketches are sketches constructed based on the target scale function. The metadata of each sketch includes a sketch time window and a sketch time. Line identifier, the sketch time window is the time window corresponding to the timestamp of the data point constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data point constructing the corresponding sketch belongs;
    基于所述目标时间窗和所述目标数据点所属的时间线,从所述元数据集确定第一元数据,所述第一元数据中的草图时间窗为所述目标时间窗的部分或全部时间窗,所述第一元数据中的草图时间线的标识与所述目标数据点所属的时间线的标识相同;Determine first metadata from the set of metadata based on the target time window and the timeline to which the target data point belongs, where the sketch time window in the first metadata is part or all of the target time window Time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs;
    将所述第一元数据对应的草图确定为所述第一草图。The sketch corresponding to the first metadata is determined as the first sketch.
  18. 如权利要求17所述的方法,其特征在于,所述基于所述多个数据点中除所述部分数据点之外的数据点和所述目标尺度函数构建草图,得到第二草图之后,所述方法还包括:The method of claim 17, wherein a sketch is constructed based on data points other than the partial data points among the plurality of data points and the target scale function, and after obtaining the second sketch, The above methods also include:
    确定所述第二草图的元数据,得到第二元数据;Determine the metadata of the second sketch to obtain second metadata;
    缓存所述第二草图,并将所述第二元数据添加到所述元数据集中。The second sketch is cached and the second metadata is added to the metadata set.
  19. 如权利要求17所述的方法,其特征在于,所述方法还包括:The method of claim 17, further comprising:
    确定待写入数据点的时间戳、以及所述待写入数据点所属的时间线的标识;Determine the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written belongs;
    如果所述待写入数据点的时间戳和所属时间线的标识与所述元数据集中第三元数据匹配,则将所述第三元数据对应的草图删除,并更新所述元数据集。If the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted, and the metadata set is updated.
  20. 如权利要求17所述的方法,其特征在于,所述元数据集还包括与任一草图时间线标识对应的第一使用信息,所述第一使用信息用于记录与所述任一草图时间线标识匹配的多个草图中每个草图的使用时间;所述方法还包括:The method of claim 17, wherein the metadata set further includes first usage information corresponding to any sketch timeline identifier, and the first usage information is used to record the time associated with any sketch time. The line identifies a usage time of each of the plurality of matching sketches; the method further includes:
    基于所述第一使用信息确定与所述任一草图时间线标识匹配的多个草图中待淘汰的草图,并删除所述待淘汰的草图。 Determine a sketch to be eliminated among the plurality of sketches matching any of the sketch timeline identifiers based on the first usage information, and delete the sketch to be eliminated.
  21. 如权利要求17所述的方法,其特征在于,所述元数据集还包括第二使用信息,所述第二使用信息用于记录所述元数据集中多个草图时间线标识中每个草图时间线标识对应的使用信息,每个草图时间线标识对应的使用信息指示与相应草图时间线标识匹配的草图的使用时间;所述方法还包括:The method of claim 17, wherein the metadata set further includes second usage information, and the second usage information is used to record each sketch time in a plurality of sketch timeline identifiers in the metadata set. The usage information corresponding to the line identification, and the usage information corresponding to each sketch timeline identification indicates the usage time of the sketch matching the corresponding sketch timeline identification; the method further includes:
    基于所述第二使用信息确定所述多个草图时间线标识中待淘汰的草图时间线标识;Determine a sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information;
    将与所述待淘汰的草图时间线标识匹配的草图删除。Delete the sketches that match the timeline identifier of the sketch to be eliminated.
  22. 一种数据点查询装置,其特征在于,所述装置包括:A data point query device, characterized in that the device includes:
    第一确定模块,用于基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数,所述多个尺度函数中不同尺度函数构建的草图中的簇的密集程度不同,所述目标分位数指示所述目标数据点在按照大小排序后的多个数据点中的位置;The first determination module is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of the clusters in the sketch constructed by different scale functions in the multiple scale functions is The degree is different, and the target quantile indicates the position of the target data point among the plurality of data points sorted by size;
    构建模块,用于基于所述目标尺度函数和所述多个数据点构建目标草图,所述目标草图包括多个簇,每个簇包括簇均值和簇权重,所述簇均值指示聚类得到相应簇的数据点的均值,所述簇权重指示聚类得到相应簇的数据点的数量;A building module configured to construct a target sketch based on the target scale function and the plurality of data points, the target sketch including a plurality of clusters, each cluster including a cluster mean and a cluster weight, the cluster mean indicating that the clustering results in a corresponding The mean value of the data points of the cluster, and the cluster weight indicates the number of data points of the corresponding cluster obtained by clustering;
    查询模块,用于基于所述目标草图查询所述目标数据点。A query module configured to query the target data points based on the target sketch.
  23. 如权利要求22所述的装置,其特征在于,所述多个尺度函数包括第一尺度函数和第二尺度函数,基于所述第一尺度函数构建的草图中的簇在第一分位数区间上的密集程度,大于基于所述第二尺度函数构建的草图中的簇在所述第一分位数区间上的密集程度,基于所述第一尺度函数构建的草图中的簇在所述第二分位数区间上的密集程度,小于基于所述第二尺度函数构建的草图中的簇在所述第二分位数区间上的密集程度;The device of claim 22, wherein the plurality of scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are in the first quantile interval. The density of the clusters in the sketch constructed based on the second scale function is greater than the density of the clusters in the sketch constructed based on the second scale function on the first quantile interval. The clusters in the sketch constructed based on the first scale function are clustered in the first quantile interval. The density of the clusters on the second quantile interval is less than the density of the clusters in the sketch constructed based on the second scale function on the second quantile interval;
    所述第一确定模块用于:The first determination module is used for:
    如果所述目标分位数位于所述第一分位数区间,则将所述第一尺度函数确定为所述目标尺度函数;If the target quantile is located in the first quantile interval, determine the first scale function as the target scale function;
    如果所述目标分位数位于所述第二分位数区间,则将所述第二尺度函数确定为所述目标尺度函数。If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
  24. 如权利要求23所述的装置,其特征在于,所述第一分位数区间包括从0至x1的区间、以及从x2至1的区间,所述x1和所述x2均大于0且小于1,且所述x1小于所述x2;The device of claim 23, wherein the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, and both x1 and x2 are greater than 0 and less than 1 , and the x1 is smaller than the x2;
    所述第二分位数区间包括从x1至x2的区间。The second quantile interval includes an interval from x1 to x2.
  25. 如权利要求22-24任一所述的装置,其特征在于,所述查询模块用于:The device according to any one of claims 22 to 24, characterized in that the query module is used to:
    基于所述目标草图和所述目标分位数,查询所述目标数据点的数据值。Based on the target sketch and the target quantile, the data value of the target data point is queried.
  26. 如权利要求25所述的装置,其特征在于,所述装置还包括:The device of claim 25, further comprising:
    接收模块,用于接收数据点查询请求,所述数据点查询请求用于查询多个数据点中的目标数据点的数据值,所述数据点查询请求携带所述目标数据点的标准分位数;A receiving module, configured to receive a data point query request. The data point query request is used to query the data value of a target data point among multiple data points. The data point query request carries the standard quantile of the target data point. ;
    所述第一确定模块,还用于将所述数据点查询请求中携带的标准分位数确定为所述目标分位数。 The first determination module is also configured to determine the standard quantile carried in the data point query request as the target quantile.
  27. 如权利要求25所述的装置,其特征在于,所述装置还包括:The device of claim 25, further comprising:
    接收模块,用于接收等高直方图查询请求,所述等高直方图查询请求用于查询基于所述多个数据点构建的等高直方图,且所述等高直方图查询请求携带桶数量h,所述h为大于1的整数;A receiving module, configured to receive a equal height histogram query request. The equal height histogram query request is used to query a equal height histogram constructed based on the multiple data points, and the equal height histogram query request carries the number of buckets. h, the h is an integer greater than 1;
    所述第一确定模块,还用于基于所述桶数量h和所述多个数据点的总数量,确定所述等高直方图中从左到右第一个桶至第h-1个桶的分位数,得到h-1个分位数;The first determination module is also used to determine the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of the multiple data points. quantiles, get h-1 quantiles;
    所述查询模块,还用于将所述h-1个分位数中每个分位数分别作为所述目标分位数,并执行基于待查询的目标数据点对应的目标分位数,从多个尺度函数中确定目标尺度函数的操作,以得到与所述h-1个分位数一一对应的h-1个数据值;The query module is also used to regard each quantile among the h-1 quantiles as the target quantile, and execute the target quantile corresponding to the target data point to be queried, from The operation of determining the target scale function among multiple scale functions to obtain h-1 data values corresponding to the h-1 quantiles;
    所述装置还包括绘制模块,用于基于所述h-1个数据值、以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,绘制所述等高直方图。The device further includes a drawing module for drawing the equal height histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points. .
  28. 如权利要求22-24任一所述的装置,其特征在于,所述第一确定模块还用于:The device according to any one of claims 22 to 24, characterized in that the first determination module is also used to:
    基于所述目标数据点的数据值,以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,确定所述目标数据点的估计分位数,将所述估计分位数作为所述目标分位数;Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points, an estimated quantile of the target data point is determined, and the estimated quantile is digit as the target quantile;
    所述查询模块用于:The query module is used for:
    基于所述目标草图和所述目标数据点的数据值,查询所述目标数据点的标准分位数。Based on the target sketch and the data value of the target data point, the standard quantile of the target data point is queried.
  29. 如权利要求28所述的装置,其特征在于,所述装置还包括:The device of claim 28, further comprising:
    接收模块,用于接收分位数查询请求,所述分位数查询请求用于查询多个数据点中的目标数据点的标准分位数,所述分位数查询请求携带所述目标数据点的数据值。A receiving module, configured to receive a quantile query request. The quantile query request is used to query the standard quantile of a target data point among multiple data points. The quantile query request carries the target data point. data value.
  30. 如权利要求28所述的装置,其特征在于,所述装置还包括:The device of claim 28, further comprising:
    接收模块,用于接收等宽直方图查询请求,所述等宽直方图查询请求用于查询基于所述多个数据点构建的等宽直方图,且所述等宽直方图查询请求携带桶边界数组,所述桶边界数组包括n个边界值,所述n个边界值将所述多个数据点中最小数据点的数据值与最大数据点的数据值之间划分出n+1个区间;A receiving module, configured to receive an equal-width histogram query request, the equal-width histogram query request is used to query an equal-width histogram constructed based on the multiple data points, and the equal-width histogram query request carries a bucket boundary Array, the bucket boundary array includes n boundary values, and the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among the plurality of data points;
    所述查询模块,用于将所述n个边界值中每个边界值分别作为所述目标数据点的数据值,并执行基于所述目标数据点的数据值,以及所述多个数据点中的最大数据点的数据值和最小数据点的数据值,确定所述目标数据点的估计分位数的操作,以得到与所述n个边界值一一对应的n个标准分位数;The query module is configured to use each boundary value among the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, and among the plurality of data points The operation of determining the estimated quantile of the target data point based on the data value of the maximum data point and the data value of the minimum data point to obtain n standard quantiles corresponding to the n boundary values one-to-one;
    所述装置还包括绘制模块,用于基于与所述n个边界值一一对应的n个标准分位数,绘制所述等宽直方图。The device further includes a drawing module for drawing the equal-width histogram based on n standard quantiles corresponding to the n boundary values one-to-one.
  31. 如权利要求22-30任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 22 to 30, characterized in that the device further includes:
    生成模块,用于生成与缓存中待更新数据点对应的待更新簇,所述待更新簇包括簇均值、簇权重以及簇标记,所述待更新簇的簇均值指示所述待更新数据点的数据值,所述待更新簇的簇权重指示所述待更新数据点的数量,所述待更新簇的簇标记指示所述待更新数据点的更 新类型;A generation module, configured to generate clusters to be updated corresponding to the data points to be updated in the cache. The clusters to be updated include cluster means, cluster weights and cluster tags. The cluster mean of the clusters to be updated indicates the cluster mean of the data points to be updated. data value, the cluster weight of the cluster to be updated indicates the number of data points to be updated, and the cluster mark of the cluster to be updated indicates the update of the data points to be updated. new types;
    更新模块,用于基于所述待更新簇,更新所述目标草图。An update module, configured to update the target sketch based on the cluster to be updated.
  32. 如权利要求31所述的装置,其特征在于,所述更新模块用于:The device of claim 31, wherein the update module is used for:
    从所述待更新簇中获取簇标记为待合并标记的待更新簇,得到待合并簇;Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;
    将所述待合并簇合并至所述目标草图中。Merge the clusters to be merged into the target sketch.
  33. 如权利要求32所述的装置,其特征在于,所述更新模块用于:The device of claim 32, wherein the update module is used for:
    将所述目标草图中的簇和所述待合并簇按照簇均值从小到大的顺序进行排序;Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;
    对于排序后的第一个簇,基于所述目标尺度函数确定分位数阈值,从排序后的第二个簇开始遍历每个簇,并对每个簇依次执行下述操作:For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:
    对于第i个簇,基于所述第i个簇的簇权重,确定所述第i个簇的当前分位数,所述i为大于1的整数;For the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1;
    如果所述第i个簇的当前分位数低于所述分位数阈值,则将所述第i个簇合并至前一个簇中,并从所述前一个簇继续开始遍历;If the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster, and the traversal continues from the previous cluster;
    如果所述第i个簇的当前分位数超过所述分位数阈值,则基于所述第i个簇的当前分位数和所述目标尺度函数更新所述分位数阈值,并遍历下一个簇。If the current quantile of the i-th cluster exceeds the quantile threshold, update the quantile threshold based on the current quantile of the i-th cluster and the target scale function, and traverse the following a cluster.
  34. 如权利要求31所述的装置,其特征在于,所述更新模块用于:The device of claim 31, wherein the update module is used for:
    从所述待更新簇中获取簇标记为待删除标记的待更新簇,得到待删除簇;Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;
    从所述目标草图中删除所述待删除簇。The cluster to be deleted is deleted from the target sketch.
  35. 如权利要求34所述的装置,其特征在于,所述更新模块用于:The device of claim 34, wherein the update module is used for:
    将所述目标草图中的簇和所述待删除簇按照簇均值从小到大的顺序进行排序;Sort the clusters in the target sketch and the clusters to be deleted in ascending order of cluster mean values;
    从排序后的第一个簇开始遍历每个簇,对每个簇依次执行下述操作:Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:
    对于第j个簇,确定所述第j个簇的簇标记,如果所述第j个簇的簇标记为待删除标记,则删除所述第j个簇,并更新与所述j个簇相邻的簇的簇权重,所述j为大于或等于1的整数。For the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster and update the information related to the j-th cluster. The cluster weight of the neighboring cluster, where j is an integer greater than or equal to 1.
  36. 如权利要求35所述的装置,其特征在于,所述更新模块用于:The device of claim 35, wherein the update module is used for:
    如果所述第j个簇为排序之后的中间簇,则获取所述j个簇的左相邻簇的簇均值以及所述所述j个簇的右相邻簇的簇均值;If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent clusters of the j clusters and the cluster mean of the right adjacent clusters of the j clusters;
    基于所述左相邻簇的簇均值、所述右相邻簇的簇均值以及所述j个簇的簇均值和簇权重,分别确定与所述左相邻簇对应的删除权重、以及与所述右相邻簇对应的删除权重;Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the j clusters are respectively determined. Describe the deletion weight corresponding to the right adjacent cluster;
    基于与所述左相邻簇对应的删除权重更新所述左相邻簇的簇权重,基于与所述右相邻簇对应的删除权重更新所述左相邻簇的簇权重。The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
  37. 如权利要求22-36任一所述的装置,其特征在于,所述构建模块用于:The device according to any one of claims 22-36, characterized in that the building module is used for:
    获取基于所述多个数据点中部分数据点和所述目标尺度函数已经缓存的草图,得到第一 草图;Obtain the cached sketch based on some of the multiple data points and the target scale function, and obtain the first sketch;
    基于所述多个数据点中除所述部分数据点之外的数据点和所述目标尺度函数构建草图,得到第二草图;Construct a sketch based on the data points other than the partial data points among the plurality of data points and the target scale function to obtain a second sketch;
    将所述第一草图和所述第二草图进行聚合,得到所述目标草图。The first sketch and the second sketch are aggregated to obtain the target sketch.
  38. 如权利要求37所述的装置,其特征在于,所述构建模块用于:The device of claim 37, wherein the building module is used for:
    获取待查询的目标时间窗,所述目标数据点为时间戳位于所述目标时间窗内的数据点;Obtain the target time window to be queried, and the target data point is a data point with a timestamp located within the target time window;
    获取元数据集,所述元数据集包括缓存中的多个草图的元数据,所述多个草图为基于所述目标尺度函数构建的草图,每个草图的元数据包括草图时间窗和草图时间线标识,所述草图时间窗为构建相应草图的数据点的时间戳对应的时间窗,所述草图时间线标识为构建相应草图的数据点所属的时间线的标识;Obtain a metadata set. The metadata set includes metadata of multiple sketches in the cache. The multiple sketches are sketches constructed based on the target scale function. The metadata of each sketch includes a sketch time window and a sketch time. Line identifier, the sketch time window is the time window corresponding to the timestamp of the data point constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data point constructing the corresponding sketch belongs;
    基于所述目标时间窗和所述目标数据点所属的时间线,从所述元数据集确定第一元数据,所述第一元数据中的草图时间窗为所述目标时间窗的部分或全部时间窗,所述第一元数据中的草图时间线的标识与所述目标数据点所属的时间线的标识相同;Determine first metadata from the set of metadata based on the target time window and the timeline to which the target data point belongs, where the sketch time window in the first metadata is part or all of the target time window Time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs;
    将所述第一元数据对应的草图确定为所述第一草图。The sketch corresponding to the first metadata is determined as the first sketch.
  39. 如权利要求38所述的装置,其特征在于,所述装置还包括:The device of claim 38, further comprising:
    第二确定模块,用于确定所述第二草图的元数据,得到第二元数据;The second determination module is used to determine the metadata of the second sketch and obtain the second metadata;
    缓存模块,用于缓存所述第二草图,并将所述第二元数据添加到所述元数据集中。A caching module, configured to cache the second sketch and add the second metadata to the metadata set.
  40. 如权利要求38所述的装置,其特征在于,所述装置还包括:The device of claim 38, further comprising:
    第三确定模块,用于确定待写入数据点的时间戳、以及所述待写入数据点所属的时间线的标识;The third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;
    第一删除模块,用于如果所述待写入数据点的时间戳和所属时间线的标识与所述元数据集中第三元数据匹配,则将所述第三元数据对应的草图删除,并更新所述元数据集。A first deletion module, configured to delete the sketch corresponding to the third metadata if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set, and Update the metadata set.
  41. 如权利要求38所述的装置,其特征在于,所述元数据集还包括与任一草图时间线标识对应的第一使用信息,所述第一使用信息用于记录与所述任一草图时间线标识匹配的多个草图中每个草图的使用时间;所述装置还包括:The device of claim 38, wherein the metadata set further includes first usage information corresponding to any sketch timeline identifier, and the first usage information is used to record the time associated with any sketch. The line identifies the usage time of each of the plurality of matched sketches; the device further includes:
    第二删除模块,用于基于所述第一使用信息确定与所述任一草图时间线标识匹配的多个草图中待淘汰的草图,并删除所述待淘汰的草图。The second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any of the sketch timeline identifiers, and delete the sketches to be eliminated.
  42. 如权利要求38所述的装置,其特征在于,所述元数据集还包括第二使用信息,所述第二使用信息用于记录所述元数据集中多个草图时间线标识中每个草图时间线标识对应的使用信息,每个草图时间线标识对应的使用信息指示与相应草图时间线标识匹配的草图的使用时间;所述装置还包括:The apparatus of claim 38, wherein the metadata set further includes second usage information, and the second usage information is used to record each sketch time in a plurality of sketch timeline identifiers in the metadata set. The usage information corresponding to the line identification, and the usage information corresponding to each sketch timeline identification indicates the usage time of the sketch matching the corresponding sketch timeline identification; the device further includes:
    第三删除模块,用于基于所述第二使用信息确定所述多个草图时间线标识中待淘汰的草图时间线标识;将与所述待淘汰的草图时间线标识匹配的草图删除。 A third deletion module, configured to determine a sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.
  43. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;A computing device cluster, characterized by including at least one computing device, each computing device including a processor and a memory;
    所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1-21任一所述的方法。The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the cluster of computing devices performs the method according to any one of claims 1-21.
  44. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备集群运行时,使得所述计算设备集群执行如权利要求1-21任一所述的方法。A computer program product containing instructions, characterized in that, when the instructions are executed by a cluster of computing devices, they cause the cluster of computing devices to execute the method according to any one of claims 1-21.
  45. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1-21任一所述的方法。 A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computing device cluster, the computing device cluster performs the method according to any one of claims 1-21.
PCT/CN2023/086007 2022-07-19 2023-04-03 Data point query method and apparatus, device cluster, program product, and storage medium WO2024016731A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210855232 2022-07-19
CN202210855232.X 2022-07-19
CN202211091505.4A CN117472975A (en) 2022-07-19 2022-09-07 Data point query method, data point query device cluster, data point query program product and data point query storage medium
CN202211091505.4 2022-09-07

Publications (1)

Publication Number Publication Date
WO2024016731A1 true WO2024016731A1 (en) 2024-01-25

Family

ID=89616930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086007 WO2024016731A1 (en) 2022-07-19 2023-04-03 Data point query method and apparatus, device cluster, program product, and storage medium

Country Status (1)

Country Link
WO (1) WO2024016731A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088813A1 (en) * 2016-09-23 2018-03-29 Samsung Electronics Co., Ltd. Summarized data storage management system for streaming data
CN108388603A (en) * 2018-02-05 2018-08-10 中国科学院信息工程研究所 The construction method and querying method of distributed summary data structure based on Spark frames
US10248476B2 (en) * 2017-05-22 2019-04-02 Sas Institute Inc. Efficient computations and network communications in a distributed computing environment
CN110968835A (en) * 2019-12-12 2020-04-07 清华大学 Approximate quantile calculation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088813A1 (en) * 2016-09-23 2018-03-29 Samsung Electronics Co., Ltd. Summarized data storage management system for streaming data
US10248476B2 (en) * 2017-05-22 2019-04-02 Sas Institute Inc. Efficient computations and network communications in a distributed computing environment
CN108388603A (en) * 2018-02-05 2018-08-10 中国科学院信息工程研究所 The construction method and querying method of distributed summary data structure based on Spark frames
CN110968835A (en) * 2019-12-12 2020-04-07 清华大学 Approximate quantile calculation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAIDU GEEK SPEAKING: "A System and Method Based on Real-time Quantile Calculation)", CSDN BLOG, 27 May 2021 (2021-05-27), XP093131717, Retrieved from the Internet <URL:https://blog.csdn.net/lihui49/article/details/117250392> [retrieved on 20240215] *
LOVE TO EAT CORIANDER AND SCALLION: "T-digest", CSDN BLOG, 20 July 2020 (2020-07-20), XP093131713, Retrieved from the Internet <URL:https://blog.csdn.net/qq_41648804/article/details/107474870> [retrieved on 20240215] *

Similar Documents

Publication Publication Date Title
US7603339B2 (en) Merging synopses to determine number of distinct values in large databases
US7636731B2 (en) Approximating a database statistic
US10042914B2 (en) Database index for constructing large scale data level of details
EP2997472B1 (en) Managing memory and storage space for a data operation
CN114168608B (en) Data processing system for updating knowledge graph
CN111061758B (en) Data storage method, device and storage medium
CN112925821B (en) MapReduce-based parallel frequent item set incremental data mining method
Awad et al. Dynamic graphs on the GPU
CN105045806A (en) Dynamic splitting and maintenance method of quantile query oriented summary data
CN112925859A (en) Data storage method and device
WO2015168988A1 (en) Data index creation method and device, and computer storage medium
Beyer et al. Distinct-value synopses for multiset operations
Hershberger et al. Adaptive sampling for geometric problems over data streams
CN108829343B (en) Cache optimization method based on artificial intelligence
CN116756494B (en) Data outlier processing method, apparatus, computer device, and readable storage medium
AU2020101071A4 (en) A Parallel Association Mining Algorithm for Analyzing Passenger Travel Characteristics
WO2024016731A1 (en) Data point query method and apparatus, device cluster, program product, and storage medium
Wang et al. Stull: Unbiased online sampling for visual exploration of large spatiotemporal data
JP6006740B2 (en) Index management device
US11520834B1 (en) Chaining bloom filters to estimate the number of keys with low frequencies in a dataset
CN117472975A (en) Data point query method, data point query device cluster, data point query program product and data point query storage medium
CN107846327A (en) A kind of processing method and processing device of network management performance data
Nabil et al. Mining frequent itemsets from online data streams: Comparative study
CN110990394A (en) Distributed column database table-oriented line number statistical method and device and storage medium
Hershberger et al. Adaptive sampling for geometric problems over data streams

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23841804

Country of ref document: EP

Kind code of ref document: A1