WO2024016731A1

WO2024016731A1 - Data point query method and apparatus, device cluster, program product, and storage medium

Info

Publication number: WO2024016731A1
Application number: PCT/CN2023/086007
Authority: WO
Inventors: 刘超; 叶冠宇; 李云川; 李仕林
Original assignee: 华为云计算技术有限公司
Priority date: 2022-07-19
Filing date: 2023-04-03
Publication date: 2024-01-25

Abstract

Embodiments of the present application relate to the technical field of cloud computing. Disclosed are a data point query method and apparatus, a device cluster, a program product, and a storage medium. The method comprises: determining a target scale function from a plurality of scale functions on the basis of a target quantile corresponding to a target data point to be queried; constructing a target sketch on the basis of the target scale function and a plurality of data points; and querying the target data point on the basis of the target sketch. Because the densities of clusters in sketches constructed on the basis of different scale functions are different, in the embodiments of the present application, the target scale function can be adaptively selected on the basis of the target quantile corresponding to the target data point to be queried, such that the sketch constructed on the basis of the target scale function has dense clusters near the target quantile. When clusters of a sketch are dense, the clusters in the sketch can more accurately represent features of data points of the clusters obtained by clustering, so that the precision of querying the target data point on the basis of the sketch is improved.

Description

Data point query method, device, equipment cluster, program product and storage medium

This application requires that the application number submitted on July 19, 2022 is 202210855232. Priority is given to the Chinese patent application 202211091505.4, whose invention title is "data point query method, device, equipment cluster, program product and storage medium", the entire content of which is incorporated into this application by reference.

Technical field

Embodiments of the present application relate to the field of cloud computing technology, and in particular to a data point query method, device, equipment cluster, program product and storage medium.

Background technique

Data points refer to data collected by relevant devices in Internet of Things technology, such as temperatures collected by temperature sensing devices. Data point query is used to query the characteristics of a certain data point in a batch of data points, such as querying the quantile of the data point in a batch of data points based on the data value of the data point, or based on the quantile of the data point Query the data value of this data point. Among them, the quantile indicates the position of the data point in a batch of data points sorted by size. With the development of Internet of Things technology, the number of data points in various industries has exploded. In this scenario, how to efficiently and accurately query a certain data point from massive data points is a current research hotspot.

Contents of the invention

Embodiments of the present application provide a data point query method, device, equipment cluster, program product and storage medium, which can efficiently and accurately query a certain data point from massive data points. The technical solutions are as follows:

In the first aspect, a data point query method is provided. In this method, based on the target quantile corresponding to the target data point to be queried, the target scale function is determined from multiple scale functions. Different scales in the multiple scale functions The density of clusters in the sketch constructed by the function is different, and the target quantile indicates the position of the target data point among multiple data points sorted by size; the target sketch is constructed based on the target scale function and multiple data points, and the target sketch includes Multiple clusters, each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the mean value of the data points of the corresponding cluster obtained by clustering, and the cluster weight indicates the number of data points obtained by clustering of the corresponding cluster; query the target data points based on the target sketch.

Since the clusters in the sketches constructed by different scale functions have different density, in this embodiment of the present application, the target scale function can be adaptively selected based on the target quantile corresponding to the target data point to be queried, so that the target scale function can be adaptively selected based on the target scale. The sketches constructed by the function have dense clusters near the target quantile. When the clusters in the sketch are relatively dense, the clusters in the sketch can more accurately represent the characteristics of the data points obtained by clustering, thereby improving the accuracy of querying the target data points based on the sketch.

Based on the method provided in the first aspect, in some embodiments, the multiple scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are dense on the first quantile interval. The degree is greater than the density of the clusters in the first quantile interval in the sketch constructed based on the second scale function. The density of the clusters in the sketch constructed based on the first scale function in the second quantile interval is less than The clusters in the sketch built based on the second scale function are in the second Intensity on the quantile interval.

In this scenario, based on the target quantile corresponding to the target data point, the implementation method of determining the target scale function from multiple scale functions can be: if the target quantile is located in the first quantile interval, then the first scale The function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.

Since the sketches constructed based on the first scale function have denser clusters on the first quantile interval, and the sketches constructed based on the second scale function have denser clusters on the second quantile interval, it can be determined based on the target data At the target quantile corresponding to the point, the first scale function or the second scale function is adaptively selected to construct the sketch, so that the constructed sketch has dense clusters in the interval near the target quantile.

Based on the method provided in the first aspect, in some embodiments, the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2 ;The second quantile interval includes the interval from x1 to x2.

At this time, the method provided by the embodiment of the present application can realize accurate query of the data points corresponding to any quantile in the global quantile interval [0,1], that is, high-precision query in the entire range can be achieved.

Based on the method provided in the first aspect, in some embodiments, querying the target data point based on the target sketch may be implemented by querying the data value of the target data point based on the target sketch and the target quantile.

In the embodiment of the present application, the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. That is to say, the method provided by the embodiment of the present application is suitable for data point query in various scenarios, which improves the flexibility of the embodiment of the present application.

Based on the method provided in the first aspect, in some embodiments, before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, in this method, the data point is also received Query request, data point query request is used to query the data value of the target data point among multiple data points. The data point query request carries the standard quantile of the target data point; determine the standard quantile carried in the data point query request. is the target quantile.

Before building the sketch, you can also receive a data point query request, which is used to query the data value of a target data point among multiple data points, and the data point query request carries the standard quantile of the target data point. In this case, the standard quantile carried in the data point query request is determined as the target quantile, so that the target sketch can be constructed based on the target quantile, and then the data value of the target data point can be queried. In this case, the accuracy of the queried data values can be improved.

Based on the method provided in the first aspect, in some embodiments, in this method, a equal height histogram query request may also be received, and the equal height histogram query request is used to query a equal height histogram constructed based on multiple data points, And the equal height histogram query request carries the number of buckets h, h is an integer greater than 1; based on the number of buckets h and the total number of multiple data points, determine the first bucket from left to right in the equal height histogram to the h-th The quantiles of 1 bucket are obtained by h-1 quantiles; each quantile in the h-1 quantiles is used as the target quantile, and the target corresponding to the target data point to be queried is executed. Quantile, the operation of determining the target scale function from multiple scale functions to obtain h-1 data values that correspond to h-1 quantiles one-to-one. Draw a contour histogram based on h-1 data values, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points.

Before building a sketch, you can also receive a contour histogram query request, which is used to query a contour histogram built based on multiple data points. In this case, the accuracy of the constructed equal-height histogram can be improved.

Based on the method provided in the first aspect, in some embodiments, before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, in this method, the target scale function may also be determined based on the target quantile corresponding to the target data point to be queried. The data value of the data point, as well as the data value of the largest data point and the data value of the smallest data point among multiple data points, determine the target data point Estimate the quantile and use the estimated quantile as the target quantile.

In this scenario, querying the target data point based on the target sketch can be implemented by querying the standard quantile of the target data point based on the data value of the target sketch and the target data point.

In the embodiment of this application, the data value of the target data point can be queried based on the quantile of the target data point. In this scenario, a quantile can be estimated based on the data value of the data point, and the estimated quantile can be The number is used as the target quantile and the scale function is adaptively selected to construct the sketch to improve the accuracy of the standard quantile obtained by subsequent queries.

Based on the method provided in the first aspect, in some embodiments, an estimate of the target data point is determined based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points. Before quantile, you can also receive a quantile query request. The quantile query request is used to query the standard quantile of the target data point among multiple data points. The quantile query request carries the data value of the target data point.

Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where a quantile query request is received, which improves the accuracy of the standard quantile queried in this scenario.

Based on the method provided in the first aspect, in some embodiments, in this method, an equal-width histogram query request may also be received, and the equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, And the equal-width histogram query request carries a bucket boundary array. The bucket boundary array includes n boundary values. The n boundary values divide n+ between the data value of the smallest data point and the data value of the largest data point among multiple data points. 1 interval; use each boundary value in the n boundary values as the data value of the target data point, and execute the data value based on the target data point, as well as the data value of the largest data point and the smallest data point among multiple data points. The operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one. Draw an equal-width histogram based on n standard quantiles corresponding to n boundary values.

Querying the data value of the target data point based on the quantile of the target data point can be applied in the scenario where an equal-width histogram query request is received, which improves the accuracy of the equal-width histogram queried in this scenario.

Based on the method provided in the first aspect, in some embodiments, after constructing the target sketch based on the target scale function and multiple data points, clusters to be updated corresponding to the data points to be updated in the cache can also be generated, and the clusters to be updated include clusters Mean, cluster weight and cluster mark. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster weight of the cluster to be updated indicates the number of data points to be updated. The cluster mark of the cluster to be updated indicates the update of the data point to be updated. Type; update the target sketch based on the cluster to be updated.

In the embodiment of the present application, in order to be able to insert data points or delete data points into the target sketch, for the data points in the cache, the data points in the cache are expressed as clusters to be updated in the form of triples as mentioned above, so as to facilitate Subsequently, the target sketch is updated based on the data points to be updated in the cache.

Based on the method provided in the first aspect, in some embodiments, based on the cluster to be updated, the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtaining the cluster to be merged; Merge the clusters to be merged into the target sketch.

Since there are data points in the cache that need to be deleted or added, after the data points in the cache are represented as clusters to be updated, the clusters to be merged can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be added are filtered out. data points, and then merge the clusters to be merged into the target sketch.

Based on the method provided in the first aspect, in some embodiments, merging the clusters to be merged into the target sketch may be implemented by: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and Each cluster performs the following operations in turn: for the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster If the quantile is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the i-th cluster The current quantile and target scale function of the cluster update the quantile threshold and traverse the next cluster.

Through the above method, the merged cluster to be updated can be added to other clusters of the target sketch to update the target sketch.

Based on the method provided in the first aspect, in some embodiments, based on the cluster to be updated, the implementation of updating the target sketch may be: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; Remove the cluster to be deleted from the target sketch.

Since there are data points in the cache that need to be deleted or added, after all the data points in the cache are represented as clusters to be updated, the clusters to be deleted can be filtered out from the cache based on the cluster tags, that is, the clusters that need to be deleted can be filtered out. data points, and then delete the cluster to be deleted from the target sketch.

Based on the method provided in the first aspect, in some embodiments, deleting the clusters to be deleted from the target sketch can be implemented by: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; The first cluster after sorting starts to traverse each cluster, and performs the following operations on each cluster in turn: for the jth cluster, determine the cluster mark of the jth cluster, if the cluster mark of the jth cluster is a mark to be deleted , then delete the jth cluster and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.

Through the above method, the cluster to be deleted can be deleted from the target sketch to update the target sketch.

Based on the method provided in the first aspect, in some embodiments, the implementation of updating the cluster weights of clusters adjacent to j clusters can be: if the j-th cluster is the intermediate cluster after sorting, then obtain the cluster weights of j clusters. The cluster mean of the left adjacent cluster and the cluster mean of the right adjacent clusters of j clusters; based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster and the cluster mean and cluster weight of j clusters, respectively determine and The deletion weight corresponding to the left adjacent cluster, and the deletion weight corresponding to the right adjacent cluster; the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight corresponding to the right adjacent cluster is updated. The cluster weight of the left adjacent cluster.

Since deleting a cluster from the target sketch will affect the cluster weights of clusters adjacent to this cluster, when a cluster is deleted, the cluster weights of clusters adjacent to this cluster also need to be updated.

Based on the method provided in the first aspect, in some embodiments, the implementation of constructing a target sketch based on the target scale function and multiple data points may be: obtaining a cached sketch based on some of the data points among the multiple data points and the target scale function. , obtain the first sketch; construct a sketch based on the data points except some data points among the multiple data points and the target scale function, and obtain the second sketch; aggregate the first sketch and the second sketch to obtain the target sketch.

In the embodiment of this application, when the target data point needs to be queried, if some sketches have been constructed in advance based on some data points and the target scale function, the currently constructed sketch can be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.

Based on the method provided in the first aspect, in some embodiments, a sketch that has been cached based on some of the data points among the multiple data points and the target scale function is obtained. The implementation of obtaining the first sketch may be: obtaining the target time window to be queried. , the target data point is a data point whose timestamp is within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. Each sketch The metadata includes the sketch time window and the sketch timeline identifier. The sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch. The sketch time window The intermediate line identifier is the identifier of the timeline to which the data points of the corresponding sketch belong; based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, and the sketch time window in the first metadata is For part or all of the target time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs; the sketch corresponding to the first metadata is determined as the first sketch.

The cached sketches can be managed through the metadata set, so that when querying a certain data point, the cached sketches can be obtained based on the metadata set, which improves the efficiency of obtaining cached sketches.

Based on the method provided in the first aspect, in some embodiments, a sketch is constructed based on data points other than some data points among the multiple data points and the target scale function. After obtaining the second sketch, the elements of the second sketch can also be determined. data, get the second metadata; cache the second sketch, and add the second metadata to the metadata set.

Since the second sketch is currently newly constructed, the metadata set can also be updated based on the second sketch, so that subsequent query operations can be performed based on the updated metadata set.

Based on the method provided in the first aspect, in some embodiments, in this method, the timestamp of the data point to be written and the identity of the timeline to which the data point to be written can also be determined; if the data point to be written is If the timestamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata will be deleted and the metadata set will be updated.

In this embodiment of the present application, when new data points are overwritten within the time range corresponding to the cached sketch, the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.

Based on the method provided in the first aspect, in some embodiments, the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record multiple usage information matching any sketch timeline identification. The usage time of each sketch in the sketches.

In this scenario, in this method, the sketch to be eliminated among the multiple sketches matching any sketch timeline identifier can also be determined based on the first usage information, and the sketch to be eliminated can be deleted.

In the embodiment of the present application, as time goes by, more and more sketches are cached. In order to avoid too many sketches wasting the cache, the sketches can also be eliminated. Specifically, some sketches among multiple sketches belonging to the same timeline can be eliminated, so as to eliminate the sketches from the time dimension.

Based on the method provided in the first aspect, in some embodiments, the metadata set further includes second usage information, and the second usage information is used to record the usage corresponding to each of the multiple sketch timeline identifications in the metadata set. Information, the usage information corresponding to each sketch timeline ID indicates the usage time of the sketch that matches the corresponding sketch timeline ID.

In this scenario, in this method, the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers can also be determined based on the second usage information; and the sketch matching the sketch timeline identifier to be eliminated is deleted.

In addition, you can also eliminate sketches from a certain timeline in different timelines to eliminate sketches from the spatial dimension.

In a second aspect, a data point query device is provided. The data point query device has the function of realizing the behavior of the data point query method in the first aspect. The data point query device includes at least one module, and the at least one module is used to implement the data point query method provided in the first aspect.

In a third aspect, a computing device cluster is provided. The computing device cluster includes at least one computing device, each computing device includes a processor and a memory; the processor of the at least one computing device is used to execute the memory of the at least one computing device. instructions stored in the computing device cluster to cause the computing device cluster to execute the data point query method provided in the first aspect Law.

In a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute the data point query method described in the first aspect.

A fifth aspect provides a computer program product containing instructions that, when run on a computer, causes the computer to execute the data point query method described in the first aspect.

The technical effects obtained by the above-mentioned second aspect, third aspect, fourth aspect and fifth aspect are similar to those obtained by the corresponding technical means in the first aspect, and will not be described again here.

Description of drawings

Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application;

Figure 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S ₁ (q) and S ₁ (q) provided by the embodiment of the present application;

Figure 3 is a schematic diagram of the curve change trend of the derivative of a second scale function S ₂ (q) and S ₂ (q) provided by the embodiment of the present application;

Figure 4 is a schematic diagram of a query process for querying data values based on target sketches and target quantiles provided by an embodiment of the present application;

Figure 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application;

Figure 6 is a schematic diagram of a query process for querying the standard quantile q of a target data point based on the target sketch and the data value Q of the target data point provided by the embodiment of the present application;

Figure 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application;

Figure 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application;

Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application;

Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application;

Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application;

Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application;

Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by the embodiment of the present application;

Figure 14 is a schematic structural diagram of a data point query device provided by an embodiment of the present application;

Figure 15 is a schematic structural diagram of a computing device provided by an embodiment of the present application;

Figure 16 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application;

Figure 17 is a schematic diagram of a connection method between computing device clusters provided by an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that "plurality" mentioned herein refers to two or more. In the description of this application, unless otherwise stated, "/" means or, for example, A/B can mean A or B; "and/or" in this article is just an association relationship describing related objects, It means that there can be three relationships, for example, A and/or B, it can mean: A alone exists, There are three situations: A and B exist at the same time, and B exists alone. In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as “first” and “second” are used to distinguish the same or similar items with basically the same functions and effects. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not limit the number and execution order.

Before explaining the embodiments of the present application in detail, the application scenarios of the embodiments of the present application are first introduced.

With the rapid development of 5th Generation Mobile Communication Technology (5G) and Internet of Things (IoT) technology, the number of data points appearing in various industries has exploded and massively increased. Each data point Represents a specific data, such as temperature, humidity or weather. Therefore, it is necessary to perform statistics and analysis on a large number of data points to mine useful features in the data points.

Current data point analysis methods include quantile method and histogram method.

Quantile is used to characterize the position of a certain data point in a sequence of a large number of data points sorted in order of size. Compared with using extreme values (maximum and/or minimum values) to characterize the characteristics of a large number of data points, quantiles can shield false extreme value information caused by abnormal data points, thereby representing the real information at each stage in a large number of data points. . Based on this, for companies that provide Internet services, the quantile can be used as one of the important indicators to measure the company's network operating status. In addition, quantile query is also used in weather temperature trends, log mining, stock trend analysis, virtual currency volume and price indicators, financial data analysis and other fields.

In some technologies, in order to accurately calculate the quantile, all data points need to be sorted, and then the quantile corresponding to each data point is calculated based on the position of each sorted data point. For example, for the quantile q of a certain data point, the value range of q is a real number between 0 and 1, that is, q∈[0,1]. q=1 means that the data point is a quantile of all data points. The maximum data point, q=0.5, indicates that the data point is the middle data point among the total data points after sorting. The time and space complexity of quantiles determined by this technique is O(NlogN), where N is the total number of full data points.

In a scenario where the quantile of each data point is known, if the quantile of the data point to be queried is q, then the quantile of all sorted data points is determined based on the quantile q. item, the result obtained is the data value of the data point, that is, the query result.

However, in the fields of IoT and DevOps (a combination of development and operations, a collective name for a set of processes, methods and systems), data points are usually stored in time series databases, and due to the large volume of data points, the size of the time series database is large. . For example, for data points in large-scale time series databases, the number of data points reaches the level of TB (Terabyte, a storage unit) or even PB (Petabyte, a storage unit). The memory of such a general computer cannot accommodate the full amount of data points. And for such a huge amount of data, the computational overhead required for strong sorting of all data points is also very huge. In this scenario, the technology of accurately calculating quantiles is no longer of practical value. Therefore, approximate quantile calculation technology is gradually emerging. Approximate quantile calculation technology refers to a technology that uses approximate algorithms to calculate quantiles.

The t-digest (an online clustering algorithm) algorithm is currently a commonly used algorithm in approximate quantile calculation technology. The basic principle of this algorithm is to cluster all data to obtain multiple clusters. Each cluster has a corresponding cluster mean and cluster weight. The cluster mean indicates the aggregation to obtain the average value of the data points of the corresponding cluster, and the cluster weight indicates the aggregation to obtain the corresponding cluster mean. The number of data points in the cluster. Multiple clusters of builds are often called sketches. The quantile of each cluster can be determined based on the cluster mean and cluster weight corresponding to each cluster in the sketch. Later, when the data value of a certain data point needs to be queried based on the quantile q, linear interpolation is used to calculate the approximate data value of the data point based on the quantile and cluster mean of each cluster in the sketch. The accuracy and efficiency of queries in this algorithm can be adjusted by the number of clusters in the sketch.

In addition, as a simple and efficient statistical analysis tool, histograms can intuitively describe the data distribution characteristics of multiple data points. Therefore, histograms are widely used in the field of network monitoring and operation and maintenance. The abscissa in the histogram represents the data value of the data point, and the ordinate represents the number of data points. The histogram includes multiple bars. Each bar can be called a bucket. The height of each bucket represents the fall of the data value. The number of data points in the data value interval corresponding to this bucket.

Currently, histograms include equal-height histograms and equal-width histograms. Among them, the equal-height histogram refers to a histogram in which the height of each bucket is close. An equal-width histogram is a histogram in which each bin has the same width.

Based on the above application scenarios, embodiments of this application provide a data point query method. The method provided by the embodiments of the present application can achieve the following technical effects: first, high-precision query of quantiles of data points in the entire range; second, deletion of sketches; and third, incremental update to avoid updating every time. The sketch needs to be rebuilt every time a query is made to avoid wasting resources.

The data point query method provided by the embodiment of the present application will be explained in detail below.

Figure 1 is a flow chart of a data point query method provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps 101 to 103.

Step 101: Based on the target quantile corresponding to the target data point to be queried, determine the target scale function from multiple scale functions. The density of clusters in the sketch constructed by different scale functions among the multiple scale functions is different. The target quantile The number indicates the position of the target data point among multiple data points sorted by size.

Among them, the scale function is used to control the density of each cluster in the sketch. The density of each cluster in the sketch is related to the size of each cluster. The size of a cluster indicates the number of data points that are aggregated into the cluster. The larger the cluster, the more data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a large number of data points. Correspondingly, the clusters in the sketch are relatively sparse, making it difficult to distinguish individual data from the sketch. The data values of the points, so the accuracy of the sketch is also lower. The smaller the cluster, the fewer data points the cluster aggregates. At this time, the cluster mean of the cluster represents the data value of a small number of data points. Correspondingly, the clusters in the sketch are denser, making it easier to distinguish each data from the sketch. The data value of the point, so the accuracy of the sketch is also relatively high. Based on this, in the embodiment of the present application, the scale function can be used to control the accuracy of the sketch to improve the accuracy of subsequent queries.

In some embodiments, the multiple scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those based on the second scale function. The clusters in the constructed sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than the sketch constructed based on the second scale function. How dense the clusters in are on the second quantile interval.

In this scenario, the implementation process of determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point in step 101 can be: if the target quantile is located in the first quantile interval, then The first scale function is determined as the target scale function; if the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.

Among them, the first quantile interval and the second quantile interval can be any interval in the global quantile interval [0,1]. For example, the sum of the first quantile interval and the second quantile interval is the global quantile interval [0,1]. At this time, the global quantile interval [0,1] can be realized through the method provided by the embodiment of the present application. Accurate query of data points corresponding to any quantile on 0,1], that is, achieving high-precision query in the entire range.

For example, the first quantile interval includes the interval from 0 to x1, and the interval from x2 to 1, x1 and x2 are both greater than 0 and less than 1, and x1 is less than x2; the second quantile interval includes the interval from x1 to x2 interval. That is, the first quantile interval is the interval near both ends of the global quantile interval [0,1], and the second quantile interval is the middle interval of the global quantile interval [0,1]. For example, x1 can be 0.2 and x2 can be 0.8. In this scenario, the first quantile interval corresponding to the first scale function is [0,0.2] and [0.8,1], and the first quantile interval corresponding to the second scale function is [0.2,0.8]. Optionally, x1 and x2 can also be other real numbers on the global quantile interval [0,1], and the embodiments of this application will not give examples one by one here.

In the embodiment of the present application, the first scale function can be designed as the function shown in the following formula (1), and the second scale function can be designed as the function shown in the following formula (2):

q in formula (1) and formula (2) represents the quantile, α represents the hyperparameter, indicating the number of clusters, S ₁ (q) and S ₂ (q) represent the first scale function and the second scale function respectively, The derivatives of S ₁ (q) and S ₂ (q) can characterize the density of clusters in the constructed sketch.

FIG. 2 is a schematic diagram of the curve change trend of the derivative of the first scale function S ₁ (q) and S ₁ (q) provided by the embodiment of the present application. FIG. 3 is a schematic diagram of the curve change trend of the second scale function S ₂ (q) and the derivative of S ₂ (q) provided by the embodiment of the present application.

As shown in Figure 2, it can be seen from the curve of the first scale function S ₁ (q) that the first scale function grows faster in the interval near both ends of the global quantile interval [0,1]. The growth rate of the middle interval of the quantile interval [0,1] is slow, so the derivative of the first scale function S ₁ (q) has a relatively large value in the interval near both ends of the global quantile interval [0,1]. The characteristics can be verified from the graph of the derivative of the first scale function S ₁ (q) in Figure 2 . Therefore, the sketch constructed based on the first scale function S ₁ (q) has relatively dense clusters in the interval near both ends of the global quantile interval [0,1], that is, the clusters are relatively small. Correspondingly, the sketch is in the global quantile interval. The accuracy on the interval near both ends of [0,1] is higher.

As shown in Figure 3, it can be seen from the curve of the second scale function S ₂ (q) that the second scale function grows slowly in the interval near both ends of the global quantile interval [0,1]. The growth rate of the middle interval of the quantile interval [0,1] is relatively fast, so the derivative of the second scale function S ₂ (q) has a relatively large value in the middle interval of the global quantile interval [0,1]. This feature This can be verified from the graph of the derivative of the second scale function S ₂ (q) in Figure 3 . Therefore, the sketch constructed based on the second scale function S ₂ (q) has relatively dense clusters in the middle interval of the global quantile interval [0,1], that is, the clusters are relatively small. Correspondingly, the sketch is in the global quantile interval [ The accuracy is higher on the intermediate interval of 0,1].

Based on the two scaling functions shown in Figure 2 and Figure 3, when the target quantile corresponding to the target data point to be queried is located in the interval near both ends of the global quantile interval [0,1], such as [0,0.2] and [ 0.8,1], the first scale function S ₁ (q) can be selected to construct the sketch. When the target quantile corresponding to the target data point to be queried is located in the middle interval of the global quantile interval [0,1], such as [0.2,0.8], the second scale function S ₂ (q) can be selected to construct the sketch, to improve the accuracy of the constructed sketch, thereby Improve the accuracy of querying data points. That is to say, the embodiment of the present application provides a method for adaptively selecting a scale function to construct a sketch based on the query environment.

The above is explained using the first scale function and the second scale function as examples. Optionally, more than two scale functions can also be designed. These scale functions correspond to different intervals of the global quantile interval [0,1]. Different cluster density levels, that is, these scale functions have different performances in different intervals of the global quantile interval [0,1], thereby realizing the method of adaptively selecting scale functions to construct sketches based on the query environment provided in the embodiment of this application. .

Step 102: Construct a target sketch based on the target scale function and multiple data points. The target sketch includes multiple clusters. Each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the clustering to obtain the mean of the data points of the corresponding cluster. The cluster weight indicates Clustering yields the number of data points corresponding to the cluster.

Among them, the implementation method of constructing the target sketch based on the target scale function and multiple data points may refer to the t-digest algorithm or other clustering methods, which is not limited in the embodiments of the present application.

Step 103: Query target data points based on the target sketch.

In the embodiment of the present application, the data value of the target data point can be queried based on the quantile of the target data point, or the quantile of the target data point can be queried based on the data value of the target data point. This is explained below in two application scenarios.

The first application scenario: querying data values based on quantile

In the first application scenario, step 103 can be implemented by querying the data value of the target data point based on the target sketch and the target quantile.

For the convenience of subsequent explanation, the target quantile is marked as q, q is a decimal between 0 and 1, assuming that the total number of data points to construct the target sketch is N, then the query obtained based on the target sketch and q The result is: the sorted result of all data points The approximate estimated value of the elements, the query result is the data value of the target data point.

Assuming that the data value of the largest data point among all data points is max, and the data value of the smallest data point among all data points is min, the query process example of querying data values based on the target sketch and target quantile is shown in Figure 4. Figure 4 The query process in 4 is as follows:

(1) If N*q＜0.5C ₁ ^weight , the query results obtained by using interpolation method are as follows:

Among them, C ₁ ^weight is the cluster weight of the first cluster in the target sketch, C ₁ ^value is the cluster mean of the first cluster in the target sketch, and the first cluster in the target sketch refers to each cluster according to the cluster mean value from small to large. The first cluster after sorting.

(2) If N*q>N-0.5C _m ^weight , the query results obtained by using the interpolation method are as follows:

Among them, C _m ^weight is the cluster weight of the last cluster in the target sketch, and C _m ^value is the cluster mean of the last cluster in the target sketch. The last cluster in the target sketch means that the clusters are sorted in order from small to large according to the cluster mean. the last cluster after.

(3) If the conditions in (1) and (2) are not met, all clusters will be traversed starting from the first cluster. Assuming that the i-th cluster is currently traversed, the following operations will be performed on the i-th cluster:

a) Calculate the cumulative sum _Wi of cluster weights of clusters that have been traversed (including the current cluster). _Wi can be expressed as follows:

b) If _Wi ≤ N*q < Wi _{+ 1} , continue traversing the next cluster, otherwise calculate the query result based on the current cluster and the next cluster using the interpolation method. An example of the interpolation calculation method is as follows:

Assume that the cluster means of the left and right clusters to be interpolated are v _l and v _r respectively, and the cluster weights are w _l and w _r respectively. The final query result is expressed as Qq, then Qq can be obtained by the following two formulas:

Q _q =p*(v _r -v _l )+v _l

It should be noted that the query process shown in Figure 4 is for illustration, and the embodiment of the present application does not limit the implementation of querying the data value of the target data point based on the already constructed target sketch and target quantile. The formulas in the above interpolation method are also used to illustrate the interpolation method, and the embodiments of the present application do not limit this.

In addition, in the first application scenario, there are the following two situations where data values need to be queried based on quantiles. This is explained below.

Case 1: Query in response to a data point query request

In the first case, before building the sketch, a data point query request can also be received. The data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the target data. The standard quantile of the point. In this case, the standard quantile carried in the query request for the data point is determined as the target quantile.

Among them, the standard quantile can be the quantile input by the user, that is, when the user triggers the data point query request, he also inputs a quantile, so that the subsequent quantile can be based on the user input through the method provided by the embodiment of the application. Query specific data values.

In this way, in the first case, the scale function can be adaptively selected according to the quantile input by the user, and a sketch can be constructed. The constructed sketch is relatively dense in the interval near the quantile input by the user, thereby improving the accuracy of the query results. .

Second case: query in response to equal height histogram query request

In the second case, before building the sketch, you can also receive a contour histogram query request, which is used to query a contour histogram constructed based on multiple data points, and the contour histogram query The request carries the number of buckets h. In this case, the way to determine the target quantile is: based on the number of buckets h and the total number of multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram. Quantile, get h-1 quantiles; use each quantile in h-1 quantiles as the target quantile, and perform steps 101 to 103 to get h-1 quantiles. h-1 data values corresponding to the number of digits.

After obtaining h-1 data values that correspond to h-1 quantiles one-to-one, it can be based on the h-1 data values, as well as the data value of the largest data point and the data of the smallest data point among the multiple data points. value, draw a histogram of equal heights.

Among them, the height of each bucket in the equal-height histogram is equal, which is the ratio of the total number N to the number of buckets h. In addition, the coordinates on the horizontal axis in the equal-height histogram become larger from left to right, in order For the convenience of explanation, the h buckets from left to right in the equal-height histogram are marked as the first bucket, the second bucket, ..., and the h-th bucket. In this case, based on the number of buckets h and the total number of multiple data points, the implementation method of determining the quantile from the first bucket to the h-1th bucket in the equal-height histogram from left to right can be: The quantile of i buckets can be expressed as i/h, where i is an integer greater than or equal to 1 and less than or h.

It should be noted that each bucket in the equal-height histogram has a corresponding left boundary value and a right boundary value on the abscissa. The quantile of each bucket mentioned above specifically refers to the quantile corresponding to the right boundary value of each bucket. number. Therefore, the quantile corresponding to the h-th bucket above is 1.

In addition, to determine the h-1 data values that correspond to the h-1 quantiles one-to-one, refer to the flow chart shown in Figure 4, which will not be described again here.

After obtaining h-1 data values that correspond to h-1 quantiles, we can use h-1 data values and multiple data Draw a contour histogram of the data value of the largest data point and the data value of the smallest data point among the points. For example, the data value of the minimum data point, h-1 data values, and the data value of the maximum data point are sorted from small to large. After sorting, each two adjacent data values are in the equal-height histogram. The left and right boundary values of a bucket, and the height of each bucket is the ratio between the total number and the number of buckets h.

FIG. 5 is a schematic flowchart of querying equal-height histograms provided by an embodiment of the present application. As shown in Figure 5, the process of querying the equal height histogram includes the following steps:

a) Determine the number of buckets h in the equal-height histogram, and initialize the quantile array T=[0,0,…,1] and the boundary array B=[0,…,0] with a length of h+1.

b) Calculate the quantile q value of each bucket from the first bucket to the h-1th bucket. The quantile of each bucket indicates the quantile corresponding to the right boundary value of the corresponding bucket. The q value of the i-th bucket is qi, which is filled into the i+1th position of the quantile array T to obtain the array T=[0,q1,q2,…,qh-1,1].

c) Traverse the array T, use the target sketch to determine the Qi value corresponding to each qi value, add the Qi value to the i-th position in the boundary array B, and obtain the boundary array B = [Q0, Q1, Q2,..., Qh-1 ,Qh]. Among them, Q0 is the data value of the smallest data point among all data points, and Qh is the data value of the largest data point among all data points.

d) Finally, a equal-height histogram is constructed based on the boundary array B = [Q0, Q1, Q2,..., Qh-1, Qh].

It should be noted that the above two situations are used to illustrate the application scenarios of querying data values based on quantiles. The embodiments of this application do not limit the application scenarios of querying data values based on quantiles.

The second application scenario: querying quantiles based on data values

In the second application scenario, since the quantile is queried based on the data value of the data point, the quantile of the data point is not known in advance. In this scenario, you can first estimate a quantile based on the data value of the data point. Quantile, use the estimated quantile as the target quantile and adaptively select the scale function to build the sketch. Based on this, in some embodiments, the implementation of determining the target quantile may be: based on the data value of the target data point, and the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine The estimated quantile of the target data point, using the estimated quantile as the target quantile.

For example, based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, the implementation method of determining the estimated quantile of the target data point can be implemented by the following formula :

Among them, Q is the data value of the target data point to be queried.

Optionally, based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, determining the estimated quantile of the target data point can also be implemented in other ways. The embodiments of the present application do not limit this.

In this scenario, step 103 can be implemented by querying the standard quantile of the target data point based on the target sketch and the data value of the target data point. In order to easily distinguish the aforementioned estimated quantiles, the quantiles obtained by the query are called standard quantiles.

For the convenience of subsequent explanation, the data value of the target data point is marked as Q, the standard quantile is marked as q, and the query result obtained based on the target sketch and Q is q.

Assume that the total number of data points to construct the target sketch is N, the data value of the largest data point among the total data points is max, and the data value of the smallest data point among the total data points is min, then based on the data of the target sketch and the target data point An example of the query process for querying the standard quantile q of the target data point by value Q is shown in Figure 6. The query process in Figure 6 is as follows:

(1) If Q＜C ₁ ^value , the query result obtained by using interpolation method is as follows:

(2) If Q≥C _m ^value , the query result obtained by using interpolation method is as follows:

b) If C _i ^value ≤ Q < C _i+1 ^value , use the interpolation method to calculate the query result based on the current cluster, the next cluster and _Wi . If Q does not satisfy C _i ^value ≤ Q＜C _i+1 ^value , continue traversing the next cluster. Among them, examples of interpolation calculation methods are as follows:

Assume that the cluster means of the left and right clusters to be interpolated are v _l and v _r respectively, and the cluster weights are w _l and w _r respectively, then the queried standard quantile q can be obtained by the following formula:

It should be noted that the query process shown in Figure 6 is for illustration, and the embodiment of the present application does not limit the implementation of querying quantiles based on the data values of the already constructed target sketch and target data points. The formulas in the above interpolation method are also used to illustrate the interpolation method, and the embodiments of the present application do not limit this.

In addition, in the second application scenario, there are the following two situations where data values need to be queried based on quantiles. This is explained below.

First case: query in response to quantile query request

In the first case, before building the sketch, you can also receive a quantile query request for querying the standard quantile of a target data point among multiple data points. The quantile query The request carries the data value of the target data point.

In this way, in the first case, a quantile can be estimated based on the data value input by the user, and then the scale function can be adaptively selected based on the estimated quantile, and a sketch can be constructed. The interval near the corresponding quantile is relatively dense, thereby improving the accuracy of the query results.

Second case: Query in response to an equal-width histogram query request

In the second case, before building the sketch, you can also receive an equal-width histogram query request. The equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points, and the equal-width histogram query The request carries a bucket boundary array. The bucket boundary array includes n boundary values. The n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points; n Each of the boundary values is used as the data value of the target data point, and steps 101 to 103 are performed to obtain n standard quantiles corresponding to the n boundary values one-to-one.

After obtaining n standard quantiles that correspond to n boundary values one-to-one, an equal-width histogram can be drawn based on the n standard quantiles that correspond to n boundary values.

Among them, the n boundary values in the bucket boundary array are arranged in order from small to large, and the n boundary values constitute an arithmetic sequence to achieve the equal width of each bucket in the equal-width histogram.

In addition, the coordinates on the horizontal axis in the equal-width histogram become larger from left to right. For the convenience of explanation, the n+1 buckets from left to right in the equal-width histogram are marked as the first bucket. , the second bucket,..., the n+1th bucket. In this way, the left boundary value of the first bucket is the data value of the smallest data point among all the data points, and the left boundary value of the second bucket (that is, the right boundary value of the first bucket) is the first data point in the bucket boundary array. boundary value, the left boundary value of the third bucket (that is, the right boundary value of the second bucket) is the second boundary value in the bucket boundary array,..., and so on, the left boundary value of the n+1th bucket (that is, the right boundary value of the nth bucket) is the nth boundary value in the bucket boundary array, and the right boundary value of the n+1th bucket is the data value of the largest data point among all data points.

In this way, based on h standard quantiles corresponding to h boundary values, the specific implementation process of drawing an equal-width histogram can be: after determining the quantile corresponding to each boundary value in the bucket boundary array, then The number of data points falling into two adjacent boundary values can be determined based on the total number and the quantile corresponding to each boundary value. Based on the number of data points falling into two adjacent boundary values, an equal-width histogram can be obtained. The height of each barrel in the picture. The specific implementation method will be explained in detail later.

In addition, to determine the h standard quantiles corresponding to the h boundary values one by one, reference can be made to the flow chart shown in Figure 6, which will not be described again here.

FIG. 7 is a schematic flowchart of querying an equal-width histogram provided by an embodiment of the present application. As shown in Figure 7, the process of querying an equal-width histogram includes the following steps:

1) Enter the data bucket boundary B = [b1, b2, b3,..., bh] (that is, the bucket boundary array) of the equal-width histogram to be queried, and initialize the array C = [0,0,...,0 ], the length of array C is h+1.

2) Based on steps 101 to 103, calculate the q value corresponding to each element in B through the adaptive selection scale function. Assume that the current traversal reaches the i-th and the calculated q value is qi. Then each element in array C is determined as follows :

a) If it is the first element, set the first element in C to q1.

b) If it is the last element, set the last element in C to 1-qh.

c) Otherwise, set the i-th element in C to qi-qi-1.

3) If the ordinate of the equal-width histogram represents frequency, then let C[i]=n*C[i]. Each element in the array C obtained in this way is the height of a bucket. At this time, the height of each bucket represents The number of data points whose data values fall within the bounds of this bucket.

Alternatively, if the ordinate of the equal-width histogram represents probability, there is no need to set C[i]=n*C[i]. Each element in the array C obtained in this way is also the height of a bucket. At this time, each The height of a bucket represents the ratio between the number of data points whose data values fall within the bounds of the bucket and the total number N.

It should be noted that the above two situations are used to illustrate the application scenarios of querying quantiles based on data values. The embodiments of this application do not limit the application scenarios of querying quantiles based on data values.

Based on the embodiment shown in Figure 1, the scale function can be adaptively selected according to the target quantile corresponding to the target data point to be queried, so as to improve the accuracy of the constructed target sketch near the target quantile, thereby improving the accuracy of the query results. . This method of adaptively selecting scale functions can be applied in the scenario of querying data values based on quantiles, in the scenario of querying quantiles based on numerical values, and in the scenario of querying equal-height histograms. It can also be applied to the scenario of querying equal-width histograms. Therefore, the method provided by the embodiments of the present application can improve the accuracy of query results in various query scenarios. Spend.

The above embodiment is used to explain how to adaptively select a scale function to construct a target sketch. In the embodiment of the present application, for the target sketch that has been constructed, a method of inserting data points or deleting data points into the target sketch is also provided to update the target sketch.

FIG. 8 is a schematic flowchart of updating a target sketch provided by an embodiment of the present application. As shown in Figure 8, the method includes the following steps 801 to 802.

Step 801: Generate a cluster to be updated corresponding to the data point to be updated in the cache. The cluster to be updated includes a cluster mean, a cluster weight and a cluster tag. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster of the cluster to be updated is The weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated.

Step 802: Update the target sketch based on the cluster to be updated.

In this embodiment of the present application, in order to update the target sketch, a triplet may be used to represent a cluster. This triplet can be expressed as <v, w, f>, where v represents the cluster mean of the cluster, w represents the cluster weight of the cluster, and f represents the cluster label of the cluster. The cluster mark indicates whether the cluster is to be deleted or merged.

Based on this, for the data points in the cache, the data points in the cache are expressed as clusters to be updated in the form of triples as above. That is, the data point to be updated in the cache corresponds to the cluster to be updated. The cluster to be updated includes the cluster mean, cluster weight and cluster mark. The cluster mean of the cluster to be updated indicates the data value of the data point to be updated. The cluster weight of the cluster to be updated indicates the cluster to be updated. The number of updated data points. The cluster tag of the cluster to be updated indicates the update type of the data point to be updated.

For example, the cluster mark of the cluster to be updated includes a mark to be merged and a mark to be deleted. For example, when f=1 in the triplet, the cluster mark is a mark to be merged, indicating that the corresponding cluster is a cluster to be merged into the target sketch. When f=-1, the cluster mark is a mark to be deleted, indicating that the corresponding cluster is a cluster to be deleted from the target sketch.

The current update operation of the target sketch includes inserting data points into the target sketch or deleting data points from the target sketch. This is explained in two cases below.

Case 1: Insert data points into the target sketch

In the first case, step 802 is implemented by: obtaining the clusters to be updated whose clusters are marked as to-be-merged markers from the clusters to be updated, and obtaining the clusters to be merged; and merging the clusters to be merged into the target sketch.

In some embodiments, the implementation process of merging clusters to be merged into the target sketch may be: sorting the clusters in the target sketch and the clusters to be merged in order from small to large cluster mean values; for the first cluster after sorting , determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:

For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1; if the current quantile of the i-th cluster is lower than the quantile threshold, Then merge the i-th cluster into the previous cluster and continue traversing from the previous cluster; if the current quantile of the i-th cluster exceeds the quantile threshold, based on the current quantile of the i-th cluster and the target The scale function updates the quantile threshold and traverses the next cluster.

That is, based on the quantile threshold of a certain cluster, it is judged whether the adjacent cluster is suitable to be merged into the current cluster. Among them, the quantile threshold can indicate the limited capacity of the corresponding cluster.

For example, for the first cluster after sorting, the implementation method of determining the quantile threshold based on the target scale function can be: Set the current quantile q ₀ of the first cluster to 0, and determine the quantile threshold q _threshold through the following formula:
q _threshold =k ^-1 (k(q ₀ )+1)

Among them, k(q ₀ ) represents the scale function.

In addition, for example, based on the cluster weight of the i-th cluster, the implementation method of determining the current quantile of the i-th cluster can be: determining the sum of the cluster weights of the clusters that have been traversed (including the i-th cluster), and determining The cluster weights of all clusters after sorting are summed, and the ratio between the two sums is used as the current quantile of the i-th cluster.

Furthermore, if the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster. For example, merging the i-th cluster into the previous cluster means updating the cluster weight and cluster mean of the previous cluster based on the cluster weight and cluster mean of the i-th cluster. For example, the cluster mean of the i-th cluster and the cluster mean of the previous cluster are weighted according to their respective cluster weights, and the resulting value is used as the updated cluster mean of the previous cluster. The cluster weight overlap of the i-th cluster is added to On the cluster weight of the previous cluster, the obtained value is used as the updated cluster weight of the previous cluster.

In addition, the implementation method of updating the quantile threshold based on the current quantile of the i-th cluster and the target scale function can also refer to the above-mentioned formula for determining the quantile threshold q _threshold , which will not be described again here.

Figure 9 is a schematic flowchart of inserting data points into a target sketch provided by an embodiment of the present application. As shown in Figure 9, the newly added data points are first placed in the cache (that is, the buffer), and the new data points in the cache are expressed in the form of triples to obtain the clusters to be merged. Sort the clusters to be merged and the clusters in the target sketch. Calculate the quantile threshold based on the first cluster after sorting, and then start traversing from the second cluster. For any current cluster traversed, determine whether the quantile of the current cluster is less than or equal to the quantile threshold. If the quantile of the current cluster is less than or equal to the quantile threshold, merge the current cluster into the previous cluster, delete the current cluster, and then redefine the updated previous cluster as the current cluster and continue the above operations. If the quantile of the current cluster is greater than the quantile threshold, the quantile threshold is recalculated based on the quantile of the current cluster and the next cluster is traversed.

It should be noted that the detailed implementation manner of the insertion process shown in Figure 9 is an illustration, and the embodiments of the present application do not limit the detailed implementation manner of the insertion operation.

Second case: delete data points from target sketch

In the second case, step 802 is implemented by: obtaining the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtaining the cluster to be deleted; and deleting the cluster to be deleted from the target sketch.

Optionally, there may be data points with the same data value among the data points to be deleted, that is, there may be two clusters with the same cluster mean in the cluster to be updated. In this scenario, in order to improve deletion efficiency, clusters with the same cluster mean in the clusters to be deleted can be merged. The cluster weight of the merged cluster is the sum of the cluster weights of each cluster before the merge. Then the target sketch is updated based on the merged clusters to be deleted.

In some embodiments, the implementation process of deleting clusters to be deleted from the target sketch may be: sorting the clusters in the target sketch and the clusters to be deleted in order from small to large cluster mean values; starting from the first cluster after sorting Traverse each cluster and perform the following operations on each cluster traversed: for the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster. clusters, and update the cluster weights of clusters adjacent to j clusters, where j is an integer greater than or equal to 1.

Since deleting a cluster from the target sketch will affect the cluster weights of clusters adjacent to the cluster, when deleting a cluster, The cluster weights of clusters adjacent to this cluster also need to be updated.

Among them, updating the cluster weights of clusters adjacent to j clusters includes the following situations:

Case 1: If the j-th cluster is the first cluster after sorting, you only need to update the cluster weight of the right adjacent cluster of the first cluster.

For example, the cluster weight of the first cluster is subtracted from the cluster weight of the right adjacent cluster of the first cluster, and the value obtained is used as the updated cluster weight of the right adjacent cluster of the first cluster.

It should be noted that if the cluster weight of the right adjacent cluster of the first cluster is less than the cluster weight of the first cluster, delete the right adjacent cluster of the first cluster and determine the right adjacent cluster of the first cluster. The difference between the cluster weight of and the cluster weight of the first cluster, and the cluster weight of the next right-neighboring cluster adjacent to the right-neighboring cluster is updated based on the difference. If the difference is still greater than the cluster weight of the next right adjacent cluster adjacent to the right adjacent cluster, continue to update the cluster weight of the next right adjacent cluster through the above method until the most recently obtained right adjacent cluster's cluster weight The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the right.

In the above scenario, since the first cluster after sorting is deleted, the minimum value of the target sketch (that is, the data value of the smallest data point among all the data points in the target sketch) has changed. At this time, you can Update the minimum value of the target sketch. For example, the cluster mean of the first cluster in the updated target sketch can be used as the minimum value of the target sketch.

Case 2: If the jth cluster is the last cluster after sorting, you only need to update the cluster weight of the left adjacent cluster of the last cluster.

For example, the cluster weight of the last cluster is subtracted from the cluster weight of the left adjacent cluster of the last cluster, and the value obtained is used as the updated cluster weight of the left adjacent cluster of the last cluster.

It should be noted that if the cluster weight of the left adjacent cluster of the last cluster is less than the cluster weight of the last cluster, delete the left adjacent cluster of the last cluster and determine the cluster weight of the left adjacent cluster of the last cluster and The difference between the cluster weights of the last cluster, based on which the cluster weight of the next left-neighboring cluster adjacent to the left-neighboring cluster is updated. If the difference is still greater than the cluster weight of the next left adjacent cluster adjacent to the left adjacent cluster, continue to update the cluster weight of the next left adjacent cluster through the above method until the most recently obtained cluster weight of the left adjacent cluster. The cluster weight is greater than the last determined difference. This approach can be called recursively updating cluster weights to the left.

In the above scenario, since the last cluster after sorting is deleted, the maximum value of the target sketch (that is, the data value of the maximum data point among all the data points in the target sketch) has changed. At this time, it can be updated. The maximum value of the target sketch. For example, the cluster mean of the last cluster in the updated target sketch can be used as the maximum value of the target sketch.

Case 3: If the jth cluster is the middle cluster after sorting, the cluster weight of the left adjacent cluster and the cluster weight of the right adjacent cluster of the jth cluster need to be updated.

In case three, the implementation process of updating the cluster weights of clusters adjacent to j clusters can be as follows: obtaining the cluster mean of the left adjacent clusters of j clusters and the cluster mean of the right adjacent clusters of j clusters; based on the left The cluster mean of the adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the jth cluster determine the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster respectively; based on The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.

For example, based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively. Deleting weights can be achieved through the following formula:

Among them, d _l represents the deletion weight corresponding to the left adjacent cluster, d _r represents the deletion weight corresponding to the right adjacent cluster, w _c represents the cluster weight of the jth cluster, v _c represents the cluster mean of the jth cluster, v _r represents the cluster mean of the left adjacent cluster, and v _l represents the cluster mean of the right adjacent cluster.

In addition, updating the cluster weight of the left adjacent cluster based on the deletion weight corresponding to the left adjacent cluster may be, for example: subtracting the deletion weight corresponding to the left adjacent cluster from the cluster weight of the left adjacent cluster, and the obtained value is as The updated cluster weight of the left adjacent cluster. For example, updating the cluster weight of the right adjacent cluster based on the deletion weight corresponding to the right adjacent cluster can be: subtracting the deletion weight corresponding to the right adjacent cluster from the cluster weight of the right adjacent cluster, and the obtained value is used as the updated value. The cluster weight of the right adjacent cluster of .

In addition, optionally, updating the cluster weight of the left adjacent cluster can also refer to the aforementioned leftward recursive update of the cluster weight. To update the cluster weight of the right adjacent cluster, you can also refer to the aforementioned rightward recursive update of the cluster weight. The explanation will not be repeated here.

Figure 10 is a schematic flowchart of deleting data points from a target sketch provided by an embodiment of the present application. As shown in Figure 10,

First, the data points to be deleted in the buffer are counted. Each data point is represented by the aforementioned triplet, that is, each data point to be deleted is represented in the form of a cluster to construct the cluster to be deleted. Sort the clusters to be deleted and the clusters in the target sketch according to the cluster mean from small to large.

Set the first cluster after sorting as the current cluster, and start traversing backward from the current cluster. If the current cluster f=1, continue traversing backward. If the current cluster f=-1, it means that the current cluster is a cluster to be deleted and needs to be deleted. The deletion rules are as follows:

1) If the current cluster is the first cluster, delete the data of the right adjacent cluster, that is, modify the cluster weight of the right adjacent cluster.

If the cluster weight of the right adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the right. If the deletion of the first cluster of the target sketch affects the minimum value of the target sketch, the minimum value of the target sketch needs to be updated based on the updated first cluster of the target sketch. After the cluster weight of the right adjacent cluster is updated based on the cluster weight of the cluster to be deleted, the current cluster to be deleted is deleted, the first cluster is marked as the current cluster and the backward traversal continues.

2) If the current cluster is the last cluster, delete the data of the left adjacent cluster set, that is, modify the cluster weight of the left adjacent cluster. If the cluster weight of the left adjacent cluster is not enough to delete the cluster weight of the current cluster, the deletion will continue through the above-mentioned method of recursively updating the cluster weight to the left. If the deletion of the last cluster of the target sketch affects the maximum value of the target sketch, the maximum value of the target sketch can be updated based on the last cluster of the updated target sketch. After the cluster weight of the left adjacent cluster is currently updated based on the cluster weight of the cluster to be deleted, the cluster to be deleted is deleted and the deletion operation is completed.

3) The current cluster is located in the middle position, then determine the deletion weight of the left adjacent cluster and the deletion weight of the right adjacent cluster of the current cluster, and then delete recursively from left to right, that is, update the deletion weight based on the left adjacent cluster. The cluster weight of the left adjacent cluster is updated based on the deletion weight of the right adjacent cluster. After the cluster weights of the left and right adjacent clusters are currently updated based on the cluster weight of the cluster to be deleted, the left adjacent cluster of the cluster to be deleted is marked as the current cluster and the cluster to be deleted is deleted, and then the backward traversal continues.

Based on the embodiment shown in Figure 8, the data points to be updated in the cache can be expressed as clusters to be updated in the form of triples, because the cluster tags in the clusters to be updated can indicate whether the clusters to be updated are clusters to be deleted or clusters to be deleted. Merged clusters, so based on cluster tags, data points to be inserted in the cache can be inserted into the target sketch, or data points to be deleted in the cache can be deleted from the target sketch.

In the aforementioned embodiment, if the target data point needs to be queried, the target sketch is temporarily constructed in the manner shown in Figure 1 . However, for the data points in the time series database, since the number of data points in the time series database is very large, in this case, it is easy to waste computing resources if the target sketch is temporarily constructed every time a data point needs to be queried. Based on this, this application implements The example provides an incremental update method. Through the incremental update method, when querying data points, a sketch is constructed based only on the newly added data points, and then the constructed sketch and the existing sketches in the cache are aggregated. Get a target sketch, thus avoiding the waste of computing resources.

In order to facilitate understanding, the characteristics of the time series database will be explained first. The data points stored in the time series database have corresponding timestamps. The timestamp of each data point can represent the collection time of the data point. Therefore, the data points stored in the time series time database have time series characteristics. In addition, the data points stored in the time series database can usually include data points on different indicators, such as data points collected for temperature and data points collected for humidity, etc. In order to facilitate the differentiation of data points on different indicators, each indicator is The data points on are called data points on a timeline. Based on this, the data points in the time series database include data points corresponding to multiple timelines, and each timeline represents an indicator.

In addition, in order to implement the incremental update method provided by the embodiments of the present application, the embodiments of the present application also provide an incremental update system. To facilitate subsequent understanding, the incremental update system provided by the embodiments of the present application is explained here. .

Figure 11 is a schematic architectural diagram of an incremental update system provided by an embodiment of the present application. As shown in Figure 11, the incremental update system includes the following components.

1) The single timeline component (seriesCusor), also known as the single timeline read data executor, is responsible for reading the original data points within the specified time range of a timeline in response to the query statement.

2) The single-timeline aggregation component (aggregateCursor), also known as the single-timeline aggregation executor, is responsible for calculating the data points of the timeline according to a specific aggregation method and outputting the aggregation results. For example, the data points of the timeline are constructed as sketches, and the insertion and deletion operations of the sketches in the aforementioned embodiments can be implemented through this component.

3) The single-timeline sketch cache component (SketchCacheCursor), also known as the single-timeline sketch cache executor, is responsible for caching the already built sketches. Among them, the incremental update system, as shown in Figure 11, also includes a data cache (CacheData) and a metadata cache (CacheMeta). These two caches are used to store the built sketches and the metadata of the sketches respectively. The metadata indication of the sketches The metadata used to index sketches. The specific functions will be described in detail in subsequent embodiments and will not be elaborated here.

4) The multi-timeline sorting component (tagSetCursor), also known as the multi-timeline sorting and merging executor, is responsible for sorting the sketches that are aggregated based on the data points of multiple timelines according to the space and time dimensions to ensure the orderliness of the cached sketches.

5) Multi-timeline inter-group component (groupCursor), also called multi-timeline inter-group executor, is responsible for aggregating the output results of multiple multi-timeline sorting components to achieve different multi-timeline sorting components. Serial scheduling.

6) The logical concurrent component (ChunkReader), also known as the logical concurrent executor, serves as the smallest granular parallel scheduling unit and is responsible for the conversion of data structures and the assembly of metadata. Among them, data structure conversion refers to converting the storage layer data structure into a query data structure to output query results. Assembly of metadata is used to generate metadata for sketches.

7) The Aggregation Transformation component (AggregateTransform), also known as the multi-timeline aggregation executor, is responsible for further aggregating the output results of components between multi-timeline groups, such as the merging of sketches.

In addition, as shown in Figure 11, based on the responsibilities of each component in the incremental update system, the following three functions can be implemented: 1. Sketch construction, 2. Sketch caching, and 3. Sketch aggregation.

Based on the incremental update system shown in Figure 11, the incremental update method provided by the embodiment of the present application will be explained in detail below. The following describes how to construct the target sketch of step 102 in the embodiment shown in FIG. 1 as an example. Figure 12 is a flow chart of an incremental update method provided by an embodiment of the present application. As shown in Figure 12, the method includes the following steps 1201 to 1203.

Step 1201: Obtain a cached sketch based on some of the multiple data points and the target scale function to obtain a first sketch.

Step 1202: Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch.

Step 1203: Aggregate the first sketch and the second sketch to obtain the target sketch.

In the embodiment of this application, when the target data point needs to be queried, if some sketches have been constructed based on some data points and the target scale function in advance, the sketch can currently be constructed based on other data points, and the currently constructed sketch and the previously constructed sketch can be The target sketch can be obtained by merging the sketches. In this way, you can avoid the need to build a target sketch based on a full set of data points for each query, thereby saving computer resources.

In some embodiments, the cached sketch based on some of the data points among the multiple data points and the target scale function is obtained. The first sketch can be obtained by: obtaining the target time window to be queried, and the target data point is the timestamp located at Data points within the target time window; obtain a metadata set, which includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. The metadata of each sketch includes the sketch time window and The sketch timeline identifier. The sketch time window is the time window corresponding to the timestamp of the data point that constructs the corresponding sketch. The sketch timeline identifier is the identifier of the timeline to which the data point that constructs the corresponding sketch belongs; based on the target time window and the target data point. The timeline of the first metadata is determined from the metadata set. The sketch time window in the first metadata is part or all of the target time window. The sketch timeline identifier in the first metadata is consistent with the target data point. The identities of the timelines are the same; the sketch corresponding to the first metadata is determined as the first sketch.

The target time window to be queried may be the time window carried in the query statement input by the user. For example, if the user inputs a query statement of "query the highest temperature in the last quarter", then the target time window is "last quarter".

In addition, the metadata set can be maintained by the metadata cache (CacheMeta) shown in Figure 11. For example, the metadata set stores the metadata of each cached sketch in the form of a list. In this case, based on the target time window and the timeline to which the target data point belongs, the implementation method of determining the first metadata from the metadata set can be: traverse each metadata in the metadata set, if the sketch time of a certain metadata If the line identifier is the same as the identifier of the timeline to which the target data point belongs, and the sketch time window of the metadata is part or all of the time window in the target time window, then the metadata is determined to be the first metadata.

Optionally, in order to improve the efficiency of metadata query, the metadata in the metadata set can be managed according to space and time dimensions. Figure 13 is a schematic diagram of managing metadata from the spatial and temporal dimensions provided by an embodiment of the present application. As shown in Figure 13, each SID represents a timeline, each SID corresponds to multiple time windows (windows), and corresponding sketches are cached based on each time window.

In this scenario, the metadata of the metadata set can be stored in a key-value format. The key is the data fragmentation identifier (SharId), where each SharId represents a time range (timerange), so the value corresponding to each SharId includes multiple metadata, and the sketch time window in each metadata is within that time Within the scope, the sketch timeline identifiers in these multiple metadata can be different timeline identifiers.

For example, the value corresponding to SharId1 in Figure 13 includes metadata corresponding to SID1. These metadata can be uniformly marked as SID1+timerange11, indicating that the timeline identifier in these metadata is SID1, and the time window in these metadata All are within the time range timerange11 corresponding to SharId1. The value corresponding to SharId1 also includes metadata corresponding to SID2. These metadata can be uniformly marked as SID2+timerange12, indicating that the timeline identifier in these metadata is SID2. The time windows in these metadata are all corresponding to SharId1. The time range is within timerange12. The value corresponding to SharId1 also includes metadata corresponding to SID3. These metadata can be uniformly marked as SID2+timerange13, indicating that the timeline identifier in these metadata is SID2. The time windows in these metadata are all corresponding to SharId1. The time range is within timerange13.

Regarding the values (values) corresponding to other data fragmentation identifiers SharId2 and SharId3 in Figure 13, you can also refer to the above explanation.

At this time, based on the target time window and the timeline to which the target data point belongs, the implementation method of determining the first metadata from the metadata set can be: determining the SharId that matches the target time window, and the time range represented by the matching SharId falls within In this target time window, the metadata whose sketch timeline identifier is the target timeline identifier is then queried from the value corresponding to the matching SharId, and the metadata obtained is the first metadata.

The above process can be realized through the multi-timeline inter-group component in Figure 11.

Correspondingly, the implementation process of constructing a sketch based on data points other than some data points among the multiple data points and the target scale function to obtain the second sketch is: obtaining the data points corresponding to the second time window among the multiple data points, and the second The time window is the time window in the target time window except the first time window, and the first time window is the part of the target time window that overlaps with the sketch time window in the first metadata; based on the target scale function and the second time window Corresponding data points, construct a second sketch.

That is, for data points that have not yet constructed a sketch, a sketch is temporarily constructed, and the temporarily constructed sketch is used as the second sketch for subsequent merging with the cached first sketch. Among them, the temporary construction sketch can be realized through the single timeline component and the single timeline aggregation component in Figure 11.

In addition, after constructing a sketch based on data points other than some data points among the plurality of data points and the target scale function to obtain the second sketch, the metadata of the second sketch can also be determined to obtain the second metadata; the second metadata can be cached Sketch and add secondary metadata to the metadata set to enable updates to the metadata set. This process can be implemented through the single-timeline sketch cache component in Figure 11.

In addition, based on the embodiment shown in Figure 1, it can be seen that the same batch of data points construct different sketches based on different scale functions. Therefore, in the embodiment of the present application, the metadata set corresponds to the scale function. For the sketches constructed with different scale functions, Different metadata sets can be maintained, each metadata set only maintaining metadata for sketches built based on the corresponding scale function.

In addition, in the embodiment of the present application, when new data points are overwritten within the time range corresponding to the cached sketch, the cached sketch needs to be invalidated to avoid inconsistency between the query results and the actual data.

Based on this, in some embodiments, for the data point to be written, the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written can also be determined; if the time of the data point to be written is If the stamp and the identifier of the timeline it belongs to match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted and the metadata set is updated.

Among them, matching the timestamp of the data point to be written and the identifier of the timeline to which it belongs matches the third metadata in the metadata set means: the timestamp of the data point to be written falls within the sketch time window of the third metadata. The identity of the timeline to which the written data point belongs is the same as the sketch timeline identity of the third metadata. This process can be implemented through the single-timeline sketch cache component in Figure 11.

In addition, in the embodiment of the present application, as time goes by, more and more sketches are cached. In order to avoid too many sketches wasting the cache, the sketches can also be eliminated.

The sketch elimination method provided by the embodiment of the present application can eliminate sketches from two aspects. The first aspect is to eliminate some sketches among multiple sketches belonging to the same timeline, so as to eliminate sketches from the time dimension. On the other hand, the sketches of a certain timeline in different timelines are eliminated to eliminate the sketches from the spatial dimension.

In some embodiments, for any sketch timeline identification in the metadata set, the metadata set also includes first usage information corresponding to the sketch timeline identification, and the first usage information is used to record matches with the sketch timeline identification. The time spent on each of the multiple sketches. In this scenario, the elimination of sketches based on the time dimension can be implemented by: determining the sketches to be eliminated among the multiple sketches that match the sketch timeline identifier based on the first usage information, and deleting the sketches to be eliminated.

For example, elimination can be carried out through the least recently used (LRU) elimination mechanism. That is, the less frequently used sketches among the multiple sketches matching the sketch timeline ID will be deleted to save cache.

In addition, in some embodiments, the metadata set further includes second usage information. The second usage information is used to record the usage information corresponding to each sketch timeline identification among the plurality of sketch timeline identifications. Each sketch timeline identification corresponds to The usage information indicates when the sketch that matches the corresponding sketch timeline ID was used. In this scenario, the implementation method of eliminating sketches based on the spatial dimension can be: determining the sketch timeline identifier to be eliminated among the multiple sketch timeline identifiers based on the second usage information; and deleting the sketch that matches the sketch timeline identifier to be eliminated. .

For example, elimination can also be performed through the LRU elimination mechanism. That is, among the various sketch timeline identifiers, the sketches corresponding to the sketch timeline identifiers that have been used less frequently recently are deleted to save cache.

In summary, the embodiments of the present application provide an incremental update system and an incremental update method, which can eliminate the need to build a target sketch based on a full amount of data points every time a data point is queried, thereby saving computing resources.

The devices and equipment involved in the embodiments of the present application will be explained below.

An embodiment of the present application also provides a data point query device. As shown in Figure 14, the device 1400 includes the following modules.

The first determination module 1401 is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of clusters in the sketch constructed by different scale functions in the multiple scale functions. Differently, the target quantile indicates the position of the target data point among multiple data points sorted by size. For specific implementation methods, reference can be made to step 101 in the embodiment of Figure 1 .

The construction module 1402 is used to construct a target sketch based on the target scale function and multiple data points. The target sketch includes multiple clusters, each cluster includes a cluster mean and a cluster weight, and the cluster mean indicates clustering to obtain the mean of the data points of the corresponding cluster, Cluster weights indicate the number of data points that clustered into corresponding clusters. For specific implementation, please refer to step 102 in the embodiment of Figure 1 .

Query module 1403, used to query target data points based on the target sketch. For specific implementation, please refer to step 103 in the embodiment of Figure 1 .

Optionally, the multiple scale functions include a first scale function and a second scale function. The clusters in the sketch constructed based on the first scale function are denser on the first quantile interval than those constructed based on the second scale function. The clusters in the sketch are denser on the first quantile interval, and the clusters in the sketch constructed based on the first scale function are denser on the second quantile interval than in the sketch constructed based on the second scale function. How dense the clusters are on the second quantile interval;

The first determination module 1401 is used for:

If the target quantile is located in the first quantile interval, the first scale function is determined as the target scale function;

If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.

Optionally, the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, both x1 and x2 are greater than 0 and less than 1, and x1 is less than x2;

The second quantile interval includes the interval from x1 to x2.

Optionally, the query module 1403 is used to:

Query the data value of the target data point based on the target sketch and the target quantile.

Optionally, the device 1400 also includes:

The receiving module is used to receive a data point query request. The data point query request is used to query the data value of a target data point among multiple data points. The data point query request carries the standard quantile of the target data point;

The first determination module is also used to determine the standard quantile carried in the data point query request as the target quantile.

Optionally, the device 1400 also includes:

The receiving module is used to receive the equal-height histogram query request. The equal-height histogram query request is used to query the equal-height histogram constructed based on multiple data points. The equal-height histogram query request carries the number of buckets h, and h is greater than 1. an integer;

The first determination module is also used to determine the quantiles from the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of multiple data points, and obtain h- 1 quantile;

The query module is also used to use each quantile in h-1 quantiles as a target quantile, and execute the target quantile corresponding to the target data point to be queried from multiple scale functions. The operation of the target scale function to obtain h-1 data values corresponding to h-1 quantiles.

The apparatus 1400 further includes a drawing module configured to draw a contour histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.

Optionally, the first determination module is also used to:

Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the multiple data points, determine the estimated quantile of the target data point, and use the estimated quantile as the target quantile;

The query module is used for:

Query the standard quantile of the target data point based on the target sketch and the data value of the target data point.

Optionally, the device 1400 also includes:

The receiving module is used to receive a quantile query request. The quantile query request is used to query the standard quantile of a target data point among multiple data points. The quantile query request carries the data value of the target data point.

Optionally, the device 1400 also includes:

The receiving module is used to receive an equal-width histogram query request. The equal-width histogram query request is used to query an equal-width histogram constructed based on multiple data points. The equal-width histogram query request carries a bucket boundary array, and the bucket boundary array includes n boundary values, n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among multiple data points;

The query module is used to treat each of the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, as well as the data value and minimum data of the largest data point among multiple data points. The operation of determining the estimated quantile of the target data point based on the data value of the point to obtain n standard quantiles that correspond one-to-one to the n boundary values.

The device also includes a drawing module for drawing an equal-width histogram based on n standard quantiles that correspond one-to-one to the n boundary values.

Optionally, the device 1400 also includes:

The generation module is used to generate clusters to be updated corresponding to the data points to be updated in the cache. The clusters to be updated include cluster means, cluster weights and cluster tags. The cluster mean of the clusters to be updated indicates the data values of the data points to be updated. The clusters to be updated are The cluster weight indicates the number of data points to be updated, and the cluster tag of the cluster to be updated indicates the update type of the data point to be updated;

The update module is used to update the target sketch based on the cluster to be updated.

Optionally, update modules are used to:

Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;

Merge the clusters to be merged into the target sketch.

Optionally, update modules are used to:

Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;

For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:

For the i-th cluster, based on the cluster weight of the i-th cluster, determine the current quantile of the i-th cluster, where i is an integer greater than 1;

If the current quantile of the i-th cluster is lower than the quantile threshold, merge the i-th cluster into the previous cluster and continue traversing from the previous cluster;

If the current quantile of the i-th cluster exceeds the quantile threshold, the quantile threshold is updated based on the current quantile of the i-th cluster and the target scale function, and the next cluster is traversed.

Optionally, update modules are used to:

Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;

Remove the cluster to be deleted from the target sketch.

Optionally, update modules are used to:

Sort the clusters in the target sketch and the clusters to be deleted according to the cluster mean value from small to large;

Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:

For the jth cluster, determine the cluster mark of the jth cluster. If the cluster mark of the jth cluster is a mark to be deleted, delete the jth cluster and update the cluster weight of the cluster adjacent to j cluster, j is an integer greater than or equal to 1.

Optionally, update modules are used to:

If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent cluster of j clusters and the cluster mean of the right adjacent cluster of j clusters;

Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the right adjacent cluster are determined respectively;

The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.

Optionally, building blocks are used to:

Obtain the cached sketch based on some of the data points among the multiple data points and the target scale function, and obtain the first sketch;

Construct a sketch based on the data points except some of the multiple data points and the target scale function to obtain a second sketch;

Aggregate the first sketch and the second sketch to obtain the target sketch.

Optionally, building blocks are used to:

Obtain the target time window to be queried. The target data points are data points whose timestamps are within the target time window;

Obtain the metadata set. The metadata set includes the metadata of multiple sketches in the cache. The multiple sketches are sketches built based on the target scale function. The metadata of each sketch includes the sketch time window and the sketch timeline identifier. The sketch time window The time window corresponding to the timestamp of the data point for constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data points for constructing the corresponding sketch belong.

Based on the target time window and the timeline to which the target data point belongs, the first metadata is determined from the metadata set, the sketch time window in the first metadata is part or all of the target time window, and the sketch in the first metadata The identity of the timeline is the same as the identity of the timeline to which the target data point belongs;

The sketch corresponding to the first metadata is determined as the first sketch.

Optionally, the device 1400 also includes:

The second determination module is used to determine the metadata of the second sketch and obtain the second metadata;

A cache module that caches the second sketch and adds the second metadata to the metadata set.

Optionally, the device 1400 also includes:

The third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;

The first deletion module is used to delete the sketch corresponding to the third metadata and update the metadata set if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set.

Optionally, the metadata set further includes first usage information corresponding to any sketch timeline identification, and the first usage information is used to record the usage time of each of the multiple sketches matching any sketch timeline identification; Device 1400 also includes:

The second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any sketch timeline identifier, and delete the sketches to be eliminated.

Optionally, the metadata set also includes second usage information. The second usage information is used to record the usage information corresponding to each sketch timeline identification among the multiple sketch timeline identifications in the metadata set. The second usage information corresponding to each sketch timeline identification is The usage information indicates the usage time of the sketch matching the corresponding sketch timeline identifier; the device 1400 further includes:

The third deletion module is configured to determine the sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.

Among them, the first determination module 1401, the construction module 1402, the query module 1403 and other modules can all be implemented by software, or can be implemented by hardware. Illustratively, the implementation of the first determination module 1401 is introduced below, taking the first determination module 1401 as an example. Similarly, the implementation of the building module 1402, the query module 1403 and other modules can refer to the implementation of the first determination module 1401.

Module As an example of a software functional unit, the first determination module 1401 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more. For example, the first determination module 1401 may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.

Likewise, the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs. Among them, usually a VPC is set up in a region. Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .

Module As an example of a hardware functional unit, the first determination module 1401 may include at least one computing device, such as a server. Alternatively, the first determination module 1401 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). Among them, the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.

The multiple computing devices included in the first determination module 1401 may be distributed in the same region or in different regions. The multiple computing devices included in the first determination module 1401 may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the first determination module 1401 may be distributed in the same VPC, It can also be distributed across multiple VPCs. The plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

It should be noted that in other embodiments, the first determination module 1401 can be used to perform any step in the data point query method, the building module 1402 can be used to perform any step in the data point query method, and the query module 1403 can be used In executing any step in the data point query method, the steps responsible for implementation by the first determination module 1401, the construction module 1402, and the query module 1403 can be specified as needed, through the first determination module 1401, the construction module 1402, and the query module 1403 respectively. Implement different steps in the data point query method to realize all functions of the data point query device.

An embodiment of the present application also provides a computing device. As shown in Figure 15, computing device 1500 includes: bus 1502, processor 1504, memory 1506, and communication interface 1508. The processor 1504, the memory 1506 and the communication interface 1508 communicate through a bus 1502. Computing device 1500 may be a server or terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1500.

The bus 1502 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 15, but it does not mean that there is only one bus or one type of bus. Bus 1504 may include a path that carries information between various components of computing device 1500 (eg, memory 1506, processor 1504, communications interface 1508).

The processor 1504 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.

Memory 1506 may include volatile memory, such as random access memory (RAM). The processor 1504 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).

The memory 1506 stores executable program code, and the processor 1504 executes the executable program code to respectively realize the functions of the aforementioned first determination module, construction module, query module and other modules, thereby realizing the data points provided by the embodiments of this application. Query method. That is, the memory 1506 stores instructions for executing the data point query method provided by the embodiment of the present application.

The communication interface 1503 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1500 and other devices or communication networks.

An embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

As shown in Figure 16, the computing device cluster includes at least one computing device 1500. The memory 1506 in one or more computing devices 1500 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.

In some possible implementations, the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application. In other words, a Or a combination of multiple computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.

It should be noted that the memories 1506 in different computing devices 1500 in the computing device cluster can store different instructions, respectively used to execute part of the functions of the data point query device. That is, instructions stored in the memory 1506 in different computing devices 1500 may implement the functions of one or more modules among the first determination module, the construction module, and the query module.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein, the network may be a wide area network or a local area network, etc. Figure 17 shows a possible implementation. As shown in Figure 17, two computing devices 1500A and 1500B are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, the memory 1506 in the computing device 1500A stores instructions for performing the functions of the first determining module and the building module. At the same time, instructions for performing the functions of the query module are stored in memory 1506 in computing device 1500B.

The connection method between the computing device clusters shown in Figure 17 may be: Considering that the data point query method provided by the embodiment of the present application requires a large amount of calculation data, it is considered that the functions implemented by the first determination module and the building module are handed over to the computing device 1500A execution.

It should be understood that the functions of computing device 1500A shown in FIG. 17 may also be performed by multiple computing devices 1500. Likewise, the functions of computing device 1500B may also be performed by multiple computing devices 1500.

The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection method of the computing device cluster described in FIG. 16 and FIG. 17 . The difference is that the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the data point query method provided by the embodiment of the present application.

In some possible implementations, the memory 1506 of one or more computing devices 1500 in the computing device cluster may also store part of the instructions for executing the data point query method provided by the embodiment of the present application. In other words, a combination of one or more computing devices 1500 can jointly execute instructions for executing the data point query method provided by embodiments of the present application.

An embodiment of the present application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, at least one computing device is caused to execute the data point query method provided by the embodiment of the present application.

An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc. The computer-readable storage medium includes instructions that instruct the computing device to execute the data point query method provided by embodiments of the present application.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications Modifications or substitutions will not cause the essence of the corresponding technical solution to depart from the protection scope of the technical solution of each embodiment of the present application.

Claims

A data point query method, characterized in that the method includes:

Based on the target quantile corresponding to the target data point to be queried, the target scale function is determined from multiple scale functions. The density of clusters in the sketch constructed by different scale functions in the multiple scale functions is different. The target quantile is The number of bits indicates the position of the target data point among the plurality of data points sorted by size;

A target sketch is constructed based on the target scale function and the plurality of data points. The target sketch includes a plurality of clusters. Each cluster includes a cluster mean and a cluster weight. The cluster mean indicates the clustering result of the data points of the corresponding cluster. Mean value, the cluster weight indicates the number of data points that cluster to obtain the corresponding cluster;

Query the target data points based on the target sketch.
The method of claim 1, wherein the plurality of scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are in the first quantile interval. The density of the clusters in the sketch constructed based on the second scale function is greater than the density of the clusters in the sketch constructed based on the second scale function on the first quantile interval. The clusters in the sketch constructed based on the first scale function are clustered in the first quantile interval. The density of the clusters on the second quantile interval is less than the density of the clusters in the sketch constructed based on the second scale function on the second quantile interval;

The target scale function is determined from multiple scale functions based on the target quantile corresponding to the target data point, including:

If the target quantile is located in the first quantile interval, determine the first scale function as the target scale function;

If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
The method of claim 2, wherein the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, and both x1 and x2 are greater than 0 and less than 1 , and the x1 is smaller than the x2;

The second quantile interval includes an interval from x1 to x2.
The method according to any one of claims 1-3, wherein querying the target data points based on the target sketch includes:

Based on the target sketch and the target quantile, the data value of the target data point is queried.
The method of claim 4, wherein before determining the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, the method further includes:

Receive a data point query request, the data point query request is used to query the data value of a target data point among multiple data points, and the data point query request carries the standard quantile of the target data point;

The standard quantile carried in the data point query request is determined as the target quantile.
The method of claim 4, further comprising:

Receive a equal height histogram query request, the equal height histogram query request is used to query the equal height histogram constructed based on the multiple data points, and the equal height histogram query request carries the number of buckets h, and the h is an integer greater than 1;

Based on the number of buckets h and the total number of the multiple data points, determine the quantile from the first bucket to the h-1th bucket from left to right in the equal-height histogram, and obtain h-1 Quantile;

Each of the h-1 quantiles is used as the target quantile, and the target scale is determined from multiple scale functions based on the target quantile corresponding to the target data point to be queried. Operation of the function to obtain h-1 data values corresponding to the h-1 quantiles;

The contour histogram is drawn based on the h-1 data values, and the data value of the largest data point and the data value of the smallest data point among the plurality of data points.
The method according to any one of claims 1 to 3, characterized in that, before determining the target scale function from a plurality of scale functions based on the target quantile corresponding to the target data point to be queried, the method further includes :

Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points, an estimated quantile of the target data point is determined, and the estimated quantile is digit as the target quantile;

The querying the target data points based on the target sketch includes:

Based on the target sketch and the data value of the target data point, the standard quantile of the target data point is queried.
The method of claim 7, wherein the determination of the target data point is based on a data value of the target data point, a data value of a maximum data point and a data value of a minimum data point among the plurality of data points. Before describing the estimated quantile of the target data point, the method further includes:

Receive a quantile query request, the quantile query request is used to query the standard quantile of a target data point among multiple data points, and the quantile query request carries the data value of the target data point.
The method of claim 7, further comprising:

Receive an equal-width histogram query request, the equal-width histogram query request is used to query an equal-width histogram constructed based on the multiple data points, and the equal-width histogram query request carries a bucket boundary array, and the bucket The boundary array includes n boundary values, and the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among the plurality of data points;

Each of the n boundary values is used as the data value of the target data point, and the execution is based on the data value of the target data point and the data value of the largest data point among the plurality of data points. and the data value of the minimum data point, the operation of determining the estimated quantile of the target data point to obtain n standard quantiles corresponding to the n boundary values one-to-one;

The equal-width histogram is drawn based on n standard quantiles corresponding one-to-one to the n boundary values.
The method according to any one of claims 1 to 9, characterized in that after constructing a target sketch based on the target scale function and the plurality of data points, the method further includes:

Generate a cluster to be updated corresponding to the data point to be updated in the cache, the cluster to be updated includes a cluster mean, a cluster weight and a cluster mark, the cluster mean of the cluster to be updated indicates the data value of the data point to be updated, the Cluster weight indication of the cluster to be updated The number of data points to be updated, and the cluster mark of the cluster to be updated indicates the update type of the data points to be updated;

Based on the cluster to be updated, the target sketch is updated.
The method of claim 10, wherein updating the target sketch based on the cluster to be updated includes:

Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;

Merge the clusters to be merged into the target sketch.
The method of claim 11, wherein merging the clusters to be merged into the target sketch includes:

Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;

For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:

For the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1;

If the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster, and the traversal continues from the previous cluster;

If the current quantile of the i-th cluster exceeds the quantile threshold, update the quantile threshold based on the current quantile of the i-th cluster and the target scale function, and traverse the following a cluster.
The method of claim 10, wherein updating the target sketch based on the cluster to be updated includes:

Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;

The cluster to be deleted is deleted from the target sketch.
The method of claim 13, wherein deleting the cluster to be deleted from the target sketch includes:

Sort the clusters in the target sketch and the clusters to be deleted in ascending order of cluster mean values;

Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:

For the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster and update the information related to the j-th cluster. The cluster weight of the neighboring cluster, where j is an integer greater than or equal to 1.
The method of claim 14, wherein updating cluster weights of clusters adjacent to the j clusters includes:

If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent clusters of the j clusters and the cluster mean of the right adjacent clusters of the j clusters;

Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the j clusters are respectively determined. Describe the deletion weight corresponding to the right adjacent cluster;

The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
The method according to any one of claims 1 to 15, wherein said constructing a target sketch based on the target scale function and the plurality of data points includes:

Obtain a cached sketch based on some of the plurality of data points and the target scale function to obtain a first sketch;

Construct a sketch based on the data points other than the partial data points among the plurality of data points and the target scale function to obtain a second sketch;

The first sketch and the second sketch are aggregated to obtain the target sketch.
The method of claim 16, wherein the obtaining the first sketch is based on some of the data points among the plurality of data points and the cached sketch of the target scale function, including:

Obtain the target time window to be queried, and the target data point is a data point with a timestamp located within the target time window;

Obtain a metadata set. The metadata set includes metadata of multiple sketches in the cache. The multiple sketches are sketches constructed based on the target scale function. The metadata of each sketch includes a sketch time window and a sketch time. Line identifier, the sketch time window is the time window corresponding to the timestamp of the data point constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data point constructing the corresponding sketch belongs;

Determine first metadata from the set of metadata based on the target time window and the timeline to which the target data point belongs, where the sketch time window in the first metadata is part or all of the target time window Time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs;

The sketch corresponding to the first metadata is determined as the first sketch.
The method of claim 17, wherein a sketch is constructed based on data points other than the partial data points among the plurality of data points and the target scale function, and after obtaining the second sketch, The above methods also include:

Determine the metadata of the second sketch to obtain second metadata;

The second sketch is cached and the second metadata is added to the metadata set.
The method of claim 17, further comprising:

Determine the timestamp of the data point to be written and the identifier of the timeline to which the data point to be written belongs;

If the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set, the sketch corresponding to the third metadata is deleted, and the metadata set is updated.
The method of claim 17, wherein the metadata set further includes first usage information corresponding to any sketch timeline identifier, and the first usage information is used to record the time associated with any sketch time. The line identifies a usage time of each of the plurality of matching sketches; the method further includes:

Determine a sketch to be eliminated among the plurality of sketches matching any of the sketch timeline identifiers based on the first usage information, and delete the sketch to be eliminated.
The method of claim 17, wherein the metadata set further includes second usage information, and the second usage information is used to record each sketch time in a plurality of sketch timeline identifiers in the metadata set. The usage information corresponding to the line identification, and the usage information corresponding to each sketch timeline identification indicates the usage time of the sketch matching the corresponding sketch timeline identification; the method further includes:

Determine a sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information;

Delete the sketches that match the timeline identifier of the sketch to be eliminated.
A data point query device, characterized in that the device includes:

The first determination module is used to determine the target scale function from multiple scale functions based on the target quantile corresponding to the target data point to be queried, and the density of the clusters in the sketch constructed by different scale functions in the multiple scale functions is The degree is different, and the target quantile indicates the position of the target data point among the plurality of data points sorted by size;

A building module configured to construct a target sketch based on the target scale function and the plurality of data points, the target sketch including a plurality of clusters, each cluster including a cluster mean and a cluster weight, the cluster mean indicating that the clustering results in a corresponding The mean value of the data points of the cluster, and the cluster weight indicates the number of data points of the corresponding cluster obtained by clustering;

A query module configured to query the target data points based on the target sketch.
The device of claim 22, wherein the plurality of scale functions include a first scale function and a second scale function, and the clusters in the sketch constructed based on the first scale function are in the first quantile interval. The density of the clusters in the sketch constructed based on the second scale function is greater than the density of the clusters in the sketch constructed based on the second scale function on the first quantile interval. The clusters in the sketch constructed based on the first scale function are clustered in the first quantile interval. The density of the clusters on the second quantile interval is less than the density of the clusters in the sketch constructed based on the second scale function on the second quantile interval;

The first determination module is used for:

If the target quantile is located in the first quantile interval, determine the first scale function as the target scale function;

If the target quantile is located in the second quantile interval, the second scale function is determined as the target scale function.
The device of claim 23, wherein the first quantile interval includes an interval from 0 to x1 and an interval from x2 to 1, and both x1 and x2 are greater than 0 and less than 1 , and the x1 is smaller than the x2;

The second quantile interval includes an interval from x1 to x2.
The device according to any one of claims 22 to 24, characterized in that the query module is used to:

Based on the target sketch and the target quantile, the data value of the target data point is queried.
The device of claim 25, further comprising:

A receiving module, configured to receive a data point query request. The data point query request is used to query the data value of a target data point among multiple data points. The data point query request carries the standard quantile of the target data point. ;

The first determination module is also configured to determine the standard quantile carried in the data point query request as the target quantile.
The device of claim 25, further comprising:

A receiving module, configured to receive a equal height histogram query request. The equal height histogram query request is used to query a equal height histogram constructed based on the multiple data points, and the equal height histogram query request carries the number of buckets. h, the h is an integer greater than 1;

The first determination module is also used to determine the first bucket to the h-1th bucket from left to right in the equal-height histogram based on the number of buckets h and the total number of the multiple data points. quantiles, get h-1 quantiles;

The query module is also used to regard each quantile among the h-1 quantiles as the target quantile, and execute the target quantile corresponding to the target data point to be queried, from The operation of determining the target scale function among multiple scale functions to obtain h-1 data values corresponding to the h-1 quantiles;

The device further includes a drawing module for drawing the equal height histogram based on the h-1 data values and the data value of the largest data point and the data value of the smallest data point among the plurality of data points. .
The device according to any one of claims 22 to 24, characterized in that the first determination module is also used to:

Based on the data value of the target data point, as well as the data value of the largest data point and the data value of the smallest data point among the plurality of data points, an estimated quantile of the target data point is determined, and the estimated quantile is digit as the target quantile;

The query module is used for:

Based on the target sketch and the data value of the target data point, the standard quantile of the target data point is queried.
The device of claim 28, further comprising:

A receiving module, configured to receive a quantile query request. The quantile query request is used to query the standard quantile of a target data point among multiple data points. The quantile query request carries the target data point. data value.
The device of claim 28, further comprising:

A receiving module, configured to receive an equal-width histogram query request, the equal-width histogram query request is used to query an equal-width histogram constructed based on the multiple data points, and the equal-width histogram query request carries a bucket boundary Array, the bucket boundary array includes n boundary values, and the n boundary values divide n+1 intervals between the data value of the smallest data point and the data value of the largest data point among the plurality of data points;

The query module is configured to use each boundary value among the n boundary values as the data value of the target data point, and perform execution based on the data value of the target data point, and among the plurality of data points The operation of determining the estimated quantile of the target data point based on the data value of the maximum data point and the data value of the minimum data point to obtain n standard quantiles corresponding to the n boundary values one-to-one;

The device further includes a drawing module for drawing the equal-width histogram based on n standard quantiles corresponding to the n boundary values one-to-one.
The device according to any one of claims 22 to 30, characterized in that the device further includes:

A generation module, configured to generate clusters to be updated corresponding to the data points to be updated in the cache. The clusters to be updated include cluster means, cluster weights and cluster tags. The cluster mean of the clusters to be updated indicates the cluster mean of the data points to be updated. data value, the cluster weight of the cluster to be updated indicates the number of data points to be updated, and the cluster mark of the cluster to be updated indicates the update of the data points to be updated. new types;

An update module, configured to update the target sketch based on the cluster to be updated.
The device of claim 31, wherein the update module is used for:

Obtain the cluster to be updated whose cluster mark is the mark to be merged from the cluster to be updated, and obtain the cluster to be merged;

Merge the clusters to be merged into the target sketch.
The device of claim 32, wherein the update module is used for:

Sort the clusters in the target sketch and the clusters to be merged in order from small to large cluster means;

For the first cluster after sorting, determine the quantile threshold based on the target scale function, traverse each cluster starting from the second cluster after sorting, and perform the following operations on each cluster in sequence:

For the i-th cluster, determine the current quantile of the i-th cluster based on the cluster weight of the i-th cluster, where i is an integer greater than 1;

If the current quantile of the i-th cluster is lower than the quantile threshold, the i-th cluster is merged into the previous cluster, and the traversal continues from the previous cluster;

If the current quantile of the i-th cluster exceeds the quantile threshold, update the quantile threshold based on the current quantile of the i-th cluster and the target scale function, and traverse the following a cluster.
The device of claim 31, wherein the update module is used for:

Obtain the cluster to be updated whose cluster mark is the mark to be deleted from the cluster to be updated, and obtain the cluster to be deleted;

The cluster to be deleted is deleted from the target sketch.
The device of claim 34, wherein the update module is used for:

Sort the clusters in the target sketch and the clusters to be deleted in ascending order of cluster mean values;

Traverse each cluster starting from the first cluster after sorting, and perform the following operations on each cluster in sequence:

For the j-th cluster, determine the cluster mark of the j-th cluster. If the cluster mark of the j-th cluster is a mark to be deleted, delete the j-th cluster and update the information related to the j-th cluster. The cluster weight of the neighboring cluster, where j is an integer greater than or equal to 1.
The device of claim 35, wherein the update module is used for:

If the jth cluster is the middle cluster after sorting, obtain the cluster mean of the left adjacent clusters of the j clusters and the cluster mean of the right adjacent clusters of the j clusters;

Based on the cluster mean of the left adjacent cluster, the cluster mean of the right adjacent cluster, and the cluster mean and cluster weight of the j clusters, the deletion weight corresponding to the left adjacent cluster and the deletion weight corresponding to the j clusters are respectively determined. Describe the deletion weight corresponding to the right adjacent cluster;

The cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the left adjacent cluster, and the cluster weight of the left adjacent cluster is updated based on the deletion weight corresponding to the right adjacent cluster.
The device according to any one of claims 22-36, characterized in that the building module is used for:

Obtain the cached sketch based on some of the multiple data points and the target scale function, and obtain the first sketch;

Construct a sketch based on the data points other than the partial data points among the plurality of data points and the target scale function to obtain a second sketch;

The first sketch and the second sketch are aggregated to obtain the target sketch.
The device of claim 37, wherein the building module is used for:

Obtain the target time window to be queried, and the target data point is a data point with a timestamp located within the target time window;

Obtain a metadata set. The metadata set includes metadata of multiple sketches in the cache. The multiple sketches are sketches constructed based on the target scale function. The metadata of each sketch includes a sketch time window and a sketch time. Line identifier, the sketch time window is the time window corresponding to the timestamp of the data point constructing the corresponding sketch, and the sketch timeline identifier is the identifier of the timeline to which the data point constructing the corresponding sketch belongs;

Determine first metadata from the set of metadata based on the target time window and the timeline to which the target data point belongs, where the sketch time window in the first metadata is part or all of the target time window Time window, the identifier of the sketch timeline in the first metadata is the same as the identifier of the timeline to which the target data point belongs;

The sketch corresponding to the first metadata is determined as the first sketch.
The device of claim 38, further comprising:

The second determination module is used to determine the metadata of the second sketch and obtain the second metadata;

A caching module, configured to cache the second sketch and add the second metadata to the metadata set.
The device of claim 38, further comprising:

The third determination module is used to determine the timestamp of the data point to be written and the identification of the timeline to which the data point to be written belongs;

A first deletion module, configured to delete the sketch corresponding to the third metadata if the timestamp of the data point to be written and the identifier of the corresponding timeline match the third metadata in the metadata set, and Update the metadata set.
The device of claim 38, wherein the metadata set further includes first usage information corresponding to any sketch timeline identifier, and the first usage information is used to record the time associated with any sketch. The line identifies the usage time of each of the plurality of matched sketches; the device further includes:

The second deletion module is configured to determine, based on the first usage information, the sketches to be eliminated among the plurality of sketches that match any of the sketch timeline identifiers, and delete the sketches to be eliminated.
The apparatus of claim 38, wherein the metadata set further includes second usage information, and the second usage information is used to record each sketch time in a plurality of sketch timeline identifiers in the metadata set. The usage information corresponding to the line identification, and the usage information corresponding to each sketch timeline identification indicates the usage time of the sketch matching the corresponding sketch timeline identification; the device further includes:

A third deletion module, configured to determine a sketch timeline identifier to be eliminated among the plurality of sketch timeline identifiers based on the second usage information; and delete the sketch that matches the sketch timeline identifier to be eliminated.
A computing device cluster, characterized by including at least one computing device, each computing device including a processor and a memory;

The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the cluster of computing devices performs the method according to any one of claims 1-21.
A computer program product containing instructions, characterized in that, when the instructions are executed by a cluster of computing devices, they cause the cluster of computing devices to execute the method according to any one of claims 1-21.
A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computing device cluster, the computing device cluster performs the method according to any one of claims 1-21.