CN108304409B

CN108304409B - Carry-based data frequency estimation method of Sketch data structure

Info

Publication number: CN108304409B
Application number: CN201710024141.0A
Authority: CN
Inventors: 杨仝; 姜雨萌; 李晓明
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2021-11-16
Anticipated expiration: 2037-01-13
Also published as: CN108304409A

Abstract

The invention relates to a carry-based data frequency estimation method of a Sketch data structure. The method comprises the following steps: 1) establishing a Sketch data structure which is a two-dimensional array composed of counters, wherein each position is an n-bit counter, and a mark bit and a counting bit are established in the n-bit space of the counter; 2) when updating operation is carried out, mapping the data items into the two-dimensional array through a hash function, counting through counting bits in the mapping process, and carrying out carry through by using the mark bits when the counting bits reach the upper limit; 3) and when the query operation is carried out, returning the minimum value in the query values of each row in the two-dimensional array as a query result. The method can adopt a mode of fixing the marking bits or a mode of dynamically marking the bits in multiple stages. The invention can obviously improve the upper limit of the counting under the condition that the size of the counter is not changed, and can improve the accuracy of the counting.

Description

Carry-based data frequency estimation method of Sketch data structure

Technical Field

The invention relates to a plurality of important fields of network security, financial analysis, machine learning, natural language processing and the like, in particular to a carry-based data frequency estimation method of a Sketch data structure.

Background

Currently, The Count-Min Sketch (Graham Cormode, S.Muthukukrishnan. an Improved Data Stream Summary: The Count-Min Sketch and Its Applications [ M ]), which is The one sketcher with The most use, The best performance, and The most general fit to various Data, is The Sketch with The smallest Count-minimum. The method is relatively light, simple and quick in real-time counting, high in expandability and low in storage and calculation complexity.

However, as a lightweight data structure (y.wang, y.zu, and et al.wire speed name lookup: a GPU-based advanced. in proc.useni NSDI, pages 199-. Meanwhile, the data structure design is simple, so that the upper limit of data storage is very limited.

Disclosure of Invention

In order to overcome the original defect of the conventional Count-Min Sketch counting mode, the invention provides a counting method for improving the upper limit of a value range which can be expressed by a certain bit.

The technical scheme adopted by the invention is as follows:

a data frequency estimation method of a carry-based Sketch data structure comprises the following steps:

1) Establishing a Sketch data structure which is a two-dimensional array composed of counters, wherein each position is an n-bit counter, and a mark bit and a counting bit are established in the n-bit space of the counter;

2) when updating operation is carried out, mapping the data items into the two-dimensional array through a hash function, counting through counting bits in the mapping process, and carrying out carry through by using the mark bits when the counting bits reach the upper limit;

3) and when the query operation is carried out, returning the minimum value in the query values of each row in the two-dimensional array as a query result.

Further, step 1) adopts a mode of fixing the marking bits, namely, high x bits in n bit space of the counter are used as the marking bits, and the rest n-x bits are used as counting bits.

Or, step 1) adopts a multi-stage dynamic marking bit mode, and the number of the marking bits and the number of the counting bits are dynamically adjusted according to the stored numerical values.

A statistical method for query string frequency comprises the following steps:

1) recording the number of occurrences of a search string used by a user per search using the Sketch data structure of claim 1;

2) and for each query string, obtaining a query value of the occurrence times of the query string according to the Sketch data structure, and further obtaining k query strings with the maximum occurrence times.

Further, if the query value obtained from a certain query string in step 2) is not enough to be arranged in the k query strings with the largest number of times, it is not necessary to obtain the true value from the off-chip hash table.

The invention has the beneficial effects that:

under the condition that the size of the counter is not changed, the upper limit of the counting is obviously improved. Therefore, if the upper limit of the count is kept constant, a smaller and larger number of counters can be used, and the accuracy of the count can be improved. The invention is an improvement on the Count-Min Sketch, so that the method is suitable for all use scenes of the Count-Min Sketch, including natural language processing, data flow statistics, point mutual information calculation, sparse approximation of compressive sensing, network abnormal flow detection, distributed data set processing and the like.

Drawings

FIG. 1 is a diagram of a fixed flag bit version of the present invention when performing an update operation.

Fig. 2 is a schematic diagram of a multi-level dynamic flag bit version counter, showing the distinction between flag bits and count bits.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

The technical scheme adopted by the invention is divided into 2 versions:

Carry-In Sketch fixing the flag bit versions

1) Data structure

Carry-In Sketch is the same as CM Sketch data structure, and is a two-dimensional array with width w and height d composed of Counter: c1, 1]…C[d,w]. Each position is an n-bit counter initialized to 0. In this n-bit space, x (x) is high<n) bits are used as flag bits and the remaining (n-x) bits are used as count bits. Furthermore, we need d pairwise independent hash functions: h is₁...h_dI.e., (1.∞) → {1.. w }. Wherein h is₁…h_dRepresents d hash functions, (1.∞) → {1.. w } represents that each hash function maps any positive integer to 1 to w. Meanwhile, we define the expansion coefficient m.

2) Operation of

Update: when an Update request (k, c) arrives, we need to insert the element with key value k c times into Sketch. At a time, we perform an insert operation for each row of Carry-In Sketch. For row r, we first follow h_r(k) Finding the elementsPosition, h_r(k) Representing the mapping value of the r-th hash function to k. Then, we look at this position C [ r, h ]_r(k)]Performing one insertion operation: we use the flag bit as a carry. When the flag bit of a counter is 0, the insert operation is simply +1 to the counter. Until the count bit reaches its upper limit 2 ^n-xWe operate on the flag bit by +1 and set the count bit to 0. Since then, for each insertion, we

The probability of counting bits is + 1. If the count bit has reached its upper limit again, we repeat the previous operation one time, marking bit +1 and setting the count bit to 0.

FIG. 1 is a diagram of a fixed flag bit version of the present invention when performing an update operation. There are 4 rows, each according to h_r(k) And finding the position of the counter corresponding to the element and operating the counter. Counting the bit +1 each time when the flag bit is 0, otherwise counting the bit +1 each time

The probability pair counts bits + 1.

Query: when a query with a key value of k comes, we calculate:

C[1,h₁(k)],C[2,h₂(k)]…C[d,h_d(k)]：

if value(x sign bits)＝0

value(C[i,hi(k)])＝value((n-x)count bits)；

if value(x sign bits)>0

value(C[i,hi(k)])＝2^(n-x)+m*2^(n-x)*(value(x sign bits)-1)+m*(value((n-x)count bits)).

the above is described in natural language as:

for each C [ i, h_i(k)]The query value calculation method is as follows:

1) if the value of the flag bit is 0, the value of the count bit is the query value.

2) If the flag bit is not 0, the query value is divided into two parts:

one part is the query value of the count bits, which mathematically expects to be increased by 1 for every m insertions, so it is equal to the value of the expansion coefficient m times the count bits.

The other part is the value represented by the marker bit. Knowing that the count bits have (n-x) bits, the count bits will increment the flag bit by 1 every increment of 2^ (n-x) according to the update procedure described above, while the count bits are returned to 0. Therefore, growing the marker bit from 0 to 1 requires 2^ (n-x) insertions. Once the flag bit is not 0, mathematically speaking, the count bit can be incremented by 1 every m insertions, and since it is required to increment the flag bit by 1 by 2^ (n-x), the number of insertions required is m ^ 2 (n-x). Therefore, if the value of the flag bit is x, the query value is 2^ (n-x) + m ^ 2 (n-x) ^ (x-1).

Adding the two parts together is Ci, h_i(k)]The query value of (2).

Value (element) in the above code represents the value of an element. After that, we return all C [ i, h_i(k)]The smallest of the query values is taken as the final query result.

Multi-stage dynamic mark bit version of Carry-In Sketch

1) Data structure

The data structure of this version is still a two-dimensional array C [ d, w ]. But this time the flag bits are multilevel. The number of the marker bits and the number of the count bits are dynamically adjusted according to the stored values. The counter is searched from the highest bit to the lower bit until the value of the first bit is found to be 0, all the bits higher than 0 are the mark bits, and all the bits lower than 0 are the counting bits. Also we define the coefficient of expansion as m. As shown in fig. 2.

2) Operation of

Update the Update operations of the multilevel flag bit Carry-In Sketch and the fixed flag bit Carry-In Sketch are only different In that +1 is performed for each counter. We assume that the high x bits are marker bits and then the low (n-x-1) bits are count bits. Each time we get

The probability of counting bits is operated on by + 1. When the count bits are full of the required carry, we set the first 0 just found to 1 and then set all count bits to 0. Thus, the flag bit is extended by one bit and the count bit is reduced by one bit, i.e., the first 0 found from the high to low bits is shifted to the right by one bit.

Query when a Query with a key value of k comes, we calculate C [1, h ]₁(k)],C[2,h₂(k)]…C[d,h_d(k)]The value of (c) is calculated as follows:

value＝m⁰*2^n-1+m¹*2^n-2+…+m^x-1*2^n+1-x+m^x-1*(value of counter bits)

wherein "value of counter bits" represents the value of the count bits. After these values are calculated, the smallest value is selected, i.e., the final query value.

The invention has the beneficial effect that under the condition that the size of the counter is not changed, the upper limit of the counting is obviously improved. When the number of bits n of each counter is 8 and the expansion coefficient m is 16, the specific effects are shown in tables 1 and 2:

table 1. fixed flag bit version:

table 2. dynamic flag bit version:

application scenarios:

the search engine records all the search strings used by the user for each search through the log file, and supposing that there are several records at present, each record corresponds to one query for a certain query string. The hottest k query strings are required to be counted.

The conventional approach is to use a hash table to record the number of occurrences of each query string. And then maintaining a small root heap with the size of k, traversing all the query strings, and finally obtaining k query strings with the largest occurrence frequency. We can now add a Count-Min Sketch (CM Sketch for short, the same below) structure in the prior art to optimize the processing speed based on the hash table. The method comprises the following specific steps:

First, depending on the actual application scenario, the hash table may be considered large and must be placed off-chip, while off-chip access is slow (relative to on-chip access). Now, we can add a CM Sketch structure in the slice to record the occurrence number of each query string. The CM Sketch is small enough to fit inside a slice, and thus the access speed is fast (the time taken to access a Sketch is much less than the time required to access a hash table). Meanwhile, according to the characteristics of the CM Sketch, the value obtained by inquiring the CM Sketch is not accurate, but the inquired value is not always smaller than the true value. Therefore, if the query value obtained from the CM Sketch is not enough to be arranged in the maximum k for a certain query string, the real value of the query string does not need to be obtained from the off-chip hash table, so that one off-chip access is avoided, and the processing efficiency is increased.

However, there is still a case where the query value of CM Sketch of a certain query string is sufficient to be arranged in the maximum k, but the true value is not sufficient to be arranged in the maximum k. It is of course desirable to minimize this situation, which requires the sketch to be as accurate as possible while keeping the consumed memory space constant. While the Carry-In Sketch of the present invention makes such an improvement over the CM Sketch!

In specific implementation, the originally used Count-Min Sketch In the scene is replaced by Carry-In Sketch. The data structure and operation of the Carry-In Sketch have been described In detail above.

Specific examples are as follows:

assume that there are 5 different query strings, a, b, c, d, e, with a frequency of 1000,300,200,1200,400. In the original CM Sketch, a and c map to the same position, and the count of this position is 1000+200 — 1200. b and d map to the same location, which is counted as 300+1200 to 1500.

Now assume we traverse these strings in the order of edcba, trying to find top-3, and have found 3 maxima 350,340,330 before. If e is found, the query value 400 is large enough, and the hash table is used to query the real value 400, then the top-3 is 400,350,340. If d is found, the value is queried 1500, and the true value is found 1200, then the present top-3 is 1200,400,350 respectively. Find c, query value 1200, find true value 200 again, ignore. Similarly, b is also ignored. Finally, finding a, inquiring the true value 1000, and finally obtaining 1200,1000,400 of top-3. In the process, 5 hash tables are queried for the 5 elements in total, i.e., 5 off-chip accesses.

If Carry-In Sketch is used, it is likely that the elements will map to different positions as the number of counters increases, with the space consumed unchanged. Then query c finds that the query value 200 is not sufficient to be ranked into top-3 and does not need to query its true value. The same applies to b. This reduces the number of accesses to the hash table 2 times, or 2 off-chip accesses, thereby improving efficiency.

In the update step described above, the insert operation is performed for all of the d rows. Another variant is to find the minimum (possibly multiple) in the d line results and then perform an interpolation on only this or these minimum; no operation is performed on the remaining rows. The rest of the operations are all unchanged. This variant applies to all 2 versions described above.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A data frequency estimation method of a carry-based Sketch data structure is characterized by comprising the following steps:

3) when the query operation is carried out, the minimum value in the query values of each row in the two-dimensional array is returned to be used as a query result;

wherein, the step 1) adopts a mode of fixing the mark bit, namely, high x bits in n bit space of the counter are used as the mark bit, and the rest n-x bits are used as counting bits; the specific method for updating operation in step 2) is as follows: when an update request (k, c) arrives, inserting the element with the key value k into the Sketch c times, and performing an insertion operation on each line once; for row r, first according to h_r(k) Finding the position of the element, h_r(k) Represents the mapping value of the r-th hash function to k, and then to the position C [ r, h_r(k)]Performing an insertion operation once, and using the mark bit as a carry; when the flag bit of a counter is 0, the insert operation is simply to perform +1 on the counter until the count bit reaches its upper limit of 2 ^n-xCarrying out +1 operation on the mark position and setting the counting position as 0; then for each insertion, in

The counting bit is counted by +1, wherein m is the expansion coefficient; if the counting position reaches the upper limit again, repeating the previous operation once, marking the position +1 and setting the counting position as 0;

or, step 1) adopts a multi-stage dynamic marking bit mode, and the number of the marking bits and the number of the counting bits are dynamically adjusted according to the stored numerical values; the multi-stage dynamic marking bit mode is that the highest bit of the counter is searched from the lowest bit until the value of the first bit is found to be 0, all bits higher than 0 are marking bits at the moment, and all bits lower than 0 are counting bits.

2. The method of claim 1, wherein the method is performed in a batch modeCharacterized in that: step 3) in the query operation, for each C [ i, h_i(k)]The query value calculation method is as follows:

a) if the value of the marker bit is 0, the value of the counting bit is the query value;

b) if the flag bit is not 0, the query value is divided into two parts: one part is the query value of the count bits, which is equal to the expansion coefficient m multiplied by the value of the count bits; the other part is the value represented by the mark bit, and if the value of the mark bit is x, the query value is 2^ (n-x) + m ^ 2^ (n-x) ^ (x-1); adding the two parts to obtain the C [ i, h _i(k)]Then all C [ i, h ] are returned_i(k)]The smallest of the query values is used as the final query result.

3. The method as claimed in claim 1, wherein the specific method for performing the update operation in step 2) is: let the high x bit be the flag bit, and the low n-x-1 bit be the count bit, every time

Performs a +1 operation on the count bits, where m is the expansion coefficient; when the counting bits are full and carry is needed, the first 0 found when the marking bits are determined is set as 1, and then all counting bits are set as 0, so that the marking bits are expanded by one bit, and the counting bits are reduced by one bit.

4. The method as claimed in claim 3, wherein the specific method for performing the query operation in step 3) is to calculate, when a query request with a key value of k comes:

value＝m⁰*2^n-1+m¹*2^n-2+…+m^x-1*2^n+1-x+m^x-1*(value of counter bits)，

wherein value of counter bits represents the value of the count bits; after these values are calculated, the smallest value is selected as the final query value.

5. A statistical method for query string frequency is characterized by comprising the following steps:

6. The method of claim 5, wherein if the query value obtained in step 2) for a query string is not enough to be arranged in the k query strings with the largest number of times, the actual value is not obtained from the off-chip hash table.