CN108304409B - Carry-based data frequency estimation method of Sketch data structure - Google Patents

Carry-based data frequency estimation method of Sketch data structure Download PDF

Info

Publication number
CN108304409B
CN108304409B CN201710024141.0A CN201710024141A CN108304409B CN 108304409 B CN108304409 B CN 108304409B CN 201710024141 A CN201710024141 A CN 201710024141A CN 108304409 B CN108304409 B CN 108304409B
Authority
CN
China
Prior art keywords
bit
bits
value
query
counting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710024141.0A
Other languages
Chinese (zh)
Other versions
CN108304409A (en
Inventor
杨仝
姜雨萌
李晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201710024141.0A priority Critical patent/CN108304409B/en
Publication of CN108304409A publication Critical patent/CN108304409A/en
Application granted granted Critical
Publication of CN108304409B publication Critical patent/CN108304409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Abstract

The invention relates to a carry-based data frequency estimation method of a Sketch data structure. The method comprises the following steps: 1) establishing a Sketch data structure which is a two-dimensional array composed of counters, wherein each position is an n-bit counter, and a mark bit and a counting bit are established in the n-bit space of the counter; 2) when updating operation is carried out, mapping the data items into the two-dimensional array through a hash function, counting through counting bits in the mapping process, and carrying out carry through by using the mark bits when the counting bits reach the upper limit; 3) and when the query operation is carried out, returning the minimum value in the query values of each row in the two-dimensional array as a query result. The method can adopt a mode of fixing the marking bits or a mode of dynamically marking the bits in multiple stages. The invention can obviously improve the upper limit of the counting under the condition that the size of the counter is not changed, and can improve the accuracy of the counting.

Description

Carry-based data frequency estimation method of Sketch data structure
Technical Field
The invention relates to a plurality of important fields of network security, financial analysis, machine learning, natural language processing and the like, in particular to a carry-based data frequency estimation method of a Sketch data structure.
Background
Currently, The Count-Min Sketch (Graham Cormode, S.Muthukukrishnan. an Improved Data Stream Summary: The Count-Min Sketch and Its Applications [ M ]), which is The one sketcher with The most use, The best performance, and The most general fit to various Data, is The Sketch with The smallest Count-minimum. The method is relatively light, simple and quick in real-time counting, high in expandability and low in storage and calculation complexity.
However, as a lightweight data structure (y.wang, y.zu, and et al.wire speed name lookup: a GPU-based advanced. in proc.useni NSDI, pages 199-. Meanwhile, the data structure design is simple, so that the upper limit of data storage is very limited.
Disclosure of Invention
In order to overcome the original defect of the conventional Count-Min Sketch counting mode, the invention provides a counting method for improving the upper limit of a value range which can be expressed by a certain bit.
The technical scheme adopted by the invention is as follows:
a data frequency estimation method of a carry-based Sketch data structure comprises the following steps:
1) Establishing a Sketch data structure which is a two-dimensional array composed of counters, wherein each position is an n-bit counter, and a mark bit and a counting bit are established in the n-bit space of the counter;
2) when updating operation is carried out, mapping the data items into the two-dimensional array through a hash function, counting through counting bits in the mapping process, and carrying out carry through by using the mark bits when the counting bits reach the upper limit;
3) and when the query operation is carried out, returning the minimum value in the query values of each row in the two-dimensional array as a query result.
Further, step 1) adopts a mode of fixing the marking bits, namely, high x bits in n bit space of the counter are used as the marking bits, and the rest n-x bits are used as counting bits.
Or, step 1) adopts a multi-stage dynamic marking bit mode, and the number of the marking bits and the number of the counting bits are dynamically adjusted according to the stored numerical values.
A statistical method for query string frequency comprises the following steps:
1) recording the number of occurrences of a search string used by a user per search using the Sketch data structure of claim 1;
2) and for each query string, obtaining a query value of the occurrence times of the query string according to the Sketch data structure, and further obtaining k query strings with the maximum occurrence times.
Further, if the query value obtained from a certain query string in step 2) is not enough to be arranged in the k query strings with the largest number of times, it is not necessary to obtain the true value from the off-chip hash table.
The invention has the beneficial effects that:
under the condition that the size of the counter is not changed, the upper limit of the counting is obviously improved. Therefore, if the upper limit of the count is kept constant, a smaller and larger number of counters can be used, and the accuracy of the count can be improved. The invention is an improvement on the Count-Min Sketch, so that the method is suitable for all use scenes of the Count-Min Sketch, including natural language processing, data flow statistics, point mutual information calculation, sparse approximation of compressive sensing, network abnormal flow detection, distributed data set processing and the like.
Drawings
FIG. 1 is a diagram of a fixed flag bit version of the present invention when performing an update operation.
Fig. 2 is a schematic diagram of a multi-level dynamic flag bit version counter, showing the distinction between flag bits and count bits.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
The technical scheme adopted by the invention is divided into 2 versions:
Carry-In Sketch fixing the flag bit versions
1) Data structure
Carry-In Sketch is the same as CM Sketch data structure, and is a two-dimensional array with width w and height d composed of Counter: c1, 1]…C[d,w]. Each position is an n-bit counter initialized to 0. In this n-bit space, x (x) is high<n) bits are used as flag bits and the remaining (n-x) bits are used as count bits. Furthermore, we need d pairwise independent hash functions: h is1...hdI.e., (1.∞) → {1.. w }. Wherein h is1…hdRepresents d hash functions, (1.∞) → {1.. w } represents that each hash function maps any positive integer to 1 to w. Meanwhile, we define the expansion coefficient m.
2) Operation of
Update: when an Update request (k, c) arrives, we need to insert the element with key value k c times into Sketch. At a time, we perform an insert operation for each row of Carry-In Sketch. For row r, we first follow hr(k) Finding the elementsPosition, hr(k) Representing the mapping value of the r-th hash function to k. Then, we look at this position C [ r, h ]r(k)]Performing one insertion operation: we use the flag bit as a carry. When the flag bit of a counter is 0, the insert operation is simply +1 to the counter. Until the count bit reaches its upper limit 2 n-xWe operate on the flag bit by +1 and set the count bit to 0. Since then, for each insertion, we
Figure BDA0001208858960000031
The probability of counting bits is + 1. If the count bit has reached its upper limit again, we repeat the previous operation one time, marking bit +1 and setting the count bit to 0.
FIG. 1 is a diagram of a fixed flag bit version of the present invention when performing an update operation. There are 4 rows, each according to hr(k) And finding the position of the counter corresponding to the element and operating the counter. Counting the bit +1 each time when the flag bit is 0, otherwise counting the bit +1 each time
Figure BDA0001208858960000032
The probability pair counts bits + 1.
Query: when a query with a key value of k comes, we calculate:
C[1,h1(k)],C[2,h2(k)]…C[d,hd(k)]:
if value(x sign bits)=0
value(C[i,hi(k)])=value((n-x)count bits);
if value(x sign bits)>0
value(C[i,hi(k)])=2^(n-x)+m*2^(n-x)*(value(x sign bits)-1)+m*(value((n-x)count bits)).
the above is described in natural language as:
for each C [ i, hi(k)]The query value calculation method is as follows:
1) if the value of the flag bit is 0, the value of the count bit is the query value.
2) If the flag bit is not 0, the query value is divided into two parts:
one part is the query value of the count bits, which mathematically expects to be increased by 1 for every m insertions, so it is equal to the value of the expansion coefficient m times the count bits.
The other part is the value represented by the marker bit. Knowing that the count bits have (n-x) bits, the count bits will increment the flag bit by 1 every increment of 2^ (n-x) according to the update procedure described above, while the count bits are returned to 0. Therefore, growing the marker bit from 0 to 1 requires 2^ (n-x) insertions. Once the flag bit is not 0, mathematically speaking, the count bit can be incremented by 1 every m insertions, and since it is required to increment the flag bit by 1 by 2^ (n-x), the number of insertions required is m ^ 2 (n-x). Therefore, if the value of the flag bit is x, the query value is 2^ (n-x) + m ^ 2 (n-x) ^ (x-1).
Adding the two parts together is Ci, hi(k)]The query value of (2).
Value (element) in the above code represents the value of an element. After that, we return all C [ i, hi(k)]The smallest of the query values is taken as the final query result.
Multi-stage dynamic mark bit version of Carry-In Sketch
1) Data structure
The data structure of this version is still a two-dimensional array C [ d, w ]. But this time the flag bits are multilevel. The number of the marker bits and the number of the count bits are dynamically adjusted according to the stored values. The counter is searched from the highest bit to the lower bit until the value of the first bit is found to be 0, all the bits higher than 0 are the mark bits, and all the bits lower than 0 are the counting bits. Also we define the coefficient of expansion as m. As shown in fig. 2.
2) Operation of
Update the Update operations of the multilevel flag bit Carry-In Sketch and the fixed flag bit Carry-In Sketch are only different In that +1 is performed for each counter. We assume that the high x bits are marker bits and then the low (n-x-1) bits are count bits. Each time we get
Figure BDA0001208858960000042
The probability of counting bits is operated on by + 1. When the count bits are full of the required carry, we set the first 0 just found to 1 and then set all count bits to 0. Thus, the flag bit is extended by one bit and the count bit is reduced by one bit, i.e., the first 0 found from the high to low bits is shifted to the right by one bit.
Query when a Query with a key value of k comes, we calculate C [1, h ]1(k)],C[2,h2(k)]…C[d,hd(k)]The value of (c) is calculated as follows:
value=m0*2n-1+m1*2n-2+…+mx-1*2n+1-x+mx-1*(value of counter bits)
wherein "value of counter bits" represents the value of the count bits. After these values are calculated, the smallest value is selected, i.e., the final query value.
The invention has the beneficial effect that under the condition that the size of the counter is not changed, the upper limit of the counting is obviously improved. When the number of bits n of each counter is 8 and the expansion coefficient m is 16, the specific effects are shown in tables 1 and 2:
table 1. fixed flag bit version:
Figure BDA0001208858960000041
table 2. dynamic flag bit version:
Figure BDA0001208858960000051
application scenarios:
the search engine records all the search strings used by the user for each search through the log file, and supposing that there are several records at present, each record corresponds to one query for a certain query string. The hottest k query strings are required to be counted.
The conventional approach is to use a hash table to record the number of occurrences of each query string. And then maintaining a small root heap with the size of k, traversing all the query strings, and finally obtaining k query strings with the largest occurrence frequency. We can now add a Count-Min Sketch (CM Sketch for short, the same below) structure in the prior art to optimize the processing speed based on the hash table. The method comprises the following specific steps:
First, depending on the actual application scenario, the hash table may be considered large and must be placed off-chip, while off-chip access is slow (relative to on-chip access). Now, we can add a CM Sketch structure in the slice to record the occurrence number of each query string. The CM Sketch is small enough to fit inside a slice, and thus the access speed is fast (the time taken to access a Sketch is much less than the time required to access a hash table). Meanwhile, according to the characteristics of the CM Sketch, the value obtained by inquiring the CM Sketch is not accurate, but the inquired value is not always smaller than the true value. Therefore, if the query value obtained from the CM Sketch is not enough to be arranged in the maximum k for a certain query string, the real value of the query string does not need to be obtained from the off-chip hash table, so that one off-chip access is avoided, and the processing efficiency is increased.
However, there is still a case where the query value of CM Sketch of a certain query string is sufficient to be arranged in the maximum k, but the true value is not sufficient to be arranged in the maximum k. It is of course desirable to minimize this situation, which requires the sketch to be as accurate as possible while keeping the consumed memory space constant. While the Carry-In Sketch of the present invention makes such an improvement over the CM Sketch!
In specific implementation, the originally used Count-Min Sketch In the scene is replaced by Carry-In Sketch. The data structure and operation of the Carry-In Sketch have been described In detail above.
Specific examples are as follows:
assume that there are 5 different query strings, a, b, c, d, e, with a frequency of 1000,300,200,1200,400. In the original CM Sketch, a and c map to the same position, and the count of this position is 1000+200 — 1200. b and d map to the same location, which is counted as 300+1200 to 1500.
Now assume we traverse these strings in the order of edcba, trying to find top-3, and have found 3 maxima 350,340,330 before. If e is found, the query value 400 is large enough, and the hash table is used to query the real value 400, then the top-3 is 400,350,340. If d is found, the value is queried 1500, and the true value is found 1200, then the present top-3 is 1200,400,350 respectively. Find c, query value 1200, find true value 200 again, ignore. Similarly, b is also ignored. Finally, finding a, inquiring the true value 1000, and finally obtaining 1200,1000,400 of top-3. In the process, 5 hash tables are queried for the 5 elements in total, i.e., 5 off-chip accesses.
If Carry-In Sketch is used, it is likely that the elements will map to different positions as the number of counters increases, with the space consumed unchanged. Then query c finds that the query value 200 is not sufficient to be ranked into top-3 and does not need to query its true value. The same applies to b. This reduces the number of accesses to the hash table 2 times, or 2 off-chip accesses, thereby improving efficiency.
In the update step described above, the insert operation is performed for all of the d rows. Another variant is to find the minimum (possibly multiple) in the d line results and then perform an interpolation on only this or these minimum; no operation is performed on the remaining rows. The rest of the operations are all unchanged. This variant applies to all 2 versions described above.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (6)

1. A data frequency estimation method of a carry-based Sketch data structure is characterized by comprising the following steps:
1) Establishing a Sketch data structure which is a two-dimensional array composed of counters, wherein each position is an n-bit counter, and a mark bit and a counting bit are established in the n-bit space of the counter;
2) when updating operation is carried out, mapping the data items into the two-dimensional array through a hash function, counting through counting bits in the mapping process, and carrying out carry through by using the mark bits when the counting bits reach the upper limit;
3) when the query operation is carried out, the minimum value in the query values of each row in the two-dimensional array is returned to be used as a query result;
wherein, the step 1) adopts a mode of fixing the mark bit, namely, high x bits in n bit space of the counter are used as the mark bit, and the rest n-x bits are used as counting bits; the specific method for updating operation in step 2) is as follows: when an update request (k, c) arrives, inserting the element with the key value k into the Sketch c times, and performing an insertion operation on each line once; for row r, first according to hr(k) Finding the position of the element, hr(k) Represents the mapping value of the r-th hash function to k, and then to the position C [ r, hr(k)]Performing an insertion operation once, and using the mark bit as a carry; when the flag bit of a counter is 0, the insert operation is simply to perform +1 on the counter until the count bit reaches its upper limit of 2 n-xCarrying out +1 operation on the mark position and setting the counting position as 0; then for each insertion, in
Figure FDA0003198325760000011
The counting bit is counted by +1, wherein m is the expansion coefficient; if the counting position reaches the upper limit again, repeating the previous operation once, marking the position +1 and setting the counting position as 0;
or, step 1) adopts a multi-stage dynamic marking bit mode, and the number of the marking bits and the number of the counting bits are dynamically adjusted according to the stored numerical values; the multi-stage dynamic marking bit mode is that the highest bit of the counter is searched from the lowest bit until the value of the first bit is found to be 0, all bits higher than 0 are marking bits at the moment, and all bits lower than 0 are counting bits.
2. The method of claim 1, wherein the method is performed in a batch modeCharacterized in that: step 3) in the query operation, for each C [ i, hi(k)]The query value calculation method is as follows:
a) if the value of the marker bit is 0, the value of the counting bit is the query value;
b) if the flag bit is not 0, the query value is divided into two parts: one part is the query value of the count bits, which is equal to the expansion coefficient m multiplied by the value of the count bits; the other part is the value represented by the mark bit, and if the value of the mark bit is x, the query value is 2^ (n-x) + m ^ 2^ (n-x) ^ (x-1); adding the two parts to obtain the C [ i, h i(k)]Then all C [ i, h ] are returnedi(k)]The smallest of the query values is used as the final query result.
3. The method as claimed in claim 1, wherein the specific method for performing the update operation in step 2) is: let the high x bit be the flag bit, and the low n-x-1 bit be the count bit, every time
Figure FDA0003198325760000012
Performs a +1 operation on the count bits, where m is the expansion coefficient; when the counting bits are full and carry is needed, the first 0 found when the marking bits are determined is set as 1, and then all counting bits are set as 0, so that the marking bits are expanded by one bit, and the counting bits are reduced by one bit.
4. The method as claimed in claim 3, wherein the specific method for performing the query operation in step 3) is to calculate, when a query request with a key value of k comes:
value=m0*2n-1+m1*2n-2+…+mx-1*2n+1-x+mx-1*(value of counter bits),
wherein value of counter bits represents the value of the count bits; after these values are calculated, the smallest value is selected as the final query value.
5. A statistical method for query string frequency is characterized by comprising the following steps:
1) recording the number of occurrences of a search string used by a user per search using the Sketch data structure of claim 1;
2) and for each query string, obtaining a query value of the occurrence times of the query string according to the Sketch data structure, and further obtaining k query strings with the maximum occurrence times.
6. The method of claim 5, wherein if the query value obtained in step 2) for a query string is not enough to be arranged in the k query strings with the largest number of times, the actual value is not obtained from the off-chip hash table.
CN201710024141.0A 2017-01-13 2017-01-13 Carry-based data frequency estimation method of Sketch data structure Active CN108304409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710024141.0A CN108304409B (en) 2017-01-13 2017-01-13 Carry-based data frequency estimation method of Sketch data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710024141.0A CN108304409B (en) 2017-01-13 2017-01-13 Carry-based data frequency estimation method of Sketch data structure

Publications (2)

Publication Number Publication Date
CN108304409A CN108304409A (en) 2018-07-20
CN108304409B true CN108304409B (en) 2021-11-16

Family

ID=62872335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710024141.0A Active CN108304409B (en) 2017-01-13 2017-01-13 Carry-based data frequency estimation method of Sketch data structure

Country Status (1)

Country Link
CN (1) CN108304409B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532307B (en) * 2019-07-11 2022-05-03 北京大学 Data storage method and query method of stream sliding window
CN110535825B (en) * 2019-07-16 2020-08-14 北京大学 Data identification method of characteristic network flow
CN110830322B (en) * 2019-09-16 2021-07-06 北京大学 Network flow measuring method and system based on probability measurement data structure Sketch
CN111782700B (en) * 2020-08-05 2023-08-18 中国人民解放军国防科技大学 Data stream frequency estimation method, system and medium based on double-layer structure
CN112422579B (en) * 2020-11-30 2021-11-30 福州大学 Execution body set construction method based on mimicry defense Sketch
US11934401B2 (en) 2022-08-04 2024-03-19 International Business Machines Corporation Scalable count based interpretability for database artificial intelligence (AI)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102882798A (en) * 2012-09-04 2013-01-16 中国人民解放军理工大学 Statistical counting method facing to backbone network flow analysis
CN103647670A (en) * 2013-12-20 2014-03-19 北京理工大学 Sketch based data center network flow analysis method
CN103763154A (en) * 2014-01-11 2014-04-30 浪潮电子信息产业股份有限公司 Network flow detection method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102882798A (en) * 2012-09-04 2013-01-16 中国人民解放军理工大学 Statistical counting method facing to backbone network flow analysis
CN103647670A (en) * 2013-12-20 2014-03-19 北京理工大学 Sketch based data center network flow analysis method
CN103763154A (en) * 2014-01-11 2014-04-30 浪潮电子信息产业股份有限公司 Network flow detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An improved data stream summary: the count-min sketch and its applications;Graham Cormode 等;《JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC》;20050430(第55期);正文第2、3、4.1节 *
Count-Min Sketch;Graham Cormode;《Encyclopedia of Database Systems》;Springer, Boston, MA;20090131;全文 *

Also Published As

Publication number Publication date
CN108304409A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108304409B (en) Carry-based data frequency estimation method of Sketch data structure
US11048966B2 (en) Method and device for comparing similarities of high dimensional features of images
KR100545477B1 (en) Image retrieval using distance measure
US10949467B2 (en) Random draw forest index structure for searching large scale unstructured data
US7080091B2 (en) Inverted index system and method for numeric attributes
CN101404032B (en) Video retrieval method and system based on contents
US11106708B2 (en) Layered locality sensitive hashing (LSH) partition indexing for big data applications
CN105574212B (en) A kind of image search method of more index disk hash data structures
KR100903961B1 (en) Indexing And Searching Method For High-Demensional Data Using Signature File And The System Thereof
US8972415B2 (en) Similarity search initialization
CN106503223B (en) online house source searching method and device combining position and keyword information
CN106777388B (en) Double-compensation multi-table Hash image retrieval method
CN112163145B (en) Website retrieval method, device and equipment based on editing distance and cosine included angle
EP3115908A1 (en) Method and apparatus for multimedia content indexing and retrieval based on product quantization
CN100476824C (en) Method and system for storing element and method and system for searching element
CN113806601B (en) Peripheral interest point retrieval method and storage medium
CN105302833A (en) Content based video retrieval mathematic model establishment method
CN108595508B (en) Adaptive index construction method and system based on suffix array
Egas et al. Adapting kd trees to visual retrieval
Gilbert et al. A retrieval pattern-based inter-query learning approach for content-based image retrieval
CN115129915A (en) Repeated image retrieval method, device, equipment and storage medium
CN105117733A (en) Method and device for determining clustering sample difference
CN110598020B (en) Binary image retrieval method
CN110609914B (en) Online Hash learning image retrieval method based on rapid category updating
CN106021360A (en) Method and device for autonomously learning and optimizing MapReduce processing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant