CN105989061A

CN105989061A - Rapid indexing method for repeated detection of multi-dimensional data under sliding window

Info

Publication number: CN105989061A
Application number: CN201510066798.4A
Authority: CN
Inventors: 王勇; 王树鹏; 王振宇; 王曦
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-02-09
Filing date: 2015-02-09
Publication date: 2016-10-05
Anticipated expiration: 2035-02-09
Also published as: CN105989061B

Abstract

The invention relates to a rapid indexing method for repeated detection of multi-dimensional data under a sliding window. According to the rapid indexing method, a compressed counting type Blond filter matrix array is adopted to maintain data items inside the sliding window, multiple sub-windows are maintained inside the sliding window, new elements are received by head sub-windows in a sliding manner, old elements are eliminated by end sub-windows in the sliding manner, each independent sub-window consists of one counting type Blond filter matrix, the counting type Blond filter matrix has a dimension reduction function on multi-dimensional data, and a counter unit is maintained inside the counting type Blond filter matrix. Since all counting type Blond filter matrixes are of a same design capacitor and share one same group of k hash functions, the repeated element detection efficiency can be effectively improved; since a basic system clock is maintained in a counter unit, invisible element delete of the sliding window can be effectively supported; multi-dimensional data can be maintained in a matrix manner, the combination error rate of the multi-dimensional data can be effectively reduced, and the overall mal-judgment rate can be reduced.

Description

Multidimensional data duplicate detection fast indexing method under a kind of sliding window

Technical field

The present invention relates to duplicate detection fast indexing method and the system of a kind of magnanimity multidimensional data, particularly relate to a kind of indexing means carrying out duplicate detection under sliding window data flow model to magnanimity multidimensional data, belong to big data and calculate field.

Background technology

With the development of mobile Internet and Web2.0, global metadata amount is just in surprising growth: the data volume of whole world generation in 2008 is 0.49ZB (1ZB=1021 byte), within 2009, is 0.8ZB, within 2010, is 1.2ZB, up to 1.82ZB in 2011.IDC expects the year two thousand twenty, and the whole mankind can produce the data more than 40ZB.At a high speed, the network data of magnanimity but comprises complicated information, wherein may have miscellaneous service data stream, such as IP service flow, user's click steam, stream of user queries, web page server daily record etc..In addition, be wherein likely to comprise various security incident, the safety to network for the security incident constitutes threat greatly, and therefore Network Traffic Monitoring is particularly important.

In Network Traffic Monitoring application system, multidimensional data duplicate detection is very important preprocessing means.As a example by network service flow in network monitoring management system, each Business Stream is by five-tuple (source address, dest address, source port, dest port, protocol) uniquely determine, when representing and inquiry network service flow this five ties up element set, need highly effective algorithm to improve system effectiveness.

Calculating under scene at flow data, according to the move mode on flow data calculation window border, calculation window currently mainly is divided into following several types.The first is stationary window model, i.e. the two ends, left and right of calculation window are all fixing, and stationary window model is little for the ageing help embodying data；The second is boundary mark window model, i.e. window left end is fixed, right-hand member moves forward, boundary mark window contains the data item occurring between from special time point to current time, it if flowing out in the existing cycle in data and arranging multiple boundary mark, is equivalent to that data stream is divided into some independent low-volume traffic streams and investigates respectively；The third is jump window model, i.e. window left end skip-forwards advances, and right-hand member is slidably advanced, and jump window model more can the consecutive variations process of feedback data stream than boundary mark window model, but owing to window end batch eliminates element, therefore in window, effective element quantity has obvious wave process；4th kind is sliding window model, i.e. two ends, window left and right forward slip simultaneously, and sliding window is deleted stale data item while inserting new data item, is considered the ideal model of data stream monitoring and analysis.

Under sliding window model, carry out multidimensional and repeat the fast indexing method of Data Detection and mainly have a following several method:

First method is the indexing means that Hash combines counting, and hash indexing method is the big data directory mechanism of a kind of very convenient and efficient, by counting the data item that cryptographic Hash is identical, completes the existence record to multidimensional data；When sliding window needs to insert into row element, add 1 to corresponding counter, when needs enter row element delete when, the operation that subtracts 1 to corresponding counter, if counter is 0, then delete respective element item.But, there is shortcoming in the index strategy of Hash counting, first, hash algorithm is a kind of nondeterministic algorithm, will necessarily there is data item hash-collision, and the quality of conflict processing method has conclusive effect for data directory；Secondly, hash algorithm takies greatly for memory headroom.

Second method is multidimensional Bloom filter (MDBF) indexing means, MDBF uses the multiple standard Bloom filter composition identical with element dimensions, the expression inquiry being directly single property value subclass by expression and the query decomposition of Muhivitamin Formula With Minerals, the dimension of element has how many, just uses the Bloom filter of how many standards to represent each self-corresponding attribute respectively.When entering row element inquiry, by judging whether each property value of Muhivitamin Formula With Minerals all judges whether element belongs to set in corresponding standard Bloom filter.But, the method there is also shortcomings.First, the method is more weak in the deletion ability of sliding window interior element, it is impossible to realize accurate data item sliding window；Secondly as the multiple Hash in Bloom filter exists the possibility of conflict, only rely on the existence verification on each independent dimension, there is the higher situation of element False Rate.

In sum, fast indexing method is extremely important for the repetition data test problems in sliding window.In fast indexing method, promote element and repeat data detection efficiency, reduce and repeat Data Detection False Rate, be very important problem in the design of quick indexing structure.

Content of the invention

The main object of the present invention is to provide under sliding window carrying out fast indexing method and the system of multidimensional data duplicate detection, promote the efficiency of element duplicate detection, reduce and repeat Data Detection False Rate, the problem effectively solving multidimensional data duplicate detection under sliding window model.

Present disclosure mainly includes the following aspects.

First, in the design of quick indexing structure, the present invention uses the data item compressed in attribute Bloom filter matrix array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) safeguards sliding window.Specifically, safeguarding multiple subwindow in sliding window, head of the queue subwindow receives new element in sliding manner, and tail of the queue subwindow eliminates old element in sliding manner；Each independent subwindow is made up of attribute Bloom filter matrix (CCBFM), and the dimension that CCBFM possesses towards multidimensional data deletes ability, and its internal maintenance counter unit.

Second, based on above-mentioned Index Structure Design, in terms of repeat element detection efficiency, all attribute Bloom filter matrixes in the present invention all use identical design capacity and share same group of k hash function, so effectively the time complexity of element inquiry can be reduced to O (k) by O (kn), effectively promote repeat element detection efficiency.

3rd, based on above-mentioned Index Structure Design, in terms of sliding window data stream calculation scene applicability, the data item of each independent subwindow of the present invention is safeguarded by attribute Bloom filter matrix (CCBFM), by safeguarding system-based clock in counter unit, can effectively support the element implicit expression deletion action in sliding window, promote the applicability for sliding-window operations for the system.

Compared with prior art, the main innovation point of the present invention and having the beneficial effect that:

1) present invention is at big data fast indexing structure design aspect, propose a kind of compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) index structure, index structure safeguards multiple subwindow in sliding window, and each independent subwindow is made up of attribute Bloom filter matrix (CCBFM).

2) present invention is based on above-mentioned Index Structure Design, by all using identical design capacity to all attribute Bloom filter matrixes and sharing same group of k hash function, can effectively promote repeat element detection efficiency；By safeguarding system-based clock in counter unit, can effectively support that the element implicit expression of sliding window is deleted；Safeguard multidimensional data by matrix-style, effectively reduce the combined error rate of multidimensional data, reduce overall False Rate.

Brief description

Fig. 1 is sliding window model schematic diagram；

Fig. 2 is repetition Data Detection index structure schematic diagram under sliding window model；

Fig. 3 is that under sliding window model, hash function merges shared schematic diagram；

Fig. 4 is data processing node multidimensional data duplicate detection workflow diagram.

Fig. 5 is the curve map of bit error rate test.

Detailed description of the invention

Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below by specific embodiments and the drawings, the present invention will be further described.

In the design of sliding window quick indexing structure, the present invention uses the data item compressed in attribute Bloom filter matrix array (CCBFMA) safeguards sliding window, and this indexing means possesses more preferable repeat element detection efficiency and lower False Rate than existing solution.

Multidimensional compression attribute Bloom filter matrix array (CCBFMA) is made up of compression attribute Bloom filter matrix (CCBFM) of one group of isomorphism, each CCBFM is made up of the counter unit that m bit wide is d, bit wide d=log of each CCBFM Counter unit₂(N/g).The total element capacity assuming sliding window is N, the present invention safeguards g subwindow in sliding window, the design capacity of each subwindow is N/g, all subwindows safeguard the data item flowing through in the way of FIFO, head of the queue subwindow receives new element in sliding manner, and tail of the queue subwindow eliminates old element in sliding manner.

Each CCBFM safeguards all dimension datas of an independent subwindow.The present invention is by safeguarding system-based clock in the counter unit at CBF, the implicit expression carrying out sliding window interior element is deleted.It is described as follows: whether element x to be determined is the effective element of current sliding window mouth, and first, in terms of the existence judgement to x, CCBFM effectively eliminates combined error rate by way of matrix, greatly reduces the False Rate of element；Secondly, if x is existence by corresponding CCBFM judgement, then whether the counter safeguarded in needing to verify its counter unit exceedes current basal clock, if it exceeds Base clock then thinks that it is not effective element.

Fig. 1 gives slip data stream window computation model.Two ends, sliding window left and right forward slip simultaneously, sliding window is deleted stale data item while inserting new data item, is considered the ideal model of data stream monitoring and analysis.

Fig. 2 gives repetition Data Detection index structure under sliding window model.According to this index structure, the present invention safeguards compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) index structure in ram space, this index structure safeguards multiple subwindow in sliding window, and all dimension datas of each independent subwindow are made up of attribute Bloom filter matrix (CCBFM).

Fig. 3 gives hash function under sliding window model and merges shared schematic diagram, and wherein k is hash function number, and g is sliding sub-window number, d=log₂(N/g) bit wide is represented.As it can be seen, all CCBFM are isomorphisms, different Bloom filters have the counter unit of same coordinate mapped and store in same vector, in order to they can be read simultaneously in an internal storage access.Owing to all g Bloom filters share the same group of hash function that quantity is k, element x to be determined the whether effective element in current all boundary mark subwindows, its query time complexity can be reduced to O (k) by O (kg).

Fig. 4 gives multidimensional data duplicate detection workflow diagram.As it can be seen, element duplicate detection mainly includes following core procedure under the scape of sliding window data flow field.

(1) system-based clock, Element detection marker bit flag and system data structure are initialized；

(2) receiving the element e of input, e is made up of w dimension, i.e. (e1, e2...ew)；

(3) in CCBFMA, whether detection elements e exists, if it does not exist, then enter flow process (4), enters new element and inserts flow process；If it is present enter flow process (8)；

(4) write e (e1, e2...ew) in head of the queue CCBFM；

(5) k counter unit writing system Base clock in corresponding CCBFM；

(6) judge whether e is last element of head of the queue subwindow, if it is, Base clock resets, and delete first subwindow of tail of the queue, produce new head of the queue sliding sub-window；If it is not, then Base clock is from increasing；

(7) arranging global mark flag is false, and flow process terminates；

(8) judge whether element ei is present in tail of the queue subwindow, if it is not, then enter flow process (9), if it is, enter flow process (10)；

(9) arranging global mark flag is true, and flow process terminates；

(10) judging whether corresponding counter unit numerical value is more than system-based clock, if it is, arranging global mark flag is true, flow process terminates；If it is not, then arranging global mark flag is false, flow process terminates.

In order to be embodied in the present invention relative to applicability under multidimensional data duplicate detection scene for the conventional method, the present invention, based on True Data collection, constructs following experiment.

Experimental situation: stand-alone server, two-way six core, internal memory 32GB；

Experimental data: true domain name data collection

Experiment content: contrast this method, detection error rate under multidimensional data scene for the MDBF indexing means.Data set inserts 1000 records, and in coverage rate, (coverage rate represents the probability that an attribute of data to be checked is concentrated with identical copies in data respectively, coverage rate is 1 to mean that all properties is repetition) it is the 0th, the 0.2nd, the 0.4th, the 0.6th, the 0.8th, when 1, the requests for information of test 6000000 record data.

Experimental result: table 1 is concrete data list, and the curve map of Fig. 5 bit error rate test, in figure, abscissa is coverage rate, and ordinate is False Rate.

Table 1. experimental result list

Sequence number	Indexing means	Coverage rate	Inquiry bar number	Error number
					1	MDBF indexing means	0	6000000	272
2	MDBF indexing means	0.2	6000000	1200235
					3	MDBF indexing means	0.4	6000000	2400175
4	MDBF indexing means	0.6	6000000	3600105
					5	MDBF indexing means	0.8	6000000	4800060
6	MDBF indexing means	1	6000000	6000000
					7	CCBFMA indexing means	0	6000000	1814
8	CCBFMA indexing means	0.2	6000000	3374
					9	CCBFMA indexing means	0.4	6000000	3024
10	CCBFMA indexing means	0.6	6000000	4678
					11	CCBFMA indexing means	0.8	6000000	5316
12	CCBFMA indexing means	1	6000000	6994

Can be drawn by above-mentioned experimental result, CCBFMA indexing means significantly reduces relative to the False Rate of MDBF indexing means.Further, since MDBF indexing means does not has eliminates combined error rate, therefore when coverage rate is 1, its all inquiries are erroneous judgement, and CCBFMA does not has this problem.

Above example is only limited in order to technical scheme to be described; technical scheme can be modified or equivalent by those of ordinary skill in the art; without departing from the spirit and scope of the present invention, protection scope of the present invention should be to be as the criterion described in claims.

Claims

1. a multidimensional data duplicate detection fast indexing method under sliding window, its step includes:

1) safeguarding multiple subwindow in sliding window, all subwindows safeguard the data item flowing through, team in the way of FIFO First subwindow receives new element in sliding manner, and tail of the queue subwindow eliminates old element in sliding manner；

2) by the data item in compression attribute Bloom filter matrix function group index structural maintenance sliding window, each attribute Bloom filter matrix safeguards a subwindow of sliding window, and it comprises multiple dimension data.

2. the method for claim 1, it is characterised in that: each attribute Bloom filter matrix is by some counter unit structures Become, bit wide d=log of counter unit₂(N/g), wherein N is total element capacity of sliding window, g be sliding window in son Window number, N/g is the design capacity of each subwindow.

3. the method for claim 1, it is characterised in that: all attribute Bloom filter matrixes all use identical design to hold Measure and share same group of k hash function.

4. method as claimed in claim 3, it is characterised in that: all attribute Bloom filter matrixes have the meter of same coordinate Number device unit is mapped and stores in same vector, and is read simultaneously in an internal storage access.

5. the method for claim 1, it is characterised in that: in the counter unit of attribute Bloom filter matrix, safeguard system System Base clock, in order to carry out the implicit expression deletion action of sliding window interior element.

6. method as claimed in claim 5, it is characterised in that determine the effective element whether element x is current sliding window mouth Method is: first, and in terms of the existence judgement to x, attribute Bloom filter matrix is effectively removed by way of matrix Combined error rate, thus greatly reduce the False Rate of element；Secondly, if x by corresponding attribute Bloom filter matrix Judgement is for existing, then whether the counter safeguarded in verifying its counter unit exceedes current basal clock, if it exceeds during basis Zhong Ze thinks that it is not effective element.

7. method as claimed in claim 6, it is characterised in that enter row element duplicate detection under the scape of sliding window data flow field Step is as follows:

(3) in compression attribute Bloom filter matrix array, whether detection elements e exists, if it does not exist, then enter to become a mandarin Journey (4), enters new element and inserts flow process；If it is present enter flow process (8)；

(4) by the attribute Bloom filter matrix of e (e1, e2...ew) write head of the queue；

(5) k counter unit writing system Base clock in corresponding attribute Bloom filter matrix；

(6) judge whether e is last element of head of the queue subwindow, if it is, Base clock resets, and delete tail of the queue the One subwindow, produces new head of the queue sliding sub-window；If it is not, then Base clock is from increasing；

(7) arranging global mark flag is false, and flow process terminates；

(9) arranging global mark flag is true, and flow process terminates；

(10) judge whether corresponding counter unit numerical value is more than system-based clock, if it is, arrange global mark flag For true, flow process terminates；If it is not, then arranging global mark flag is false, flow process terminates.