CN105989061B

CN105989061B - Multidimensional data repeats detection fast indexing method under a kind of sliding window

Info

Publication number: CN105989061B
Application number: CN201510066798.4A
Authority: CN
Inventors: 王勇; 王树鹏; 王振宇; 王曦
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-02-09
Filing date: 2015-02-09
Publication date: 2019-11-26
Anticipated expiration: 2035-02-09
Also published as: CN105989061A

Abstract

The present invention relates to multidimensional datas under a kind of sliding window to repeat detection fast indexing method.This method safeguards the data item in sliding window using compression attribute Bloom filter matrix array, multiple subwindows is safeguarded in sliding window, head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner；Each independent child window is made of an attribute Bloom filter matrix, and attribute Bloom filter matrix has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.By being all made of identical design capacity to all attribute Bloom filter matrixes and sharing same group of k hash function, repeat element detection efficiency can be effectively promoted；By the maintenance system Base clock in counter unit, the element of sliding window can be effectively supported implicitly to delete；Multidimensional data is safeguarded by matrix-style, the combined error rate of multidimensional data is effectively reduced, and reduces overall misjudgment rate.

Description

Multidimensional data repeats detection fast indexing method under a kind of sliding window

Technical field

The present invention relates to a kind of repetition of magnanimity multidimensional data detection fast indexing method and systems, more particularly to one kind The indexing means for repeat to magnanimity multidimensional data detection under sliding window data flow model belong to big data and calculate neck Domain.

Background technique

With the development of mobile Internet and Web2.0, global metadata amount just in amazing growth: the whole world generates within 2008 Data volume be 0.49ZB (1ZB=1021 byte), 2009 be 0.8ZB, 2010 be 1.2ZB, 2011 up to 1.82ZB. IDC expects the year two thousand twenty, and the whole mankind can generate the data more than 40ZB.It at a high speed, but include crisscross in the network data of magnanimity Complicated information, wherein may have various businesses data flow, such as IP service flow, user clickstream, stream of user queries, web service Device log etc..In addition, being wherein likely to includes various security incidents, security incident constitutes greatly the safety of network Threat, therefore Network Traffic Monitoring is particularly important.

In Network Traffic Monitoring application system, multidimensional data repeats detection and is very important preprocessing means.With net For network service flow in network management system for monitoring, each Business Stream is by five-tuple (source address, dest Address, source port, dest port, protocol) it uniquely determines, indicating and inquiring network service flow this five When tieing up element set, highly effective algorithm is needed to improve system effectiveness.

In the case where flow data calculates scene, according to the move mode on flow data calculation window boundary, current main calculating window Mouth is divided into following several types.The first is fixed window model, i.e., the left and right ends of calculation window are fixed, fixed window mould Type helps less for embodying the timeliness of data；Second is boundary mark window model, i.e. window left end is fixed, and right end is to Forward Dynamic, boundary mark window was contained from special time point to the data item occurred between current time, if there is week in data flow The multiple boundary marks of setting in phase are equivalent to data flow to be divided into several independent low-volume traffic streams and be investigated respectively；The third It is jump window model, i.e. window left end skip-forwards advance, and right end is slidably advanced, and the window model that jumps is than boundary mark window model More can feedback data stream consecutive variations process, but since window end batch eliminates element, effective element in window Quantity has apparent wave process；4th kind is sliding window model, i.e. window left and right ends while forward slip, sliding window Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.

Under sliding window model, mainly there are several types of sides for the fast indexing method of progress multidimensional repeated data detection Method:

First method is that Hash combines the indexing means counted, and hash indexing method is a kind of the big of very convenient and efficient Data directory mechanism is completed to record the existence of multidimensional data by counting the identical data item of cryptographic Hash；Work as cunning When dynamic window needs to carry out element insertion, to corresponding counter plus 1, when needing to carry out element deletion, subtract 1 to corresponding counter Operation deletes respective element item if counter is 0.However, there are disadvantages for the index strategy of Hash counting, firstly, Hash Algorithm is a kind of nondeterministic algorithm, will necessarily have data item hash-collision, the superiority and inferiority of conflict processing method is for data rope Drawing has conclusive effect；Secondly, hash algorithm is very big for the occupancy of memory headroom.

Second method is multidimensional Bloom filter (MDBF) indexing means, and MDBF is using identical with element dimensions multiple Standard Bloom filter composition directly inquires the expression that the expression of Muhivitamin Formula With Minerals and query decomposition are single attribute value subclass, The dimension of element how many, corresponding attribute is just respectively indicated using the Bloom filters of how many a standards.Carry out element When inquiry, by judging whether each attribute value of Muhivitamin Formula With Minerals all in corresponding standard Bloom filter judges that element is It is no to belong to set.However, there is also shortcomings for this method.Firstly, this method sliding window interior element deletion ability compared with It is weak, it cannot achieve accurate data item sliding window；Secondly as multiple Hash in Bloom filter can in the presence of what is conflicted Can, the existence verification on each independent dimension is only relied on, there are the higher situations of element False Rate.

In conclusion fast indexing method is extremely important for the repeated data test problems in sliding window.Quick In indexing means, element repeated data detection efficiency is promoted, repeated data is reduced and detects False Rate, is the design of quick indexing structure In extremely important problem.

Summary of the invention

The main object of the present invention is to provide carries out the fast indexing method that multidimensional data repeats detection under sliding window And system, the efficiency that element repeats detection is promoted, repeated data is reduced and detects False Rate, effectively solve more under sliding window model Dimension data repeats the problem of detecting.

The contents of the present invention mainly include the following aspects.

First, in the design of quick indexing structure, the present invention is using compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) safeguards the data in sliding window .Specifically, multiple subwindows are safeguarded in sliding window, head of the queue child window receives new element, tail of the queue in sliding manner Window eliminates old element in sliding manner；Each independent child window is made of an attribute Bloom filter matrix (CCBFM), CCBFM has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.

Second, it is based on above-mentioned Index Structure Design, all attributes in terms of repeat element detection efficiency, in the present invention Bloom filter matrix is all made of identical design capacity and shares same group of k hash function, in this way can effectively will be first The time complexity of element inquiry is reduced to O (k) by O (kn), effectively promotes repeat element detection efficiency.

Third is based on above-mentioned Index Structure Design, and in terms of sliding window data stream calculation scene applicability, the present invention is every The data item of a independent child window is safeguarded by an attribute Bloom filter matrix (CCBFM), by counter Maintenance system Base clock in unit can effectively support the implicit delete operation of the element in sliding window, lifting system pair In the applicability of sliding-window operations.

Compared with prior art, it main innovation of the invention point and has the beneficial effect that:

1) present invention proposes a kind of compression attribute Bloom filter square in big data quick indexing structure design aspect Battle array array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) index structure, index structure Multiple subwindows are safeguarded in sliding window, each independent child window is by attribute Bloom filter matrix (CCBFM) group At.

2) the present invention is based on above-mentioned Index Structure Designs, identical by being all made of to all attribute Bloom filter matrixes Design capacity and share same group of k hash function, can effectively promote repeat element detection efficiency；By in counter list Maintenance system Base clock in member can effectively support the element of sliding window implicitly to delete；Multidimensional is safeguarded by matrix-style The combined error rate of multidimensional data is effectively reduced in data, reduces overall misjudgment rate.

Detailed description of the invention

Fig. 1 is sliding window model schematic diagram；

Fig. 2 is that repeated data detects index structure schematic diagram under sliding window model；

Fig. 3 is that hash function merges shared schematic diagram under sliding window model；

Fig. 4 is that data processing node multidimensional data repeats detection work flow diagram.

Fig. 5 is the curve graph of bit error rate test.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing, the present invention will be further described.

In the design of sliding window quick indexing structure, the present invention is using compression attribute Bloom filter matrix array (CCBFMA) data item in sliding window is safeguarded, which has better repeat element than existing solution and examine Survey efficiency and lower False Rate.

Multidimensional compression attribute Bloom filter matrix array (CCBFMA) by one group of isomorphism the grand mistake of compression attribute cloth Filter matrix (CCBFM) is constituted, and each CCBFM is made of the counter unit that m bit wide is d, counter list in each CCBFM The bit wide d=log of member₂(N/g).Assuming that total element capacity of sliding window is N, the present invention safeguards g son in sliding window The design capacity of window, each child window is N/g, and all child windows safeguard the data item flowed through, team in a manner of first in, first out First child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner.

Each CCBFM safeguards all dimension datas of an independent child window.The present invention passes through the counter unit in CBF Middle maintenance system Base clock, the implicit deletion of Lai Jinhang sliding window interior element.It is described as follows: determining that element x is No is the effective element of current sliding window mouth, firstly, CCBFM is effective by way of matrix in terms of the existence judgement to x Eliminate combined error rate, greatly reduce the False Rate of element；Secondly, if x is to deposit by corresponding CCBFM judgement Then needing to verify whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds Base clock Then think that it is not effective element.

Fig. 1 gives sliding data stream window computation model.Sliding window left and right ends while forward slip, sliding window Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.

Fig. 2 gives repeated data under sliding window model and detects index structure.According to the index structure, the present invention exists A compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting is safeguarded in ram space Bloom Filter Matrix Array) index structure, which safeguards multiple subwindows in sliding window, each All dimension datas of independent child window are made of an attribute Bloom filter matrix (CCBFM).

Fig. 3 gives hash function under sliding window model and merges shared schematic diagram, and wherein k is hash function number, and g is Sliding sub-window number, d=log₂(N/g) bit wide is represented.As shown, all CCBFM are isomorphisms, the different grand filterings of cloth Counter unit in device with same coordinate is mapped and stores in the same vector, so that they can be in a memory It is read simultaneously in access.Since all g Bloom filters share the same group of hash function that quantity is k, element x is determined Whether the effective element in current all boundary mark child windows, query time complexity can be reduced to O (k) by O (kg).

Fig. 4 gives multidimensional data and repeats detection work flow diagram.As shown, element under the scape of sliding window data flow field Repeating detection mainly includes following core procedure.

(1) system-based clock, Element detection marker bit flag and system data structure are initialized；

(2) the element e, e for receiving input are made of w dimension, i.e., (e1, e2...ew)；

(3) detection elements e whether there is in CCBFMA, if it does not exist, then inserting into process (4) into new element Enter process；If it is present into process (8)；

(4) e (e1, e2...ew) is written in head of the queue CCBFM；

(5) k counter unit writing system Base clock in corresponding CCBFM；

(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue First child window generates new head of the queue sliding sub-window；If it is not, then Base clock increases certainly；

(7) setting global mark flag is false, and process terminates；

(8) judge that element ei whether there is in tail of the queue child window, if it is not, then into process (9), if it is, into Process (10)；

(9) setting global mark flag is true, and process terminates；

(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark Flag is true, and process terminates；If it is not, then setting global mark flag is false, process terminates.

The applicability under detection scene, the present invention are repeated in multidimensional data in order to be embodied in the present invention relative to conventional method Based on real data set, following experiment is constructed.

Experimental situation: stand-alone server, six core of two-way, memory 32GB；

Experimental data: true domain name data collection

Experiment content: the detection error rate of comparison this method, MDBF indexing means under multidimensional data scene.In data set Be inserted into 1000 records, and respectively coverage rate (coverage rate represent an attribute of data to be checked be concentrated in data it is identical The probability of copy, coverage rate are 1 to mean that all properties are repetition) be 0,0.2,0.4,0.6,0.8,1 when, test The request for information of 6000000 record data.

Experimental result: table 1 is specific data list, and the curve graph of Fig. 5 bit error rate test, abscissa is coverage rate in figure, Ordinate is False Rate.

1. experimental result list of table

Serial number	Indexing means	Coverage rate	Inquire item number	Error number
					1	MDBF indexing means	0	6000000	272
2	MDBF indexing means	0.2	6000000	1200235
					3	MDBF indexing means	0.4	6000000	2400175
4	MDBF indexing means	0.6	6000000	3600105
					5	MDBF indexing means	0.8	6000000	4800060
6	MDBF indexing means	1	6000000	6000000
					7	CCBFMA indexing means	0	6000000	1814
8	CCBFMA indexing means	0.2	6000000	3374
					9	CCBFMA indexing means	0.4	6000000	3024
10	CCBFMA indexing means	0.6	6000000	4678
					11	CCBFMA indexing means	0.8	6000000	5316
12	CCBFMA indexing means	1	6000000	6994

By above-mentioned experimental result, it can be concluded that, CCBFMA indexing means have aobvious relative to the False Rate of MDBF indexing means Writing reduces.In addition, since there is no elimination combined error rates for MDBF indexing means, when coverage rate is 1, all inquiries It is erroneous judgement, there is no this problems by CCBFMA.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims

1. multidimensional data repeats detection fast indexing method under a kind of sliding window, step includes:

1) multiple subwindows are safeguarded in sliding window, all child windows safeguard the data item flowed through in a manner of first in, first out, Head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner；

2) data item in sliding window, the compression are safeguarded by compression attribute Bloom filter matrix array index structure Attribute Bloom filter matrix array is made of the attribute Bloom filter matrix of one group of isomorphism, each grand mistake of attribute cloth Filter matrix safeguards all dimension datas of an independent child window of sliding window；

Each attribute Bloom filter matrix is made of several counter units, the bit wide d=log of counter unit₂(N/g), Wherein N is total element capacity of sliding window, and g is the child window number in sliding window, and N/g is that the design of each child window is held Amount；

All attribute Bloom filter matrixes are all made of identical design capacity and share same group of k hash function；

The maintenance system Base clock in the counter unit of attribute Bloom filter matrix, it is first in sliding window to carry out The implicit delete operation of element；

The attribute Bloom filter matrix safeguards multidimensional data by matrix-style, and the combination that multidimensional data is effectively reduced misses Rate.

2. the method as described in claim 1, it is characterised in that: have same coordinate in all attribute Bloom filter matrixes Counter unit be mapped and store in the same vector, and be read simultaneously in an internal storage access.

3. the method as described in claim 1, which is characterized in that determine element x whether be current sliding window mouth effective element Method be: firstly, in terms of the existence judgement to x, attribute Bloom filter matrix is effectively gone by way of matrix Except combined error rate, to greatly reduce the False Rate of element；Secondly, if x by corresponding attribute Bloom filter square Battle array judgement then verifies whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds base to exist Plinth clock then thinks that it is not effective element.

4. method as claimed in claim 3, which is characterized in that carry out element under the scape of sliding window data flow field and repeat to detect The step of it is as follows:

(3) detection elements e whether there is in compression attribute Bloom filter matrix array, if it does not exist, then entering stream Journey (4) is inserted into process into new element；If it is present into process (8)；

It (4) will be in the attribute Bloom filter matrix of e (e1, e2...ew) write-in head of the queue；

(5) k counter unit writing system Base clock in corresponding attribute Bloom filter matrix；

(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue first A child window generates new head of the queue sliding sub-window；If it is not, then Base clock increases certainly；

(7) setting global mark flag is false, and process terminates；

(9) setting global mark flag is true, and process terminates；

(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark flag For true, process terminates；If it is not, then setting global mark flag is false, process terminates.