CN105989061B - Multidimensional data repeats detection fast indexing method under a kind of sliding window - Google Patents

Multidimensional data repeats detection fast indexing method under a kind of sliding window Download PDF

Info

Publication number
CN105989061B
CN105989061B CN201510066798.4A CN201510066798A CN105989061B CN 105989061 B CN105989061 B CN 105989061B CN 201510066798 A CN201510066798 A CN 201510066798A CN 105989061 B CN105989061 B CN 105989061B
Authority
CN
China
Prior art keywords
bloom filter
window
sliding window
filter matrix
attribute bloom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510066798.4A
Other languages
Chinese (zh)
Other versions
CN105989061A (en
Inventor
王勇
王树鹏
王振宇
王曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201510066798.4A priority Critical patent/CN105989061B/en
Publication of CN105989061A publication Critical patent/CN105989061A/en
Application granted granted Critical
Publication of CN105989061B publication Critical patent/CN105989061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to multidimensional datas under a kind of sliding window to repeat detection fast indexing method.This method safeguards the data item in sliding window using compression attribute Bloom filter matrix array, multiple subwindows is safeguarded in sliding window, head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner;Each independent child window is made of an attribute Bloom filter matrix, and attribute Bloom filter matrix has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.By being all made of identical design capacity to all attribute Bloom filter matrixes and sharing same group of k hash function, repeat element detection efficiency can be effectively promoted;By the maintenance system Base clock in counter unit, the element of sliding window can be effectively supported implicitly to delete;Multidimensional data is safeguarded by matrix-style, the combined error rate of multidimensional data is effectively reduced, and reduces overall misjudgment rate.

Description

Multidimensional data repeats detection fast indexing method under a kind of sliding window
Technical field
The present invention relates to a kind of repetition of magnanimity multidimensional data detection fast indexing method and systems, more particularly to one kind The indexing means for repeat to magnanimity multidimensional data detection under sliding window data flow model belong to big data and calculate neck Domain.
Background technique
With the development of mobile Internet and Web2.0, global metadata amount just in amazing growth: the whole world generates within 2008 Data volume be 0.49ZB (1ZB=1021 byte), 2009 be 0.8ZB, 2010 be 1.2ZB, 2011 up to 1.82ZB. IDC expects the year two thousand twenty, and the whole mankind can generate the data more than 40ZB.It at a high speed, but include crisscross in the network data of magnanimity Complicated information, wherein may have various businesses data flow, such as IP service flow, user clickstream, stream of user queries, web service Device log etc..In addition, being wherein likely to includes various security incidents, security incident constitutes greatly the safety of network Threat, therefore Network Traffic Monitoring is particularly important.
In Network Traffic Monitoring application system, multidimensional data repeats detection and is very important preprocessing means.With net For network service flow in network management system for monitoring, each Business Stream is by five-tuple (source address, dest Address, source port, dest port, protocol) it uniquely determines, indicating and inquiring network service flow this five When tieing up element set, highly effective algorithm is needed to improve system effectiveness.
In the case where flow data calculates scene, according to the move mode on flow data calculation window boundary, current main calculating window Mouth is divided into following several types.The first is fixed window model, i.e., the left and right ends of calculation window are fixed, fixed window mould Type helps less for embodying the timeliness of data;Second is boundary mark window model, i.e. window left end is fixed, and right end is to Forward Dynamic, boundary mark window was contained from special time point to the data item occurred between current time, if there is week in data flow The multiple boundary marks of setting in phase are equivalent to data flow to be divided into several independent low-volume traffic streams and be investigated respectively;The third It is jump window model, i.e. window left end skip-forwards advance, and right end is slidably advanced, and the window model that jumps is than boundary mark window model More can feedback data stream consecutive variations process, but since window end batch eliminates element, effective element in window Quantity has apparent wave process;4th kind is sliding window model, i.e. window left and right ends while forward slip, sliding window Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.
Under sliding window model, mainly there are several types of sides for the fast indexing method of progress multidimensional repeated data detection Method:
First method is that Hash combines the indexing means counted, and hash indexing method is a kind of the big of very convenient and efficient Data directory mechanism is completed to record the existence of multidimensional data by counting the identical data item of cryptographic Hash;Work as cunning When dynamic window needs to carry out element insertion, to corresponding counter plus 1, when needing to carry out element deletion, subtract 1 to corresponding counter Operation deletes respective element item if counter is 0.However, there are disadvantages for the index strategy of Hash counting, firstly, Hash Algorithm is a kind of nondeterministic algorithm, will necessarily have data item hash-collision, the superiority and inferiority of conflict processing method is for data rope Drawing has conclusive effect;Secondly, hash algorithm is very big for the occupancy of memory headroom.
Second method is multidimensional Bloom filter (MDBF) indexing means, and MDBF is using identical with element dimensions multiple Standard Bloom filter composition directly inquires the expression that the expression of Muhivitamin Formula With Minerals and query decomposition are single attribute value subclass, The dimension of element how many, corresponding attribute is just respectively indicated using the Bloom filters of how many a standards.Carry out element When inquiry, by judging whether each attribute value of Muhivitamin Formula With Minerals all in corresponding standard Bloom filter judges that element is It is no to belong to set.However, there is also shortcomings for this method.Firstly, this method sliding window interior element deletion ability compared with It is weak, it cannot achieve accurate data item sliding window;Secondly as multiple Hash in Bloom filter can in the presence of what is conflicted Can, the existence verification on each independent dimension is only relied on, there are the higher situations of element False Rate.
In conclusion fast indexing method is extremely important for the repeated data test problems in sliding window.Quick In indexing means, element repeated data detection efficiency is promoted, repeated data is reduced and detects False Rate, is the design of quick indexing structure In extremely important problem.
Summary of the invention
The main object of the present invention is to provide carries out the fast indexing method that multidimensional data repeats detection under sliding window And system, the efficiency that element repeats detection is promoted, repeated data is reduced and detects False Rate, effectively solve more under sliding window model Dimension data repeats the problem of detecting.
The contents of the present invention mainly include the following aspects.
First, in the design of quick indexing structure, the present invention is using compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) safeguards the data in sliding window .Specifically, multiple subwindows are safeguarded in sliding window, head of the queue child window receives new element, tail of the queue in sliding manner Window eliminates old element in sliding manner;Each independent child window is made of an attribute Bloom filter matrix (CCBFM), CCBFM has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.
Second, it is based on above-mentioned Index Structure Design, all attributes in terms of repeat element detection efficiency, in the present invention Bloom filter matrix is all made of identical design capacity and shares same group of k hash function, in this way can effectively will be first The time complexity of element inquiry is reduced to O (k) by O (kn), effectively promotes repeat element detection efficiency.
Third is based on above-mentioned Index Structure Design, and in terms of sliding window data stream calculation scene applicability, the present invention is every The data item of a independent child window is safeguarded by an attribute Bloom filter matrix (CCBFM), by counter Maintenance system Base clock in unit can effectively support the implicit delete operation of the element in sliding window, lifting system pair In the applicability of sliding-window operations.
Compared with prior art, it main innovation of the invention point and has the beneficial effect that:
1) present invention proposes a kind of compression attribute Bloom filter square in big data quick indexing structure design aspect Battle array array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) index structure, index structure Multiple subwindows are safeguarded in sliding window, each independent child window is by attribute Bloom filter matrix (CCBFM) group At.
2) the present invention is based on above-mentioned Index Structure Designs, identical by being all made of to all attribute Bloom filter matrixes Design capacity and share same group of k hash function, can effectively promote repeat element detection efficiency;By in counter list Maintenance system Base clock in member can effectively support the element of sliding window implicitly to delete;Multidimensional is safeguarded by matrix-style The combined error rate of multidimensional data is effectively reduced in data, reduces overall misjudgment rate.
Detailed description of the invention
Fig. 1 is sliding window model schematic diagram;
Fig. 2 is that repeated data detects index structure schematic diagram under sliding window model;
Fig. 3 is that hash function merges shared schematic diagram under sliding window model;
Fig. 4 is that data processing node multidimensional data repeats detection work flow diagram.
Fig. 5 is the curve graph of bit error rate test.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing, the present invention will be further described.
In the design of sliding window quick indexing structure, the present invention is using compression attribute Bloom filter matrix array (CCBFMA) data item in sliding window is safeguarded, which has better repeat element than existing solution and examine Survey efficiency and lower False Rate.
Multidimensional compression attribute Bloom filter matrix array (CCBFMA) by one group of isomorphism the grand mistake of compression attribute cloth Filter matrix (CCBFM) is constituted, and each CCBFM is made of the counter unit that m bit wide is d, counter list in each CCBFM The bit wide d=log of member2(N/g).Assuming that total element capacity of sliding window is N, the present invention safeguards g son in sliding window The design capacity of window, each child window is N/g, and all child windows safeguard the data item flowed through, team in a manner of first in, first out First child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner.
Each CCBFM safeguards all dimension datas of an independent child window.The present invention passes through the counter unit in CBF Middle maintenance system Base clock, the implicit deletion of Lai Jinhang sliding window interior element.It is described as follows: determining that element x is No is the effective element of current sliding window mouth, firstly, CCBFM is effective by way of matrix in terms of the existence judgement to x Eliminate combined error rate, greatly reduce the False Rate of element;Secondly, if x is to deposit by corresponding CCBFM judgement Then needing to verify whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds Base clock Then think that it is not effective element.
Fig. 1 gives sliding data stream window computation model.Sliding window left and right ends while forward slip, sliding window Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.
Fig. 2 gives repeated data under sliding window model and detects index structure.According to the index structure, the present invention exists A compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting is safeguarded in ram space Bloom Filter Matrix Array) index structure, which safeguards multiple subwindows in sliding window, each All dimension datas of independent child window are made of an attribute Bloom filter matrix (CCBFM).
Fig. 3 gives hash function under sliding window model and merges shared schematic diagram, and wherein k is hash function number, and g is Sliding sub-window number, d=log2(N/g) bit wide is represented.As shown, all CCBFM are isomorphisms, the different grand filterings of cloth Counter unit in device with same coordinate is mapped and stores in the same vector, so that they can be in a memory It is read simultaneously in access.Since all g Bloom filters share the same group of hash function that quantity is k, element x is determined Whether the effective element in current all boundary mark child windows, query time complexity can be reduced to O (k) by O (kg).
Fig. 4 gives multidimensional data and repeats detection work flow diagram.As shown, element under the scape of sliding window data flow field Repeating detection mainly includes following core procedure.
(1) system-based clock, Element detection marker bit flag and system data structure are initialized;
(2) the element e, e for receiving input are made of w dimension, i.e., (e1, e2...ew);
(3) detection elements e whether there is in CCBFMA, if it does not exist, then inserting into process (4) into new element Enter process;If it is present into process (8);
(4) e (e1, e2...ew) is written in head of the queue CCBFM;
(5) k counter unit writing system Base clock in corresponding CCBFM;
(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue First child window generates new head of the queue sliding sub-window;If it is not, then Base clock increases certainly;
(7) setting global mark flag is false, and process terminates;
(8) judge that element ei whether there is in tail of the queue child window, if it is not, then into process (9), if it is, into Process (10);
(9) setting global mark flag is true, and process terminates;
(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark Flag is true, and process terminates;If it is not, then setting global mark flag is false, process terminates.
The applicability under detection scene, the present invention are repeated in multidimensional data in order to be embodied in the present invention relative to conventional method Based on real data set, following experiment is constructed.
Experimental situation: stand-alone server, six core of two-way, memory 32GB;
Experimental data: true domain name data collection
Experiment content: the detection error rate of comparison this method, MDBF indexing means under multidimensional data scene.In data set Be inserted into 1000 records, and respectively coverage rate (coverage rate represent an attribute of data to be checked be concentrated in data it is identical The probability of copy, coverage rate are 1 to mean that all properties are repetition) be 0,0.2,0.4,0.6,0.8,1 when, test The request for information of 6000000 record data.
Experimental result: table 1 is specific data list, and the curve graph of Fig. 5 bit error rate test, abscissa is coverage rate in figure, Ordinate is False Rate.
1. experimental result list of table
Serial number Indexing means Coverage rate Inquire item number Error number
1 MDBF indexing means 0 6000000 272
2 MDBF indexing means 0.2 6000000 1200235
3 MDBF indexing means 0.4 6000000 2400175
4 MDBF indexing means 0.6 6000000 3600105
5 MDBF indexing means 0.8 6000000 4800060
6 MDBF indexing means 1 6000000 6000000
7 CCBFMA indexing means 0 6000000 1814
8 CCBFMA indexing means 0.2 6000000 3374
9 CCBFMA indexing means 0.4 6000000 3024
10 CCBFMA indexing means 0.6 6000000 4678
11 CCBFMA indexing means 0.8 6000000 5316
12 CCBFMA indexing means 1 6000000 6994
By above-mentioned experimental result, it can be concluded that, CCBFMA indexing means have aobvious relative to the False Rate of MDBF indexing means Writing reduces.In addition, since there is no elimination combined error rates for MDBF indexing means, when coverage rate is 1, all inquiries It is erroneous judgement, there is no this problems by CCBFMA.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims (4)

1. multidimensional data repeats detection fast indexing method under a kind of sliding window, step includes:
1) multiple subwindows are safeguarded in sliding window, all child windows safeguard the data item flowed through in a manner of first in, first out, Head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner;
2) data item in sliding window, the compression are safeguarded by compression attribute Bloom filter matrix array index structure Attribute Bloom filter matrix array is made of the attribute Bloom filter matrix of one group of isomorphism, each grand mistake of attribute cloth Filter matrix safeguards all dimension datas of an independent child window of sliding window;
Each attribute Bloom filter matrix is made of several counter units, the bit wide d=log of counter unit2(N/g), Wherein N is total element capacity of sliding window, and g is the child window number in sliding window, and N/g is that the design of each child window is held Amount;
All attribute Bloom filter matrixes are all made of identical design capacity and share same group of k hash function;
The maintenance system Base clock in the counter unit of attribute Bloom filter matrix, it is first in sliding window to carry out The implicit delete operation of element;
The attribute Bloom filter matrix safeguards multidimensional data by matrix-style, and the combination that multidimensional data is effectively reduced misses Rate.
2. the method as described in claim 1, it is characterised in that: have same coordinate in all attribute Bloom filter matrixes Counter unit be mapped and store in the same vector, and be read simultaneously in an internal storage access.
3. the method as described in claim 1, which is characterized in that determine element x whether be current sliding window mouth effective element Method be: firstly, in terms of the existence judgement to x, attribute Bloom filter matrix is effectively gone by way of matrix Except combined error rate, to greatly reduce the False Rate of element;Secondly, if x by corresponding attribute Bloom filter square Battle array judgement then verifies whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds base to exist Plinth clock then thinks that it is not effective element.
4. method as claimed in claim 3, which is characterized in that carry out element under the scape of sliding window data flow field and repeat to detect The step of it is as follows:
(1) system-based clock, Element detection marker bit flag and system data structure are initialized;
(2) the element e, e for receiving input are made of w dimension, i.e., (e1, e2...ew);
(3) detection elements e whether there is in compression attribute Bloom filter matrix array, if it does not exist, then entering stream Journey (4) is inserted into process into new element;If it is present into process (8);
It (4) will be in the attribute Bloom filter matrix of e (e1, e2...ew) write-in head of the queue;
(5) k counter unit writing system Base clock in corresponding attribute Bloom filter matrix;
(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue first A child window generates new head of the queue sliding sub-window;If it is not, then Base clock increases certainly;
(7) setting global mark flag is false, and process terminates;
(8) judge that element ei whether there is in tail of the queue child window, if it is not, then into process (9), if it is, into process (10);
(9) setting global mark flag is true, and process terminates;
(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark flag For true, process terminates;If it is not, then setting global mark flag is false, process terminates.
CN201510066798.4A 2015-02-09 2015-02-09 Multidimensional data repeats detection fast indexing method under a kind of sliding window Active CN105989061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510066798.4A CN105989061B (en) 2015-02-09 2015-02-09 Multidimensional data repeats detection fast indexing method under a kind of sliding window

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510066798.4A CN105989061B (en) 2015-02-09 2015-02-09 Multidimensional data repeats detection fast indexing method under a kind of sliding window

Publications (2)

Publication Number Publication Date
CN105989061A CN105989061A (en) 2016-10-05
CN105989061B true CN105989061B (en) 2019-11-26

Family

ID=57038169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510066798.4A Active CN105989061B (en) 2015-02-09 2015-02-09 Multidimensional data repeats detection fast indexing method under a kind of sliding window

Country Status (1)

Country Link
CN (1) CN105989061B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694074B (en) * 2017-04-07 2023-04-07 腾讯科技(深圳)有限公司 Method for acquiring counting information and server
CN106997391B (en) * 2017-04-10 2020-11-03 华北电力大学(保定) Method for rapidly screening steady-state working condition data in large-scale process data
CN110704419A (en) * 2018-06-21 2020-01-17 中兴通讯股份有限公司 Data structure, data indexing method, device and equipment, and storage medium
CN109582640B (en) * 2018-11-15 2020-12-01 深圳市酷开网络科技有限公司 Sliding window-based data deduplication storage method and device and storage medium
CN109815234B (en) * 2018-12-29 2021-01-08 杭州中科先进技术研究院有限公司 Multiple cuckoo filter under STREAMING computational model
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN112529613A (en) * 2020-11-27 2021-03-19 广州华多网络科技有限公司 Method and device for processing user continuous login data and transferring virtual resources
CN112751869B (en) * 2020-12-31 2023-07-14 中国人民解放军战略支援部队航天工程大学 Method and device for detecting abnormal network traffic based on sliding window group
CN112688837B (en) * 2021-03-17 2021-06-08 中国人民解放军国防科技大学 Network measurement method and device based on time sliding window
CN114595280B (en) * 2022-05-10 2022-08-02 鹏城实验室 Time member query method, device, terminal and medium based on sliding window

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window

Also Published As

Publication number Publication date
CN105989061A (en) 2016-10-05

Similar Documents

Publication Publication Date Title
CN105989061B (en) Multidimensional data repeats detection fast indexing method under a kind of sliding window
US11087329B2 (en) Method and apparatus of identifying a transaction risk
US20210152444A1 (en) Aggregation of select network traffic statistics
IL273860A (en) Event context management system
US9589004B2 (en) Data storage method and apparatus
CN103930887B (en) The inquiry stored using raw column data collects generation
GB2595800A (en) Managing data objects for graph-based data structures
CN103577440B (en) A kind of data processing method and device in non-relational database
CN106534164B (en) Effective virtual identity depicting method based on cyberspace user identifier
CN102253991B (en) Uniform resource locator (URL) storage method, web filtering method, device and system
CN106202569A (en) A kind of cleaning method based on big data quantity
CN103685224A (en) A network invasion detection method
WO2016145993A1 (en) Method and system for user device identification
US10623371B2 (en) Providing network behavior visibility based on events logged by network security devices
CN104750826B (en) A kind of structural data resource metadata is screened automatically and dynamic registration method
US11777983B2 (en) Systems and methods for rapidly generating security ratings
CN107682345A (en) Detection method, detection means and the electronic equipment of IP address
CN103685221A (en) A network invasion detection method
CN104022913A (en) Test method and device for data cluster
WO2016165542A1 (en) Method for analyzing cache hit rate, and device
CN110011830A (en) Communication topology information modeling method based on data on flows
CN103685222A (en) A data matching detection method based on a determinacy finite state automation
Zhang et al. Density approach: a new model for BigData analysis and visualization
CN104794158B (en) Domain name data repeats detection fast indexing method under a kind of boundary mark window
WO2021082936A1 (en) Method and apparatus for counting number of webpage visitors

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant