CN105989061B - Multidimensional data repeats detection fast indexing method under a kind of sliding window - Google Patents
Multidimensional data repeats detection fast indexing method under a kind of sliding window Download PDFInfo
- Publication number
- CN105989061B CN105989061B CN201510066798.4A CN201510066798A CN105989061B CN 105989061 B CN105989061 B CN 105989061B CN 201510066798 A CN201510066798 A CN 201510066798A CN 105989061 B CN105989061 B CN 105989061B
- Authority
- CN
- China
- Prior art keywords
- bloom filter
- window
- sliding window
- filter matrix
- attribute bloom
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to multidimensional datas under a kind of sliding window to repeat detection fast indexing method.This method safeguards the data item in sliding window using compression attribute Bloom filter matrix array, multiple subwindows is safeguarded in sliding window, head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner;Each independent child window is made of an attribute Bloom filter matrix, and attribute Bloom filter matrix has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.By being all made of identical design capacity to all attribute Bloom filter matrixes and sharing same group of k hash function, repeat element detection efficiency can be effectively promoted;By the maintenance system Base clock in counter unit, the element of sliding window can be effectively supported implicitly to delete;Multidimensional data is safeguarded by matrix-style, the combined error rate of multidimensional data is effectively reduced, and reduces overall misjudgment rate.
Description
Technical field
The present invention relates to a kind of repetition of magnanimity multidimensional data detection fast indexing method and systems, more particularly to one kind
The indexing means for repeat to magnanimity multidimensional data detection under sliding window data flow model belong to big data and calculate neck
Domain.
Background technique
With the development of mobile Internet and Web2.0, global metadata amount just in amazing growth: the whole world generates within 2008
Data volume be 0.49ZB (1ZB=1021 byte), 2009 be 0.8ZB, 2010 be 1.2ZB, 2011 up to 1.82ZB.
IDC expects the year two thousand twenty, and the whole mankind can generate the data more than 40ZB.It at a high speed, but include crisscross in the network data of magnanimity
Complicated information, wherein may have various businesses data flow, such as IP service flow, user clickstream, stream of user queries, web service
Device log etc..In addition, being wherein likely to includes various security incidents, security incident constitutes greatly the safety of network
Threat, therefore Network Traffic Monitoring is particularly important.
In Network Traffic Monitoring application system, multidimensional data repeats detection and is very important preprocessing means.With net
For network service flow in network management system for monitoring, each Business Stream is by five-tuple (source address, dest
Address, source port, dest port, protocol) it uniquely determines, indicating and inquiring network service flow this five
When tieing up element set, highly effective algorithm is needed to improve system effectiveness.
In the case where flow data calculates scene, according to the move mode on flow data calculation window boundary, current main calculating window
Mouth is divided into following several types.The first is fixed window model, i.e., the left and right ends of calculation window are fixed, fixed window mould
Type helps less for embodying the timeliness of data;Second is boundary mark window model, i.e. window left end is fixed, and right end is to Forward
Dynamic, boundary mark window was contained from special time point to the data item occurred between current time, if there is week in data flow
The multiple boundary marks of setting in phase are equivalent to data flow to be divided into several independent low-volume traffic streams and be investigated respectively;The third
It is jump window model, i.e. window left end skip-forwards advance, and right end is slidably advanced, and the window model that jumps is than boundary mark window model
More can feedback data stream consecutive variations process, but since window end batch eliminates element, effective element in window
Quantity has apparent wave process;4th kind is sliding window model, i.e. window left and right ends while forward slip, sliding window
Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.
Under sliding window model, mainly there are several types of sides for the fast indexing method of progress multidimensional repeated data detection
Method:
First method is that Hash combines the indexing means counted, and hash indexing method is a kind of the big of very convenient and efficient
Data directory mechanism is completed to record the existence of multidimensional data by counting the identical data item of cryptographic Hash;Work as cunning
When dynamic window needs to carry out element insertion, to corresponding counter plus 1, when needing to carry out element deletion, subtract 1 to corresponding counter
Operation deletes respective element item if counter is 0.However, there are disadvantages for the index strategy of Hash counting, firstly, Hash
Algorithm is a kind of nondeterministic algorithm, will necessarily have data item hash-collision, the superiority and inferiority of conflict processing method is for data rope
Drawing has conclusive effect;Secondly, hash algorithm is very big for the occupancy of memory headroom.
Second method is multidimensional Bloom filter (MDBF) indexing means, and MDBF is using identical with element dimensions multiple
Standard Bloom filter composition directly inquires the expression that the expression of Muhivitamin Formula With Minerals and query decomposition are single attribute value subclass,
The dimension of element how many, corresponding attribute is just respectively indicated using the Bloom filters of how many a standards.Carry out element
When inquiry, by judging whether each attribute value of Muhivitamin Formula With Minerals all in corresponding standard Bloom filter judges that element is
It is no to belong to set.However, there is also shortcomings for this method.Firstly, this method sliding window interior element deletion ability compared with
It is weak, it cannot achieve accurate data item sliding window;Secondly as multiple Hash in Bloom filter can in the presence of what is conflicted
Can, the existence verification on each independent dimension is only relied on, there are the higher situations of element False Rate.
In conclusion fast indexing method is extremely important for the repeated data test problems in sliding window.Quick
In indexing means, element repeated data detection efficiency is promoted, repeated data is reduced and detects False Rate, is the design of quick indexing structure
In extremely important problem.
Summary of the invention
The main object of the present invention is to provide carries out the fast indexing method that multidimensional data repeats detection under sliding window
And system, the efficiency that element repeats detection is promoted, repeated data is reduced and detects False Rate, effectively solve more under sliding window model
Dimension data repeats the problem of detecting.
The contents of the present invention mainly include the following aspects.
First, in the design of quick indexing structure, the present invention is using compression attribute Bloom filter matrix array
(CCBFMA-Compressed Counting Bloom Filter Matrix Array) safeguards the data in sliding window
.Specifically, multiple subwindows are safeguarded in sliding window, head of the queue child window receives new element, tail of the queue in sliding manner
Window eliminates old element in sliding manner;Each independent child window is made of an attribute Bloom filter matrix (CCBFM),
CCBFM has the dimension towards multidimensional data and deletes ability, and its internal maintenance counter unit.
Second, it is based on above-mentioned Index Structure Design, all attributes in terms of repeat element detection efficiency, in the present invention
Bloom filter matrix is all made of identical design capacity and shares same group of k hash function, in this way can effectively will be first
The time complexity of element inquiry is reduced to O (k) by O (kn), effectively promotes repeat element detection efficiency.
Third is based on above-mentioned Index Structure Design, and in terms of sliding window data stream calculation scene applicability, the present invention is every
The data item of a independent child window is safeguarded by an attribute Bloom filter matrix (CCBFM), by counter
Maintenance system Base clock in unit can effectively support the implicit delete operation of the element in sliding window, lifting system pair
In the applicability of sliding-window operations.
Compared with prior art, it main innovation of the invention point and has the beneficial effect that:
1) present invention proposes a kind of compression attribute Bloom filter square in big data quick indexing structure design aspect
Battle array array (CCBFMA-Compressed Counting Bloom Filter Matrix Array) index structure, index structure
Multiple subwindows are safeguarded in sliding window, each independent child window is by attribute Bloom filter matrix (CCBFM) group
At.
2) the present invention is based on above-mentioned Index Structure Designs, identical by being all made of to all attribute Bloom filter matrixes
Design capacity and share same group of k hash function, can effectively promote repeat element detection efficiency;By in counter list
Maintenance system Base clock in member can effectively support the element of sliding window implicitly to delete;Multidimensional is safeguarded by matrix-style
The combined error rate of multidimensional data is effectively reduced in data, reduces overall misjudgment rate.
Detailed description of the invention
Fig. 1 is sliding window model schematic diagram;
Fig. 2 is that repeated data detects index structure schematic diagram under sliding window model;
Fig. 3 is that hash function merges shared schematic diagram under sliding window model;
Fig. 4 is that data processing node multidimensional data repeats detection work flow diagram.
Fig. 5 is the curve graph of bit error rate test.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and
Attached drawing, the present invention will be further described.
In the design of sliding window quick indexing structure, the present invention is using compression attribute Bloom filter matrix array
(CCBFMA) data item in sliding window is safeguarded, which has better repeat element than existing solution and examine
Survey efficiency and lower False Rate.
Multidimensional compression attribute Bloom filter matrix array (CCBFMA) by one group of isomorphism the grand mistake of compression attribute cloth
Filter matrix (CCBFM) is constituted, and each CCBFM is made of the counter unit that m bit wide is d, counter list in each CCBFM
The bit wide d=log of member2(N/g).Assuming that total element capacity of sliding window is N, the present invention safeguards g son in sliding window
The design capacity of window, each child window is N/g, and all child windows safeguard the data item flowed through, team in a manner of first in, first out
First child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner.
Each CCBFM safeguards all dimension datas of an independent child window.The present invention passes through the counter unit in CBF
Middle maintenance system Base clock, the implicit deletion of Lai Jinhang sliding window interior element.It is described as follows: determining that element x is
No is the effective element of current sliding window mouth, firstly, CCBFM is effective by way of matrix in terms of the existence judgement to x
Eliminate combined error rate, greatly reduce the False Rate of element;Secondly, if x is to deposit by corresponding CCBFM judgement
Then needing to verify whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds Base clock
Then think that it is not effective element.
Fig. 1 gives sliding data stream window computation model.Sliding window left and right ends while forward slip, sliding window
Stale data item is deleted while being inserted into new data item, is considered as the ideal model of data stream monitoring and analysis.
Fig. 2 gives repeated data under sliding window model and detects index structure.According to the index structure, the present invention exists
A compression attribute Bloom filter matrix array (CCBFMA-Compressed Counting is safeguarded in ram space
Bloom Filter Matrix Array) index structure, which safeguards multiple subwindows in sliding window, each
All dimension datas of independent child window are made of an attribute Bloom filter matrix (CCBFM).
Fig. 3 gives hash function under sliding window model and merges shared schematic diagram, and wherein k is hash function number, and g is
Sliding sub-window number, d=log2(N/g) bit wide is represented.As shown, all CCBFM are isomorphisms, the different grand filterings of cloth
Counter unit in device with same coordinate is mapped and stores in the same vector, so that they can be in a memory
It is read simultaneously in access.Since all g Bloom filters share the same group of hash function that quantity is k, element x is determined
Whether the effective element in current all boundary mark child windows, query time complexity can be reduced to O (k) by O (kg).
Fig. 4 gives multidimensional data and repeats detection work flow diagram.As shown, element under the scape of sliding window data flow field
Repeating detection mainly includes following core procedure.
(1) system-based clock, Element detection marker bit flag and system data structure are initialized;
(2) the element e, e for receiving input are made of w dimension, i.e., (e1, e2...ew);
(3) detection elements e whether there is in CCBFMA, if it does not exist, then inserting into process (4) into new element
Enter process;If it is present into process (8);
(4) e (e1, e2...ew) is written in head of the queue CCBFM;
(5) k counter unit writing system Base clock in corresponding CCBFM;
(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue
First child window generates new head of the queue sliding sub-window;If it is not, then Base clock increases certainly;
(7) setting global mark flag is false, and process terminates;
(8) judge that element ei whether there is in tail of the queue child window, if it is not, then into process (9), if it is, into
Process (10);
(9) setting global mark flag is true, and process terminates;
(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark
Flag is true, and process terminates;If it is not, then setting global mark flag is false, process terminates.
The applicability under detection scene, the present invention are repeated in multidimensional data in order to be embodied in the present invention relative to conventional method
Based on real data set, following experiment is constructed.
Experimental situation: stand-alone server, six core of two-way, memory 32GB;
Experimental data: true domain name data collection
Experiment content: the detection error rate of comparison this method, MDBF indexing means under multidimensional data scene.In data set
Be inserted into 1000 records, and respectively coverage rate (coverage rate represent an attribute of data to be checked be concentrated in data it is identical
The probability of copy, coverage rate are 1 to mean that all properties are repetition) be 0,0.2,0.4,0.6,0.8,1 when, test
The request for information of 6000000 record data.
Experimental result: table 1 is specific data list, and the curve graph of Fig. 5 bit error rate test, abscissa is coverage rate in figure,
Ordinate is False Rate.
1. experimental result list of table
Serial number | Indexing means | Coverage rate | Inquire item number | Error number |
1 | MDBF indexing means | 0 | 6000000 | 272 |
2 | MDBF indexing means | 0.2 | 6000000 | 1200235 |
3 | MDBF indexing means | 0.4 | 6000000 | 2400175 |
4 | MDBF indexing means | 0.6 | 6000000 | 3600105 |
5 | MDBF indexing means | 0.8 | 6000000 | 4800060 |
6 | MDBF indexing means | 1 | 6000000 | 6000000 |
7 | CCBFMA indexing means | 0 | 6000000 | 1814 |
8 | CCBFMA indexing means | 0.2 | 6000000 | 3374 |
9 | CCBFMA indexing means | 0.4 | 6000000 | 3024 |
10 | CCBFMA indexing means | 0.6 | 6000000 | 4678 |
11 | CCBFMA indexing means | 0.8 | 6000000 | 5316 |
12 | CCBFMA indexing means | 1 | 6000000 | 6994 |
By above-mentioned experimental result, it can be concluded that, CCBFMA indexing means have aobvious relative to the False Rate of MDBF indexing means
Writing reduces.In addition, since there is no elimination combined error rates for MDBF indexing means, when coverage rate is 1, all inquiries
It is erroneous judgement, there is no this problems by CCBFMA.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this
The protection scope of invention should be subject to described in claims.
Claims (4)
1. multidimensional data repeats detection fast indexing method under a kind of sliding window, step includes:
1) multiple subwindows are safeguarded in sliding window, all child windows safeguard the data item flowed through in a manner of first in, first out,
Head of the queue child window receives new element in sliding manner, and tail of the queue child window eliminates old element in sliding manner;
2) data item in sliding window, the compression are safeguarded by compression attribute Bloom filter matrix array index structure
Attribute Bloom filter matrix array is made of the attribute Bloom filter matrix of one group of isomorphism, each grand mistake of attribute cloth
Filter matrix safeguards all dimension datas of an independent child window of sliding window;
Each attribute Bloom filter matrix is made of several counter units, the bit wide d=log of counter unit2(N/g),
Wherein N is total element capacity of sliding window, and g is the child window number in sliding window, and N/g is that the design of each child window is held
Amount;
All attribute Bloom filter matrixes are all made of identical design capacity and share same group of k hash function;
The maintenance system Base clock in the counter unit of attribute Bloom filter matrix, it is first in sliding window to carry out
The implicit delete operation of element;
The attribute Bloom filter matrix safeguards multidimensional data by matrix-style, and the combination that multidimensional data is effectively reduced misses
Rate.
2. the method as described in claim 1, it is characterised in that: have same coordinate in all attribute Bloom filter matrixes
Counter unit be mapped and store in the same vector, and be read simultaneously in an internal storage access.
3. the method as described in claim 1, which is characterized in that determine element x whether be current sliding window mouth effective element
Method be: firstly, in terms of the existence judgement to x, attribute Bloom filter matrix is effectively gone by way of matrix
Except combined error rate, to greatly reduce the False Rate of element;Secondly, if x by corresponding attribute Bloom filter square
Battle array judgement then verifies whether the counter safeguarded in its counter unit is more than current basal clock, if it exceeds base to exist
Plinth clock then thinks that it is not effective element.
4. method as claimed in claim 3, which is characterized in that carry out element under the scape of sliding window data flow field and repeat to detect
The step of it is as follows:
(1) system-based clock, Element detection marker bit flag and system data structure are initialized;
(2) the element e, e for receiving input are made of w dimension, i.e., (e1, e2...ew);
(3) detection elements e whether there is in compression attribute Bloom filter matrix array, if it does not exist, then entering stream
Journey (4) is inserted into process into new element;If it is present into process (8);
It (4) will be in the attribute Bloom filter matrix of e (e1, e2...ew) write-in head of the queue;
(5) k counter unit writing system Base clock in corresponding attribute Bloom filter matrix;
(6) judge whether e is the last one element of head of the queue child window, if it is, Base clock is reset, and delete tail of the queue first
A child window generates new head of the queue sliding sub-window;If it is not, then Base clock increases certainly;
(7) setting global mark flag is false, and process terminates;
(8) judge that element ei whether there is in tail of the queue child window, if it is not, then into process (9), if it is, into process
(10);
(9) setting global mark flag is true, and process terminates;
(10) judge whether corresponding counter unit numerical value is greater than system-based clock, if it is, setting global mark flag
For true, process terminates;If it is not, then setting global mark flag is false, process terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510066798.4A CN105989061B (en) | 2015-02-09 | 2015-02-09 | Multidimensional data repeats detection fast indexing method under a kind of sliding window |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510066798.4A CN105989061B (en) | 2015-02-09 | 2015-02-09 | Multidimensional data repeats detection fast indexing method under a kind of sliding window |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105989061A CN105989061A (en) | 2016-10-05 |
CN105989061B true CN105989061B (en) | 2019-11-26 |
Family
ID=57038169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510066798.4A Active CN105989061B (en) | 2015-02-09 | 2015-02-09 | Multidimensional data repeats detection fast indexing method under a kind of sliding window |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105989061B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694074B (en) * | 2017-04-07 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Method for acquiring counting information and server |
CN106997391B (en) * | 2017-04-10 | 2020-11-03 | 华北电力大学(保定) | Method for rapidly screening steady-state working condition data in large-scale process data |
CN110704419A (en) * | 2018-06-21 | 2020-01-17 | 中兴通讯股份有限公司 | Data structure, data indexing method, device and equipment, and storage medium |
CN109582640B (en) * | 2018-11-15 | 2020-12-01 | 深圳市酷开网络科技有限公司 | Sliding window-based data deduplication storage method and device and storage medium |
CN109815234B (en) * | 2018-12-29 | 2021-01-08 | 杭州中科先进技术研究院有限公司 | Multiple cuckoo filter under STREAMING computational model |
CN110083743B (en) * | 2019-03-28 | 2021-11-16 | 哈尔滨工业大学(深圳) | Rapid similar data detection method based on unified sampling |
CN112529613A (en) * | 2020-11-27 | 2021-03-19 | 广州华多网络科技有限公司 | Method and device for processing user continuous login data and transferring virtual resources |
CN112751869B (en) * | 2020-12-31 | 2023-07-14 | 中国人民解放军战略支援部队航天工程大学 | Method and device for detecting abnormal network traffic based on sliding window group |
CN112688837B (en) * | 2021-03-17 | 2021-06-08 | 中国人民解放军国防科技大学 | Network measurement method and device based on time sliding window |
CN114595280B (en) * | 2022-05-10 | 2022-08-02 | 鹏城实验室 | Time member query method, device, terminal and medium based on sliding window |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN103336771A (en) * | 2013-04-02 | 2013-10-02 | 江苏大学 | Data similarity detection method based on sliding window |
-
2015
- 2015-02-09 CN CN201510066798.4A patent/CN105989061B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN103336771A (en) * | 2013-04-02 | 2013-10-02 | 江苏大学 | Data similarity detection method based on sliding window |
Also Published As
Publication number | Publication date |
---|---|
CN105989061A (en) | 2016-10-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105989061B (en) | Multidimensional data repeats detection fast indexing method under a kind of sliding window | |
US11087329B2 (en) | Method and apparatus of identifying a transaction risk | |
US20210152444A1 (en) | Aggregation of select network traffic statistics | |
IL273860A (en) | Event context management system | |
US9589004B2 (en) | Data storage method and apparatus | |
CN103930887B (en) | The inquiry stored using raw column data collects generation | |
GB2595800A (en) | Managing data objects for graph-based data structures | |
CN103577440B (en) | A kind of data processing method and device in non-relational database | |
CN106534164B (en) | Effective virtual identity depicting method based on cyberspace user identifier | |
CN102253991B (en) | Uniform resource locator (URL) storage method, web filtering method, device and system | |
CN106202569A (en) | A kind of cleaning method based on big data quantity | |
CN103685224A (en) | A network invasion detection method | |
WO2016145993A1 (en) | Method and system for user device identification | |
US10623371B2 (en) | Providing network behavior visibility based on events logged by network security devices | |
CN104750826B (en) | A kind of structural data resource metadata is screened automatically and dynamic registration method | |
US11777983B2 (en) | Systems and methods for rapidly generating security ratings | |
CN107682345A (en) | Detection method, detection means and the electronic equipment of IP address | |
CN103685221A (en) | A network invasion detection method | |
CN104022913A (en) | Test method and device for data cluster | |
WO2016165542A1 (en) | Method for analyzing cache hit rate, and device | |
CN110011830A (en) | Communication topology information modeling method based on data on flows | |
CN103685222A (en) | A data matching detection method based on a determinacy finite state automation | |
Zhang et al. | Density approach: a new model for BigData analysis and visualization | |
CN104794158B (en) | Domain name data repeats detection fast indexing method under a kind of boundary mark window | |
WO2021082936A1 (en) | Method and apparatus for counting number of webpage visitors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |