CN108595581A - The method for digging and digging system of frequent episode in data flow - Google Patents
The method for digging and digging system of frequent episode in data flow Download PDFInfo
- Publication number
- CN108595581A CN108595581A CN201810345014.5A CN201810345014A CN108595581A CN 108595581 A CN108595581 A CN 108595581A CN 201810345014 A CN201810345014 A CN 201810345014A CN 108595581 A CN108595581 A CN 108595581A
- Authority
- CN
- China
- Prior art keywords
- data item
- array
- frequent episode
- data
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 239000004744 fabric Substances 0.000 claims description 138
- 230000006870 function Effects 0.000 claims description 62
- 238000012545 processing Methods 0.000 claims description 20
- 238000001914 filtration Methods 0.000 claims description 10
- 238000007418 data mining Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 21
- 238000010586 diagram Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 15
- 238000001514 detection method Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 12
- 238000009826 distribution Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 8
- 238000009412 basement excavation Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 6
- 101000762967 Homo sapiens Lymphokine-activated killer T-cell-originated protein kinase Proteins 0.000 description 5
- 102100026753 Lymphokine-activated killer T-cell-originated protein kinase Human genes 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 230000002123 temporal effect Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000000205 computational method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses the method for digging and digging system of frequent episode in a kind of data flow, wherein the method for digging of frequent episode includes in the data flow:A data item is read from data flow;Search whether that there are the data item in a frequent episode array;The frequent episode array record has the maximum multiple data item of frequency and its frequency;When having found the data item in the frequent episode array, the frequency values of corresponding data item in the frequent episode array are updated;When not finding the data item in the frequent episode array, the frequency values of corresponding data item are updated in an auxiliary data item array.In this way, can be updated to the frequency of each data item, the frequent episode in data flow can be quickly excavated, and space is greatly saved, meanwhile, estimated accuracy can be improved.
Description
Technical field
This application involves data flow fields, the method for digging more particularly to frequent episode in a kind of data flow and excavation system
System.
Background technology
The nearest more than ten years, industry will appreciate that the statistics and analysis to data flow, and therefrom excavate useful information
Importance.Data flow is all widely used in all fields, is the unlimited data sequence of a Temporal Evolution, has
The features such as unlimitedness, continuity and rapidity.Data scale in data flow is very huge, can not generally preserve in data flow
Therefore all data will abandon some data, and these are dropped while reading some data and entering memory from memory
Data be expendable.
A fundamental problem in data flow be exactly find out several data item of most frequent appearance in data flow, and
Provide the frequency of this several data item appearance.It finds out several data item most frequent in data flow and has in reality and much answer
With, such as detection click steam, telephone call register, network packet record, the behavior of detection network fraud, Spam address filtering
Deng.The characteristics of data flow model is exactly that the scale of input data is very huge, and entire data flow can not be put into memory by us,
In addition we can only sequence one pass evidence of reading.Therefore traditional to occur data item to each and count its frequency, then by frequency
The maximum preceding several method of frequency values is found out in sequence, and in data flow model, room and time consumption is all unpractical.
Therefore, INFORMATION DISCOVERY on the data streams is the huge challenge that Data Mining faces.
Invention content
In view of the missing of the above the relevant technologies, the purpose of the application is to disclose a kind of excavation side of frequent episode in data flow
Method and digging system, the digging technology room and time cost consumption for solving frequent episode in data flow in the related technology are larger
The problems such as.
To achieve the above object and other purposes, the first aspect of the application discloses a kind of excavation of frequent episode in data flow
Method includes the following steps:A data item is read from data flow;Search whether that there are the data in a frequent episode array
;The frequent episode array record has the maximum multiple data item of frequency and its frequency;It is searched when in the frequent episode array
To when having the data item, the frequency values of corresponding data item in the frequent episode array are updated;When described frequent
When not finding the data item in item array, the frequency values of corresponding data item are given in an auxiliary data item array
Update.
The second aspect of the application discloses a kind of digging system of frequent episode in data flow, including:Read module, for from
A data item is read in data flow;Searching module, for searching whether that there are the data item in a frequent episode array;It is described
Frequent episode array record has the maximum multiple data item of frequency and its frequency;Update module is recorded, is used for:When in the frequent episode
When having found the data item in array, the frequency values of corresponding data item in the frequent episode array are updated;When
When not finding the data item in the frequent episode array, by corresponding data item in an auxiliary data item array
Frequency values are updated.
The third aspect of the application discloses a kind of computer readable storage medium, is stored with the journey applied to data mining
Sequence is realized in data flow as previously described when described program is executed by least one processor in the method for digging of frequent episode
Each step.
The fourth aspect of the application discloses a kind of data processing equipment, including:At least one processor;At least one storage
Device;At least one program, wherein at least one program be stored in at least one processor and be configured as by
At least one processor executes instruction, and is executed instruction so that the data processing described at least one processor execution
Equipment executes each step in the method for digging of frequent episode in data flow as previously described.
As described above, in the data flow of the application frequent episode method for digging and digging system, have the advantages that:
Each data item in data flow is recorded by a frequent episode array and an auxiliary data item array, wherein described frequent
Item array is used to record several data item of most frequent appearance, and the auxiliary data item array is for recording other data
, when reading a certain data item in data flow, first judge that the data item whether there is in frequent episode array, if
It is then to be updated the frequency values of the data item in the frequent episode, if it is not, then can be in the auxiliary data item array
The middle frequency values by the data item are updated.In this way, can be updated to the frequency of each data item, can quickly excavate
Frequent episode in data flow, and space is greatly saved, meanwhile, estimated accuracy can be improved.
Description of the drawings
Fig. 1 is shown as the flow diagram of the method for digging of frequent episode in the application data flow in one embodiment.
Fig. 2 is shown as the structural schematic diagram of the grand filter array of cloth.
Fig. 3 is shown as the flow diagram of the method for digging of frequent episode in the application data flow in another embodiment.
Fig. 4 is shown as the relation schematic diagram of frequent episode Hash table and heap space in frequent episode array.
Fig. 5 is shown as the structural schematic diagram of the digging system of frequent episode in the application data flow in one embodiment.
Fig. 6 is shown as the schematic diagram of influence of the parameter alpha of zipf distributions to error.
Fig. 7 is shown as containing the three kinds of common effects of algorithm in the data that zipf is distributed including the application and compares
Schematic diagram.
Fig. 8 is shown as the schematic diagram of influence of the size of the grand filter array of cloth to error.
Fig. 9 is shown as the schematic diagram of influences of the number d of hash function to error.
Figure 10 is shown as the structural schematic diagram of the data processing equipment of the application in one embodiment.
Specific implementation mode
Illustrate that presently filed embodiment, those skilled in the art can be by this explanations by particular specific embodiment below
Content disclosed by book understands other advantages and effect of the application easily.
In described below, refer to the attached drawing, attached drawing describes several embodiments of the application.It should be appreciated that also can be used
Other embodiment, and composition can be carried out without departing substantially from spirit and scope of the present disclosure and operational changed
Become.Following detailed description should not be considered limiting, and the range of embodiments herein is only by the application's
Claims of patent are limited.Term used herein is merely to describe specific embodiment, and be not intended to limit this
Application.
Although term first, second etc. are used for describing various elements herein in some instances, these elements
It should not be limited by these terms.These terms are only used for distinguishing an element with another element.For example, first is pre-
If threshold value can be referred to as the second predetermined threshold value, and similarly, the second predetermined threshold value can be referred to as the first predetermined threshold value, and
The range of various described embodiments is not departed from.First predetermined threshold value and predetermined threshold value are to describe a threshold value, still
Unless context otherwise explicitly points out, otherwise they are not the same predetermined threshold values.
Furthermore as used in herein, singulative " one ", "one" and "the" are intended to also include plural number shape
Formula, unless there is opposite instruction in context.It will be further understood that term "comprising", " comprising " show that there are the spies
Sign, step, operation, element, component, project, type, and/or group, but it is not excluded for other one or more features, step, behaviour
Presence, appearance or the addition of work, element, component, project, type, and/or group.Term "or" used herein and "and/or" quilt
It is construed to inclusive, or means any one or any combinations.Therefore, " A, B or C " or " A, B and/or C " mean " with
Descend any one:A;B;C;A and B;A and C;B and C;A, B and C ".Only when element, function, step or the combination of operation are in certain sides
When inherently mutually exclusive under formula, it just will appear the exception of this definition.
Data flow is the unlimited data sequence of a Temporal Evolution, and a most basic problem in data flow is exactly
Find out the frequent episode in data flow, the frequent episode be in data flow the frequency of occurrences reach a certain amount of data item, that is,
It says, finds out several data item of most frequent appearance in data flow.Unlimitedness, continuity and rapidity due to data flow etc. are special
Point, data volume is very huge, can not entire data flow be put into memory.In the related art, usually to each appearance
Data item counts its frequency, then the frequency values of all data item are ranked up to find out corresponding frequent episode, this processing side
Formula is all unpractical on space consuming and time loss.In order to reduce corresponding room and time consumption, the application discloses
The method for digging of frequent episode in a kind of data flow, by each data item in data flow by a frequent episode array and an auxiliary data
Item array is recorded, wherein the frequent episode array is used to record several data item of most frequent appearance, the supplementary number
According to item array for recording other data item, when reading a certain data item in data flow, the data item is first judged
With the presence or absence of in frequent episode array, if so, be updated the frequency values of the data item in the frequent episode, if
It is no, then the frequency values of the data item can be updated in the auxiliary data item array.In this way, can be to each data item
Frequency be updated, can quickly excavate the frequent episode in data flow, and space is greatly saved, meanwhile, estimation can be improved
Precision.
The method for digging of frequent episode can be executed by being, for example, the information processing equipments such as computer equipment in the data flow.
The computer equipment can be following suitable equipment, such as handheld computer device, tablet computer equipment, notebook meter
Calculation machine, desktop PC, server etc..Computer equipment may include following one or more components:Display, input dress
Set, the port input/output (I/O), one or more processors, memory, non-volatile memory device, network interface and
Power supply etc..The various parts may include that (such as store instruction has for hardware element (such as chip and circuit), software element
Shape non-transitory computer-readable medium) or hardware element and software element combination.In addition, it may be noted that various parts can quilt
It is combined into less component or is separated into additional component.For example, memory and non-volatile memory device can be included in
In single component.The method for digging of the frequent episode can be individually performed in the computer equipment, or matches with other computer equipments
It closes and executes.
This application discloses a kind of method for digging of frequent episode in data flow.Referring to Fig. 1, being shown as the application data flow
The flow diagram of the method for digging of middle frequent episode in one embodiment.As shown in Figure 1, in the data flow frequent episode excavation
Method includes the following steps:
Step S11 reads a data item from data flow.Data flow is the unlimited data sequence of a Temporal Evolution
Row, have the characteristics that unlimitedness, continuity and rapidity.Data flow is applicable in a variety of application environments, for example, in finance
Securities trading, weather forecast, hydrological observation, detection website click steam, telephone call register, network packet record, detection net
It will produce the data of a large amount of data stream type in the application environments such as network fraud, Spam address filtering.Therefore, one
As, the processing to data flow is to receive a part for data flow, is sequentially read out to data item therein, the number read
It may be then dropped according to stream, subsequently, continue to the other parts of data flow and sequentially read.
In step s 11, it is exactly sequentially to be read to the data item in data flow from a data item is read in data flow
A data item is read during taking.
Step S13 searches whether that there are the data item in a frequent episode array.
Be apparent from, in data flow a fundamental problem is exactly the frequent episode found out in data flow, that is, find out it is most frequent go out
Existing multiple data item and provide the frequency that this multiple data item occurs.Such as:Given character set (Alphabet), A={ Ei
|1≤i≤|A|}.S is a data flow being made of data item in character set A.One data item EiThe frequency f of ∈ AiFor its
Occurrence number in data flow S.Without loss of generality, it will be assumed that f1≥f2≥.....≥f|A|, simultaneously, it is assumed that be directed to data flow S mesh
Mark is to find out the highest K data item of the frequency of occurrences, and provide the value of this K data item frequency.Any one data flow algorithm
A S can only be read in sequence, the data item read cannot be read again, and cannot as RAM (random access memory,
Random Access Memory) data item of designated position in specified data flow S is read like that.Theoretically, above-mentioned ask is provided
Topic accurate solution, it is desirable that the Complete Information of the frequency of all data item, and this be in data flow model it is unpractical, therefore,
It is concentrated mainly on the approximate solution of excavation.It approximately means herein, provides the good estimation to the data item frequency of occurrences so that
It is small as possible with the error of actual value, and the frequency of occurrences of the highest K data item of the frequency of occurrences that algorithm provides is generally
It is highest K in practice.
It is traditional in the related technology, sequentially each data item in reading data flow S, and by the data item and its frequency of reading
The information such as rate are placed in corresponding data structure, and then to being ranked up the frequency values of data item in data structure, are found out
The maximum former a data item of frequency values.But it is above-mentioned in the related technology, room and time consumption it is all bigger.
As described in step 13, in the application data flow in the method for digging of frequent episode, particularly, a frequent episode is devised
Array, in the frequent episode array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream
Rate, that is, the K data item and its frequency of most frequent appearance, hereinafter, these data being recorded in the frequent episode array
Item is alternatively referred to as frequent episode.Therefore, further include being pre-created one frequently in the application data flow in the method for digging of frequent episode
The step of item array.As previously described, it is assumed that our purpose is the highest K data of the frequency of occurrences in data flow S to be excavated
, a frequent episode array is created, size K, from 0 to K-1, each single item in the frequent episode array is character set for address
Some data item in A and its frequency.
In step s 13, search whether that there are the data item in the frequent episode array to realize, in the present embodiment
In, further include creating a frequent episode Hash table (Hash table) for the frequent episode array when creating a frequent episode array.
The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, in step s 13, when
When reading some data item x in data flow S, so that it may judge the data of aforementioned reading using the frequent episode Hash table
Whether item x is in the frequent episode array, that is, judges the data item x read whether is recorded in the frequent episode array.Such as
This, whether one data item of detection belongs to the record of frequent episode array TOPK, is only needed using frequent episode Hash table with O's (1)
Time.In one embodiment, the frequent episode Hash table can be by each data item for will being recorded in the frequent episode array
Pass through hash function (Hash Function) calculating respectively to obtain.In this way, when reading a certain data item, will read
Result of calculation is calculated in the logical hash function that carries out of data item, by the result of calculation and the corresponding frequent episode Hash table into
Row matching judges to whether there is the data item in the frequent episode array according to matching result.Although in the aforementioned embodiment
The quick access of data item is realized using frequent episode Hash table, but is not limited thereto, and those skilled in the art also may be used
With using other data structures with similar functions.
Step S15, when having found the data item in frequent episode array, by corresponding data in frequent episode array
The frequency values of item are updated.As previously mentioned, in step s 13, before can advantageously being determined very much using frequent episode Hash table
The data item of reading is stated whether in the frequent episode array.Therefore, in step S15, it is when judgement obtains the data item
When being present in the record of the frequent episode array, that is, show that the data item belongs to the highest K data item of the frequency of occurrences,
Therefore, the frequency values of the frequency of corresponding data items increase by 1 in the frequent episode array, realize the tired of the frequency values of data item
Add.
Step S17 will in an auxiliary data item array when not finding the data item in frequent episode array
The frequency values of corresponding data item are updated.
In this application, each data item in data flow is carried out by a frequent episode array and an auxiliary data item array
Record, wherein the frequent episode array is used to record several data item of most frequent appearance, and the auxiliary data item array is used
In the other data item of record.In the present embodiment, conventional data structure can be used in the auxiliary data item array, for example,
The grand filter array of cloth.
The grand filter of cloth (Bloom Filter) array is that one kind can be used to quickly judge whether an element belongs to one
The data structure of set.A set X is given, by each element in the grand algorithm filter cycle X of cloth, judges that this element is
It is no to belong to set Y.
The application of the grand filter array of lower cloth briefly described below.
It is assumed that the grand filter array of cloth is the array of a M size, address is from 0 to M-1, initial value 0.To in Y
Each element y utilizes d independent hash function h1,h2,...,hd, y is mapped to address y1,y2,...,yd(0≤yi≤M-
1,1≤i≤d), the value of these addresses is set as 1.Each element x in X is obtained using d hash function above-mentioned
D cryptographic Hash of element x is h1(x), h2(x) ..., hd(x), check whether the value of this d address in array is 1, if there is
Some address value is 0, it can be concluded thatIf all 1, we will be with very close 1 probabilistic determination x ∈ Y.
The powerful place of the grand filter array of cloth is that it without all elements in storage Y completely, this is in | Y | very
Under big occasion very effectively.Therefore, compared to other data structures, Bloom filter has in terms of room and time
Big advantage.The grand algorithm filter of cloth is random algorithm, needs the space of the operation and O (| Y |) of O (| X |).
Therefore, the step of being based on the grand filter array of cloth, the frequency values of corresponding data item are updated further includes:
Step S171 is filtered to judge in the cloth grand filter the data item using the grand filter array of cloth
It whether there is the data item in array.
Since can be used to quickly judge whether an element belongs to a set as previously mentioned, the grand filter of cloth is one kind
Data structure.Therefore, in the present embodiment, can using the grand filter of cloth come judge in the grand filter array of the cloth whether
There are the data item of reading.
In advance, the grand filter of a cloth is created.It is assumed that the size of the Bloom filter array is M, address is from 0 to M-
1, the initial value of each single item is 0.
In addition, creating d mutually independent hash function h for the Bloom filter array1,h2,...,hd.In this reality
It applies in example, any one hash function in the d hash function obtains in the following manner:It is prime number to enable M, from 0,
1,2 ... ..., M-1 } in uniformly choose r+1 number a0, a1... ..., ar,Obtain a Kazakhstan
Uncommon function ha
In decision process, d mutual independent hash function h are utilized1,h2,...,hdBy the data item x mappings of reading
Into multiple addresses of the Bloom filter array, that is, utilize d mutual independent hash function h1,h2,...,hdTo reading
The data item x taken carries out Hash calculation and obtains d cryptographic Hash to be h1(x), h2(x) ..., hd(x) (Fig. 2 is seen), described in detection
The value of multiple addresses is (that is, h1(x), h2(x) ..., hd(x)) whether be 1, when the value of the multiple address it is all 1 when, then may be used
There are data item x in the grand filter array of the cloth for judgement, conversely, when there are at least one addresses in the multiple address
Value is 0, then can determine that do not have data item x in the grand filter array of the cloth.
Step S173, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the cloth
The frequency values of corresponding data item are updated in grand filter array.
As described in step S171, it can determine that in the grand filter array of cloth with the presence or absence of reading using the grand filter array of cloth
Data item.It, then can be by cloth when being determined based on step S171 in the grand filter array of cloth there are when the data item of reading
The frequency values of corresponding data item are updated in grand filter array.
Specifically, the frequency values of corresponding data item in the grand filter array of the cloth are updated including following step
Suddenly:
Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter
In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized1,
h2,...,hdThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d
Hash function h1,h2,...,hdThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h1(x), h2(x) ...,
hd(x).In fact, above-mentioned calculating process is identical to the Hash calculation of data item x with step S171.Subsequently, by the grand mistake of the cloth
The value of d address in filter array increases by 1.
Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency of the data item
Value.In the present embodiment, it is exactly the minimum taken in the numerical value of d address from minimum value is chosen in the numerical value of the multiple address
Value, is denoted as:min{BF(h1(x)),BF(h2(x)),....,BF(hd(x)) }, and by the minimum value it is denoted as the frequency of data item x
Rate value.Herein, with min { BF (h1(x)),BF(h2(x)),....,BF(hd(x)) } carry out the actual frequency of approximate data item x,
The reasons why this processing mode, is as follows:First, the inventor of the present application discovered that BF (h1(x)),BF(h2(x)),....,BF(hd
(x)) this d value is naturally larger than the frequency of occurrences equal to current data item x, because of each appearance of data item x, this d address
Value can all be increased by one, if data item x enters frequent episode array and is recorded, then again be replaced back the grand filter array of cloth, effect
Fruit is the same.Also, only work as h1(x),h2(x),....,hd(x) this d address is all by other data item of non-data item x
Cryptographic Hash it is shared, i.e.,And these shared data items frequency it is all very big when, just meeting
Generate bigger error.Since the probability that d address is all shared by the prodigious data item of frequency is little, along with frequency is opposite
Bigger element is recorded by frequent episode array, so generating the probability very little of big error.
In addition, it should be noted that, it is based on step S171, also will appear judgement, there is no read in the grand filter array of cloth
Data item situation.For the situation, in one embodiment, the data item can be increased newly in the grand filter array of cloth
Information, record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, example
As the data item in data flow meets zipf (Qi Pufu) distributions.Zipf distributions are proposed by American scholar G.K. Qi Pufu,
It can substantially be expressed as:In the corpus of natural language, frequency and its ranking in frequency meter that a word occurs
It is inversely proportional.For example, the frequency of occurrences of the second common word is about the 1/2 of the frequency of occurrences of most common word, third is normal
The frequency of occurrences for the word seen is about the 1/3 of the frequency of occurrences of most common word, and so on, word common N
The frequency of occurrences is about the 1/N of the frequency of occurrences of most common word.Therefore, for meeting the data flow of zipf distributions,
Since the data item of reading is neither in frequent episode array nor in the grand filter number of cloth as auxiliary data item array
In group, then, it is reasonable to think the data item can become most frequent data item possibility it is very little, in this way, it is a kind of compared with
For simple processing mode, the data item directly can exactly be given up to fall, do not note down.
S11 reads a data item to step S17 from data flow through the above steps, auxiliary in a frequent episode array or one
Help in data item array and search whether there are the data item, whether there is in the frequent episode array according to the data item or
The judgement of the auxiliary data item array is as a result, to institute in the corresponding frequent episode array or the auxiliary data item array
The frequency values for stating data item are updated, to complete the record of a data item.
(by corresponding number in frequent episode array in step S15 after completing the record of a data item in data flow
Updated according to the frequency values of item or in step S17 in an auxiliary data item array by the frequency values of corresponding data item
Updated), you can return and the S11 to step S17 that repeats the above steps, next next data item is read out and
Record, until sequentially completing the record of each data item in data flow.To a certain stage or to the end, the data to be handled
The most frequent multiple data item for (that is, frequency of occurrences highest) occur in stream, it is only necessary to transfer record in the frequent episode array i.e.
Can, rapid and convenient.
The application discloses a kind of method for digging of frequent episode in data flow, frequently by one by each data item in data flow
Item array and an auxiliary data item array are recorded, wherein the frequent episode array is for recording the several of most frequent appearance
A data item, the auxiliary data item array is for recording other data item, when reading a certain data item in data flow
When, first judge that the data item whether there is in frequent episode array, if so, by the data item in the frequent episode
Frequency values are updated, if it is not, can then be updated the frequency values of the data item in the auxiliary data item array.Such as
This, can be updated the frequency of each data item, can quickly excavate the frequent episode in data flow, and sky is greatly saved
Between, meanwhile, estimated accuracy can be improved.
In previous embodiment description, when the data item of reading is present in frequent episode array, then in corresponding institute
It states in frequent episode array and the frequency values of the data item is updated, when the data item of reading is to be present in auxiliary data item number
When in group, then the frequency values of the data item are updated in the corresponding auxiliary data item array.Due to the frequency
The frequency values of data item are that dynamic is newer in the frequency values of data item and the auxiliary data item array in numerous array, because
This, is by the reading and update of each data item in data flow, the frequency of a certain (a little) data item of the auxiliary data item array
Rate value may be more than the frequency values of a certain (a little) data item in the frequent episode array, that is, in the frequent episode array
The frequency values of a certain (a little) data item are just no longer belong to that highest several data item (that is, frequent episode) of frequency.Therefore, in this Shen
Please further include replacing a certain (a little) data item in the auxiliary data item array in data flow in the method for digging of frequent episode
The operation of a certain (a little) data item in the frequent episode array.
Show referring to Fig. 3, being shown as the flow of the method for digging of frequent episode in the application data flow in another embodiment
It is intended to.As shown in figure 3, the method for digging of frequent episode includes the following steps in the data flow:
Step S21 reads a data item from data flow.
In the step s 21, it is exactly sequentially to be read to the data item in data flow from a data item is read in data flow
A data item is read during taking.
Step S22 searches whether that there are the data item in a frequent episode array.
As described in step 22, in the application data flow in the method for digging of frequent episode, particularly, a frequent episode is devised
Array, in the frequent episode array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream
Rate, that is, the K data item and its frequency of most frequent appearance.Therefore, in the application data flow in the method for digging of frequent episode,
Further include the steps that being pre-created a frequent episode array.As previously described, it is assumed that our purpose is gone out in data flow S to be excavated
The highest K data item of existing frequency, creates a frequent episode array, size K, address is from 0 to K-1, the frequent episode array
In each single item be some data item and its frequency in character set A.
In step S22, search whether that there are the data item in the frequent episode array to realize, in the present embodiment
In, further include creating a frequent episode Hash table (Hash table) for the frequent episode array when creating a frequent episode array.
The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, in step S22, when
When reading some data item x in data flow S, so that it may judge the data of aforementioned reading using the frequent episode Hash table
Whether item x is in the frequent episode array, that is, judges the data item x read whether is recorded in the frequent episode array.Such as
This, whether one data item of detection belongs to the record of frequent episode array TOPK, is only needed using frequent episode Hash table with O's (1)
Time.In one embodiment, the frequent episode Hash table can be by each data item for will being recorded in the frequent episode array
It is obtained respectively by hash function calculating.In this way, when reading a certain data item, the logical progress Hash of data item will be read
Result of calculation is calculated in function, the result of calculation is matched with the corresponding frequent episode Hash table, according to matching
As a result judge to whether there is the data item in the frequent episode array.Although being breathed out in the aforementioned embodiment using frequent episode
Uncommon table realizes the quick access of data item, but is not limited thereto, and those skilled in the art, which can also use, has similar work(
Other data structures of energy.
Step S23, when having found the data item in frequent episode array, by corresponding data in frequent episode array
The frequency values of item are updated.As previously mentioned, in step S22, before can advantageously being determined very much using frequent episode Hash table
The data item of reading is stated whether in the frequent episode array.Therefore, in step S23, it is when judgement obtains the data item
When being present in the record of the frequent episode array, that is, show that the data item belongs to the highest K data item of the frequency of occurrences,
Therefore, the frequency values of the frequency of corresponding data items increase by 1 in the frequent episode array, realize the tired of the frequency values of data item
Add.
Step S24 searches minimum frequency value and its corresponding data item from frequent episode array.
Minimum frequency value is searched from the frequent episode array, conventional implementation method is to use traversal computational methods,
That is, the frequency values traversal of all data item in the frequent episode array is searched one time, minimum frequency value is therefrom calculated.But
Above-mentioned implementation method is relatively cumbersome, and time cost is O (K).And in the present embodiment, first, as previously mentioned, creating
When one frequent episode array, there is a frequent episode Hash table (Hash table) for frequent episode array establishment.Secondly, extraly,
A heap space (Heap) has been created according to the frequency of data item in the frequent episode array, it, can be to institute using the heap space
The frequency for stating each data item in frequent episode array is safeguarded.In addition, in the frequent episode Hash table and the heap space
Between with doubly linked list connect (relationship of the frequent episode Hash table and the heap space sees Fig. 4).In this way, described in utilizing
Frequent episode Hash table searching data item x is to judge the data item x of aforementioned reading whether in the frequent episode array, using described
Heap space searches minimum frequency value minTOPK (that is, minimum frequency value in frequent episode array in the frequency of each data item), then
The corresponding data item of the minimum frequency value is found out using the frequent episode Hash table, for example, by the frequent episode array
It is denoted as data item y with the corresponding data item of minimum frequency value.In this way, can quickly and accurately be detected using frequent episode Hash table
Whether some data item for going out reading belongs to the record of frequent episode array TOPK, and the time cost needed for detection process is O
(1), the minimum frequency value in the frequent episode array, and search procedure can quickly and accurately be found out using the heap space
Required time cost is also only O (1).In addition, for heap space, required generation time is inserted into and deleted in heap space
Valence is O (logK).
Step S25 will in an auxiliary data item array when not finding the data item in frequent episode array
The frequency values of corresponding data item are updated.
In this application, each data item in data flow is carried out by a frequent episode array and an auxiliary data item array
Record, wherein the frequent episode array is used to record several data item of most frequent appearance, and the auxiliary data item array is used
In the other data item of record.In the present embodiment, conventional data structure can be used in the auxiliary data item array, for example,
The grand filter array of cloth.
The grand filter of cloth (Bloom Filter) array is that one kind can be used to quickly judge whether an element belongs to one
The data structure of set.A set X is given, by each element in the grand algorithm filter cycle X of cloth, judges that this element is
It is no to belong to set Y.
The application of the grand filter array of lower cloth briefly described below.
It is assumed that the grand filter array of cloth is the array of a M size, address is from 0 to M-1, initial value 0.To in Y
Each element y utilizes d independent hash function h1,h2,...,hd, y is mapped to address y1,y2,...,yd(0≤yi≤M-
1,1≤i≤d), the value of these addresses is set as 1.Each element x in X is obtained using d hash function above-mentioned
D cryptographic Hash of element x is h1(x), h2(x) ..., hd(x), check whether the value of this d address in array is 1, if there is
Some address value is 0, it can be concluded thatIf all 1, we will be with very close 1 probabilistic determination x ∈ Y.
The powerful place of the grand filter array of cloth is that it without all elements in storage Y completely, this is in | Y | very
Under big occasion very effectively.Therefore, compared to other data structures, Bloom filter has in terms of room and time
Big advantage.The grand algorithm filter of cloth is random algorithm, needs the space of the operation and O (| Y |) of O (| X |).
Therefore, the step of being based on the grand filter array of cloth, the frequency values of corresponding data item are updated further includes:
Step S251 is filtered to judge in the cloth grand filter the data item using the grand filter array of cloth
It whether there is the data item in array.
Since can be used to quickly judge whether an element belongs to a set as previously mentioned, the grand filter of cloth is one kind
Data structure.Therefore, in the present embodiment, can using the grand filter of cloth come judge in the grand filter array of the cloth whether
There are the data item of reading.
In advance, the grand filter of a cloth is created.It is assumed that the size of the Bloom filter array is M, address is from 0 to M-
1, the initial value of each single item is 0.
In addition, creating d mutually independent hash function h for the Bloom filter array1,h2,...,hd.In this reality
It applies in example, any one hash function in the d hash function obtains in the following manner:It is prime number to enable M, from 0,
1,2 ... ..., M-1 } in uniformly choose r+1 number a0, a1... ..., ar,Obtain a Kazakhstan
Uncommon function ha
In decision process, d mutual independent hash function h are utilized1,h2,...,hdBy the data item x mappings of reading
Into multiple addresses of the Bloom filter array, that is, utilize d mutual independent hash function h1,h2,...,hdTo reading
The data item x taken carries out Hash calculation and obtains d cryptographic Hash to be h1(x), h2(x) ..., hd(x) (Fig. 2 is seen), described in detection
The value of multiple addresses is (that is, h1(x), h2(x) ..., hd(x)) whether be 1, when the value of the multiple address it is all 1 when, then may be used
There are data item x in the grand filter array of the cloth for judgement, conversely, when there are at least one addresses in the multiple address
Value is 0, then can determine that do not have data item x in the grand filter array of the cloth.
Step S253, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the cloth
The frequency values of corresponding data item are updated in grand filter array.
As described in step S251, it can determine that in the grand filter array of cloth with the presence or absence of reading using the grand filter array of cloth
Data item.It, then can be by cloth when being determined based on step S251 in the grand filter array of cloth there are when the data item of reading
The frequency values of corresponding data item are updated in grand filter array.
Specifically, the frequency values of corresponding data item in the grand filter array of the cloth are updated including following step
Suddenly:
Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter
In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized1,
h2,...,hdThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d
Hash function h1,h2,...,hdThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h1(x), h2(x) ...,
hd(x).In fact, above-mentioned calculating process is identical to the Hash calculation of data item x with step S251.Subsequently, by the grand mistake of the cloth
The value of d address in filter array increases by 1.
Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency of the data item
Value.In the present embodiment, it is exactly the minimum taken in the numerical value of d address from minimum value is chosen in the numerical value of the multiple address
Value:min{BF(h1(x)),BF(h2(x)),....,BF(hd(x)) it }, can be denoted as minBF, and the minimum value minBF is denoted as
The frequency values of data item x.Herein, with min { BF (h1(x)),BF(h2(x)),....,BF(hd(x)) } carry out approximate data item x
Actual frequency, the reasons why this processing mode is as follows:First, the inventor of the present application discovered that BF (h1(x)),BF(h2
(x)),....,BF(hd(x)) this d value is naturally larger than the frequency of occurrences equal to current data item x, because data item x's is each
Occur, the value of this d address can be all increased by one, if data item x enters frequent episode array and is recorded, then be replaced back cloth again
Grand filter array, effect are the same.Also, only work as h1(x),h2(x),....,hd(x) this d address is all by non-data
The cryptographic Hash of other data item of item x is shared, i.e.,And the frequency of these shared data items
When all very big, bigger error just will produce.Due to d address all by the prodigious data item of frequency share probability not
Greatly, along with the frequency larger element that compares is recorded by frequent episode array, so generating the probability very little of big error.
In addition, it should be noted that, it is based on step S251, also will appear judgement, there is no read in the grand filter array of cloth
Data item situation.For the situation, in one embodiment, the data item can be increased newly in the grand filter array of cloth
Information, record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, example
If the data item in data flow meets zipf (Qi Pufu) distributions, therefore, since the data item of reading is neither in frequent episode
In array also not in as the grand filter array of the cloth of auxiliary data item array, then, it is reasonable to think the data item energy
Possibility as most frequent data item is very little, in this way, a kind of relatively simple processing mode, being exactly can be by the data
Item directly is given up to fall, and does not note down.
S21 reads a data item to step S25 from data flow through the above steps, auxiliary in a frequent episode array or one
Help in data item array and search whether there are the data item, whether there is in the frequent episode array according to the data item or
The judgement of the auxiliary data item array is as a result, to institute in the corresponding frequent episode array or the auxiliary data item array
The frequency values for stating data item are updated, to complete the record of a data item.
Step S26, by the minimum frequency in the frequency values of corresponding data item in auxiliary data item array and frequent episode array
Rate value is compared.
From preceding:In step s 24, minimum frequency value minTOPK and its right is found from the frequent episode array
The data item y answered.In practical applications, minimum frequency value minTOPK and its corresponding is found from the frequent episode array
Data item y can be recorded in the updated, and be called in step S26.In abovementioned steps S25, in the grand filtering of the cloth
In device array (by taking the grand filter array of cloth as an example), the frequency values of the data item x read are denoted as minBF=min { BF (h1
(x)),BF(h2(x)),....,BF(hd(x)) it }, therefore, in step S26, calls in the frequent episode array recorded
Minimum value by frequency values of corresponding data item in the auxiliary data item array obtained in step s 25 and has recorded
Minimum value in the frequent episode array is compared, specifically, the number that will exactly be recorded in the grand filter array of the cloth
It is compared with the minimum frequency value minTOPK recorded in the frequent episode array according to the frequency values minBF of item x.
Step S27, when the frequency values of a certain data item in auxiliary data item array are more than or equal in frequent episode array most
When small frequency value, minimum frequency value in frequent episode array is replaced to correspond to the corresponding data item of frequency values in auxiliary data item array
Data item to be recorded in frequent episode array, the corresponding data item of superseded minimum frequency value is gone into auxiliary data
It is recorded in item array.
The auxiliary data item array is by taking the grand filter of cloth as an example, in this way, in step s 27, when in the grand filtering of the cloth
The frequency values minBF of the data item x recorded in device array is greater than equal to the minimum frequency recorded in the frequent episode array
When value minTOPK, then the operation executed includes:It is in the frequent episode array that the minimum frequency value minTOPK is corresponding
Data item y is deleted, while the corresponding data item x of frequency values minBF in the grand filter array of the cloth being inserted into described
In frequent episode array, and record frequency values minBF corresponding with the data item x being inserted into;Correspondingly, in the grand filter of the cloth
Data item x is deleted in array, meanwhile, data item y is inserted into the grand filter array of the cloth.In the frequent item number
In group delete and be inserted into a data item needed for time cost be O (logK), this be based on carried out in heap space be inserted into and
Time cost needed for deleting is O (logK).
Extraly, since data item x and data item y are exchanged, in the grand filter array of cloth, on the one hand, will
It is updated for the value of d address corresponding with d hash function in data item x, by h in the grand filter array of cloth1(x),h2
(x),...,hd(x) value of these addresses subtracts minBF, that is, BF (hi(x))=BF (hi(x))-minBF, wherein 1≤i≤d,
On the other hand, it will be updated for the value of d address corresponding with d hash function in data item y, by the grand filter number of cloth
H in group1(y),h2(y),...,hd(y) value of these addresses adds minTOPK, that is, BF (hi(y))=BF (hi(y))+
MinTOPK, wherein 1≤i≤d.
In fact, the frequency values when a certain data item in the grand filter array of the cloth are less than in the frequent episode array
When minimum frequency value, then terminate, completes the record of a data item x in data flow.
After completing the record of a data item in data flow, you can return and the S21 that repeats the above steps is to step
S27 is read out and records to next next data item, until sequentially completing the note of each data item in data flow
Record.To a certain stage or to the end, multiple data item (that is, frequent episode) of most frequent appearance in the data flow to be handled,
It only needs to transfer the record in the frequent episode array, rapid and convenient.
The application discloses a kind of method for digging of frequent episode in data flow, frequently by one by each data item in data flow
Item array and an auxiliary data item array are recorded, wherein the frequent episode array is for recording the several of most frequent appearance
A data item, the auxiliary data item array is for recording other data item, when reading a certain data item in data flow
When, first judge that the data item whether there is in frequent episode array, if so, by the data item in the frequent episode
Frequency values are updated, if it is not, can then be updated the frequency values of the data item in the auxiliary data item array.Such as
This, can be updated the frequency of each data item, can quickly excavate the frequent episode in data flow, and sky is greatly saved
Between, meanwhile, estimated accuracy can be improved.
In addition, in the application data flow in the method for digging of frequent episode, the frequent episode array is most frequent for recording
Several data item occurred, the auxiliary data item array are used to record other relatively low data item of the frequency of occurrences, when
The data item in data item and the auxiliary data item array in the frequent episode array can be according to corresponding frequency values
And dynamic change.For example, when the frequency values of some (a little) data item recorded in the auxiliary data item array increase,
It is possible to replace the data item in the frequent episode array, and substituted data item just turns by the auxiliary data item array
Record.By taking the grand filter array of cloth as an example, the purpose of the grand filter of cloth can reduce used the auxiliary data item array
Space, because if without the grand filter array of cloth and if being realized with the method for the common array record frequency of occurrences merely,
Need the space of O (| A |).The purpose one of the frequent episode array is to record the highest multiple data item of the current frequency of occurrences
(that is, frequent episode), the other is reducing the error of the grand filter array of cloth (for details, reference can be made to the descriptions in abovementioned steps S253).
Referring to Fig. 5, being shown as the structural representation of the digging system of frequent episode in the application data flow in one embodiment
Figure.As shown in figure 5, the digging system of frequent episode includes in the data flow:Read module 51, searching module 53 and record
Update module 55.
Read module 51 from data flow for reading a data item.Data flow is the unlimited number of a Temporal Evolution
According to sequence, have the characteristics that unlimitedness, continuity and rapidity.Data flow is applicable in a variety of application environments, for example,
Financial instrument transaction, weather forecast, hydrological observation, detection website click steam, telephone call register, network packet record, inspection
It will produce the data of a large amount of data stream type in the application environments such as the fraud of survey grid network, Spam address filtering.Cause
This, usually, the processing to data flow is to receive a part for data flow, is sequentially read out to data item therein, is read
Complete data flow may be then dropped, and subsequently, continued to the other parts of data flow and sequentially read.
Data flow can be sequentially read out using read module 51, it is exactly in logarithm that a data item is read from data flow
A data item is read during being sequentially read out according to the data item in stream.
Searching module 53 in a frequent episode array for searching whether that there are the data item.
It being apparent from, in data flow a fundamental problem is exactly to find out multiple data item of most frequent appearance in data flow,
And provide the frequency that this multiple data item occurs.Such as:Given character set (Alphabet), A={ Ei|1≤i≤|A|}。S
The data flow being made of data item in character set A for one.One data item EiThe frequency f of ∈ AiFor its going out in data flow S
Occurrence number.Without loss of generality, it will be assumed that f1≥f2≥.....≥f|A|, simultaneously, it is assumed that for data flow S targets be to find out appearance
The highest K data item of frequency, and provide the value of this K data item frequency.Any one data flow algorithm can only be in sequence
A S is read, the data item read cannot be read again, and cannot be as RAM (random access memory, Random Access
Memory the data item of designated position in specified data flow S) is read like that.Theoretically, the accurate solution of the above problem is provided, it is desirable that
The Complete Information of the frequency of all data item, and this is unpractical in data flow model, therefore, it is close to be concentrated mainly on excavation
As solve.It approximately means herein, provides the good estimation to the data item frequency of occurrences so that the error of itself and actual value
It is small as possible, and the frequency of occurrences of the highest K data item of the frequency of occurrences that provides of algorithm is generally highest K in practice
It is a.
It is traditional in the related technology, sequentially each data item in reading data flow S, and by the data item and its frequency of reading
The information such as rate are placed in corresponding data structure, and then to being ranked up the frequency values of data item in data structure, are found out
The maximum former a data item of frequency values.But it is above-mentioned in the related technology, room and time consumption it is all bigger.
In the application data flow in the digging system of frequent episode, particularly, a frequent episode array is provided, in the frequency
In numerous array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream, that is, most frequent
The K data item and its frequency of appearance.Therefore, further include frequent episode in the application data flow in the digging system of frequent episode
Array creation module 52, for creating frequent episode array.As previously described, it is assumed that our purpose is in data flow S to be excavated
The highest K data item of the frequency of occurrences creates a frequent episode array, size K, ground using frequent episode array creation module 52
From 0 to K-1, each single item in the frequent episode array is some data item and its frequency in character set A for location.
It searches whether that there are the data item in the frequent episode array to realize, in the present embodiment, is utilizing frequency
Further include creating a frequent episode Hash for the frequent episode array when numerous array creation module 42 creates a frequent episode array
Table.The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, when mould is read in utilization
When block 51 reads some data item x in data flow S, searching module 53 just calls the frequent episode Hash table and available
Whether the frequent episode Hash table judges the data item x of aforementioned reading in the frequent episode array, that is, judgement is described frequently
The data item x read whether is recorded in item array.In this way, whether one data item of detection belongs to frequent episode array TOPK's
Record only needs the time with O (1) using frequent episode Hash table.In one embodiment, the frequent episode Hash table can pass through
The each data item recorded in the frequent episode array is obtained by hash function calculating respectively.In this way, when reading certain
When one data item, the logical hash function that carries out of data item will be read, result of calculation is calculated, by the result of calculation with it is corresponding
The frequent episode Hash table matched, according to matching result judge in the frequent episode array whether there is the data
.Although the quick access of data item is realized using frequent episode Hash table in the aforementioned embodiment, not as
Limit, those skilled in the art can also use other data structures with similar functions.
Record update module 55 is used for:When having found the data item in the frequent episode array, by the frequency
The frequency values of corresponding data item are updated in numerous array;When not finding the data in the frequent episode array
Xiang Shi is updated the frequency values of corresponding data item in an auxiliary data item array.
It on the one hand, will be right in the frequent episode array when having found the data item in the frequent episode array
The frequency values for the data item answered are updated.Described in brought forward, searching module 53 calls the frequent episode Hash table and using institute
Frequent episode Hash table is stated to judge the data item x of aforementioned reading whether in the frequent episode array.When judgement obtains the number
It is to show that the data item is to belong to the highest K number of the frequency of occurrences when being present in the record of the frequent episode array according to item
According to item, therefore, so that it may utilize the frequency values of the record frequency of corresponding data items in the frequent episode array of update module 55
Increase by 1, realizes the cumulative of the frequency values of data item.
On the other hand, when not finding the data item in the frequent episode array, in an auxiliary data item number
The frequency values of corresponding data item are updated in group.
In this application, each data item in data flow is carried out by a frequent episode array and an auxiliary data item array
Record, wherein the frequent episode array is used to record several data item of most frequent appearance, and the auxiliary data item array is used
In the other data item of record.In the present embodiment, conventional data structure can be used in the auxiliary data item array, for example,
The grand filter array of cloth.
The grand filter of cloth (Bloom Filter) array is that one kind can be used to quickly judge whether an element belongs to one
The data structure of set.A set X is given, by each element in the grand algorithm filter cycle X of cloth, judges that this element is
It is no to belong to set Y.
The application of the grand filter array of lower cloth briefly described below.
It is assumed that the grand filter array of cloth is the array of a M size, address is from 0 to M-1, initial value 0.To in Y
Each element y utilizes d independent hash function h1,h2,...,hd, y is mapped to address y1,y2,...,yd(0≤yi≤M-
1,1≤i≤d), the value of these addresses is set as 1.Each element x in X is obtained using d hash function above-mentioned
D cryptographic Hash of element x is h1(x), h2(x) ..., hd(x), check whether the value of this d address in array is 1, if there is
Some address value is 0, it can be concluded thatIf all 1, we will be with very close 1 probabilistic determination x ∈ Y.
The powerful place of the grand filter array of cloth is that it without all elements in storage Y completely, this is in | Y | very
Under big occasion very effectively.Therefore, compared to other data structures, Bloom filter has in terms of room and time
Big advantage.The grand algorithm filter of cloth is random algorithm, needs the space of the operation and O (| Y |) of O (| X |).
Therefore, further include Bloom filter array creation module in the application data flow in the digging system of frequent episode
54, for creating Bloom filter array.It is assumed that the size of the Bloom filter array is M, address is each from 0 to M-1
The initial value of item is 0.It is used for judging the Bloom filter number in addition, Bloom filter array creation module 54 is additionally operable to create
Whether the multiple mutually independent hash functions of the data item are recorded in group.Assuming that the quantity of mutually independent hash function
For d h1,h2,...,hd.In the present embodiment, any one hash function in the d hash function is in the following manner
It obtains:It is prime number to enable M, and r+1 number a is uniformly chosen from { 0,1,2 ... ..., M-1 }0, a1... ..., ar, ha(x)=Obtain a hash function ha
In this way, being filtered the data item to judge in the grand filter number of the cloth using the grand filter array of cloth
It whether there is the data item in group.
Since can be used to quickly judge whether an element belongs to a set as previously mentioned, the grand filter of cloth is one kind
Data structure.Therefore, in the present embodiment, can using the grand filter of cloth come judge in the grand filter array of the cloth whether
There are the data item of reading.
In decision process, d mutual independent hash function h are utilized1,h2,...,hdBy the data item x mappings of reading
Into multiple addresses of the Bloom filter array, that is, utilize d mutual independent hash function h1,h2,...,hdTo reading
The data item x taken carries out Hash calculation and obtains d cryptographic Hash to be h1(x), h2(x) ..., hd(x) (Fig. 2 is seen), described in detection
The value of multiple addresses is (that is, h1(x), h2(x) ..., hd(x)) whether be 1, when the value of the multiple address it is all 1 when, then may be used
There are data item x in the grand filter array of the cloth for judgement, conversely, when there are at least one addresses in the multiple address
Value is 0, then can determine that do not have data item x in the grand filter array of the cloth.
Subsequently, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the grand filter of the cloth
The frequency values of corresponding data item are updated in wave device array.
It, then can will be in the grand filter array of cloth when determining in the grand filter array of cloth there are when the data item of reading
The frequency values of corresponding data item are updated.
Specifically, the frequency values of corresponding data item in the grand filter array of the cloth being given newer process can wrap
It includes:
Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter
In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized1,
h2,...,hdThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d
Hash function h1,h2,...,hdThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h1(x), h2(x) ...,
hd(x).In this way, the value of d address in the Bloom filter array is just increased by 1 using record update module 55.
Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency of the data item
Value.In the present embodiment, it is exactly the minimum taken in the numerical value of d address from minimum value is chosen in the numerical value of the multiple address
Value, is denoted as:min{BF(h1(x)),BF(h2(x)),....,BF(hd(x)) }, and by the minimum value it is denoted as the frequency of data item x
Rate value.Herein, with min { BF (h1(x)),BF(h2(x)),....,BF(hd(x)) } carry out the actual frequency of approximate data item x,
The reasons why this processing mode, is as follows:First, the inventor of the present application discovered that BF (h1(x)),BF(h2(x)),....,BF(hd
(x)) this d value is naturally larger than the frequency of occurrences equal to current data item x, because of each appearance of data item x, this d address
Value can all be increased by one, if data item x enters frequent episode array and is recorded, then again be replaced back the grand filter array of cloth, effect
Fruit is the same.Also, only work as h1(x),h2(x),....,hd(x) this d address is all by other data item of non-data item x
Cryptographic Hash it is shared, i.e.,And these shared data items frequency it is all very big when, just meeting
Generate bigger error.Since the probability that d address is all shared by the prodigious data item of frequency is little, along with frequency is opposite
Bigger element is recorded by frequent episode array, so generating the probability very little of big error.
In addition, it should be noted that, it would still be possible to will appear judgement, there is no the data item read in the grand filter array of cloth
Situation.For the situation, in one embodiment, the information of the data item can be increased newly in the grand filter array of cloth,
Record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, such as data flow
In data item meet zipf (Qi Pufu) distribution, therefore, since the data item of reading is neither in frequent episode array
Not in as the grand filter array of the cloth of auxiliary data item array, then, it is reasonable to think that the data item can become most frequency
The possibility of numerous data item is very little, in this way, a kind of relatively simple processing mode, exactly can directly give up the data item
It discards, does not note down.
It should be noted that when the data item of reading is present in frequent episode array, then in the corresponding frequent episode
The frequency values of the data item are updated in array, when the data item of reading is present in auxiliary data item array,
Then the frequency values of the data item are updated in the corresponding auxiliary data item array.Due to the frequent item number
The frequency values of data item are that dynamic is newer in the frequency values of data item and the auxiliary data item array in group, therefore, are passed through
The frequency values of the reading and update of each data item in data flow, a certain (a little) data item of the auxiliary data item array may
The frequency values of a certain (a little) data item in the frequent episode array can be more than, that is, a certain (a little) in the frequent episode array
The frequency values of data item are just no longer belong to that highest several data item (that is, frequent episode) of frequency.Therefore, in the application data flow
In the method for digging of middle frequent episode, further include by a certain (a little) data item in the auxiliary data item array replace it is described frequently
The operation of a certain (a little) data item in item array.
The digging system of frequent episode further includes in the application data flow:Data item comparison module 57 and data item replacement module
59。
Data item comparison module 57 is used for the frequency values Yu the frequent episode of data item in the Bloom filter array
Minimum frequency value in array is compared.
Minimum frequency value is searched from the frequent episode array, conventional implementation method is to use traversal computational methods,
That is, the frequency values traversal of all data item in the frequent episode array is searched one time, minimum frequency value is therefrom calculated.But
Above-mentioned implementation method is relatively cumbersome, and time cost is O (K).And in the present embodiment, first, as previously mentioned, creating
When one frequent episode array, there is a frequent episode Hash table for frequent episode array establishment.Secondly, extraly, according to described frequent
The frequency of data item in array and created a heap space (Heap), can be to the frequent episode array using the heap space
In the frequency of each data item safeguarded.In addition, using Two-way Chain between the frequent episode Hash table and the heap space
Table connects.In this way, using the frequent episode Hash table searching data item x to judge the data item x of aforementioned reading whether described
In frequent episode array, minimum frequency value minTOPK is searched (that is, each data item in frequent episode array using the heap space
Minimum frequency value in frequency), recycle the frequent episode Hash table to find out the corresponding data item of the minimum frequency value, example
Such as, in the frequent episode array will there is the corresponding data item of minimum frequency value to be denoted as data item y.In this way, being breathed out using frequent episode
Uncommon table can quickly and accurately detect whether some data item read belongs to the record of frequent episode array TOPK, and detect
Time cost needed for process is O (1), can quickly and accurately be found out in the frequent episode array using the heap space
Minimum frequency value, and the time cost needed for search procedure is also only O (1).In addition, for heap space, carried out in heap space
Time cost needed for being inserted into and deleting is O (logK).
In the grand filter array of the cloth (by taking the grand filter array of cloth as an example), by the frequency of the data item x read
Value is denoted as minBF=min { BF (h1(x)),BF(h2(x)),....,BF(hd(x))}。
Therefore, by the frequency values of corresponding data item in the auxiliary data item array and the frequent episode array most
Small value is compared, specifically, exactly by the frequency values minBF of the data item x recorded in the grand filter array of the cloth with
The minimum frequency value minTOPK recorded in the frequent episode array is compared.
Data item replacement module 59 is used to be more than or equal to when the frequency values of a certain data item in the Bloom filter array
When minimum frequency value in the frequent episode array, by frequency values corresponding data item generation described in the Bloom filter array
It, will be by for the corresponding data item of minimum frequency value described in the frequent episode array to be recorded in the frequent episode array
Instead of the corresponding data item of the minimum frequency value go in the Bloom filter array and recorded.
When the frequency values minBF of the data item x recorded in the grand filter array of the cloth is greater than equal in the frequency
When the minimum frequency value minTOPK recorded in numerous array, then the operation executed includes:It will be described in the frequent episode array
The corresponding data item y of minimum frequency value minTOPK are deleted, while by frequency values minBF in the grand filter array of the cloth
Corresponding data item x is inserted into the frequent episode array, and records frequency values minBF corresponding with the data item x being inserted into;Phase
Accordingly, data item x is deleted in the grand filter array of the cloth, meanwhile, data item y is inserted into the grand filtering of the cloth
In device array.The time cost deleted and be inserted into the frequent episode array needed for a data item is O (logK), this is base
It is O (logK) in required time cost is inserted into and deleted in heap space.
Extraly, since data item x and data item y are exchanged, in the grand filter array of cloth, on the one hand, will
It is updated for the value of d address corresponding with d hash function in data item x, by h in the grand filter array of cloth1(x),h2
(x),...,hd(x) value of these addresses subtracts minBF, that is, BF (hi(x))=BF (hi(x))-minBF, wherein 1≤i≤d,
On the other hand, it will be updated for the value of d address corresponding with d hash function in data item y, by the grand filter number of cloth
H in group1(y),h2(y),...,hd(y) value of these addresses adds minTOPK, that is, BF (hi(y))=BF (hi(y))+
MinTOPK, wherein 1≤i≤d.
In fact, when the judgement of data item comparison module 57 obtains the frequency of a certain data item in the grand filter array of the cloth
When value is less than the minimum frequency value in the frequent episode array, then end operation, completes the note of a data item x in data flow
Record.
After completing the record of a data item in data flow, you can be read out to next next data item
And record, until sequentially completing the record of each data item in data flow.To a certain stage or to the end, the number to be handled
According to multiple data item (that is, frequent episode) of most frequent appearance in stream, it is only necessary to transfer the record in the frequent episode array, soon
Speed is convenient.
It should be noted that all modules in the data flow in the digging system of frequent episode can be configured in single meter
It calculates on machine equipment, or, each module in the data flow in the digging system of frequent episode is arranged, respectively distributed network
On one or more servers.
The application discloses a kind of digging system of frequent episode in data flow, frequently by one by each data item in data flow
Item array and an auxiliary data item array are recorded, wherein the frequent episode array is for recording the several of most frequent appearance
A data item, the auxiliary data item array are read when using read module in data flow for recording other data item
A certain data item when, search whether that there are the data item in frequent episode array using searching module, if finding, profit
The frequency values of the data item are updated in the frequent episode with record update module, if not finding, are utilized
The frequency values of the data item are updated by record update module in the auxiliary data item array.In this way, can be to each
The frequency of data item is updated, and can quickly excavate the frequent episode in data flow, and space is greatly saved, meanwhile, it can carry
High estimated accuracy.
In addition, in the application data flow in the digging system of frequent episode, the frequent episode array is most frequent for recording
Several data item occurred, the auxiliary data item array are used to record other relatively low data item of the frequency of occurrences, when
The data item in data item and the auxiliary data item array in the frequent episode array can be according to corresponding frequency values
And dynamic change.For example, when the frequency values of some (a little) data item recorded in the auxiliary data item array increase,
It is possible to replace the data item in the frequent episode array, and substituted data item just turns by the auxiliary data item array
Record.
Below for frequent episode in the application data flow the performance in an experiment of method for digging and digging system carry out it is detailed
It describes in detail bright.
Experimental situation is as follows:Central processing unit:Inter Pentium 3.06GHz, memory:504MB, operating system:
Microsof Window XP。
In an experiment, the 10 of multigroup Arbitrary distribution are used8Scale IP address data flow, set the grand filter of cloth
The size M of array (Bloom Filter, BL) is about 5 × 104Scale, the frequency of occurrences i-th big element of algorithm record
The frequency of occurrences is in range [fi,fi+ ε | S |], wherein ε=0.001.Studies have shown that data can be approximately considered obedience in data flow
Zipf is distributed.For obeying the 10 of zipf distributions8The data flow of scale, it is only necessary to which the size M of the grand filter array of cloth is about
5000, so that it may to ensure error as above.And when zipf distribution parameter alpha it is bigger, the grand required space of filter array of cloth
With regard to smaller.It is demonstrated experimentally that using the digging technology of frequent episode in the application data flow, the error that data item is excavated is distributed with zipf
Parameter alpha increase and reduce, Fig. 6 i.e. show this trend.Referring to Fig. 6, being shown as the parameter alpha of zipf distributions to error
Influence schematic diagram.The string data being distributed using the zipf of synthesis, data scale is 108The character that a length is 20
String, the size M of the grand filter array of cloth used is 1003, and hash function number d is 20.The error of experiment from α=1.1 when
0.0001% when 3.32% to α=3.0, it is seen then that error is significantly reduced with the increase of the zipf parameter alphas being distributed.
Fig. 7 is shown as containing the three kinds of common effects of algorithm in the data that zipf is distributed including the application and compares
Schematic diagram.It can be seen that the effect of the application is better than CountSketch algorithms always, (CountSketch algorithms can be found in:
M.Charikar,K.Chen,M.Farach-Colton.Finding Frequent Items in Data Streams.In
Proceeding of the 29th International Colloquium on Automata,Language and
Programming (ICALP), pp693-703,2002), in α < 2.2, effect is not so good as Space-Saving algorithms (Space-
Saving algorithms can be found in:A.Metwally,D.Agrawal,A.El Abbadi.Efficient Computation of
Frequent and Top-k Elements in Data Streams.In Proceeding of the 10th
International Conference on Database Theory (ICDT), pp 398-412,2005.), but when α >=
When 2.2, the effect of the application is better than Space-Saving algorithms.
Fig. 8 is shown as the schematic diagram of influence of the size of the grand filter array of cloth to error.In an experiment, the rule of data flow
Mould is still 108, the data of three groups of zipf distributions are respectively adopted, parameter alpha is respectively 1.5,2.0,3.0.The grand filter array of cloth
(the grand filtering of cloth of 103,211,307,499,997,1999,5003,10007,20011,30011,39989 sizes has been used respectively
The size M of device array is necessary for prime number).For example, as α=1.5, using the error of the application from the big of the grand filter array of cloth
0.001% when small 6.02% size for becoming the grand filter array of cloth when being 103 is 39989.
Fig. 9 is shown as the schematic diagram of influences of the number d of hash function to error.In an experiment, the grand filtering of cloth used
The size of device array is 997.As d=1, the grand filter array of cloth is common hash function, and at this moment effect is excessively poor, accidentally
Difference is more than 30%.When d becomes larger, error quickly becomes smaller.As d=15, error is minimum.When d become greater to the scale of BF sizes
When, error just starts to become larger again.The experience of multigroup experiment indicates, as d ≈ 1.3log | B | arrive 1.5log | and B | when, effect is
Best.
Referring to Fig. 10, it is shown as the structural schematic diagram of the data processing equipment of the application in one embodiment.Such as figure
Shown in 10, data processing equipment provided in this embodiment 41 include mainly memory 410, one or more processors 411 and
It is stored in one or more of the memory 410 program, wherein the storage of memory 410 executes instruction, when data processing is set
When standby 41 operation, communicated between processor 411 and memory 410.
Wherein, one or more of programs are stored in the memory and are configured as by one or more of
Processor executes instruction, and is executed instruction described in one or more of processors execution so that the data processing equipment executes
The method for digging based on frequent episode in data flow stated, the i.e. processor 411 execute instruction so that data processing equipment
41 execute such as Fig. 1 or shown in Fig. 3 methods, can be updated, can quickly excavate in data flow to the frequency of each data item
Frequent episode, and space is greatly saved, meanwhile, estimated accuracy can be improved.
It should be noted that through the above description of the embodiments, those skilled in the art can be understood that
It can be realized to some or all of the application by software and in conjunction with required general hardware platform.Based on this understanding,
Substantially the part that contributes to existing technology can embody the technical solution of the application in the form of software products in other words
Out, which may include the one or more machine readable medias for being stored thereon with machine-executable instruction,
These instructions may make this when being executed by one or more machines such as computer, computer network or other electronic equipments
One or more machines execute operation according to an embodiment of the present application.
Based on this, the application provide again it is a kind of it is computer-readable write storage medium, be stored thereon with the object based on crowdsourcing
The computer program of the computer program of classification, the object classification based on crowdsourcing realizes above-mentioned be based on when being executed by processor
The step of object classification method of crowdsourcing.
In embodiment, the machine readable media may include, but be not limited to, and (compact-disc-is only by floppy disk, CD, CD-ROM
Read memory), magneto-optic disk, ROM (read-only memory), RAM (random access memory), (erasable programmable is read-only to be deposited EPROM
Reservoir), EEPROM (electrically erasable programmable read-only memory), magnetic or optical card, flash memory or refer to suitable for storage machine is executable
Other kinds of medium/the machine readable media enabled.
The application can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, service
Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, top set
Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including any of the above system or equipment
Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group
Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environments, by
Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage device.
It should be noted that it will be understood by those skilled in the art that above-mentioned members can be programmable logic device,
Including:Programmable logic array (Programmable Array Logic, abbreviation PAL), Universal Array Logic (Generic
Array Logic, abbreviation GAL), field programmable gate array (Field-Programmable Gate Array, referred to as
FPGA), one kind in Complex Programmable Logic Devices (Complex Programmable Logic Device, abbreviation CPLD)
Or it is a variety of, the application is not particularly limited this.
The principles and effects of the application are only illustrated in above-described embodiment, not for limitation the application.It is any ripe
Know the personage of this technology all can without prejudice to spirit herein and under the scope of, carry out modifications and changes to above-described embodiment.Cause
This, those of ordinary skill in the art is complete without departing from spirit disclosed herein and institute under technological thought such as
At all equivalent modifications or change, should be covered by claims hereof.
Claims (22)
1. the method for digging of frequent episode in a kind of data flow, which is characterized in that include the following steps:
A data item is read from data flow;
Search whether that there are the data item in a frequent episode array;The frequent episode array record has frequency maximum multiple
Data item and its frequency;
When having found the data item in the frequent episode array, by corresponding data item in the frequent episode array
Frequency values are updated;
When not finding the data item in the frequent episode array, by corresponding number in an auxiliary data item array
It is updated according to the frequency values of item.
2. the method for digging of frequent episode in data flow according to claim 1, which is characterized in that further include creating one frequently
The step of item array.
3. the method for digging of frequent episode in data flow according to claim 2, which is characterized in that further comprising the steps of:
A frequent episode Hash table is created for the frequent episode array;And
Judge whether record the data item read in the frequent episode array using the frequent episode Hash table.
4. the method for digging of frequent episode in data flow according to claim 1, which is characterized in that the auxiliary data item number
Group is the grand filter array of cloth;
The step of being updated the frequency values of corresponding data item in an auxiliary data item array include:
The data item is filtered to judge whether deposited in the grand filter array of the cloth using cloth grand filter array
In the data item;And
Based in the grand filter array of the cloth there are the judgement of the data item as a result, by the grand filter array of the cloth
The frequency values of corresponding data item are updated.
5. the method for digging of frequent episode in data flow according to claim 4, which is characterized in that further include that one cloth of establishment is grand
The size of the step of filter array, the Bloom filter array are M, and from 0 to M-1, the initial value of each single item is 0 for address,
Multiple mutually independent hash functions are created for the Bloom filter array.
6. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that the multiple hash function
In any one hash function be obtained through the following steps:
It is prime number to enable M, and r+1 number a is uniformly chosen from { 0,1,2 ... ..., M-1 }0, a1... ..., ar,Obtain a hash function ha(x)。
7. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that utilize the grand filter number of cloth
Group is filtered the data item to judge that the step of whether there is the data item in the grand filter array of the cloth wraps
It includes:
Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter array
In location;And
Whether the value for detecting the multiple address is 1, when the value of the multiple address it is all 1 when, then judge it is grand in the cloth
There are the data item in filter array, conversely, there are the value of at least one address being 0 in the multiple address, then sentence
Being scheduled in the grand filter array of the cloth does not have the data item.
8. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that by the grand filter of the cloth
The step of frequency values of corresponding data item are updated in array include:
Using multiple independent hash functions mutually by multiple addresses of the maps data items of reading to the Bloom filter,
And the value of the multiple address is increased by 1;And
Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency values of the data item.
9. the method for digging of frequent episode in data flow according to claim 4, which is characterized in that further comprising the steps of:
The frequency values of data item in the Bloom filter array are compared with the minimum frequency value in the frequent episode array
Compared with;And
When the frequency values of a certain data item in the Bloom filter array are more than or equal to the minimum frequency in the frequent episode array
When rate value, the corresponding data item of frequency values described in the Bloom filter array is replaced described in the frequent episode array most
The corresponding data item of small frequency value is corresponded to the superseded minimum frequency value with being recorded in the frequent episode array
Data item go in the Bloom filter array and recorded.
10. the method for digging of frequent episode in data flow according to claim 9, which is characterized in that further comprising the steps of:
Minimum frequency value and its corresponding data item are searched from the frequent episode array.
11. the method for digging of frequent episode in data flow according to claim 10, which is characterized in that further include following step
Suddenly:
The frequent episode Hash table that is created according to the data item that the frequent episode is looked in minimum frequency group and according to the frequency
The frequency of data item in numerous array and create a heap space, established between the frequent episode Hash table and the heap space two-way
Chained list;
Minimum frequency value is searched using the heap space, it is corresponding using minimum frequency value described in the frequent episode Hash table search
Data item.
12. the digging system of frequent episode in a kind of data flow, which is characterized in that including:
Read module, for reading a data item from data flow;
Searching module, for searching whether that there are the data item in a frequent episode array;The frequent episode array record has
The maximum multiple data item of frequency and its frequency;And
Update module is recorded, is used for:When having found the data item in the frequent episode array, by the frequent item number
The frequency values of corresponding data item are updated in group;When not finding the data item in the frequent episode array,
The frequency values of corresponding data item are updated in an auxiliary data item array.
13. the digging system of frequent episode in data flow according to claim 12, which is characterized in that further include frequent item number
Group creation module, for creating frequent episode array.
14. the digging system of frequent episode in data flow according to claim 13, which is characterized in that the frequent episode array
Creation module further includes being used for judging the frequent episode array for creating one according to the data item in the frequent episode array
In whether record the frequent episode Hash table of the data item read.
15. the digging system of frequent episode in data flow according to claim 12, which is characterized in that the auxiliary data item
Array is the grand filter array of cloth;When not finding the data item in the frequent episode array, the grand filtering of cloth is utilized
Device array is filtered the data item to judge to whether there is in the grand filter array of the cloth data item, and base
In in the grand filter array of the cloth there are the judgement of the data item as a result, by corresponding in the grand filter array of the cloth
The frequency values of data item are updated.
16. the digging system of frequent episode in data flow according to claim 15, which is characterized in that further include the grand filtering of cloth
Device array creation module is used for judging the cloth for creating Bloom filter array and creating for the Bloom filter array
Whether the multiple mutually independent hash functions of the data item, the Bloom filter array are recorded in grand filter array
Size be M, from 0 to M-1, the initial value of each single item is 0 for address.
17. the digging system of frequent episode in data flow according to claim 16, which is characterized in that the multiple Hash letter
Any one hash function in number obtains in the following manner:
It is prime number to enable M, and r+1 number a is uniformly chosen from { 0,1,2 ... ..., M-1 }0, a1... ..., ar,Obtain a hash function ha(x)。
18. the digging system of frequent episode in data flow according to claim 16, which is characterized in that the record updates mould
Block in the grand filter array of the cloth by the frequency values of corresponding data item updated including:
Using multiple independent hash functions mutually by multiple addresses of the maps data items of reading to the Bloom filter,
And the value of the multiple address is increased by 1;And
Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency values of the data item.
19. the digging system of frequent episode in data flow according to claim 15, which is characterized in that further include:
Data item comparison module, for will be in the frequency values of data item in the Bloom filter array and the frequent episode array
Minimum frequency value be compared;
Data item replacement module is more than or equal to the frequency for the frequency values when a certain data item in the Bloom filter array
When minimum frequency value in numerous array, the corresponding data item of frequency values described in the Bloom filter array is replaced described in
The corresponding data item of minimum frequency value described in frequent episode array, will be superseded to be recorded in the frequent episode array
The corresponding data item of the minimum frequency value, which is gone in the Bloom filter array, to be recorded.
20. the digging system of frequent episode in data flow according to claim 19, which is characterized in that the frequent episode array
Creation module further includes:
The frequent episode Hash table that is created according to the data item in the frequent episode array and according to the frequent episode array
The frequency of middle data item and create a heap space, establish doubly linked list between the frequent episode Hash table and the heap space;
Minimum frequency value is searched using the heap space, it is corresponding using minimum frequency value described in the frequent episode Hash table search
Data item.
21. a kind of computer readable storage medium, which is characterized in that be stored with the program applied to data mining, described program
When being executed by least one processor, the digging of frequent episode in the data flow as described in any one of claim 1 to 11 is realized
Each step in pick method.
22. a kind of data processing equipment, which is characterized in that including:
At least one processor;
At least one processor;And
At least one program, wherein at least one program is stored in at least one processor and is configured as
It is executed instruction by least one processor, is executed instruction so that at the data described at least one processor execution
Manage each step in the method for digging of frequent episode in equipment execution data flow as described in any one of claim 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810345014.5A CN108595581A (en) | 2018-04-17 | 2018-04-17 | The method for digging and digging system of frequent episode in data flow |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810345014.5A CN108595581A (en) | 2018-04-17 | 2018-04-17 | The method for digging and digging system of frequent episode in data flow |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108595581A true CN108595581A (en) | 2018-09-28 |
Family
ID=63611218
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810345014.5A Pending CN108595581A (en) | 2018-04-17 | 2018-04-17 | The method for digging and digging system of frequent episode in data flow |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108595581A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610429A (en) * | 2019-09-25 | 2019-12-24 | 中国银行股份有限公司 | Data processing method and device |
CN112988892A (en) * | 2021-03-12 | 2021-06-18 | 北京航空航天大学 | Distributed system hot spot data management method |
CN116881338A (en) * | 2023-09-07 | 2023-10-13 | 北京傲星科技有限公司 | Data mining method and related equipment for data stream based on large model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060143170A1 (en) * | 2004-12-29 | 2006-06-29 | Lucent Technologies, Inc. | Processing data-stream join aggregates using skimmed sketches |
CN101499097A (en) * | 2009-03-16 | 2009-08-05 | 浙江工商大学 | Hash table based data stream frequent pattern internal memory compression and storage method |
CN102760132A (en) * | 2011-04-28 | 2012-10-31 | 中国移动通信集团浙江有限公司 | Excavation method and device for data stream frequent item |
-
2018
- 2018-04-17 CN CN201810345014.5A patent/CN108595581A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060143170A1 (en) * | 2004-12-29 | 2006-06-29 | Lucent Technologies, Inc. | Processing data-stream join aggregates using skimmed sketches |
CN101499097A (en) * | 2009-03-16 | 2009-08-05 | 浙江工商大学 | Hash table based data stream frequent pattern internal memory compression and storage method |
CN102760132A (en) * | 2011-04-28 | 2012-10-31 | 中国移动通信集团浙江有限公司 | Excavation method and device for data stream frequent item |
Non-Patent Citations (6)
Title |
---|
袁志坚 等: "典型Bloom过滤器的研究及其数据流应用", 《计算机工程》 * |
袁志坚 等: "典型Bloom过滤器的研究及其数据流应用", 《计算机工程》, vol. 35, no. 7, 30 April 2009 (2009-04-30), pages 5 - 7 * |
袁志坚 等: "典型Bloom过滤器的研究及其数据流应用", 计算机工程, vol. 35, no. 7, 30 April 2009 (2009-04-30), pages 5 - 7 * |
袁志坚 等: "数据流突发检测若干关键技术研究", 中国博士学位论文全文数据库 信息科技辑, no. 4, 15 April 2010 (2010-04-15), pages 15 - 17 * |
袁志坚: "数据量突发检测若干关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
袁志坚: "数据量突发检测若干关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》, no. 4, 15 April 2010 (2010-04-15), pages 15 - 17 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610429A (en) * | 2019-09-25 | 2019-12-24 | 中国银行股份有限公司 | Data processing method and device |
CN110610429B (en) * | 2019-09-25 | 2022-03-18 | 中国银行股份有限公司 | Data processing method and device |
CN112988892A (en) * | 2021-03-12 | 2021-06-18 | 北京航空航天大学 | Distributed system hot spot data management method |
CN112988892B (en) * | 2021-03-12 | 2022-04-29 | 北京航空航天大学 | Distributed system hot spot data management method |
CN116881338A (en) * | 2023-09-07 | 2023-10-13 | 北京傲星科技有限公司 | Data mining method and related equipment for data stream based on large model |
CN116881338B (en) * | 2023-09-07 | 2024-01-26 | 北京傲星科技有限公司 | Data mining method and related equipment for data stream based on large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112597284B (en) | Company name matching method and device, computer equipment and storage medium | |
CN108595581A (en) | The method for digging and digging system of frequent episode in data flow | |
JP2022003509A (en) | Entity relation mining method, device, electronic device, computer readable storage medium, and computer program | |
CN112559522A (en) | Data storage method and device, query method, electronic device and readable medium | |
CN114706894A (en) | Information processing method, apparatus, device, storage medium, and program product | |
Zhu et al. | Making smart contract classification easier and more effective | |
Cao et al. | Mapping elements with the hungarian algorithm: An efficient method for querying business process models | |
Henning et al. | ShuffleBench: A benchmark for large-scale data shuffling operations with distributed stream processing frameworks | |
US10229223B2 (en) | Mining relevant approximate subgraphs from multigraphs | |
CN111221690A (en) | Model determination method and device for integrated circuit design and terminal | |
CN115225308B (en) | Attack partner identification method for large-scale group attack flow and related equipment | |
US20150006578A1 (en) | Dynamic search system | |
CN115099798A (en) | Abnormal bitcoin address tracking system based on entity identification | |
Huang et al. | Efficient Algorithms for Parallel Bi-core Decomposition | |
US11921690B2 (en) | Custom object paths for object storage management | |
CN105677801A (en) | Data processing method and system based on graph | |
CN112328807A (en) | Anti-cheating method, device, equipment and storage medium | |
Chembu et al. | Scalable and Globally Optimal Generalized L₁ K-center Clustering via Constraint Generation in Mixed Integer Linear Programming | |
CN112035486B (en) | Partition establishing method, device and equipment of partition table | |
CN117539948B (en) | Service data retrieval method and device based on deep neural network | |
CN112667679B (en) | Data relationship determination method, device and server | |
JP2019144873A (en) | Block diagram analyzer | |
CN108304671A (en) | The data managing method and relevant apparatus of Building Information Model | |
Liang et al. | Unsupervised clustering strategy based on label propagation | |
CN111199156B (en) | Named entity recognition method, device, storage medium and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190509 Address after: 710000 Room 101, Block B, Yunhui Valley, 156 Tiangu Eighth Road, Yuhua Street Software New Town, Yanta District, Xi'an City, Shaanxi Province Applicant after: Cross Information Core Technology Research Institute (Xi'an) Co., Ltd. Address before: 100084 Tsinghua Yuan, Beijing, Haidian District Applicant before: Tsinghua University |