CN108595581A

CN108595581A - The method for digging and digging system of frequent episode in data flow

Info

Publication number: CN108595581A
Application number: CN201810345014.5A
Authority: CN
Inventors: 李建
Original assignee: Tsinghua University
Current assignee: Cross Information Core Technology Research Institute (xi'an) Co Ltd
Priority date: 2018-04-17
Filing date: 2018-04-17
Publication date: 2018-09-28

Abstract

The application discloses the method for digging and digging system of frequent episode in a kind of data flow, wherein the method for digging of frequent episode includes in the data flow：A data item is read from data flow；Search whether that there are the data item in a frequent episode array；The frequent episode array record has the maximum multiple data item of frequency and its frequency；When having found the data item in the frequent episode array, the frequency values of corresponding data item in the frequent episode array are updated；When not finding the data item in the frequent episode array, the frequency values of corresponding data item are updated in an auxiliary data item array.In this way, can be updated to the frequency of each data item, the frequent episode in data flow can be quickly excavated, and space is greatly saved, meanwhile, estimated accuracy can be improved.

Description

The method for digging and digging system of frequent episode in data flow

Technical field

This application involves data flow fields, the method for digging more particularly to frequent episode in a kind of data flow and excavation system System.

Background technology

The nearest more than ten years, industry will appreciate that the statistics and analysis to data flow, and therefrom excavate useful information Importance.Data flow is all widely used in all fields, is the unlimited data sequence of a Temporal Evolution, has The features such as unlimitedness, continuity and rapidity.Data scale in data flow is very huge, can not generally preserve in data flow Therefore all data will abandon some data, and these are dropped while reading some data and entering memory from memory Data be expendable.

A fundamental problem in data flow be exactly find out several data item of most frequent appearance in data flow, and Provide the frequency of this several data item appearance.It finds out several data item most frequent in data flow and has in reality and much answer With, such as detection click steam, telephone call register, network packet record, the behavior of detection network fraud, Spam address filtering Deng.The characteristics of data flow model is exactly that the scale of input data is very huge, and entire data flow can not be put into memory by us, In addition we can only sequence one pass evidence of reading.Therefore traditional to occur data item to each and count its frequency, then by frequency The maximum preceding several method of frequency values is found out in sequence, and in data flow model, room and time consumption is all unpractical. Therefore, INFORMATION DISCOVERY on the data streams is the huge challenge that Data Mining faces.

Invention content

In view of the missing of the above the relevant technologies, the purpose of the application is to disclose a kind of excavation side of frequent episode in data flow Method and digging system, the digging technology room and time cost consumption for solving frequent episode in data flow in the related technology are larger The problems such as.

To achieve the above object and other purposes, the first aspect of the application discloses a kind of excavation of frequent episode in data flow Method includes the following steps：A data item is read from data flow；Search whether that there are the data in a frequent episode array ；The frequent episode array record has the maximum multiple data item of frequency and its frequency；It is searched when in the frequent episode array To when having the data item, the frequency values of corresponding data item in the frequent episode array are updated；When described frequent When not finding the data item in item array, the frequency values of corresponding data item are given in an auxiliary data item array Update.

The second aspect of the application discloses a kind of digging system of frequent episode in data flow, including：Read module, for from A data item is read in data flow；Searching module, for searching whether that there are the data item in a frequent episode array；It is described Frequent episode array record has the maximum multiple data item of frequency and its frequency；Update module is recorded, is used for：When in the frequent episode When having found the data item in array, the frequency values of corresponding data item in the frequent episode array are updated；When When not finding the data item in the frequent episode array, by corresponding data item in an auxiliary data item array Frequency values are updated.

The third aspect of the application discloses a kind of computer readable storage medium, is stored with the journey applied to data mining Sequence is realized in data flow as previously described when described program is executed by least one processor in the method for digging of frequent episode Each step.

The fourth aspect of the application discloses a kind of data processing equipment, including：At least one processor；At least one storage Device；At least one program, wherein at least one program be stored in at least one processor and be configured as by At least one processor executes instruction, and is executed instruction so that the data processing described at least one processor execution Equipment executes each step in the method for digging of frequent episode in data flow as previously described.

As described above, in the data flow of the application frequent episode method for digging and digging system, have the advantages that： Each data item in data flow is recorded by a frequent episode array and an auxiliary data item array, wherein described frequent Item array is used to record several data item of most frequent appearance, and the auxiliary data item array is for recording other data , when reading a certain data item in data flow, first judge that the data item whether there is in frequent episode array, if It is then to be updated the frequency values of the data item in the frequent episode, if it is not, then can be in the auxiliary data item array The middle frequency values by the data item are updated.In this way, can be updated to the frequency of each data item, can quickly excavate Frequent episode in data flow, and space is greatly saved, meanwhile, estimated accuracy can be improved.

Description of the drawings

Fig. 1 is shown as the flow diagram of the method for digging of frequent episode in the application data flow in one embodiment.

Fig. 2 is shown as the structural schematic diagram of the grand filter array of cloth.

Fig. 3 is shown as the flow diagram of the method for digging of frequent episode in the application data flow in another embodiment.

Fig. 4 is shown as the relation schematic diagram of frequent episode Hash table and heap space in frequent episode array.

Fig. 5 is shown as the structural schematic diagram of the digging system of frequent episode in the application data flow in one embodiment.

Fig. 6 is shown as the schematic diagram of influence of the parameter alpha of zipf distributions to error.

Fig. 7 is shown as containing the three kinds of common effects of algorithm in the data that zipf is distributed including the application and compares Schematic diagram.

Fig. 8 is shown as the schematic diagram of influence of the size of the grand filter array of cloth to error.

Fig. 9 is shown as the schematic diagram of influences of the number d of hash function to error.

Figure 10 is shown as the structural schematic diagram of the data processing equipment of the application in one embodiment.

Specific implementation mode

Illustrate that presently filed embodiment, those skilled in the art can be by this explanations by particular specific embodiment below Content disclosed by book understands other advantages and effect of the application easily.

In described below, refer to the attached drawing, attached drawing describes several embodiments of the application.It should be appreciated that also can be used Other embodiment, and composition can be carried out without departing substantially from spirit and scope of the present disclosure and operational changed Become.Following detailed description should not be considered limiting, and the range of embodiments herein is only by the application's Claims of patent are limited.Term used herein is merely to describe specific embodiment, and be not intended to limit this Application.

Although term first, second etc. are used for describing various elements herein in some instances, these elements It should not be limited by these terms.These terms are only used for distinguishing an element with another element.For example, first is pre- If threshold value can be referred to as the second predetermined threshold value, and similarly, the second predetermined threshold value can be referred to as the first predetermined threshold value, and The range of various described embodiments is not departed from.First predetermined threshold value and predetermined threshold value are to describe a threshold value, still Unless context otherwise explicitly points out, otherwise they are not the same predetermined threshold values.

Furthermore as used in herein, singulative " one ", "one" and "the" are intended to also include plural number shape Formula, unless there is opposite instruction in context.It will be further understood that term "comprising", " comprising " show that there are the spies Sign, step, operation, element, component, project, type, and/or group, but it is not excluded for other one or more features, step, behaviour Presence, appearance or the addition of work, element, component, project, type, and/or group.Term "or" used herein and "and/or" quilt It is construed to inclusive, or means any one or any combinations.Therefore, " A, B or C " or " A, B and/or C " mean " with Descend any one：A；B；C；A and B；A and C；B and C；A, B and C ".Only when element, function, step or the combination of operation are in certain sides When inherently mutually exclusive under formula, it just will appear the exception of this definition.

Data flow is the unlimited data sequence of a Temporal Evolution, and a most basic problem in data flow is exactly Find out the frequent episode in data flow, the frequent episode be in data flow the frequency of occurrences reach a certain amount of data item, that is, It says, finds out several data item of most frequent appearance in data flow.Unlimitedness, continuity and rapidity due to data flow etc. are special Point, data volume is very huge, can not entire data flow be put into memory.In the related art, usually to each appearance Data item counts its frequency, then the frequency values of all data item are ranked up to find out corresponding frequent episode, this processing side Formula is all unpractical on space consuming and time loss.In order to reduce corresponding room and time consumption, the application discloses The method for digging of frequent episode in a kind of data flow, by each data item in data flow by a frequent episode array and an auxiliary data Item array is recorded, wherein the frequent episode array is used to record several data item of most frequent appearance, the supplementary number According to item array for recording other data item, when reading a certain data item in data flow, the data item is first judged With the presence or absence of in frequent episode array, if so, be updated the frequency values of the data item in the frequent episode, if It is no, then the frequency values of the data item can be updated in the auxiliary data item array.In this way, can be to each data item Frequency be updated, can quickly excavate the frequent episode in data flow, and space is greatly saved, meanwhile, estimation can be improved Precision.

The method for digging of frequent episode can be executed by being, for example, the information processing equipments such as computer equipment in the data flow. The computer equipment can be following suitable equipment, such as handheld computer device, tablet computer equipment, notebook meter Calculation machine, desktop PC, server etc..Computer equipment may include following one or more components：Display, input dress Set, the port input/output (I/O), one or more processors, memory, non-volatile memory device, network interface and Power supply etc..The various parts may include that (such as store instruction has for hardware element (such as chip and circuit), software element Shape non-transitory computer-readable medium) or hardware element and software element combination.In addition, it may be noted that various parts can quilt It is combined into less component or is separated into additional component.For example, memory and non-volatile memory device can be included in In single component.The method for digging of the frequent episode can be individually performed in the computer equipment, or matches with other computer equipments It closes and executes.

This application discloses a kind of method for digging of frequent episode in data flow.Referring to Fig. 1, being shown as the application data flow The flow diagram of the method for digging of middle frequent episode in one embodiment.As shown in Figure 1, in the data flow frequent episode excavation Method includes the following steps：

Step S11 reads a data item from data flow.Data flow is the unlimited data sequence of a Temporal Evolution Row, have the characteristics that unlimitedness, continuity and rapidity.Data flow is applicable in a variety of application environments, for example, in finance Securities trading, weather forecast, hydrological observation, detection website click steam, telephone call register, network packet record, detection net It will produce the data of a large amount of data stream type in the application environments such as network fraud, Spam address filtering.Therefore, one As, the processing to data flow is to receive a part for data flow, is sequentially read out to data item therein, the number read It may be then dropped according to stream, subsequently, continue to the other parts of data flow and sequentially read.

In step s 11, it is exactly sequentially to be read to the data item in data flow from a data item is read in data flow A data item is read during taking.

Step S13 searches whether that there are the data item in a frequent episode array.

Be apparent from, in data flow a fundamental problem is exactly the frequent episode found out in data flow, that is, find out it is most frequent go out Existing multiple data item and provide the frequency that this multiple data item occurs.Such as：Given character set (Alphabet), A={ E_i |1≤i≤|A|}.S is a data flow being made of data item in character set A.One data item E_iThe frequency f of ∈ A_iFor its Occurrence number in data flow S.Without loss of generality, it will be assumed that f₁≥f₂≥.....≥f_|A|, simultaneously, it is assumed that be directed to data flow S mesh Mark is to find out the highest K data item of the frequency of occurrences, and provide the value of this K data item frequency.Any one data flow algorithm A S can only be read in sequence, the data item read cannot be read again, and cannot as RAM (random access memory, Random Access Memory) data item of designated position in specified data flow S is read like that.Theoretically, above-mentioned ask is provided Topic accurate solution, it is desirable that the Complete Information of the frequency of all data item, and this be in data flow model it is unpractical, therefore, It is concentrated mainly on the approximate solution of excavation.It approximately means herein, provides the good estimation to the data item frequency of occurrences so that It is small as possible with the error of actual value, and the frequency of occurrences of the highest K data item of the frequency of occurrences that algorithm provides is generally It is highest K in practice.

It is traditional in the related technology, sequentially each data item in reading data flow S, and by the data item and its frequency of reading The information such as rate are placed in corresponding data structure, and then to being ranked up the frequency values of data item in data structure, are found out The maximum former a data item of frequency values.But it is above-mentioned in the related technology, room and time consumption it is all bigger.

As described in step 13, in the application data flow in the method for digging of frequent episode, particularly, a frequent episode is devised Array, in the frequent episode array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream Rate, that is, the K data item and its frequency of most frequent appearance, hereinafter, these data being recorded in the frequent episode array Item is alternatively referred to as frequent episode.Therefore, further include being pre-created one frequently in the application data flow in the method for digging of frequent episode The step of item array.As previously described, it is assumed that our purpose is the highest K data of the frequency of occurrences in data flow S to be excavated , a frequent episode array is created, size K, from 0 to K-1, each single item in the frequent episode array is character set for address Some data item in A and its frequency.

In step s 13, search whether that there are the data item in the frequent episode array to realize, in the present embodiment In, further include creating a frequent episode Hash table (Hash table) for the frequent episode array when creating a frequent episode array. The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, in step s 13, when When reading some data item x in data flow S, so that it may judge the data of aforementioned reading using the frequent episode Hash table Whether item x is in the frequent episode array, that is, judges the data item x read whether is recorded in the frequent episode array.Such as This, whether one data item of detection belongs to the record of frequent episode array TOPK, is only needed using frequent episode Hash table with O's (1) Time.In one embodiment, the frequent episode Hash table can be by each data item for will being recorded in the frequent episode array Pass through hash function (Hash Function) calculating respectively to obtain.In this way, when reading a certain data item, will read Result of calculation is calculated in the logical hash function that carries out of data item, by the result of calculation and the corresponding frequent episode Hash table into Row matching judges to whether there is the data item in the frequent episode array according to matching result.Although in the aforementioned embodiment The quick access of data item is realized using frequent episode Hash table, but is not limited thereto, and those skilled in the art also may be used With using other data structures with similar functions.

Step S15, when having found the data item in frequent episode array, by corresponding data in frequent episode array The frequency values of item are updated.As previously mentioned, in step s 13, before can advantageously being determined very much using frequent episode Hash table The data item of reading is stated whether in the frequent episode array.Therefore, in step S15, it is when judgement obtains the data item When being present in the record of the frequent episode array, that is, show that the data item belongs to the highest K data item of the frequency of occurrences, Therefore, the frequency values of the frequency of corresponding data items increase by 1 in the frequent episode array, realize the tired of the frequency values of data item Add.

Step S17 will in an auxiliary data item array when not finding the data item in frequent episode array The frequency values of corresponding data item are updated.

In this application, each data item in data flow is carried out by a frequent episode array and an auxiliary data item array Record, wherein the frequent episode array is used to record several data item of most frequent appearance, and the auxiliary data item array is used In the other data item of record.In the present embodiment, conventional data structure can be used in the auxiliary data item array, for example, The grand filter array of cloth.

The grand filter of cloth (Bloom Filter) array is that one kind can be used to quickly judge whether an element belongs to one The data structure of set.A set X is given, by each element in the grand algorithm filter cycle X of cloth, judges that this element is It is no to belong to set Y.

The application of the grand filter array of lower cloth briefly described below.

It is assumed that the grand filter array of cloth is the array of a M size, address is from 0 to M-1, initial value 0.To in Y Each element y utilizes d independent hash function h₁,h₂,...,h_d, y is mapped to address y₁,y₂,...,y_d(0≤y_i≤M- 1,1≤i≤d), the value of these addresses is set as 1.Each element x in X is obtained using d hash function above-mentioned D cryptographic Hash of element x is h₁(x), h₂(x) ..., h_d(x), check whether the value of this d address in array is 1, if there is Some address value is 0, it can be concluded thatIf all 1, we will be with very close 1 probabilistic determination x ∈ Y.

The powerful place of the grand filter array of cloth is that it without all elements in storage Y completely, this is in | Y | very Under big occasion very effectively.Therefore, compared to other data structures, Bloom filter has in terms of room and time Big advantage.The grand algorithm filter of cloth is random algorithm, needs the space of the operation and O (| Y |) of O (| X |).

Therefore, the step of being based on the grand filter array of cloth, the frequency values of corresponding data item are updated further includes：

Step S171 is filtered to judge in the cloth grand filter the data item using the grand filter array of cloth It whether there is the data item in array.

Since can be used to quickly judge whether an element belongs to a set as previously mentioned, the grand filter of cloth is one kind Data structure.Therefore, in the present embodiment, can using the grand filter of cloth come judge in the grand filter array of the cloth whether There are the data item of reading.

In advance, the grand filter of a cloth is created.It is assumed that the size of the Bloom filter array is M, address is from 0 to M- 1, the initial value of each single item is 0.

In addition, creating d mutually independent hash function h for the Bloom filter array₁,h₂,...,h_d.In this reality It applies in example, any one hash function in the d hash function obtains in the following manner：It is prime number to enable M, from 0, 1,2 ... ..., M-1 } in uniformly choose r+1 number a₀, a₁... ..., a_r,Obtain a Kazakhstan Uncommon function h_a

In decision process, d mutual independent hash function h are utilized₁,h₂,...,h_dBy the data item x mappings of reading Into multiple addresses of the Bloom filter array, that is, utilize d mutual independent hash function h₁,h₂,...,h_dTo reading The data item x taken carries out Hash calculation and obtains d cryptographic Hash to be h₁(x), h₂(x) ..., h_d(x) (Fig. 2 is seen), described in detection The value of multiple addresses is (that is, h₁(x), h₂(x) ..., h_d(x)) whether be 1, when the value of the multiple address it is all 1 when, then may be used There are data item x in the grand filter array of the cloth for judgement, conversely, when there are at least one addresses in the multiple address Value is 0, then can determine that do not have data item x in the grand filter array of the cloth.

Step S173, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the cloth The frequency values of corresponding data item are updated in grand filter array.

As described in step S171, it can determine that in the grand filter array of cloth with the presence or absence of reading using the grand filter array of cloth Data item.It, then can be by cloth when being determined based on step S171 in the grand filter array of cloth there are when the data item of reading The frequency values of corresponding data item are updated in grand filter array.

Specifically, the frequency values of corresponding data item in the grand filter array of the cloth are updated including following step Suddenly：

Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized₁, h₂,...,h_dThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d Hash function h₁,h₂,...,h_dThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h₁(x), h₂(x) ..., h_d(x).In fact, above-mentioned calculating process is identical to the Hash calculation of data item x with step S171.Subsequently, by the grand mistake of the cloth The value of d address in filter array increases by 1.

Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency of the data item Value.In the present embodiment, it is exactly the minimum taken in the numerical value of d address from minimum value is chosen in the numerical value of the multiple address Value, is denoted as:min{BF(h₁(x)),BF(h₂(x)),....,BF(h_d(x)) }, and by the minimum value it is denoted as the frequency of data item x Rate value.Herein, with min { BF (h₁(x)),BF(h₂(x)),....,BF(h_d(x)) } carry out the actual frequency of approximate data item x, The reasons why this processing mode, is as follows：First, the inventor of the present application discovered that BF (h₁(x)),BF(h₂(x)),....,BF(h_d (x)) this d value is naturally larger than the frequency of occurrences equal to current data item x, because of each appearance of data item x, this d address Value can all be increased by one, if data item x enters frequent episode array and is recorded, then again be replaced back the grand filter array of cloth, effect Fruit is the same.Also, only work as h₁(x),h₂(x),....,h_d(x) this d address is all by other data item of non-data item x Cryptographic Hash it is shared, i.e.,And these shared data items frequency it is all very big when, just meeting Generate bigger error.Since the probability that d address is all shared by the prodigious data item of frequency is little, along with frequency is opposite Bigger element is recorded by frequent episode array, so generating the probability very little of big error.

In addition, it should be noted that, it is based on step S171, also will appear judgement, there is no read in the grand filter array of cloth Data item situation.For the situation, in one embodiment, the data item can be increased newly in the grand filter array of cloth Information, record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, example As the data item in data flow meets zipf (Qi Pufu) distributions.Zipf distributions are proposed by American scholar G.K. Qi Pufu, It can substantially be expressed as：In the corpus of natural language, frequency and its ranking in frequency meter that a word occurs It is inversely proportional.For example, the frequency of occurrences of the second common word is about the 1/2 of the frequency of occurrences of most common word, third is normal The frequency of occurrences for the word seen is about the 1/3 of the frequency of occurrences of most common word, and so on, word common N The frequency of occurrences is about the 1/N of the frequency of occurrences of most common word.Therefore, for meeting the data flow of zipf distributions, Since the data item of reading is neither in frequent episode array nor in the grand filter number of cloth as auxiliary data item array In group, then, it is reasonable to think the data item can become most frequent data item possibility it is very little, in this way, it is a kind of compared with For simple processing mode, the data item directly can exactly be given up to fall, do not note down.

S11 reads a data item to step S17 from data flow through the above steps, auxiliary in a frequent episode array or one Help in data item array and search whether there are the data item, whether there is in the frequent episode array according to the data item or The judgement of the auxiliary data item array is as a result, to institute in the corresponding frequent episode array or the auxiliary data item array The frequency values for stating data item are updated, to complete the record of a data item.

(by corresponding number in frequent episode array in step S15 after completing the record of a data item in data flow Updated according to the frequency values of item or in step S17 in an auxiliary data item array by the frequency values of corresponding data item Updated), you can return and the S11 to step S17 that repeats the above steps, next next data item is read out and Record, until sequentially completing the record of each data item in data flow.To a certain stage or to the end, the data to be handled The most frequent multiple data item for (that is, frequency of occurrences highest) occur in stream, it is only necessary to transfer record in the frequent episode array i.e. Can, rapid and convenient.

The application discloses a kind of method for digging of frequent episode in data flow, frequently by one by each data item in data flow Item array and an auxiliary data item array are recorded, wherein the frequent episode array is for recording the several of most frequent appearance A data item, the auxiliary data item array is for recording other data item, when reading a certain data item in data flow When, first judge that the data item whether there is in frequent episode array, if so, by the data item in the frequent episode Frequency values are updated, if it is not, can then be updated the frequency values of the data item in the auxiliary data item array.Such as This, can be updated the frequency of each data item, can quickly excavate the frequent episode in data flow, and sky is greatly saved Between, meanwhile, estimated accuracy can be improved.

In previous embodiment description, when the data item of reading is present in frequent episode array, then in corresponding institute It states in frequent episode array and the frequency values of the data item is updated, when the data item of reading is to be present in auxiliary data item number When in group, then the frequency values of the data item are updated in the corresponding auxiliary data item array.Due to the frequency The frequency values of data item are that dynamic is newer in the frequency values of data item and the auxiliary data item array in numerous array, because This, is by the reading and update of each data item in data flow, the frequency of a certain (a little) data item of the auxiliary data item array Rate value may be more than the frequency values of a certain (a little) data item in the frequent episode array, that is, in the frequent episode array The frequency values of a certain (a little) data item are just no longer belong to that highest several data item (that is, frequent episode) of frequency.Therefore, in this Shen Please further include replacing a certain (a little) data item in the auxiliary data item array in data flow in the method for digging of frequent episode The operation of a certain (a little) data item in the frequent episode array.

Show referring to Fig. 3, being shown as the flow of the method for digging of frequent episode in the application data flow in another embodiment It is intended to.As shown in figure 3, the method for digging of frequent episode includes the following steps in the data flow：

Step S21 reads a data item from data flow.

In the step s 21, it is exactly sequentially to be read to the data item in data flow from a data item is read in data flow A data item is read during taking.

Step S22 searches whether that there are the data item in a frequent episode array.

As described in step 22, in the application data flow in the method for digging of frequent episode, particularly, a frequent episode is devised Array, in the frequent episode array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream Rate, that is, the K data item and its frequency of most frequent appearance.Therefore, in the application data flow in the method for digging of frequent episode, Further include the steps that being pre-created a frequent episode array.As previously described, it is assumed that our purpose is gone out in data flow S to be excavated The highest K data item of existing frequency, creates a frequent episode array, size K, address is from 0 to K-1, the frequent episode array In each single item be some data item and its frequency in character set A.

In step S22, search whether that there are the data item in the frequent episode array to realize, in the present embodiment In, further include creating a frequent episode Hash table (Hash table) for the frequent episode array when creating a frequent episode array. The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, in step S22, when When reading some data item x in data flow S, so that it may judge the data of aforementioned reading using the frequent episode Hash table Whether item x is in the frequent episode array, that is, judges the data item x read whether is recorded in the frequent episode array.Such as This, whether one data item of detection belongs to the record of frequent episode array TOPK, is only needed using frequent episode Hash table with O's (1) Time.In one embodiment, the frequent episode Hash table can be by each data item for will being recorded in the frequent episode array It is obtained respectively by hash function calculating.In this way, when reading a certain data item, the logical progress Hash of data item will be read Result of calculation is calculated in function, the result of calculation is matched with the corresponding frequent episode Hash table, according to matching As a result judge to whether there is the data item in the frequent episode array.Although being breathed out in the aforementioned embodiment using frequent episode Uncommon table realizes the quick access of data item, but is not limited thereto, and those skilled in the art, which can also use, has similar work( Other data structures of energy.

Step S23, when having found the data item in frequent episode array, by corresponding data in frequent episode array The frequency values of item are updated.As previously mentioned, in step S22, before can advantageously being determined very much using frequent episode Hash table The data item of reading is stated whether in the frequent episode array.Therefore, in step S23, it is when judgement obtains the data item When being present in the record of the frequent episode array, that is, show that the data item belongs to the highest K data item of the frequency of occurrences, Therefore, the frequency values of the frequency of corresponding data items increase by 1 in the frequent episode array, realize the tired of the frequency values of data item Add.

Step S24 searches minimum frequency value and its corresponding data item from frequent episode array.

Minimum frequency value is searched from the frequent episode array, conventional implementation method is to use traversal computational methods, That is, the frequency values traversal of all data item in the frequent episode array is searched one time, minimum frequency value is therefrom calculated.But Above-mentioned implementation method is relatively cumbersome, and time cost is O (K).And in the present embodiment, first, as previously mentioned, creating When one frequent episode array, there is a frequent episode Hash table (Hash table) for frequent episode array establishment.Secondly, extraly, A heap space (Heap) has been created according to the frequency of data item in the frequent episode array, it, can be to institute using the heap space The frequency for stating each data item in frequent episode array is safeguarded.In addition, in the frequent episode Hash table and the heap space Between with doubly linked list connect (relationship of the frequent episode Hash table and the heap space sees Fig. 4).In this way, described in utilizing Frequent episode Hash table searching data item x is to judge the data item x of aforementioned reading whether in the frequent episode array, using described Heap space searches minimum frequency value minTOPK (that is, minimum frequency value in frequent episode array in the frequency of each data item), then The corresponding data item of the minimum frequency value is found out using the frequent episode Hash table, for example, by the frequent episode array It is denoted as data item y with the corresponding data item of minimum frequency value.In this way, can quickly and accurately be detected using frequent episode Hash table Whether some data item for going out reading belongs to the record of frequent episode array TOPK, and the time cost needed for detection process is O (1), the minimum frequency value in the frequent episode array, and search procedure can quickly and accurately be found out using the heap space Required time cost is also only O (1).In addition, for heap space, required generation time is inserted into and deleted in heap space Valence is O (logK).

Step S25 will in an auxiliary data item array when not finding the data item in frequent episode array The frequency values of corresponding data item are updated.

Step S251 is filtered to judge in the cloth grand filter the data item using the grand filter array of cloth It whether there is the data item in array.

Step S253, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the cloth The frequency values of corresponding data item are updated in grand filter array.

As described in step S251, it can determine that in the grand filter array of cloth with the presence or absence of reading using the grand filter array of cloth Data item.It, then can be by cloth when being determined based on step S251 in the grand filter array of cloth there are when the data item of reading The frequency values of corresponding data item are updated in grand filter array.

Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized₁, h₂,...,h_dThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d Hash function h₁,h₂,...,h_dThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h₁(x), h₂(x) ..., h_d(x).In fact, above-mentioned calculating process is identical to the Hash calculation of data item x with step S251.Subsequently, by the grand mistake of the cloth The value of d address in filter array increases by 1.

Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency of the data item Value.In the present embodiment, it is exactly the minimum taken in the numerical value of d address from minimum value is chosen in the numerical value of the multiple address Value：min{BF(h₁(x)),BF(h₂(x)),....,BF(h_d(x)) it }, can be denoted as minBF, and the minimum value minBF is denoted as The frequency values of data item x.Herein, with min { BF (h₁(x)),BF(h₂(x)),....,BF(h_d(x)) } carry out approximate data item x Actual frequency, the reasons why this processing mode is as follows：First, the inventor of the present application discovered that BF (h₁(x)),BF(h₂ (x)),....,BF(h_d(x)) this d value is naturally larger than the frequency of occurrences equal to current data item x, because data item x's is each Occur, the value of this d address can be all increased by one, if data item x enters frequent episode array and is recorded, then be replaced back cloth again Grand filter array, effect are the same.Also, only work as h₁(x),h₂(x),....,h_d(x) this d address is all by non-data The cryptographic Hash of other data item of item x is shared, i.e.,And the frequency of these shared data items When all very big, bigger error just will produce.Due to d address all by the prodigious data item of frequency share probability not Greatly, along with the frequency larger element that compares is recorded by frequent episode array, so generating the probability very little of big error.

In addition, it should be noted that, it is based on step S251, also will appear judgement, there is no read in the grand filter array of cloth Data item situation.For the situation, in one embodiment, the data item can be increased newly in the grand filter array of cloth Information, record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, example If the data item in data flow meets zipf (Qi Pufu) distributions, therefore, since the data item of reading is neither in frequent episode In array also not in as the grand filter array of the cloth of auxiliary data item array, then, it is reasonable to think the data item energy Possibility as most frequent data item is very little, in this way, a kind of relatively simple processing mode, being exactly can be by the data Item directly is given up to fall, and does not note down.

S21 reads a data item to step S25 from data flow through the above steps, auxiliary in a frequent episode array or one Help in data item array and search whether there are the data item, whether there is in the frequent episode array according to the data item or The judgement of the auxiliary data item array is as a result, to institute in the corresponding frequent episode array or the auxiliary data item array The frequency values for stating data item are updated, to complete the record of a data item.

Step S26, by the minimum frequency in the frequency values of corresponding data item in auxiliary data item array and frequent episode array Rate value is compared.

From preceding：In step s 24, minimum frequency value minTOPK and its right is found from the frequent episode array The data item y answered.In practical applications, minimum frequency value minTOPK and its corresponding is found from the frequent episode array Data item y can be recorded in the updated, and be called in step S26.In abovementioned steps S25, in the grand filtering of the cloth In device array (by taking the grand filter array of cloth as an example), the frequency values of the data item x read are denoted as minBF=min { BF (h₁ (x)),BF(h₂(x)),....,BF(h_d(x)) it }, therefore, in step S26, calls in the frequent episode array recorded Minimum value by frequency values of corresponding data item in the auxiliary data item array obtained in step s 25 and has recorded Minimum value in the frequent episode array is compared, specifically, the number that will exactly be recorded in the grand filter array of the cloth It is compared with the minimum frequency value minTOPK recorded in the frequent episode array according to the frequency values minBF of item x.

Step S27, when the frequency values of a certain data item in auxiliary data item array are more than or equal in frequent episode array most When small frequency value, minimum frequency value in frequent episode array is replaced to correspond to the corresponding data item of frequency values in auxiliary data item array Data item to be recorded in frequent episode array, the corresponding data item of superseded minimum frequency value is gone into auxiliary data It is recorded in item array.

The auxiliary data item array is by taking the grand filter of cloth as an example, in this way, in step s 27, when in the grand filtering of the cloth The frequency values minBF of the data item x recorded in device array is greater than equal to the minimum frequency recorded in the frequent episode array When value minTOPK, then the operation executed includes：It is in the frequent episode array that the minimum frequency value minTOPK is corresponding Data item y is deleted, while the corresponding data item x of frequency values minBF in the grand filter array of the cloth being inserted into described In frequent episode array, and record frequency values minBF corresponding with the data item x being inserted into；Correspondingly, in the grand filter of the cloth Data item x is deleted in array, meanwhile, data item y is inserted into the grand filter array of the cloth.In the frequent item number In group delete and be inserted into a data item needed for time cost be O (logK), this be based on carried out in heap space be inserted into and Time cost needed for deleting is O (logK).

Extraly, since data item x and data item y are exchanged, in the grand filter array of cloth, on the one hand, will It is updated for the value of d address corresponding with d hash function in data item x, by h in the grand filter array of cloth₁(x),h₂ (x),...,h_d(x) value of these addresses subtracts minBF, that is, BF (h_i(x))=BF (h_i(x))-minBF, wherein 1≤i≤d, On the other hand, it will be updated for the value of d address corresponding with d hash function in data item y, by the grand filter number of cloth H in group₁(y),h₂(y),...,h_d(y) value of these addresses adds minTOPK, that is, BF (h_i(y))=BF (h_i(y))+ MinTOPK, wherein 1≤i≤d.

In fact, the frequency values when a certain data item in the grand filter array of the cloth are less than in the frequent episode array When minimum frequency value, then terminate, completes the record of a data item x in data flow.

After completing the record of a data item in data flow, you can return and the S21 that repeats the above steps is to step S27 is read out and records to next next data item, until sequentially completing the note of each data item in data flow Record.To a certain stage or to the end, multiple data item (that is, frequent episode) of most frequent appearance in the data flow to be handled, It only needs to transfer the record in the frequent episode array, rapid and convenient.

In addition, in the application data flow in the method for digging of frequent episode, the frequent episode array is most frequent for recording Several data item occurred, the auxiliary data item array are used to record other relatively low data item of the frequency of occurrences, when The data item in data item and the auxiliary data item array in the frequent episode array can be according to corresponding frequency values And dynamic change.For example, when the frequency values of some (a little) data item recorded in the auxiliary data item array increase, It is possible to replace the data item in the frequent episode array, and substituted data item just turns by the auxiliary data item array Record.By taking the grand filter array of cloth as an example, the purpose of the grand filter of cloth can reduce used the auxiliary data item array Space, because if without the grand filter array of cloth and if being realized with the method for the common array record frequency of occurrences merely, Need the space of O (| A |).The purpose one of the frequent episode array is to record the highest multiple data item of the current frequency of occurrences (that is, frequent episode), the other is reducing the error of the grand filter array of cloth (for details, reference can be made to the descriptions in abovementioned steps S253).

Referring to Fig. 5, being shown as the structural representation of the digging system of frequent episode in the application data flow in one embodiment Figure.As shown in figure 5, the digging system of frequent episode includes in the data flow：Read module 51, searching module 53 and record Update module 55.

Read module 51 from data flow for reading a data item.Data flow is the unlimited number of a Temporal Evolution According to sequence, have the characteristics that unlimitedness, continuity and rapidity.Data flow is applicable in a variety of application environments, for example, Financial instrument transaction, weather forecast, hydrological observation, detection website click steam, telephone call register, network packet record, inspection It will produce the data of a large amount of data stream type in the application environments such as the fraud of survey grid network, Spam address filtering.Cause This, usually, the processing to data flow is to receive a part for data flow, is sequentially read out to data item therein, is read Complete data flow may be then dropped, and subsequently, continued to the other parts of data flow and sequentially read.

Data flow can be sequentially read out using read module 51, it is exactly in logarithm that a data item is read from data flow A data item is read during being sequentially read out according to the data item in stream.

Searching module 53 in a frequent episode array for searching whether that there are the data item.

It being apparent from, in data flow a fundamental problem is exactly to find out multiple data item of most frequent appearance in data flow, And provide the frequency that this multiple data item occurs.Such as：Given character set (Alphabet), A={ E_i|1≤i≤|A|}。S The data flow being made of data item in character set A for one.One data item E_iThe frequency f of ∈ A_iFor its going out in data flow S Occurrence number.Without loss of generality, it will be assumed that f₁≥f₂≥.....≥f_|A|, simultaneously, it is assumed that for data flow S targets be to find out appearance The highest K data item of frequency, and provide the value of this K data item frequency.Any one data flow algorithm can only be in sequence A S is read, the data item read cannot be read again, and cannot be as RAM (random access memory, Random Access Memory the data item of designated position in specified data flow S) is read like that.Theoretically, the accurate solution of the above problem is provided, it is desirable that The Complete Information of the frequency of all data item, and this is unpractical in data flow model, therefore, it is close to be concentrated mainly on excavation As solve.It approximately means herein, provides the good estimation to the data item frequency of occurrences so that the error of itself and actual value It is small as possible, and the frequency of occurrences of the highest K data item of the frequency of occurrences that provides of algorithm is generally highest K in practice It is a.

In the application data flow in the digging system of frequent episode, particularly, a frequent episode array is provided, in the frequency In numerous array, so that it may the maximum multiple data item (that is, frequent episode) of frequency and its frequency in record data stream, that is, most frequent The K data item and its frequency of appearance.Therefore, further include frequent episode in the application data flow in the digging system of frequent episode Array creation module 52, for creating frequent episode array.As previously described, it is assumed that our purpose is in data flow S to be excavated The highest K data item of the frequency of occurrences creates a frequent episode array, size K, ground using frequent episode array creation module 52 From 0 to K-1, each single item in the frequent episode array is some data item and its frequency in character set A for location.

It searches whether that there are the data item in the frequent episode array to realize, in the present embodiment, is utilizing frequency Further include creating a frequent episode Hash for the frequent episode array when numerous array creation module 42 creates a frequent episode array Table.The frequent episode Hash table is associated with the data item recorded in the frequent episode array.Therefore, when mould is read in utilization When block 51 reads some data item x in data flow S, searching module 53 just calls the frequent episode Hash table and available Whether the frequent episode Hash table judges the data item x of aforementioned reading in the frequent episode array, that is, judgement is described frequently The data item x read whether is recorded in item array.In this way, whether one data item of detection belongs to frequent episode array TOPK's Record only needs the time with O (1) using frequent episode Hash table.In one embodiment, the frequent episode Hash table can pass through The each data item recorded in the frequent episode array is obtained by hash function calculating respectively.In this way, when reading certain When one data item, the logical hash function that carries out of data item will be read, result of calculation is calculated, by the result of calculation with it is corresponding The frequent episode Hash table matched, according to matching result judge in the frequent episode array whether there is the data .Although the quick access of data item is realized using frequent episode Hash table in the aforementioned embodiment, not as Limit, those skilled in the art can also use other data structures with similar functions.

Record update module 55 is used for：When having found the data item in the frequent episode array, by the frequency The frequency values of corresponding data item are updated in numerous array；When not finding the data in the frequent episode array Xiang Shi is updated the frequency values of corresponding data item in an auxiliary data item array.

It on the one hand, will be right in the frequent episode array when having found the data item in the frequent episode array The frequency values for the data item answered are updated.Described in brought forward, searching module 53 calls the frequent episode Hash table and using institute Frequent episode Hash table is stated to judge the data item x of aforementioned reading whether in the frequent episode array.When judgement obtains the number It is to show that the data item is to belong to the highest K number of the frequency of occurrences when being present in the record of the frequent episode array according to item According to item, therefore, so that it may utilize the frequency values of the record frequency of corresponding data items in the frequent episode array of update module 55 Increase by 1, realizes the cumulative of the frequency values of data item.

On the other hand, when not finding the data item in the frequent episode array, in an auxiliary data item number The frequency values of corresponding data item are updated in group.

Therefore, further include Bloom filter array creation module in the application data flow in the digging system of frequent episode 54, for creating Bloom filter array.It is assumed that the size of the Bloom filter array is M, address is each from 0 to M-1 The initial value of item is 0.It is used for judging the Bloom filter number in addition, Bloom filter array creation module 54 is additionally operable to create Whether the multiple mutually independent hash functions of the data item are recorded in group.Assuming that the quantity of mutually independent hash function For d h₁,h₂,...,h_d.In the present embodiment, any one hash function in the d hash function is in the following manner It obtains：It is prime number to enable M, and r+1 number a is uniformly chosen from { 0,1,2 ... ..., M-1 }₀, a₁... ..., a_r, h_a(x)=Obtain a hash function h_a

In this way, being filtered the data item to judge in the grand filter number of the cloth using the grand filter array of cloth It whether there is the data item in group.

Subsequently, based in the grand filter array of the cloth there are the judgement of the data item as a result, by the grand filter of the cloth The frequency values of corresponding data item are updated in wave device array.

It, then can will be in the grand filter array of cloth when determining in the grand filter array of cloth there are when the data item of reading The frequency values of corresponding data item are updated.

Specifically, the frequency values of corresponding data item in the grand filter array of the cloth being given newer process can wrap It includes：

Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter In location, and the value of the multiple address is increased by 1.In the present embodiment, d mutual independent hash function h are utilized₁, h₂,...,h_dThe data item x of reading is mapped in d address of the Bloom filter array, that is, independent mutually using d Hash function h₁,h₂,...,h_dThe data item x of reading is carried out Hash calculation to obtain d cryptographic Hash being h₁(x), h₂(x) ..., h_d(x).In this way, the value of d address in the Bloom filter array is just increased by 1 using record update module 55.

In addition, it should be noted that, it would still be possible to will appear judgement, there is no the data item read in the grand filter array of cloth Situation.For the situation, in one embodiment, the information of the data item can be increased newly in the grand filter array of cloth, Record the data item and its frequency.In another embodiment, in view of the data item characteristics in data flow, such as data flow In data item meet zipf (Qi Pufu) distribution, therefore, since the data item of reading is neither in frequent episode array Not in as the grand filter array of the cloth of auxiliary data item array, then, it is reasonable to think that the data item can become most frequency The possibility of numerous data item is very little, in this way, a kind of relatively simple processing mode, exactly can directly give up the data item It discards, does not note down.

It should be noted that when the data item of reading is present in frequent episode array, then in the corresponding frequent episode The frequency values of the data item are updated in array, when the data item of reading is present in auxiliary data item array, Then the frequency values of the data item are updated in the corresponding auxiliary data item array.Due to the frequent item number The frequency values of data item are that dynamic is newer in the frequency values of data item and the auxiliary data item array in group, therefore, are passed through The frequency values of the reading and update of each data item in data flow, a certain (a little) data item of the auxiliary data item array may The frequency values of a certain (a little) data item in the frequent episode array can be more than, that is, a certain (a little) in the frequent episode array The frequency values of data item are just no longer belong to that highest several data item (that is, frequent episode) of frequency.Therefore, in the application data flow In the method for digging of middle frequent episode, further include by a certain (a little) data item in the auxiliary data item array replace it is described frequently The operation of a certain (a little) data item in item array.

The digging system of frequent episode further includes in the application data flow：Data item comparison module 57 and data item replacement module 59。

Data item comparison module 57 is used for the frequency values Yu the frequent episode of data item in the Bloom filter array Minimum frequency value in array is compared.

Minimum frequency value is searched from the frequent episode array, conventional implementation method is to use traversal computational methods, That is, the frequency values traversal of all data item in the frequent episode array is searched one time, minimum frequency value is therefrom calculated.But Above-mentioned implementation method is relatively cumbersome, and time cost is O (K).And in the present embodiment, first, as previously mentioned, creating When one frequent episode array, there is a frequent episode Hash table for frequent episode array establishment.Secondly, extraly, according to described frequent The frequency of data item in array and created a heap space (Heap), can be to the frequent episode array using the heap space In the frequency of each data item safeguarded.In addition, using Two-way Chain between the frequent episode Hash table and the heap space Table connects.In this way, using the frequent episode Hash table searching data item x to judge the data item x of aforementioned reading whether described In frequent episode array, minimum frequency value minTOPK is searched (that is, each data item in frequent episode array using the heap space Minimum frequency value in frequency), recycle the frequent episode Hash table to find out the corresponding data item of the minimum frequency value, example Such as, in the frequent episode array will there is the corresponding data item of minimum frequency value to be denoted as data item y.In this way, being breathed out using frequent episode Uncommon table can quickly and accurately detect whether some data item read belongs to the record of frequent episode array TOPK, and detect Time cost needed for process is O (1), can quickly and accurately be found out in the frequent episode array using the heap space Minimum frequency value, and the time cost needed for search procedure is also only O (1).In addition, for heap space, carried out in heap space Time cost needed for being inserted into and deleting is O (logK).

In the grand filter array of the cloth (by taking the grand filter array of cloth as an example), by the frequency of the data item x read Value is denoted as minBF=min { BF (h₁(x)),BF(h₂(x)),....,BF(h_d(x))}。

Therefore, by the frequency values of corresponding data item in the auxiliary data item array and the frequent episode array most Small value is compared, specifically, exactly by the frequency values minBF of the data item x recorded in the grand filter array of the cloth with The minimum frequency value minTOPK recorded in the frequent episode array is compared.

Data item replacement module 59 is used to be more than or equal to when the frequency values of a certain data item in the Bloom filter array When minimum frequency value in the frequent episode array, by frequency values corresponding data item generation described in the Bloom filter array It, will be by for the corresponding data item of minimum frequency value described in the frequent episode array to be recorded in the frequent episode array Instead of the corresponding data item of the minimum frequency value go in the Bloom filter array and recorded.

When the frequency values minBF of the data item x recorded in the grand filter array of the cloth is greater than equal in the frequency When the minimum frequency value minTOPK recorded in numerous array, then the operation executed includes：It will be described in the frequent episode array The corresponding data item y of minimum frequency value minTOPK are deleted, while by frequency values minBF in the grand filter array of the cloth Corresponding data item x is inserted into the frequent episode array, and records frequency values minBF corresponding with the data item x being inserted into；Phase Accordingly, data item x is deleted in the grand filter array of the cloth, meanwhile, data item y is inserted into the grand filtering of the cloth In device array.The time cost deleted and be inserted into the frequent episode array needed for a data item is O (logK), this is base It is O (logK) in required time cost is inserted into and deleted in heap space.

In fact, when the judgement of data item comparison module 57 obtains the frequency of a certain data item in the grand filter array of the cloth When value is less than the minimum frequency value in the frequent episode array, then end operation, completes the note of a data item x in data flow Record.

After completing the record of a data item in data flow, you can be read out to next next data item And record, until sequentially completing the record of each data item in data flow.To a certain stage or to the end, the number to be handled According to multiple data item (that is, frequent episode) of most frequent appearance in stream, it is only necessary to transfer the record in the frequent episode array, soon Speed is convenient.

It should be noted that all modules in the data flow in the digging system of frequent episode can be configured in single meter It calculates on machine equipment, or, each module in the data flow in the digging system of frequent episode is arranged, respectively distributed network On one or more servers.

The application discloses a kind of digging system of frequent episode in data flow, frequently by one by each data item in data flow Item array and an auxiliary data item array are recorded, wherein the frequent episode array is for recording the several of most frequent appearance A data item, the auxiliary data item array are read when using read module in data flow for recording other data item A certain data item when, search whether that there are the data item in frequent episode array using searching module, if finding, profit The frequency values of the data item are updated in the frequent episode with record update module, if not finding, are utilized The frequency values of the data item are updated by record update module in the auxiliary data item array.In this way, can be to each The frequency of data item is updated, and can quickly excavate the frequent episode in data flow, and space is greatly saved, meanwhile, it can carry High estimated accuracy.

In addition, in the application data flow in the digging system of frequent episode, the frequent episode array is most frequent for recording Several data item occurred, the auxiliary data item array are used to record other relatively low data item of the frequency of occurrences, when The data item in data item and the auxiliary data item array in the frequent episode array can be according to corresponding frequency values And dynamic change.For example, when the frequency values of some (a little) data item recorded in the auxiliary data item array increase, It is possible to replace the data item in the frequent episode array, and substituted data item just turns by the auxiliary data item array Record.

Below for frequent episode in the application data flow the performance in an experiment of method for digging and digging system carry out it is detailed It describes in detail bright.

Experimental situation is as follows：Central processing unit：Inter Pentium 3.06GHz, memory：504MB, operating system： Microsof Window XP。

In an experiment, the 10 of multigroup Arbitrary distribution are used⁸Scale IP address data flow, set the grand filter of cloth The size M of array (Bloom Filter, BL) is about 5 × 10⁴Scale, the frequency of occurrences i-th big element of algorithm record The frequency of occurrences is in range [f_i,f_i+ ε | S |], wherein ε=0.001.Studies have shown that data can be approximately considered obedience in data flow Zipf is distributed.For obeying the 10 of zipf distributions⁸The data flow of scale, it is only necessary to which the size M of the grand filter array of cloth is about 5000, so that it may to ensure error as above.And when zipf distribution parameter alpha it is bigger, the grand required space of filter array of cloth With regard to smaller.It is demonstrated experimentally that using the digging technology of frequent episode in the application data flow, the error that data item is excavated is distributed with zipf Parameter alpha increase and reduce, Fig. 6 i.e. show this trend.Referring to Fig. 6, being shown as the parameter alpha of zipf distributions to error Influence schematic diagram.The string data being distributed using the zipf of synthesis, data scale is 10⁸The character that a length is 20 String, the size M of the grand filter array of cloth used is 1003, and hash function number d is 20.The error of experiment from α=1.1 when 0.0001% when 3.32% to α=3.0, it is seen then that error is significantly reduced with the increase of the zipf parameter alphas being distributed.

Fig. 7 is shown as containing the three kinds of common effects of algorithm in the data that zipf is distributed including the application and compares Schematic diagram.It can be seen that the effect of the application is better than CountSketch algorithms always, (CountSketch algorithms can be found in： M.Charikar,K.Chen,M.Farach-Colton.Finding Frequent Items in Data Streams.In Proceeding of the 29th International Colloquium on Automata,Language and Programming (ICALP), pp693-703,2002), in α ＜ 2.2, effect is not so good as Space-Saving algorithms (Space- Saving algorithms can be found in：A.Metwally,D.Agrawal,A.El Abbadi.Efficient Computation of Frequent and Top-k Elements in Data Streams.In Proceeding of the 10th International Conference on Database Theory (ICDT), pp 398-412,2005.), but when α >= When 2.2, the effect of the application is better than Space-Saving algorithms.

Fig. 8 is shown as the schematic diagram of influence of the size of the grand filter array of cloth to error.In an experiment, the rule of data flow Mould is still 10⁸, the data of three groups of zipf distributions are respectively adopted, parameter alpha is respectively 1.5,2.0,3.0.The grand filter array of cloth (the grand filtering of cloth of 103,211,307,499,997,1999,5003,10007,20011,30011,39989 sizes has been used respectively The size M of device array is necessary for prime number).For example, as α=1.5, using the error of the application from the big of the grand filter array of cloth 0.001% when small 6.02% size for becoming the grand filter array of cloth when being 103 is 39989.

Fig. 9 is shown as the schematic diagram of influences of the number d of hash function to error.In an experiment, the grand filtering of cloth used The size of device array is 997.As d=1, the grand filter array of cloth is common hash function, and at this moment effect is excessively poor, accidentally Difference is more than 30%.When d becomes larger, error quickly becomes smaller.As d=15, error is minimum.When d become greater to the scale of BF sizes When, error just starts to become larger again.The experience of multigroup experiment indicates, as d ≈ 1.3log | B | arrive 1.5log | and B | when, effect is Best.

Referring to Fig. 10, it is shown as the structural schematic diagram of the data processing equipment of the application in one embodiment.Such as figure Shown in 10, data processing equipment provided in this embodiment 41 include mainly memory 410, one or more processors 411 and It is stored in one or more of the memory 410 program, wherein the storage of memory 410 executes instruction, when data processing is set When standby 41 operation, communicated between processor 411 and memory 410.

Wherein, one or more of programs are stored in the memory and are configured as by one or more of Processor executes instruction, and is executed instruction described in one or more of processors execution so that the data processing equipment executes The method for digging based on frequent episode in data flow stated, the i.e. processor 411 execute instruction so that data processing equipment 41 execute such as Fig. 1 or shown in Fig. 3 methods, can be updated, can quickly excavate in data flow to the frequency of each data item Frequent episode, and space is greatly saved, meanwhile, estimated accuracy can be improved.

It should be noted that through the above description of the embodiments, those skilled in the art can be understood that It can be realized to some or all of the application by software and in conjunction with required general hardware platform.Based on this understanding, Substantially the part that contributes to existing technology can embody the technical solution of the application in the form of software products in other words Out, which may include the one or more machine readable medias for being stored thereon with machine-executable instruction, These instructions may make this when being executed by one or more machines such as computer, computer network or other electronic equipments One or more machines execute operation according to an embodiment of the present application.

Based on this, the application provide again it is a kind of it is computer-readable write storage medium, be stored thereon with the object based on crowdsourcing The computer program of the computer program of classification, the object classification based on crowdsourcing realizes above-mentioned be based on when being executed by processor The step of object classification method of crowdsourcing.

In embodiment, the machine readable media may include, but be not limited to, and (compact-disc-is only by floppy disk, CD, CD-ROM Read memory), magneto-optic disk, ROM (read-only memory), RAM (random access memory), (erasable programmable is read-only to be deposited EPROM Reservoir), EEPROM (electrically erasable programmable read-only memory), magnetic or optical card, flash memory or refer to suitable for storage machine is executable Other kinds of medium/the machine readable media enabled.

The application can be used in numerous general or special purpose computing system environments or configuration.Such as：Personal computer, service Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, top set Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including any of the above system or equipment Distributed computing environment etc..

The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, group Part, data structure etc..The application can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.

It should be noted that it will be understood by those skilled in the art that above-mentioned members can be programmable logic device, Including：Programmable logic array (Programmable Array Logic, abbreviation PAL), Universal Array Logic (Generic Array Logic, abbreviation GAL), field programmable gate array (Field-Programmable Gate Array, referred to as FPGA), one kind in Complex Programmable Logic Devices (Complex Programmable Logic Device, abbreviation CPLD) Or it is a variety of, the application is not particularly limited this.

The principles and effects of the application are only illustrated in above-described embodiment, not for limitation the application.It is any ripe Know the personage of this technology all can without prejudice to spirit herein and under the scope of, carry out modifications and changes to above-described embodiment.Cause This, those of ordinary skill in the art is complete without departing from spirit disclosed herein and institute under technological thought such as At all equivalent modifications or change, should be covered by claims hereof.

Claims

1. the method for digging of frequent episode in a kind of data flow, which is characterized in that include the following steps：

A data item is read from data flow；

Search whether that there are the data item in a frequent episode array；The frequent episode array record has frequency maximum multiple Data item and its frequency；

When having found the data item in the frequent episode array, by corresponding data item in the frequent episode array Frequency values are updated；

When not finding the data item in the frequent episode array, by corresponding number in an auxiliary data item array It is updated according to the frequency values of item.

2. the method for digging of frequent episode in data flow according to claim 1, which is characterized in that further include creating one frequently The step of item array.

3. the method for digging of frequent episode in data flow according to claim 2, which is characterized in that further comprising the steps of：

A frequent episode Hash table is created for the frequent episode array；And

Judge whether record the data item read in the frequent episode array using the frequent episode Hash table.

4. the method for digging of frequent episode in data flow according to claim 1, which is characterized in that the auxiliary data item number Group is the grand filter array of cloth；

The step of being updated the frequency values of corresponding data item in an auxiliary data item array include：

The data item is filtered to judge whether deposited in the grand filter array of the cloth using cloth grand filter array In the data item；And

Based in the grand filter array of the cloth there are the judgement of the data item as a result, by the grand filter array of the cloth The frequency values of corresponding data item are updated.

5. the method for digging of frequent episode in data flow according to claim 4, which is characterized in that further include that one cloth of establishment is grand The size of the step of filter array, the Bloom filter array are M, and from 0 to M-1, the initial value of each single item is 0 for address, Multiple mutually independent hash functions are created for the Bloom filter array.

6. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that the multiple hash function In any one hash function be obtained through the following steps：

It is prime number to enable M, and r+1 number a is uniformly chosen from { 0,1,2 ... ..., M-1 }₀, a₁... ..., a_r,Obtain a hash function h_a(x)。

7. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that utilize the grand filter number of cloth Group is filtered the data item to judge that the step of whether there is the data item in the grand filter array of the cloth wraps It includes：

Using multiple independent hash functions mutually by the multiple of the maps data items of reading to the Bloom filter array In location；And

Whether the value for detecting the multiple address is 1, when the value of the multiple address it is all 1 when, then judge it is grand in the cloth There are the data item in filter array, conversely, there are the value of at least one address being 0 in the multiple address, then sentence Being scheduled in the grand filter array of the cloth does not have the data item.

8. the method for digging of frequent episode in data flow according to claim 5, which is characterized in that by the grand filter of the cloth The step of frequency values of corresponding data item are updated in array include：

Using multiple independent hash functions mutually by multiple addresses of the maps data items of reading to the Bloom filter, And the value of the multiple address is increased by 1；And

Minimum value is chosen from the numerical value of the multiple address, and the minimum value is denoted as to the frequency values of the data item.

9. the method for digging of frequent episode in data flow according to claim 4, which is characterized in that further comprising the steps of：

The frequency values of data item in the Bloom filter array are compared with the minimum frequency value in the frequent episode array Compared with；And

When the frequency values of a certain data item in the Bloom filter array are more than or equal to the minimum frequency in the frequent episode array When rate value, the corresponding data item of frequency values described in the Bloom filter array is replaced described in the frequent episode array most The corresponding data item of small frequency value is corresponded to the superseded minimum frequency value with being recorded in the frequent episode array Data item go in the Bloom filter array and recorded.

10. the method for digging of frequent episode in data flow according to claim 9, which is characterized in that further comprising the steps of： Minimum frequency value and its corresponding data item are searched from the frequent episode array.

11. the method for digging of frequent episode in data flow according to claim 10, which is characterized in that further include following step Suddenly：

The frequent episode Hash table that is created according to the data item that the frequent episode is looked in minimum frequency group and according to the frequency The frequency of data item in numerous array and create a heap space, established between the frequent episode Hash table and the heap space two-way Chained list；

Minimum frequency value is searched using the heap space, it is corresponding using minimum frequency value described in the frequent episode Hash table search Data item.

12. the digging system of frequent episode in a kind of data flow, which is characterized in that including：

Read module, for reading a data item from data flow；

Searching module, for searching whether that there are the data item in a frequent episode array；The frequent episode array record has The maximum multiple data item of frequency and its frequency；And

Update module is recorded, is used for：When having found the data item in the frequent episode array, by the frequent item number The frequency values of corresponding data item are updated in group；When not finding the data item in the frequent episode array, The frequency values of corresponding data item are updated in an auxiliary data item array.

13. the digging system of frequent episode in data flow according to claim 12, which is characterized in that further include frequent item number Group creation module, for creating frequent episode array.

14. the digging system of frequent episode in data flow according to claim 13, which is characterized in that the frequent episode array Creation module further includes being used for judging the frequent episode array for creating one according to the data item in the frequent episode array In whether record the frequent episode Hash table of the data item read.

15. the digging system of frequent episode in data flow according to claim 12, which is characterized in that the auxiliary data item Array is the grand filter array of cloth；When not finding the data item in the frequent episode array, the grand filtering of cloth is utilized Device array is filtered the data item to judge to whether there is in the grand filter array of the cloth data item, and base In in the grand filter array of the cloth there are the judgement of the data item as a result, by corresponding in the grand filter array of the cloth The frequency values of data item are updated.

16. the digging system of frequent episode in data flow according to claim 15, which is characterized in that further include the grand filtering of cloth Device array creation module is used for judging the cloth for creating Bloom filter array and creating for the Bloom filter array Whether the multiple mutually independent hash functions of the data item, the Bloom filter array are recorded in grand filter array Size be M, from 0 to M-1, the initial value of each single item is 0 for address.

17. the digging system of frequent episode in data flow according to claim 16, which is characterized in that the multiple Hash letter Any one hash function in number obtains in the following manner：

18. the digging system of frequent episode in data flow according to claim 16, which is characterized in that the record updates mould Block in the grand filter array of the cloth by the frequency values of corresponding data item updated including：

19. the digging system of frequent episode in data flow according to claim 15, which is characterized in that further include：

Data item comparison module, for will be in the frequency values of data item in the Bloom filter array and the frequent episode array Minimum frequency value be compared；

Data item replacement module is more than or equal to the frequency for the frequency values when a certain data item in the Bloom filter array When minimum frequency value in numerous array, the corresponding data item of frequency values described in the Bloom filter array is replaced described in The corresponding data item of minimum frequency value described in frequent episode array, will be superseded to be recorded in the frequent episode array The corresponding data item of the minimum frequency value, which is gone in the Bloom filter array, to be recorded.

20. the digging system of frequent episode in data flow according to claim 19, which is characterized in that the frequent episode array Creation module further includes：

The frequent episode Hash table that is created according to the data item in the frequent episode array and according to the frequent episode array The frequency of middle data item and create a heap space, establish doubly linked list between the frequent episode Hash table and the heap space；

21. a kind of computer readable storage medium, which is characterized in that be stored with the program applied to data mining, described program When being executed by least one processor, the digging of frequent episode in the data flow as described in any one of claim 1 to 11 is realized Each step in pick method.

22. a kind of data processing equipment, which is characterized in that including：

At least one processor；

At least one processor；And

At least one program, wherein at least one program is stored in at least one processor and is configured as It is executed instruction by least one processor, is executed instruction so that at the data described at least one processor execution Manage each step in the method for digging of frequent episode in equipment execution data flow as described in any one of claim 1 to 11.