Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a distributed oblique flow processing method and a system based on high-frequency key value counting, and aims to solve the technical problems of extremely high memory overhead of random grouping downstream examples and unbalanced load among key value grouping downstream examples in the oblique flow processing method.
To achieve the above object, according to one aspect of the present invention, there is provided a distributed oblique flow processing method based on high frequency key value counting, comprising the steps of:
(1) Acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
(2) Judging data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
(3) Pair data item e using a computational bloom filter i Processing to obtain the data item e i Frequency f of (f) i ;
(4) Judging data item e i Frequency f of (f) i If the size is larger than or equal to the high-frequency key threshold epsilon, the step (5) is carried out, otherwise, the step (6) is carried out;
(5) Judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, adding the data item e i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then turning to the step (10);
(6) Judging data item e i Frequency f of (f) i If the size is larger than or equal to the low frequency key threshold value theta, the step (9) is carried out, otherwise, the step (7) is carried out;
(7) Judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then go to step (8), otherwise, directly insert data item e i Inserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) Judging a data item e of a head node in the low-frequency key queue Q
h Attenuation probability of (2)
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q
h Update to get the data item e
h The updated frequency number is then entered into step (9), wherein b is a preset exponent base, b>1 and b.apprxeq.1, f
h Data item e being the head node in the low frequency key queue Q
h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a step (9);
(9) Data item e using key grouping algorithm i Distributing downstream examples, adding 1 to the total number M of processed data items in the data stream, and ending the process;
(10) According to the high frequency key set S and the data item e i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number of processed data items M in the data stream, the process ends.
Preferably, the high-frequency key set S in the step (2) is implemented through a data structure based on a stream summary in a space saving algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and point to the same father bucket, and a bidirectional linked list link is used between different father buckets in the high-frequency key set S.
Preferably, the counter bloom filter in step (3) is an array b= { B [ 0] containing w counters],B[1],…,B[w-1]First, the computational bloom filter uses t different hash functions h 1 (),h 2 (),...,h t () Calculate data item e i Hash values h respectively corresponding to 1 (e i ),h 2 (e i ),...,h t (e i ) Then, processing results h after each hash value modulo w are calculated 1 (e i )%w,h 2 (e i )%w,...,h t (e i ) % w, then, adding 1 to each element equal to each processing result in the array B, and taking the minimum value in all the obtained elements as a data item e i Frequency f of (f) i ;
Preferably, the high frequency key threshold epsilon in step (4) is defined by the acquisition data item e
i Previously, the total number M of processed data items in the data stream was determined and there were
Preferably, in step (8), a counter bloom filter is used to count the data item e of the head node in the low frequency key queue Q h Updating is performed on the data item e of the head node in the low-frequency key queue Q h The elements in the corresponding array B in the CBF are decremented by 1.
Preferably, step (10) comprises the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high-frequency key set S max -f min If the size of the (a) is larger than M/M, wherein M is the number of downstream examples, if so, the step (10-2) is entered, otherwise, the step (10-5) is entered;
(10-2) judging the data item e i If the key with the largest value is in the high-frequency key set S, m downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-3);
(10-3) judging the data item e i If the key with the minimum value in the high-frequency key set S is the key with the minimum value, 2 downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding a presetOne of the 2 random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-4);
(10-4) centered key assignment to high frequency keyset S-median
A downstream instance of the data item e
i Randomly adding one suffix in the preset random suffixes with the same number as the downstream examples, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream example number, and allocating the downstream example corresponding to the downstream example number to the data item e
i Ending the process;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, for the data item e i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i The process ends.
According to another aspect of the present invention, there is provided a distributed oblique flow processing system based on high frequency key value counting, including:
a first module for acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
a second module for judging the data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering a tenth module, otherwise entering a third module;
a third module for using a counter bloom filter to the data item e i Processing to obtain the data item e i Frequency f of (f) i ;
A fourth module for judging the data item e i Frequency f of (f) i The size isIf not, the method enters a fifth module, and if not, the method enters a sixth module;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e is selected i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the tenth module is shifted to; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then transferring to a tenth module;
a sixth module for judging the data item e i Frequency f of (f) i If the size is larger than or equal to the low frequency key threshold value theta, the method is switched to a ninth module, otherwise, the method is switched to a seventh module;
a seventh module for judging whether the low frequency key queue Q is full, if yes, deleting the data item e of the head node in the low frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then enter the eighth module, otherwise, directly insert the data item e i Inserting the code into the low-frequency key queue Q, and then transferring to a ninth module;
an eighth module for determining a data item e of the head node in the low frequency key queue Q
h Attenuation probability of (2)
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q
h Update to get the data item e
h The updated frequency number then enters a ninth module, wherein b is a preset exponent base, b>1 and b.apprxeq.1, f
h Data item e being the head node in the low frequency key queue Q
h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a ninth module;
a ninth module for using keysThe value grouping algorithm is data item e i Allocating downstream examples, and adding 1 to the total number M of processed data items in the data stream;
tenth module for selecting data item e according to the high frequency key set S i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number M of processed data items in the data stream.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) In the step (2) of the invention, the high-frequency key set is stored by adopting a lightweight data structure, so that the space efficiency is very high, the downstream instance is convenient to load the set into the memory, and meanwhile, the high-frequency key set is supported to quickly inquire and update on line, so that the keys allocated to the downstream instance are quickly selected;
(2) In the step (3) of the invention, each data item in the data stream is monitored by using a counting bloom filter, and the Ji Shushi bloom filter has high calculation and memory efficiency and supports the insertion and deletion of the data item;
(3) In the step (6), the low-frequency keys with limited length are used for caching the low-frequency keys in the queue, and the low-frequency keys stored in the counter type bloom filter are removed according to the characteristic of first-in first-out of the queue and with certain attenuation probability, the attenuation probability can filter out relatively smaller low-frequency keys, so that the memory occupation of the counter type bloom filter is saved, and the probability of conflict of hash operation results among different data items is reduced;
(4) In the step (7) of the invention, the keys in the high-frequency key set are distinguished according to the value, the keys with small key values are evenly distributed to 2 downstream examples, the keys with large key values are distributed with different downstream examples according to the value, the number of the distributed downstream examples can be dynamically updated according to the change of the data flow, thus realizing the load balancing among the downstream examples, and the less downstream examples are distributed for the keys with relatively small key values, thereby reducing unnecessary memory expenditure.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The basic idea of the invention is that as shown in fig. 1, a counter type bloom filter (Counting Bloom Filter, abbreviated as CBF) is used for counting data flows, data items are respectively identified as a high-frequency key, a potential high-frequency key and a low-frequency key according to frequency numbers, so that the distribution of different data items is obtained, the high-frequency keys are stored in a high-frequency key set, the low-frequency keys are stored in a low-frequency key queue, downstream examples are allocated to the high-frequency keys by adopting a strategy of adding random suffix for regrouping and aggregation, and downstream examples are allocated to non-high-frequency keys by adopting a key value grouping strategy, so that the loads among different downstream examples are realized, and the system performance is improved.
As shown in fig. 2, the present invention provides a distributed oblique flow processing method based on high-frequency key value counting, which includes the following steps:
(1) Acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
specifically, the initial value of the total number M of processed data items in the data stream is 0, and the total number M of processed data items is counted and updated as each data item in the data stream is processed in turn.
(2) Judging data item e i If the key is in the high-frequency key set S, adding 1 to a value (value) corresponding to a key (key) which is the same as the data item in the high-frequency key set S, and then entering a step (10), otherwise, entering a step (3);
specifically, the format of the high-frequency key set S is, for example { (value)
1 ,key
1 ,key
2 ),(value
2 ,key
3 ) … }, wherein (value)
1 ,key
1 ,key
2 ) For one record in the set of high frequency keys S, there is one value in each record, but there can be one or more keys, the maximum number of keys in the set of high frequency keys S is C, where C is set according to the preset expected error E, and there is
key
1 ,key
2 ,key
3 Representing a key
1 And value
2 Representing the value.
In this step, each data item e is judged i Whether the key is located in the high-frequency key set S is determined by judging whether the key in the high-frequency key set S is matched with the data item e i In agreement, if so, then the acquisition data item e is described i Previously, the processed data item in the data stream was recorded as a high frequency key, i.e. the data item e was directly recorded i And accumulating statistics of the values of the corresponding keys in the high-frequency key set.
Preferably, the high-frequency key set S is implemented through a data structure based on Stream Summary (Stream Summary) in a Space Saving (Space Saving) algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and point to the same parent Bucket (i.e. socket), and two-way linked lists are used for linking different parent buckets in the high-frequency key set S.
For example, the currently received data item to be processed is { talk, nano space, first, title, wiki }, there is already a record { (255, nano space), (84, first, case), (61, left), (35, word) }, the data item talk, title and wiki is not in the high frequency key set S, step (2) is entered, and the next data item nano space and first are in the high frequency key set S, the corresponding value is increased by 1, so the high frequency key set S is updated to { (256, nano space), (85, first), (84, case), (61, left), (35, word) }.
(3) Data item e using a computational bloom filter (Counting Bloom Filter, CBF for short) i Processing to obtain the data item e i Frequency f of (f) i ;
Specifically, CBF is an array b= { B [ 0] containing w counters],B[1],…,B[w-1]Specifically, the present step is that, first, the CBF uses t different hash functions h 1 (),h 2 (),...,h t () Calculate data item e i Hash values h respectively corresponding to 1 (e i ),h 2 (e i ),...,h t (e i ) Then, processing results h after each hash value modulo w are calculated 1 (e i )%w,h 2 (e i )%w,...,h t (e i ) % w, then, adding 1 to each element equal to each processing result in the array B, and taking the minimum value in all the obtained elements as a data item e i Frequency f of (f) i ;
Wherein the hash function number t of the CBF is preferably set as
n is the number of categories of data items in the data stream; the number w of counters is preferably +.>
Delta is error rate error of CBF; the CBF supports counter-plus-1 processing when data items are inserted and counter-minus-1 processing when data items are deleted.
The data item e obtained in this step i Frequency f of (f) i Including the acquisition of data item e i Previously, the processed data item e in the data stream i CBF is updated continuously with the arrival of the data stream; the CBF uses a plurality of hash functions, and takes the minimum value as the statistical frequency, so as to reduce the probability of hash collision among different data items and improve the statistical accuracy.
For the example in step (2), it is assumed that before the data item talk is acquired, the frequency of the data item talk recorded in the CBF in the processed data item in the data stream is 24, the frequency of the data item title recorded in the CBF is 29, after the processing in this step, the frequency of the data item talk returned is 25, the frequency of the data item title returned is 30, and the frequency of the data item wiki returned is 1;
(4) Judging data item e i Frequency f of (f) i Whether the size is larger than or equal to a high-frequency key threshold epsilon, if so, the data item is indicated to be a high-frequency key, then the step (5) is carried out, otherwise, the data item is indicated to be a non-high-frequency key, and then the step (6) is carried out;
specifically, the high frequency key threshold ε is determined by the capture data item e
i Previously, the total number M of processed data items in the data stream was determined and there were
For the example in step (2), assuming that the total number of processed data items m=1203 in the data stream before the data item talk is acquired, the expected error e=0.05 (i.e. c=1/e=20), the high frequency key threshold value
The data item talk is a non-high frequency key; before acquiring the title of the data item, if the total number of processed data items m=1206 and the high-frequency key threshold epsilon=30 in the data stream, the title of the data item is a high-frequency key; the data item wiki is a non-high frequency key;
(5) Judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, adding the data item e i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then turning to the step (10);
for the example in step (2), the data item title is a high frequency key, and the number of existing keys in the current high frequency key set S does not exceed C, and the data item title is directly inserted into the high frequency key set S, so as to obtain an updated high frequency key set S as { (256, nasspace), (85, first), (84, case), (61, filter), (35, word), (30, title) }.
(6) Judging data item e i Frequency f of (f) i Whether the magnitude is greater than or equal to the low frequency key threshold value theta, if so,the considered data item e i Is a potential high frequency key, then go to step (9), otherwise go to step (7);
specifically, the value range of the low frequency key threshold value θ is [2,10], preferably 5;
(7) Judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then go to step (8), otherwise, directly insert data item e i Inserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
the length of the low-frequency key queue Q is equal to C, namely the length is the same as the maximum key number of the high-frequency key set, and the length of the queue can be adjusted according to the size of the data stream in practical application;
(8) Judging a data item e of a head node in the low-frequency key queue Q
h Attenuation probability of (2)
Whether or not the data item e is larger than the random number r, if yes, using CBF to the data item e of the head node in the low frequency key queue Q
h Update to get the data item e
h The updated frequency number is then entered into step (9), wherein b is a preset exponent base, b>1 and b.apprxeq.1, f
h Data item e being the head node in the low frequency key queue Q
h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a step (9);
specifically, the CBF is used for the data item e of the head node in the low frequency key queue Q h Updating is performed on the data item e of the head node in the low-frequency key queue Q h The elements in the corresponding array B in the CBF are decremented by 1.
For the example in step (2), the frequency of the data item talk is 25, and then the data item talk is continuously stored in the CBF; the data item wiki needs to be inserted into the low-frequency key queue Q, but at this time, the low-frequency key queue Q is already provided with data items { page, trans-format, ns, ns, ce, page, …, user }, the length is 20, and is full, the data item page of the head node in the low-frequency key queue Q is deleted before the data item wiki is inserted, and meanwhile the data item page of the head node in the low-frequency key queue Q is queried and is kept in the CBFStored frequency f, calculating decay probability p=b -f Generating random numbers between 0 and 1), and subtracting 1 from the array element corresponding to the data item page in the CBF when the random numbers are smaller than the attenuation probability p.
The step (3), the step (4) and the step (8) have the advantages that the frequency of each data item of the data stream is monitored by using the CBF with high memory efficiency, on one hand, the dynamic high-frequency key threshold is set, the size change of the data stream can be adapted, the high-frequency key set can be identified more accurately, and the identification precision of the high-frequency keys is improved; on the other hand, a low-frequency key threshold value is set, the low-frequency keys are stored in a low-frequency key queue, and according to the characteristic of first-in first-out of the queue, the corresponding array elements in the CBF are subtracted by 1 according to the attenuation probability, the CBF is reversely updated, most of the low-frequency keys can be filtered in actual application data, so that the CBF can monitor data flow with smaller memory, and memory overhead is reduced.
(9) Data item e using key grouping algorithm i Distributing downstream examples, adding 1 to the total number M of processed data items in the data stream, and ending the process;
specifically, the key value grouping algorithm is realized based on hash operation, the assigned downstream instance number is obtained after the data item is subjected to hash operation, the downstream instance corresponding to the downstream instance number is assigned to the data item, namely, the data item corresponding to the same key is assigned to the same downstream instance;
(10) According to the high frequency key set S and the data item e i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i Adding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the method comprises the following sub-steps:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high-frequency key set S max -f min If the size of the (a) is larger than M/M, wherein M is the number of downstream examples, if so, the step (10-2) is entered, otherwise, the step (10-5) is entered;
(10-2) judging the data item e i At high heightIf the key with the largest value in the frequency key set S is the key with the largest value, m downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-3);
specifically, different keys may be assigned to different numbers of downstream instances, each key being preset with the same number of random suffixes as the number of downstream instances that may be assigned; and adding a suffix of the corresponding sequence number to the data item according to the random number sequence number generated by the random function.
(10-3) judging the data item e i If the key with the minimum value in the high-frequency key set S is the key with the minimum value, 2 downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-4);
(10-4) centered key assignment to high frequency keyset S-median
A downstream instance of the data item e
i Randomly adding one suffix in the preset random suffixes with the same number as the downstream examples, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream example number, and allocating the downstream example corresponding to the downstream example number to the data item e
i Ending the process;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, for the data item e i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Process knotA bundle.
For the instance in step (2), assuming that the number of downstream instances is m=6, after the arrival of the data item nano space in the data stream, the high frequency key set S is updated to { (256, nano space), (84, first, case), (61, filter), (35, word) }, where m=1204, f
max -f
min 221 is greater than M/m=200, so the key in the current high frequency key set S is considered to have a large difference in value, and the naspace is the key with the largest value, which can be assigned to 6 downstream instances, randomly generating one at [1,6]Random numbers in the space are added with suffixes of sequence numbers corresponding to the random numbers in { 1, _2, _3, _4, _5, _6}, and downstream instance numbers which can be allocated are obtained after hash operation is carried out on the data item added with the suffixes, and the corresponding downstream instance is allocated to the data item; after arrival of the first data item in the data stream, the set of high frequency keys S is updated to { (256, naspace), (85, first), (84, case), (61, letter), (35, word) }, where m=1205, f
max -f
min The data item first, which is a key with centered value, can be assigned to 221 greater than M/m=200
A downstream instance, randomly generates an intermediate [1,2 ]]The random number in between and the suffix of the random number corresponding to the sequence number in the data item first is added in { _1, _2}, the data item added with the suffix is subjected to hash operation to obtain a downstream instance number which can be allocated, and the corresponding downstream instance is allocated to the data item; after the data item title arrives, the high frequency key set S is updated to { (256, naspace), (85, first), (84, case), (61, letter), (35, word), (30, title) }, where m=1206, f
max -f
min The data item title is the smallest key with 226 greater than M/m=201, and 2 downstream instances can be allocated, randomly generating one at [1,2]And (3) adding a suffix of a random number corresponding to the sequence number in { _1 and _2} to the data item title, carrying out hash operation on the data item added with the suffix to obtain an assignable downstream instance number, and assigning the corresponding downstream instance to the data item.
The method has the advantages that keys in the high-frequency key set are distinguished according to the value, keys with small key values are evenly distributed to 2 downstream instances, keys with large key values are distributed with different downstream instances according to the value, the number of the distributed downstream instances can be dynamically updated according to the change of data flow, load balancing among the downstream instances is achieved, fewer downstream instances are distributed to keys with relatively small key values, and unnecessary memory expenditure is reduced.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.