Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention provides a distributed oblique flow processing method and system based on high-frequency key value counting, and aims to solve the technical problems of great memory overhead of random packet downstream instances and load imbalance among key value packet downstream instances in the oblique flow processing method.
To achieve the above object, according to one aspect of the present invention, there is provided a distributed inclined stream processing method based on high-frequency key value counting, including the steps of:
(1) obtaining a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
(2) judging data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to the value corresponding to the key in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
(3) for data item e using a counting bloom filteriProcessing to obtain the data item eiFrequency of (f)i;
(4) Judging data item eiFrequency of (f)iWhether the size is larger than or equal to the high-frequency key threshold epsilon or not is judged, if yes, the step (5) is carried out, and if not, the step (6) is carried out;
(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item eiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);
(6) judging data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold theta is judged, if yes, the step (9) is carried out, and if not, the step (7) is carried out;
(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue QhThen the data item eiInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item eiInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) judging the data item e of the head node in the low frequency key queue Q
hProbability of attenuation of
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q
hUpdating to obtain the data item e
hThe updated frequency is then entered into step (9), wherein b is the preset exponential base number, b>1 and b ≈ 1, f
hFor data item e of head node in low frequency key queue Q
hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);
(9) using key-value grouping algorithms for data items eiDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;
(10) according to the high frequency key set S and the data item eiAre identical to each otherDetermines the number of downstream instances to which the key can be assigned, selects a downstream instance from the downstream instances according to the determined number of downstream instances, and assigns the selected downstream instance to the data item eiAnd adding 1 to the total number M of the processed data items in the data stream, and ending the process.
Preferably, the high frequency key set S in step (2) is implemented by a data structure based on a stream summary in a space saving algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent bucket, and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.
Preferably, the counting bloom filter in step (3) is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, the counting bloom filter utilizes t different hash functions h1(),h2(),...,ht() Calculating a data item eiRespectively corresponding hash value h1(ei),h2(ei),...,ht(ei) Then, calculating to obtain a processing result h of each hash value modulo w1(ei)%w,h2(ei)%w,...,ht(ei) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item eiFrequency of (f)i;
Preferably, the high frequency key threshold epsilon in step (4) is determined by acquiring the data item e
iPreviously, the total number M of processed data items in the data stream was determined, and there were
Preferably, the data item e of the head node in the low-frequency key queue Q is subjected to counting bloom filter in step (8)hUpdating the data item e of the head node in the low-frequency key queue QhThe element in array B corresponding to CBF is decremented by 1.
Preferably, step (10) comprises the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set Smax-fminIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);
(10-2) judging the data item eiIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item eiRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-3);
(10-3) judging the data item eiWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected toiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-4);
(10-4) Key assignment for value-centered in high-frequency Key set S
A downstream instance, for the data item e
iRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e
iAnd the process is ended;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, eiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiGo throughThe routine ends.
According to another aspect of the present invention, there is provided a distributed inclined stream processing system based on high frequency key value counting, comprising the following modules:
a first module for acquiring a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
a second module for judging the data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value corresponding to a key in the high-frequency key set S, which is the same as the data item, and then entering a tenth module, otherwise, entering a third module;
a third module for applying a counting bloom filter to the data item eiProcessing to obtain the data item eiFrequency of (f)i;
A fourth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the fifth module is started, and if not, the sixth module is started;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e isiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the tenth module is switched to; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then switching to a tenth module;
a sixth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold value theta is judged, if yes, the ninth module is switched to, and if not, the seventh module is switched to;
a seventh module, configured to determine whether the low frequency key queue Q is full, and if so, delete the data item e of the head node in the low frequency key queue Q firsthThen the data item eiInserting the data item into the low frequency key queue Q, then entering an eighth module, otherwise, directly inserting the data item eiInsert into low frequency key queue Q, then goEntering a ninth module;
an eighth module for determining a data item e of a head node in the low frequency key queue Q
hProbability of attenuation of
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q
hUpdating to obtain the data item e
hThe updated frequency number is then entered into a ninth module, wherein b is a preset exponential base number, b>1 and b ≈ 1, f
hFor data item e of head node in low frequency key queue Q
hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering a ninth module;
a ninth module for grouping data items e using a key-value grouping algorithmiAllocating downstream instances and adding 1 to the total number M of processed data items in the data stream;
a tenth module for merging the data item e with the set of high frequency keys SiDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAnd adds 1 to the total number of processed data items M in the data stream.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) in the step (2), the high-frequency key set is stored by adopting a lightweight data structure, so that the space efficiency is high, the collection is conveniently loaded into a memory by a downstream instance, and meanwhile, the high-frequency key set supports quick O (1) online query and update, and the purpose of quickly selecting keys distributed to the downstream instance is realized;
(2) in the step (3), each data item in the data stream is monitored by using a counting type bloom filter, and the counting type bloom filter has high calculation and memory efficiency and simultaneously supports the insertion and deletion of the data item;
(3) in the step (6), the low-frequency keys are cached by using the low-frequency key queue with limited length, and the low-frequency keys stored in the counting type bloom filter are removed according to the first-in first-out characteristic of the queue with a certain attenuation probability, so that the relatively small low-frequency keys can be filtered out by the attenuation probability, the memory occupation of the counting type bloom filter is saved, and the probability of hash operation result conflict among different data items is also reduced;
(4) in the step (7), the keys in the high-frequency key set are distinguished according to the value sizes, the keys with the slightly different key values are distributed to 2 downstream instances, the keys with the greatly different key values are distributed to different downstream instances according to the value sizes, the distributed downstream instance number can be dynamically updated according to the change of the data stream, so that the load balance among all the downstream instances is realized, and the unnecessary memory overhead is reduced by distributing fewer downstream instances for the keys with the relatively small key values.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the present invention is to count a data stream by using a Counting Bloom Filter (CBF for short) as shown in fig. 1, identify data items as a high frequency key, a potential high frequency key, and a low frequency key according to frequency, further obtain the distribution of different data items, store the high frequency key in a high frequency key set, store the low frequency key in a low frequency key queue, assign a downstream instance to the high frequency key by a policy of adding a random suffix and grouping and aggregating, and assign a downstream instance to a non-high frequency key by a key value grouping policy, thereby realizing the load between different downstream instances and improving the system performance.
As shown in fig. 2, the present invention provides a distributed oblique stream processing method based on high frequency key value counting, which includes the following steps:
(1) obtaining a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
specifically, the total number M of processed data items in the data stream is initially set to 0, and the total number M of processed data items is counted and updated as each data item in the data stream is processed in turn.
(2) Judging data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value (value) corresponding to a key (key) which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
specifically, the format of the high frequency key set S is, for example, { (value)
1,key
1,key
2),(value
2,key
3) … }, wherein (value)
1,key
1,key
2) For a record in the high frequency key set S, each record has only one value, but there may be one or more keys, the maximum number of keys in the high frequency key set S is C, where C is set according to a preset expected error e, and has
key
1,key
2,key
3Indicates a key, value
1And value
2Represents a value.
In this step, each data item e is judgediWhether the key is located in the high frequency key set S is determined by determining whether the key in the high frequency key set S is associated with the data item eiIf yes, the data item e is acquirediPreviously, recording the data item as a high frequency key in a processed data item in the data stream, i.e. directly recording the data item eiAnd accumulating the values of the corresponding keys in the high-frequency key set.
Preferably, the high frequency key set S is implemented by a data structure based on Stream Summary (Stream Summary) in Space Saving (Space Saving) algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent Bucket (i.e. Bucket), and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.
For example, the currently received data item to be processed is { talk, namespace, first, title, wiki }, there are records { (255, namespace), (84, first, case), (61, lett), (35, word) } in the high-frequency key set S, the data items talk, title, and wiki are not in the high-frequency key set S, the process of step (2) is performed, and the next data items namespace and first are in the high-frequency key set S, then the corresponding value is added by 1, so that the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, lett), (35, word) }.
(3) Using a Counting Bloom Filter (CBF) to process the data item eiProcessing to obtain the data item eiFrequency of (f)i;
Specifically, CBF is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, CBF utilizes t different hash functions h1(),h2(),...,ht() Calculating a data item eiRespectively corresponding hash value h1(ei),h2(ei),...,ht(ei) Then, calculating to obtain a processing result h of each hash value modulo w1(ei)%w,h2(ei)%w,...,ht(ei) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item eiFrequency of (f)i;
Wherein, the number t of hash functions of CBF is preferably set as
n is the number of types of data items in the data stream; the number w of counters is preferably
δ is the error rate error of the CBF; the CBF supports processing in which the counter is increased by 1 when a data item is inserted, and also supports processing in which the counter is decreased by 1 when a data item is deleted.
Data item e obtained in this stepiFrequency of (f)iIncludes acquiring the data item eiFrom the previously processed data item e in the data streamiCBF is continuously updated with the arrival of the data stream; the CBF uses a plurality of hash functions, and uses the minimum value as the statistical frequency, so as to reduce the hash collision probability between different data items and improve the statistical accuracy.
For the example in the step (2), it is assumed that before the data item talk is obtained, the frequency count of the data item talk in the processed data item in the data stream recorded in the CBF is 24, the frequency count of the data item title recorded in the CBF is 29, after the processing in this step, the frequency count returned by the data item talk is 25, the frequency count returned by the data item title is 30, and the frequency count returned by the data item wiki is 1;
(4) judging data item eiFrequency of (f)iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the data item is represented as a high-frequency key, then the step (5) is carried out, otherwise, the data item is represented as a non-high-frequency key, and then the step (6) is carried out;
in particular, the high frequency key threshold ε is determined by obtaining the data item e
iPreviously, the total number M of processed data items in the data stream was determined, and there were
For the example in step (2), it is assumed that before the data item talk is acquired, the total number M of processed data items in the data stream is 1203, the expected error e is 0.05 (i.e. C is 1/eis 20), and the high frequency key threshold is set as the high frequency key threshold
The data item talk is a non-high frequency key; before the data item title is obtained, the total number M of processed data items in the data stream is 1206, and the high-frequency key threshold value epsilon is 30, so that the data item title is a high-frequency key; data item wiki is not highA frequency key;
(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item eiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);
for the example in step (2), the data item title is a high-frequency key, and the number of existing keys in the current high-frequency key set S does not exceed C, the data item title is directly inserted into the high-frequency key set S, so that the updated high-frequency key set S is { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }.
(6) Judging data item eiFrequency of (f)iWhether the size is larger than or equal to the low frequency key threshold value theta, if so, the data item e is considerediIf the key is a potential high-frequency key, then the step (9) is carried out, otherwise, the step (7) is carried out;
specifically, the value range of the low-frequency key threshold θ is [2,10], preferably 5;
(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue QhThen the data item eiInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item eiInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
the length of the low-frequency key queue Q is equal to C, namely the length is the same as the maximum key number of the high-frequency key set, and the queue length can be adjusted according to the size of the data stream in practical application;
(8) judging the data item e of the head node in the low frequency key queue Q
hProbability of attenuation of
If the number is larger than the random number r, the data item e of the head node in the low-frequency key queue Q is subjected to CBF
hUpdating to obtain the data item e
hUpdated frequencyCounting, and then entering the step (9), wherein b is a preset exponential base number, b>1 and b ≈ 1, f
hFor data item e of head node in low frequency key queue Q
hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);
specifically, data item e of the head node in the low frequency key queue Q is paired using CBFhUpdating the data item e of the head node in the low-frequency key queue QhThe element in array B corresponding to CBF is decremented by 1.
For the example in step (2), the frequency of the data item talk is 25, and the data item talk continues to be stored in the CBF; the data item wiki needs to be inserted into the low-frequency key queue Q, but at this time, a data item { page, translation, ns, ns, ce, page, …, user } exists in the low-frequency key queue Q, the length is 20, the data item wiki is full, the data item page of the head node in the low-frequency key queue Q needs to be deleted before the data item wiki is inserted, the frequency f of the data item page of the head node in the low-frequency key queue Q stored in the CBF is inquired, and the attenuation probability p ═ b is calculated-fAnd generating any random number between [0,1), and subtracting 1 from the array element corresponding to the data item page in the CBF when the random number is smaller than the decay probability p.
The step (3), the step (4) and the step (8) have the advantages that the frequency of each data item of the data stream is monitored by using the CBF with high memory efficiency, on one hand, a dynamic high-frequency key threshold value is set, the dynamic high-frequency key threshold value can adapt to the size change of the data stream, the high-frequency key set is identified more accurately, and the identification precision of the high-frequency key is improved; and on the other hand, a low-frequency key threshold value is set, the low-frequency key is stored in a low-frequency key queue, the corresponding array elements in the CBF are subjected to subtraction processing by 1 according to the first-in first-out characteristic of the queue and by combining with the attenuation probability, the CBF is updated reversely, most low-frequency keys can be filtered out in practical application data, the CBF is ensured to monitor data flow by a small memory, and the memory overhead is reduced.
(9) Using key-value grouping algorithms for data items eiDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the key value grouping algorithm is realized based on hash operation, a distributed downstream instance number is obtained after the data item is subjected to the hash operation, a downstream instance corresponding to the downstream instance number is distributed to the data item, that is, the data items corresponding to the same key are distributed to the same downstream instance;
(10) according to the high frequency key set S and the data item eiDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAdding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the present step includes the following substeps:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set Smax-fminIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);
(10-2) judging the data item eiIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item eiRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-3);
specifically, different keys may be assigned to different numbers of downstream instances, with each key being preset with the same number of random suffixes as the number of downstream instances that may be assigned; and adding suffixes of corresponding sequence numbers to the data items according to the random number sequence numbers generated by the random function.
(10-3) judging the data item eiWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected toiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiProcess knotBundling, otherwise, entering the step (10-4);
(10-4) Key assignment for value-centered in high-frequency Key set S
A downstream instance, for the data item e
iRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e
iAnd the process is ended;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, eiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process ends.
For the example in step (2), assuming that the number of downstream instances is M-6, after the data item namespace in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (84, first, case), (61, letter), (35, word) }, when M is 1204, f is f
max-f
min221 is greater than M/M200, so the value of the key in the current high frequency key set S is considered to be more different, namespace is the key with the largest value, which can be assigned to 6 downstream instances, randomly generating one at [1,6 []Adding suffixes of random numbers in { _1, _2, _3, _4, _5, _6} to namespaces, performing hash operation on data items added with the suffixes to obtain assignable downstream instance numbers, and assigning the corresponding downstream instances to the data items; when the data item first in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word) }, at this time, M is 1205, f
max-f
min221 is greater than M/M200, the data item first is a key with a central value, which can be assigned to
A downstream instance, randomly generated one at [1,2 ]]In betweenThe data item first obtains an assignable downstream instance number by adding a suffix of a corresponding serial number of the random number in { _1, _2} and carrying out hash operation on the data item added with the suffix, and assigns a corresponding downstream instance to the data item; when the data item title arrives, the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }, at which time M ═ 1206, f
max-f
min226 is greater than M/M201, the data item title is the key with the smallest value, 2 downstream instances can be assigned, one randomly generated at [1,2 [ ]]And (3) adding suffixes of the random numbers in { _1, _2} to the data items, carrying out hash operation on the data items added with the suffixes to obtain distributable downstream instance numbers, and distributing the corresponding downstream instances to the data items.
The method has the advantages that keys in the high-frequency key set are distinguished according to the value size, keys with small key values are distributed to 2 downstream instances, keys with large key values are distributed with different downstream instances according to the value size, the number of distributed downstream instances can be dynamically updated according to the change of data flow, so that load balance among the downstream instances is achieved, the keys with small key values are distributed with fewer downstream instances, and unnecessary memory overhead is reduced.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.