CN112783644A

CN112783644A - Distributed inclined stream processing method and system based on high-frequency key value counting

Info

Publication number: CN112783644A
Application number: CN202011629933.9A
Authority: CN
Inventors: 唐卓; 郭耀莲; 李肯立; 刘园春; 罗文明; 宋莹洁; 阳王东; 曹嵘晖; 肖国庆; 刘楚波; 周旭
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-11
Anticipated expiration: 2040-12-31
Also published as: CN112783644B

Abstract

The invention discloses a distributed inclined stream processing method and a system based on high-frequency key value counting, which have the basic idea that a counting type bloom filter is used for counting each data item in a data stream, the data items are respectively identified as a high-frequency key, a potential high-frequency key and a low-frequency key according to frequency, the distribution of different data items is further obtained, a strategy of adding a random suffix to the high-frequency key and then grouping and aggregating is adopted to distribute downstream instances, a key value grouping strategy is adopted to distribute the downstream instances to non-high-frequency keys, and therefore the load balance among different downstream instances is realized, and the system performance is improved. The invention can solve the technical problems of great memory overhead of random grouping downstream instances and unbalanced load among key value grouping downstream instances in the oblique flow processing method.

Description

Distributed inclined stream processing method and system based on high-frequency key value counting

Technical Field

The invention belongs to the field of big data processing, and particularly relates to a distributed oblique stream processing method and system based on high-frequency key value counting.

Background

With the development of big data technologies, a large number of applications based on data streams appear in the fields of social networks, financial data analysis, e-commerce transactions, and the like. Compared with the traditional data, the data stream has the characteristics of dynamic property, high speed, mass, infinite property and the like, the traditional distributed processing method cannot predict and control the arrival time and scale of the data stream, and when the arrival scale of the data is extremely large, the processing performance of the traditional distributed processing method is sharply reduced. To address the above challenges, a method based on a distributed stream processing system such as S4, Storm, Spark Streaming, Flink, etc. is developed. In addition, the data flow distribution in practical applications is highly skewed, i.e., the frequency of each data in the data flow is relatively different.

The distributed stream processing method organizes and connects running nodes in a distributed stream processing system into an application processing flow in a logical topology mode, the connection information is usually represented as a directed acyclic graph, a vertex in the graph represents an operation in the application, and an edge represents the flow direction of data streams between the operations. The distributed flow processing system creates a plurality of downstream instances for each data operation, and the grouping strategy in the flow processing method aims to group the data sent by the upstream operation and distribute the data to each downstream instance, so that the grouping strategy of the flow processing directly influences the quantity and the distribution condition of the data processed by each downstream instance. The existing basic grouping method for distributed stream processing comprises random grouping and key value grouping, wherein the random grouping adopts a polling mechanism to distribute each data item to each downstream instance in an equal probability manner, so that the uniform distribution of system workload is easy to realize; key value groupings assign data items of the same key to one downstream instance based on a hash operation, the state of the key for each data item being maintained by only one downstream instance.

However, the existing distributed oblique stream processing method has the following technical problems: each downstream instance in the random grouping needs to maintain the states of all keys, and the memory overhead of the downstream instance is extremely large; the key value grouping distributes the same key to the same downstream instance, the value difference of different keys is large, load imbalance among the downstream instances is caused, and the load imbalance among the downstream instances is more serious as the inclination of the data stream is increased.

Disclosure of Invention

In view of the above defects or improvement requirements of the prior art, the present invention provides a distributed oblique flow processing method and system based on high-frequency key value counting, and aims to solve the technical problems of great memory overhead of random packet downstream instances and load imbalance among key value packet downstream instances in the oblique flow processing method.

To achieve the above object, according to one aspect of the present invention, there is provided a distributed inclined stream processing method based on high-frequency key value counting, including the steps of:

(1) obtaining a data item e to be processed in a data stream_iAnd in data item e in the data stream_iTotal number of previously processed data items M;

(2) judging data item e_iWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to the value corresponding to the key in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);

(3) for data item e using a counting bloom filter_iProcessing to obtain the data item e_iFrequency of (f)_i；

(4) Judging data item e_iFrequency of (f)_iWhether the size is larger than or equal to the high-frequency key threshold epsilon or not is judged, if yes, the step (5) is carried out, and if not, the step (6) is carried out;

(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item e_iReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as f_i+f_minWherein f is_minIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is put_iAnd frequency f_iInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);

(6) judging data item e_iFrequency of (f)_iWhether the size is larger than or equal to the low-frequency key threshold theta is judged, if yes, the step (9) is carried out, and if not, the step (7) is carried out;

(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q_hThen the data item e_iInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item e_iInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);

(8) judging the data item e of the head node in the low frequency key queue Q_hProbability of attenuation of

If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q_hUpdating to obtain the data item e_hThe updated frequency is then entered into step (9), wherein b is the preset exponential base number, b>1 and b ≈ 1, f_hFor data item e of head node in low frequency key queue Q_hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);

(9) using key-value grouping algorithms for data items e_iDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;

(10) according to the high frequency key set S and the data item e_iAre identical to each otherDetermines the number of downstream instances to which the key can be assigned, selects a downstream instance from the downstream instances according to the determined number of downstream instances, and assigns the selected downstream instance to the data item e_iAnd adding 1 to the total number M of the processed data items in the data stream, and ending the process.

Preferably, the high frequency key set S in step (2) is implemented by a data structure based on a stream summary in a space saving algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent bucket, and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.

Preferably, the counting bloom filter in step (3) is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, the counting bloom filter utilizes t different hash functions h₁()，h₂()，...，h_t() Calculating a data item e_iRespectively corresponding hash value h₁(e_i)，h₂(e_i)，...，h_t(e_i) Then, calculating to obtain a processing result h of each hash value modulo w₁(e_i)％w，h₂(e_i)％w，...，h_t(e_i) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item e_iFrequency of (f)_i；

Preferably, the high frequency key threshold epsilon in step (4) is determined by acquiring the data item e_iPreviously, the total number M of processed data items in the data stream was determined, and there were

Preferably, the data item e of the head node in the low-frequency key queue Q is subjected to counting bloom filter in step (8)_hUpdating the data item e of the head node in the low-frequency key queue Q_hThe element in array B corresponding to CBF is decremented by 1.

Preferably, step (10) comprises the sub-steps of:

(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set S_max-f_minIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);

(10-2) judging the data item e_iIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item e_iRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iEnding the process, otherwise, entering the step (10-3);

(10-3) judging the data item e_iWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected to_iRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iEnding the process, otherwise, entering the step (10-4);

(10-4) Key assignment for value-centered in high-frequency Key set S

A downstream instance, for the data item e_iRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iAnd the process is ended;

(10-5) assigning 2 downstream instances to the key corresponding to the data item, e_iRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iGo throughThe routine ends.

According to another aspect of the present invention, there is provided a distributed inclined stream processing system based on high frequency key value counting, comprising the following modules:

a first module for acquiring a data item e to be processed in a data stream_iAnd in data item e in the data stream_iTotal number of previously processed data items M;

a second module for judging the data item e_iWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value corresponding to a key in the high-frequency key set S, which is the same as the data item, and then entering a tenth module, otherwise, entering a third module;

a third module for applying a counting bloom filter to the data item e_iProcessing to obtain the data item e_iFrequency of (f)_i；

A fourth module for judging the data item e_iFrequency of (f)_iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the fifth module is started, and if not, the sixth module is started;

a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e is_iReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as f_i+f_minWherein f is_minIs the minimum value of the keys in the high-frequency key set S, and then the tenth module is switched to; otherwise, the data item e is put_iAnd frequency f_iInserting the key value as a new key value into the high-frequency key set S, and then switching to a tenth module;

a sixth module for judging the data item e_iFrequency of (f)_iWhether the size is larger than or equal to the low-frequency key threshold value theta is judged, if yes, the ninth module is switched to, and if not, the seventh module is switched to;

a seventh module, configured to determine whether the low frequency key queue Q is full, and if so, delete the data item e of the head node in the low frequency key queue Q first_hThen the data item e_iInserting the data item into the low frequency key queue Q, then entering an eighth module, otherwise, directly inserting the data item e_iInsert into low frequency key queue Q, then goEntering a ninth module;

an eighth module for determining a data item e of a head node in the low frequency key queue Q_hProbability of attenuation of

If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q_hUpdating to obtain the data item e_hThe updated frequency number is then entered into a ninth module, wherein b is a preset exponential base number, b>1 and b ≈ 1, f_hFor data item e of head node in low frequency key queue Q_hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering a ninth module;

a ninth module for grouping data items e using a key-value grouping algorithm_iAllocating downstream instances and adding 1 to the total number M of processed data items in the data stream;

a tenth module for merging the data item e with the set of high frequency keys S_iDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e_iAnd adds 1 to the total number of processed data items M in the data stream.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) in the step (2), the high-frequency key set is stored by adopting a lightweight data structure, so that the space efficiency is high, the collection is conveniently loaded into a memory by a downstream instance, and meanwhile, the high-frequency key set supports quick O (1) online query and update, and the purpose of quickly selecting keys distributed to the downstream instance is realized;

(2) in the step (3), each data item in the data stream is monitored by using a counting type bloom filter, and the counting type bloom filter has high calculation and memory efficiency and simultaneously supports the insertion and deletion of the data item;

(3) in the step (6), the low-frequency keys are cached by using the low-frequency key queue with limited length, and the low-frequency keys stored in the counting type bloom filter are removed according to the first-in first-out characteristic of the queue with a certain attenuation probability, so that the relatively small low-frequency keys can be filtered out by the attenuation probability, the memory occupation of the counting type bloom filter is saved, and the probability of hash operation result conflict among different data items is also reduced;

(4) in the step (7), the keys in the high-frequency key set are distinguished according to the value sizes, the keys with the slightly different key values are distributed to 2 downstream instances, the keys with the greatly different key values are distributed to different downstream instances according to the value sizes, the distributed downstream instance number can be dynamically updated according to the change of the data stream, so that the load balance among all the downstream instances is realized, and the unnecessary memory overhead is reduced by distributing fewer downstream instances for the keys with the relatively small key values.

Drawings

FIG. 1 is a schematic view of the process of the present invention;

fig. 2 is a flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The basic idea of the present invention is to count a data stream by using a Counting Bloom Filter (CBF for short) as shown in fig. 1, identify data items as a high frequency key, a potential high frequency key, and a low frequency key according to frequency, further obtain the distribution of different data items, store the high frequency key in a high frequency key set, store the low frequency key in a low frequency key queue, assign a downstream instance to the high frequency key by a policy of adding a random suffix and grouping and aggregating, and assign a downstream instance to a non-high frequency key by a key value grouping policy, thereby realizing the load between different downstream instances and improving the system performance.

As shown in fig. 2, the present invention provides a distributed oblique stream processing method based on high frequency key value counting, which includes the following steps:

specifically, the total number M of processed data items in the data stream is initially set to 0, and the total number M of processed data items is counted and updated as each data item in the data stream is processed in turn.

(2) Judging data item e_iWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value (value) corresponding to a key (key) which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);

specifically, the format of the high frequency key set S is, for example, { (value)₁,key₁,key₂),(value₂,key₃) … }, wherein (value)₁,key₁,key₂) For a record in the high frequency key set S, each record has only one value, but there may be one or more keys, the maximum number of keys in the high frequency key set S is C, where C is set according to a preset expected error e, and has

key₁,key₂,key₃Indicates a key, value₁And value₂Represents a value.

In this step, each data item e is judged_iWhether the key is located in the high frequency key set S is determined by determining whether the key in the high frequency key set S is associated with the data item e_iIf yes, the data item e is acquired_iPreviously, recording the data item as a high frequency key in a processed data item in the data stream, i.e. directly recording the data item e_iAnd accumulating the values of the corresponding keys in the high-frequency key set.

Preferably, the high frequency key set S is implemented by a data structure based on Stream Summary (Stream Summary) in Space Saving (Space Saving) algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent Bucket (i.e. Bucket), and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.

For example, the currently received data item to be processed is { talk, namespace, first, title, wiki }, there are records { (255, namespace), (84, first, case), (61, lett), (35, word) } in the high-frequency key set S, the data items talk, title, and wiki are not in the high-frequency key set S, the process of step (2) is performed, and the next data items namespace and first are in the high-frequency key set S, then the corresponding value is added by 1, so that the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, lett), (35, word) }.

(3) Using a Counting Bloom Filter (CBF) to process the data item e_iProcessing to obtain the data item e_iFrequency of (f)_i；

Specifically, CBF is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, CBF utilizes t different hash functions h₁()，h₂()，...，h_t() Calculating a data item e_iRespectively corresponding hash value h₁(e_i)，h₂(e_i)，...，h_t(e_i) Then, calculating to obtain a processing result h of each hash value modulo w₁(e_i)％w，h₂(e_i)％w，...，h_t(e_i) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item e_iFrequency of (f)_i；

Wherein, the number t of hash functions of CBF is preferably set as

n is the number of types of data items in the data stream; the number w of counters is preferably

δ is the error rate error of the CBF; the CBF supports processing in which the counter is increased by 1 when a data item is inserted, and also supports processing in which the counter is decreased by 1 when a data item is deleted.

Data item e obtained in this step_iFrequency of (f)_iIncludes acquiring the data item e_iFrom the previously processed data item e in the data stream_iCBF is continuously updated with the arrival of the data stream; the CBF uses a plurality of hash functions, and uses the minimum value as the statistical frequency, so as to reduce the hash collision probability between different data items and improve the statistical accuracy.

For the example in the step (2), it is assumed that before the data item talk is obtained, the frequency count of the data item talk in the processed data item in the data stream recorded in the CBF is 24, the frequency count of the data item title recorded in the CBF is 29, after the processing in this step, the frequency count returned by the data item talk is 25, the frequency count returned by the data item title is 30, and the frequency count returned by the data item wiki is 1;

(4) judging data item e_iFrequency of (f)_iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the data item is represented as a high-frequency key, then the step (5) is carried out, otherwise, the data item is represented as a non-high-frequency key, and then the step (6) is carried out;

in particular, the high frequency key threshold ε is determined by obtaining the data item e_iPreviously, the total number M of processed data items in the data stream was determined, and there were

For the example in step (2), it is assumed that before the data item talk is acquired, the total number M of processed data items in the data stream is 1203, the expected error e is 0.05 (i.e. C is 1/eis 20), and the high frequency key threshold is set as the high frequency key threshold

The data item talk is a non-high frequency key; before the data item title is obtained, the total number M of processed data items in the data stream is 1206, and the high-frequency key threshold value epsilon is 30, so that the data item title is a high-frequency key; data item wiki is not highA frequency key;

for the example in step (2), the data item title is a high-frequency key, and the number of existing keys in the current high-frequency key set S does not exceed C, the data item title is directly inserted into the high-frequency key set S, so that the updated high-frequency key set S is { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }.

(6) Judging data item e_iFrequency of (f)_iWhether the size is larger than or equal to the low frequency key threshold value theta, if so, the data item e is considered_iIf the key is a potential high-frequency key, then the step (9) is carried out, otherwise, the step (7) is carried out;

specifically, the value range of the low-frequency key threshold θ is [2,10], preferably 5;

the length of the low-frequency key queue Q is equal to C, namely the length is the same as the maximum key number of the high-frequency key set, and the queue length can be adjusted according to the size of the data stream in practical application;

If the number is larger than the random number r, the data item e of the head node in the low-frequency key queue Q is subjected to CBF_hUpdating to obtain the data item e_hUpdated frequencyCounting, and then entering the step (9), wherein b is a preset exponential base number, b>1 and b ≈ 1, f_hFor data item e of head node in low frequency key queue Q_hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);

specifically, data item e of the head node in the low frequency key queue Q is paired using CBF_hUpdating the data item e of the head node in the low-frequency key queue Q_hThe element in array B corresponding to CBF is decremented by 1.

For the example in step (2), the frequency of the data item talk is 25, and the data item talk continues to be stored in the CBF; the data item wiki needs to be inserted into the low-frequency key queue Q, but at this time, a data item { page, translation, ns, ns, ce, page, …, user } exists in the low-frequency key queue Q, the length is 20, the data item wiki is full, the data item page of the head node in the low-frequency key queue Q needs to be deleted before the data item wiki is inserted, the frequency f of the data item page of the head node in the low-frequency key queue Q stored in the CBF is inquired, and the attenuation probability p ═ b is calculated^-fAnd generating any random number between [0,1), and subtracting 1 from the array element corresponding to the data item page in the CBF when the random number is smaller than the decay probability p.

The step (3), the step (4) and the step (8) have the advantages that the frequency of each data item of the data stream is monitored by using the CBF with high memory efficiency, on one hand, a dynamic high-frequency key threshold value is set, the dynamic high-frequency key threshold value can adapt to the size change of the data stream, the high-frequency key set is identified more accurately, and the identification precision of the high-frequency key is improved; and on the other hand, a low-frequency key threshold value is set, the low-frequency key is stored in a low-frequency key queue, the corresponding array elements in the CBF are subjected to subtraction processing by 1 according to the first-in first-out characteristic of the queue and by combining with the attenuation probability, the CBF is updated reversely, most low-frequency keys can be filtered out in practical application data, the CBF is ensured to monitor data flow by a small memory, and the memory overhead is reduced.

specifically, the key value grouping algorithm is realized based on hash operation, a distributed downstream instance number is obtained after the data item is subjected to the hash operation, a downstream instance corresponding to the downstream instance number is distributed to the data item, that is, the data items corresponding to the same key are distributed to the same downstream instance;

(10) according to the high frequency key set S and the data item e_iDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e_iAdding 1 to the total number M of the processed data items in the data stream, and ending the process;

specifically, the present step includes the following substeps:

specifically, different keys may be assigned to different numbers of downstream instances, with each key being preset with the same number of random suffixes as the number of downstream instances that may be assigned; and adding suffixes of corresponding sequence numbers to the data items according to the random number sequence numbers generated by the random function.

(10-3) judging the data item e_iWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected to_iRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iProcess knotBundling, otherwise, entering the step (10-4);

(10-4) Key assignment for value-centered in high-frequency Key set S

(10-5) assigning 2 downstream instances to the key corresponding to the data item, e_iRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item e_iAnd the process ends.

For the example in step (2), assuming that the number of downstream instances is M-6, after the data item namespace in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (84, first, case), (61, letter), (35, word) }, when M is 1204, f is f_max-f_min221 is greater than M/M200, so the value of the key in the current high frequency key set S is considered to be more different, namespace is the key with the largest value, which can be assigned to 6 downstream instances, randomly generating one at [1,6 []Adding suffixes of random numbers in { _1, _2, _3, _4, _5, _6} to namespaces, performing hash operation on data items added with the suffixes to obtain assignable downstream instance numbers, and assigning the corresponding downstream instances to the data items; when the data item first in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word) }, at this time, M is 1205, f_max-f_min221 is greater than M/M200, the data item first is a key with a central value, which can be assigned to

A downstream instance, randomly generated one at [1,2 ]]In betweenThe data item first obtains an assignable downstream instance number by adding a suffix of a corresponding serial number of the random number in { _1, _2} and carrying out hash operation on the data item added with the suffix, and assigns a corresponding downstream instance to the data item; when the data item title arrives, the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }, at which time M ═ 1206, f_max-f_min226 is greater than M/M201, the data item title is the key with the smallest value, 2 downstream instances can be assigned, one randomly generated at [1,2 [ ]]And (3) adding suffixes of the random numbers in { _1, _2} to the data items, carrying out hash operation on the data items added with the suffixes to obtain distributable downstream instance numbers, and distributing the corresponding downstream instances to the data items.

The method has the advantages that keys in the high-frequency key set are distinguished according to the value size, keys with small key values are distributed to 2 downstream instances, keys with large key values are distributed with different downstream instances according to the value size, the number of distributed downstream instances can be dynamically updated according to the change of data flow, so that load balance among the downstream instances is achieved, the keys with small key values are distributed with fewer downstream instances, and unnecessary memory overhead is reduced.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A distributed inclined stream processing method based on high-frequency key value counting is characterized by comprising the following steps:

(2) judging data item e_iIf the data item is located in the high-frequency key set S, adding 1 to the value corresponding to the same key in the high-frequency key set S as the data item, and enteringStep (10), otherwise, entering step (3);

If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q_hUpdating to obtain the data item e_hThe updated frequency is then entered into step (9), where b is the predetermined exponential base, b > 1 and b ≈ 1, f_hFor data item e of head node in low frequency key queue Q_hR is a random number generated by the random number generator in the range of [0, 1); otherwise enter step(9)；

(10) according to the high frequency key set S and the data item e_iDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e_iAnd adding 1 to the total number M of the processed data items in the data stream, and ending the process.

2. The distributed inclined stream processing method based on high-frequency key value counting according to claim 1, characterized in that the high-frequency key set S in step (2) is implemented by a data structure based on stream summary in a space-saving algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and are directed to the same parent bucket, and different parent buckets in the high-frequency key set S are linked by a bi-directional linked list.

3. The high frequency key-value counting-based distributed inclined stream processing method according to claim 1, wherein the counting bloom filter in step (3) is an array B ═ { B [ 0] containing w counters]，B[1]，...，B[w-1]Firstly, the counting bloom filter utilizes t different hash functions h₁()，h₂()，...，h_t() Calculating a data item e_iRespectively corresponding hash value h₁(e_i)，h₂(e_i)，...，h_t(e_i) Then, calculating to obtain a processing result h of each hash value modulo w₁(e_i)％w，h₂(e_i)％w，…，h_t(e_i) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item e_iFrequency of (f)_i。

4. High frequency key-value counting based as claimed in claim 1Distributed oblique stream processing method, wherein the high-frequency key threshold epsilon in step (4) is set by acquiring the data item e_iPreviously, the total number M of processed data items in the data stream was determined, and there were

5. The high frequency key-value-count-based distributed inclined stream processing method according to claim 1, wherein in step (8), a counting bloom filter is used to process the data item e of the head node in the low frequency key queue Q_hUpdating the data item e of the head node in the low-frequency key queue Q_hThe element in array B corresponding to CBF is decremented by 1.

6. The high frequency key-value count based distributed skewed stream processing method of claim 1, wherein step (10) includes the sub-steps of:

(10-3) judging the data item e_iWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected to_iRandomly adding a suffix of 2 preset random suffixes, performing hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and adding the assigned downstream instance numberThe downstream instance corresponding to the downstream instance number is assigned to the data item e_iEnding the process, otherwise, entering the step (10-4);

(10-4) Key assignment for value-centered in high-frequency Key set S

7. A distributed inclined stream processing system based on high-frequency key value counting is characterized by comprising the following modules:

a seventh module, configured to determine whether the low frequency key queue Q is full, and if so, delete the data item e of the head node in the low frequency key queue Q first_hThen the data item e_iInserting the data item into the low frequency key queue Q, then entering an eighth module, otherwise, directly inserting the data item e_iInserting the low-frequency key queue Q, and then switching to a ninth module;

an eighth module for determining a data item e of a head node in the low frequency key queue Q_hIs the probability of attenuation p ═

If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue Q_hUpdating to obtain the data item e_hThe updated frequency is then entered into a ninth module, wherein b is a preset exponential base number, b > 1 and b ≈ 1, f_hFor data item e of head node in low frequency key queue Q_hR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering a ninth module;

a tenth module for merging the data item e with the set of high frequency keys S_iSame asDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e_iAnd adds 1 to the total number of processed data items M in the data stream.