CN112783644A - Distributed inclined stream processing method and system based on high-frequency key value counting - Google Patents

Distributed inclined stream processing method and system based on high-frequency key value counting Download PDF

Info

Publication number
CN112783644A
CN112783644A CN202011629933.9A CN202011629933A CN112783644A CN 112783644 A CN112783644 A CN 112783644A CN 202011629933 A CN202011629933 A CN 202011629933A CN 112783644 A CN112783644 A CN 112783644A
Authority
CN
China
Prior art keywords
data item
key
frequency
value
downstream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011629933.9A
Other languages
Chinese (zh)
Other versions
CN112783644B (en
Inventor
唐卓
郭耀莲
李肯立
刘园春
罗文明
宋莹洁
阳王东
曹嵘晖
肖国庆
刘楚波
周旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202011629933.9A priority Critical patent/CN112783644B/en
Publication of CN112783644A publication Critical patent/CN112783644A/en
Application granted granted Critical
Publication of CN112783644B publication Critical patent/CN112783644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention discloses a distributed inclined stream processing method and a system based on high-frequency key value counting, which have the basic idea that a counting type bloom filter is used for counting each data item in a data stream, the data items are respectively identified as a high-frequency key, a potential high-frequency key and a low-frequency key according to frequency, the distribution of different data items is further obtained, a strategy of adding a random suffix to the high-frequency key and then grouping and aggregating is adopted to distribute downstream instances, a key value grouping strategy is adopted to distribute the downstream instances to non-high-frequency keys, and therefore the load balance among different downstream instances is realized, and the system performance is improved. The invention can solve the technical problems of great memory overhead of random grouping downstream instances and unbalanced load among key value grouping downstream instances in the oblique flow processing method.

Description

Distributed inclined stream processing method and system based on high-frequency key value counting
Technical Field
The invention belongs to the field of big data processing, and particularly relates to a distributed oblique stream processing method and system based on high-frequency key value counting.
Background
With the development of big data technologies, a large number of applications based on data streams appear in the fields of social networks, financial data analysis, e-commerce transactions, and the like. Compared with the traditional data, the data stream has the characteristics of dynamic property, high speed, mass, infinite property and the like, the traditional distributed processing method cannot predict and control the arrival time and scale of the data stream, and when the arrival scale of the data is extremely large, the processing performance of the traditional distributed processing method is sharply reduced. To address the above challenges, a method based on a distributed stream processing system such as S4, Storm, Spark Streaming, Flink, etc. is developed. In addition, the data flow distribution in practical applications is highly skewed, i.e., the frequency of each data in the data flow is relatively different.
The distributed stream processing method organizes and connects running nodes in a distributed stream processing system into an application processing flow in a logical topology mode, the connection information is usually represented as a directed acyclic graph, a vertex in the graph represents an operation in the application, and an edge represents the flow direction of data streams between the operations. The distributed flow processing system creates a plurality of downstream instances for each data operation, and the grouping strategy in the flow processing method aims to group the data sent by the upstream operation and distribute the data to each downstream instance, so that the grouping strategy of the flow processing directly influences the quantity and the distribution condition of the data processed by each downstream instance. The existing basic grouping method for distributed stream processing comprises random grouping and key value grouping, wherein the random grouping adopts a polling mechanism to distribute each data item to each downstream instance in an equal probability manner, so that the uniform distribution of system workload is easy to realize; key value groupings assign data items of the same key to one downstream instance based on a hash operation, the state of the key for each data item being maintained by only one downstream instance.
However, the existing distributed oblique stream processing method has the following technical problems: each downstream instance in the random grouping needs to maintain the states of all keys, and the memory overhead of the downstream instance is extremely large; the key value grouping distributes the same key to the same downstream instance, the value difference of different keys is large, load imbalance among the downstream instances is caused, and the load imbalance among the downstream instances is more serious as the inclination of the data stream is increased.
Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention provides a distributed oblique flow processing method and system based on high-frequency key value counting, and aims to solve the technical problems of great memory overhead of random packet downstream instances and load imbalance among key value packet downstream instances in the oblique flow processing method.
To achieve the above object, according to one aspect of the present invention, there is provided a distributed inclined stream processing method based on high-frequency key value counting, including the steps of:
(1) obtaining a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
(2) judging data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to the value corresponding to the key in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
(3) for data item e using a counting bloom filteriProcessing to obtain the data item eiFrequency of (f)i
(4) Judging data item eiFrequency of (f)iWhether the size is larger than or equal to the high-frequency key threshold epsilon or not is judged, if yes, the step (5) is carried out, and if not, the step (6) is carried out;
(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item eiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);
(6) judging data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold theta is judged, if yes, the step (9) is carried out, and if not, the step (7) is carried out;
(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue QhThen the data item eiInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item eiInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) judging the data item e of the head node in the low frequency key queue QhProbability of attenuation of
Figure BDA0002878296830000031
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue QhUpdating to obtain the data item ehThe updated frequency is then entered into step (9), wherein b is the preset exponential base number, b>1 and b ≈ 1, fhFor data item e of head node in low frequency key queue QhR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);
(9) using key-value grouping algorithms for data items eiDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;
(10) according to the high frequency key set S and the data item eiAre identical to each otherDetermines the number of downstream instances to which the key can be assigned, selects a downstream instance from the downstream instances according to the determined number of downstream instances, and assigns the selected downstream instance to the data item eiAnd adding 1 to the total number M of the processed data items in the data stream, and ending the process.
Preferably, the high frequency key set S in step (2) is implemented by a data structure based on a stream summary in a space saving algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent bucket, and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.
Preferably, the counting bloom filter in step (3) is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, the counting bloom filter utilizes t different hash functions h1(),h2(),...,ht() Calculating a data item eiRespectively corresponding hash value h1(ei),h2(ei),...,ht(ei) Then, calculating to obtain a processing result h of each hash value modulo w1(ei)%w,h2(ei)%w,...,ht(ei) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item eiFrequency of (f)i
Preferably, the high frequency key threshold epsilon in step (4) is determined by acquiring the data item eiPreviously, the total number M of processed data items in the data stream was determined, and there were
Figure BDA0002878296830000041
Preferably, the data item e of the head node in the low-frequency key queue Q is subjected to counting bloom filter in step (8)hUpdating the data item e of the head node in the low-frequency key queue QhThe element in array B corresponding to CBF is decremented by 1.
Preferably, step (10) comprises the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set Smax-fminIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);
(10-2) judging the data item eiIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item eiRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-3);
(10-3) judging the data item eiWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected toiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-4);
(10-4) Key assignment for value-centered in high-frequency Key set S
Figure BDA0002878296830000042
A downstream instance, for the data item eiRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process is ended;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, eiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiGo throughThe routine ends.
According to another aspect of the present invention, there is provided a distributed inclined stream processing system based on high frequency key value counting, comprising the following modules:
a first module for acquiring a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
a second module for judging the data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value corresponding to a key in the high-frequency key set S, which is the same as the data item, and then entering a tenth module, otherwise, entering a third module;
a third module for applying a counting bloom filter to the data item eiProcessing to obtain the data item eiFrequency of (f)i
A fourth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the fifth module is started, and if not, the sixth module is started;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e isiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the tenth module is switched to; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then switching to a tenth module;
a sixth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold value theta is judged, if yes, the ninth module is switched to, and if not, the seventh module is switched to;
a seventh module, configured to determine whether the low frequency key queue Q is full, and if so, delete the data item e of the head node in the low frequency key queue Q firsthThen the data item eiInserting the data item into the low frequency key queue Q, then entering an eighth module, otherwise, directly inserting the data item eiInsert into low frequency key queue Q, then goEntering a ninth module;
an eighth module for determining a data item e of a head node in the low frequency key queue QhProbability of attenuation of
Figure BDA0002878296830000051
Figure BDA0002878296830000052
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue QhUpdating to obtain the data item ehThe updated frequency number is then entered into a ninth module, wherein b is a preset exponential base number, b>1 and b ≈ 1, fhFor data item e of head node in low frequency key queue QhR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering a ninth module;
a ninth module for grouping data items e using a key-value grouping algorithmiAllocating downstream instances and adding 1 to the total number M of processed data items in the data stream;
a tenth module for merging the data item e with the set of high frequency keys SiDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAnd adds 1 to the total number of processed data items M in the data stream.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) in the step (2), the high-frequency key set is stored by adopting a lightweight data structure, so that the space efficiency is high, the collection is conveniently loaded into a memory by a downstream instance, and meanwhile, the high-frequency key set supports quick O (1) online query and update, and the purpose of quickly selecting keys distributed to the downstream instance is realized;
(2) in the step (3), each data item in the data stream is monitored by using a counting type bloom filter, and the counting type bloom filter has high calculation and memory efficiency and simultaneously supports the insertion and deletion of the data item;
(3) in the step (6), the low-frequency keys are cached by using the low-frequency key queue with limited length, and the low-frequency keys stored in the counting type bloom filter are removed according to the first-in first-out characteristic of the queue with a certain attenuation probability, so that the relatively small low-frequency keys can be filtered out by the attenuation probability, the memory occupation of the counting type bloom filter is saved, and the probability of hash operation result conflict among different data items is also reduced;
(4) in the step (7), the keys in the high-frequency key set are distinguished according to the value sizes, the keys with the slightly different key values are distributed to 2 downstream instances, the keys with the greatly different key values are distributed to different downstream instances according to the value sizes, the distributed downstream instance number can be dynamically updated according to the change of the data stream, so that the load balance among all the downstream instances is realized, and the unnecessary memory overhead is reduced by distributing fewer downstream instances for the keys with the relatively small key values.
Drawings
FIG. 1 is a schematic view of the process of the present invention;
fig. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the present invention is to count a data stream by using a Counting Bloom Filter (CBF for short) as shown in fig. 1, identify data items as a high frequency key, a potential high frequency key, and a low frequency key according to frequency, further obtain the distribution of different data items, store the high frequency key in a high frequency key set, store the low frequency key in a low frequency key queue, assign a downstream instance to the high frequency key by a policy of adding a random suffix and grouping and aggregating, and assign a downstream instance to a non-high frequency key by a key value grouping policy, thereby realizing the load between different downstream instances and improving the system performance.
As shown in fig. 2, the present invention provides a distributed oblique stream processing method based on high frequency key value counting, which includes the following steps:
(1) obtaining a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
specifically, the total number M of processed data items in the data stream is initially set to 0, and the total number M of processed data items is counted and updated as each data item in the data stream is processed in turn.
(2) Judging data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value (value) corresponding to a key (key) which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
specifically, the format of the high frequency key set S is, for example, { (value)1,key1,key2),(value2,key3) … }, wherein (value)1,key1,key2) For a record in the high frequency key set S, each record has only one value, but there may be one or more keys, the maximum number of keys in the high frequency key set S is C, where C is set according to a preset expected error e, and has
Figure BDA0002878296830000081
key1,key2,key3Indicates a key, value1And value2Represents a value.
In this step, each data item e is judgediWhether the key is located in the high frequency key set S is determined by determining whether the key in the high frequency key set S is associated with the data item eiIf yes, the data item e is acquirediPreviously, recording the data item as a high frequency key in a processed data item in the data stream, i.e. directly recording the data item eiAnd accumulating the values of the corresponding keys in the high-frequency key set.
Preferably, the high frequency key set S is implemented by a data structure based on Stream Summary (Stream Summary) in Space Saving (Space Saving) algorithm, keys with the same count value in the high frequency key set S are linked in the same linked list and point to the same parent Bucket (i.e. Bucket), and different parent buckets in the high frequency key set S are linked by using a bi-directional linked list.
For example, the currently received data item to be processed is { talk, namespace, first, title, wiki }, there are records { (255, namespace), (84, first, case), (61, lett), (35, word) } in the high-frequency key set S, the data items talk, title, and wiki are not in the high-frequency key set S, the process of step (2) is performed, and the next data items namespace and first are in the high-frequency key set S, then the corresponding value is added by 1, so that the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, lett), (35, word) }.
(3) Using a Counting Bloom Filter (CBF) to process the data item eiProcessing to obtain the data item eiFrequency of (f)i
Specifically, CBF is an array B ═ B [ 0] containing w counters],B[1],…,B[w-1]Firstly, CBF utilizes t different hash functions h1(),h2(),...,ht() Calculating a data item eiRespectively corresponding hash value h1(ei),h2(ei),...,ht(ei) Then, calculating to obtain a processing result h of each hash value modulo w1(ei)%w,h2(ei)%w,...,ht(ei) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item eiFrequency of (f)i
Wherein, the number t of hash functions of CBF is preferably set as
Figure BDA0002878296830000091
n is the number of types of data items in the data stream; the number w of counters is preferably
Figure BDA0002878296830000092
δ is the error rate error of the CBF; the CBF supports processing in which the counter is increased by 1 when a data item is inserted, and also supports processing in which the counter is decreased by 1 when a data item is deleted.
Data item e obtained in this stepiFrequency of (f)iIncludes acquiring the data item eiFrom the previously processed data item e in the data streamiCBF is continuously updated with the arrival of the data stream; the CBF uses a plurality of hash functions, and uses the minimum value as the statistical frequency, so as to reduce the hash collision probability between different data items and improve the statistical accuracy.
For the example in the step (2), it is assumed that before the data item talk is obtained, the frequency count of the data item talk in the processed data item in the data stream recorded in the CBF is 24, the frequency count of the data item title recorded in the CBF is 29, after the processing in this step, the frequency count returned by the data item talk is 25, the frequency count returned by the data item title is 30, and the frequency count returned by the data item wiki is 1;
(4) judging data item eiFrequency of (f)iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the data item is represented as a high-frequency key, then the step (5) is carried out, otherwise, the data item is represented as a non-high-frequency key, and then the step (6) is carried out;
in particular, the high frequency key threshold ε is determined by obtaining the data item eiPreviously, the total number M of processed data items in the data stream was determined, and there were
Figure BDA0002878296830000093
For the example in step (2), it is assumed that before the data item talk is acquired, the total number M of processed data items in the data stream is 1203, the expected error e is 0.05 (i.e. C is 1/eis 20), and the high frequency key threshold is set as the high frequency key threshold
Figure BDA0002878296830000094
The data item talk is a non-high frequency key; before the data item title is obtained, the total number M of processed data items in the data stream is 1206, and the high-frequency key threshold value epsilon is 30, so that the data item title is a high-frequency key; data item wiki is not highA frequency key;
(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item eiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);
for the example in step (2), the data item title is a high-frequency key, and the number of existing keys in the current high-frequency key set S does not exceed C, the data item title is directly inserted into the high-frequency key set S, so that the updated high-frequency key set S is { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }.
(6) Judging data item eiFrequency of (f)iWhether the size is larger than or equal to the low frequency key threshold value theta, if so, the data item e is considerediIf the key is a potential high-frequency key, then the step (9) is carried out, otherwise, the step (7) is carried out;
specifically, the value range of the low-frequency key threshold θ is [2,10], preferably 5;
(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue QhThen the data item eiInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item eiInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
the length of the low-frequency key queue Q is equal to C, namely the length is the same as the maximum key number of the high-frequency key set, and the queue length can be adjusted according to the size of the data stream in practical application;
(8) judging the data item e of the head node in the low frequency key queue QhProbability of attenuation of
Figure BDA0002878296830000101
If the number is larger than the random number r, the data item e of the head node in the low-frequency key queue Q is subjected to CBFhUpdating to obtain the data item ehUpdated frequencyCounting, and then entering the step (9), wherein b is a preset exponential base number, b>1 and b ≈ 1, fhFor data item e of head node in low frequency key queue QhR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering the step (9);
specifically, data item e of the head node in the low frequency key queue Q is paired using CBFhUpdating the data item e of the head node in the low-frequency key queue QhThe element in array B corresponding to CBF is decremented by 1.
For the example in step (2), the frequency of the data item talk is 25, and the data item talk continues to be stored in the CBF; the data item wiki needs to be inserted into the low-frequency key queue Q, but at this time, a data item { page, translation, ns, ns, ce, page, …, user } exists in the low-frequency key queue Q, the length is 20, the data item wiki is full, the data item page of the head node in the low-frequency key queue Q needs to be deleted before the data item wiki is inserted, the frequency f of the data item page of the head node in the low-frequency key queue Q stored in the CBF is inquired, and the attenuation probability p ═ b is calculated-fAnd generating any random number between [0,1), and subtracting 1 from the array element corresponding to the data item page in the CBF when the random number is smaller than the decay probability p.
The step (3), the step (4) and the step (8) have the advantages that the frequency of each data item of the data stream is monitored by using the CBF with high memory efficiency, on one hand, a dynamic high-frequency key threshold value is set, the dynamic high-frequency key threshold value can adapt to the size change of the data stream, the high-frequency key set is identified more accurately, and the identification precision of the high-frequency key is improved; and on the other hand, a low-frequency key threshold value is set, the low-frequency key is stored in a low-frequency key queue, the corresponding array elements in the CBF are subjected to subtraction processing by 1 according to the first-in first-out characteristic of the queue and by combining with the attenuation probability, the CBF is updated reversely, most low-frequency keys can be filtered out in practical application data, the CBF is ensured to monitor data flow by a small memory, and the memory overhead is reduced.
(9) Using key-value grouping algorithms for data items eiDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the key value grouping algorithm is realized based on hash operation, a distributed downstream instance number is obtained after the data item is subjected to the hash operation, a downstream instance corresponding to the downstream instance number is distributed to the data item, that is, the data items corresponding to the same key are distributed to the same downstream instance;
(10) according to the high frequency key set S and the data item eiDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAdding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the present step includes the following substeps:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set Smax-fminIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);
(10-2) judging the data item eiIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item eiRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-3);
specifically, different keys may be assigned to different numbers of downstream instances, with each key being preset with the same number of random suffixes as the number of downstream instances that may be assigned; and adding suffixes of corresponding sequence numbers to the data items according to the random number sequence numbers generated by the random function.
(10-3) judging the data item eiWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected toiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiProcess knotBundling, otherwise, entering the step (10-4);
(10-4) Key assignment for value-centered in high-frequency Key set S
Figure BDA0002878296830000121
A downstream instance, for the data item eiRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process is ended;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, eiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process ends.
For the example in step (2), assuming that the number of downstream instances is M-6, after the data item namespace in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (84, first, case), (61, letter), (35, word) }, when M is 1204, f is fmax-fmin221 is greater than M/M200, so the value of the key in the current high frequency key set S is considered to be more different, namespace is the key with the largest value, which can be assigned to 6 downstream instances, randomly generating one at [1,6 []Adding suffixes of random numbers in { _1, _2, _3, _4, _5, _6} to namespaces, performing hash operation on data items added with the suffixes to obtain assignable downstream instance numbers, and assigning the corresponding downstream instances to the data items; when the data item first in the data stream arrives, the high frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word) }, at this time, M is 1205, fmax-fmin221 is greater than M/M200, the data item first is a key with a central value, which can be assigned to
Figure BDA0002878296830000131
A downstream instance, randomly generated one at [1,2 ]]In betweenThe data item first obtains an assignable downstream instance number by adding a suffix of a corresponding serial number of the random number in { _1, _2} and carrying out hash operation on the data item added with the suffix, and assigns a corresponding downstream instance to the data item; when the data item title arrives, the high-frequency key set S is updated to { (256, namespace), (85, first), (84, case), (61, letter), (35, word), (30, title) }, at which time M ═ 1206, fmax-fmin226 is greater than M/M201, the data item title is the key with the smallest value, 2 downstream instances can be assigned, one randomly generated at [1,2 [ ]]And (3) adding suffixes of the random numbers in { _1, _2} to the data items, carrying out hash operation on the data items added with the suffixes to obtain distributable downstream instance numbers, and distributing the corresponding downstream instances to the data items.
The method has the advantages that keys in the high-frequency key set are distinguished according to the value size, keys with small key values are distributed to 2 downstream instances, keys with large key values are distributed with different downstream instances according to the value size, the number of distributed downstream instances can be dynamically updated according to the change of data flow, so that load balance among the downstream instances is achieved, the keys with small key values are distributed with fewer downstream instances, and unnecessary memory overhead is reduced.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A distributed inclined stream processing method based on high-frequency key value counting is characterized by comprising the following steps:
(1) obtaining a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
(2) judging data item eiIf the data item is located in the high-frequency key set S, adding 1 to the value corresponding to the same key in the high-frequency key set S as the data item, and enteringStep (10), otherwise, entering step (3);
(3) for data item e using a counting bloom filteriProcessing to obtain the data item eiFrequency of (f)i
(4) Judging data item eiFrequency of (f)iWhether the size is larger than or equal to the high-frequency key threshold epsilon or not is judged, if yes, the step (5) is carried out, and if not, the step (6) is carried out;
(5) judging whether the number of the existing keys in the high-frequency key set S is equal to the maximum number of keys C in the high-frequency key set, if so, then sending the data item eiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then turning to the step (10);
(6) judging data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold theta is judged, if yes, the step (9) is carried out, and if not, the step (7) is carried out;
(7) judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue QhThen the data item eiInserting the data item into the low frequency key queue Q, then entering the step (8), otherwise, directly inserting the data item eiInserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) judging the data item e of the head node in the low frequency key queue QhProbability of attenuation of
Figure FDA0002878296820000011
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue QhUpdating to obtain the data item ehThe updated frequency is then entered into step (9), where b is the predetermined exponential base, b > 1 and b ≈ 1, fhFor data item e of head node in low frequency key queue QhR is a random number generated by the random number generator in the range of [0, 1); otherwise enter step(9);
(9) Using key-value grouping algorithms for data items eiDistributing downstream instances, adding 1 to the total number M of the processed data items in the data stream, and ending the process;
(10) according to the high frequency key set S and the data item eiDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the same key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAnd adding 1 to the total number M of the processed data items in the data stream, and ending the process.
2. The distributed inclined stream processing method based on high-frequency key value counting according to claim 1, characterized in that the high-frequency key set S in step (2) is implemented by a data structure based on stream summary in a space-saving algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and are directed to the same parent bucket, and different parent buckets in the high-frequency key set S are linked by a bi-directional linked list.
3. The high frequency key-value counting-based distributed inclined stream processing method according to claim 1, wherein the counting bloom filter in step (3) is an array B ═ { B [ 0] containing w counters],B[1],...,B[w-1]Firstly, the counting bloom filter utilizes t different hash functions h1(),h2(),...,ht() Calculating a data item eiRespectively corresponding hash value h1(ei),h2(ei),...,ht(ei) Then, calculating to obtain a processing result h of each hash value modulo w1(ei)%w,h2(ei)%w,…,ht(ei) % w, thereafter, 1 is added to each element in the array B which is equal to each processing result, and the minimum value among all the obtained elements is taken as a data item eiFrequency of (f)i
4. High frequency key-value counting based as claimed in claim 1Distributed oblique stream processing method, wherein the high-frequency key threshold epsilon in step (4) is set by acquiring the data item eiPreviously, the total number M of processed data items in the data stream was determined, and there were
Figure FDA0002878296820000021
5. The high frequency key-value-count-based distributed inclined stream processing method according to claim 1, wherein in step (8), a counting bloom filter is used to process the data item e of the head node in the low frequency key queue QhUpdating the data item e of the head node in the low-frequency key queue QhThe element in array B corresponding to CBF is decremented by 1.
6. The high frequency key-value count based distributed skewed stream processing method of claim 1, wherein step (10) includes the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high frequency key set Smax-fminIf M is the downstream instance number, entering the step (10-2), otherwise, entering the step (10-5);
(10-2) judging the data item eiIf the key is the key with the maximum value in the high-frequency key set S, allocating m downstream instances to the key corresponding to the data item, and allocating the data item eiRandomly adding a suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiEnding the process, otherwise, entering the step (10-3);
(10-3) judging the data item eiWhether the key with the minimum value is in the high-frequency key set S, if so, 2 downstream instances are allocated to the key corresponding to the data item, and the data item e is subjected toiRandomly adding a suffix of 2 preset random suffixes, performing hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and adding the assigned downstream instance numberThe downstream instance corresponding to the downstream instance number is assigned to the data item eiEnding the process, otherwise, entering the step (10-4);
(10-4) Key assignment for value-centered in high-frequency Key set S
Figure FDA0002878296820000031
A downstream instance, for the data item eiRandomly adding a suffix in preset random suffixes with the same number as that of downstream instances, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process is ended;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, eiRandomly adding a suffix in 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an assigned downstream instance number, and assigning the downstream instance corresponding to the downstream instance number to the data item eiAnd the process ends.
7. A distributed inclined stream processing system based on high-frequency key value counting is characterized by comprising the following modules:
a first module for acquiring a data item e to be processed in a data streamiAnd in data item e in the data streamiTotal number of previously processed data items M;
a second module for judging the data item eiWhether the data item is located in the high-frequency key set S or not, if so, adding 1 to a value corresponding to a key in the high-frequency key set S, which is the same as the data item, and then entering a tenth module, otherwise, entering a third module;
a third module for applying a counting bloom filter to the data item eiProcessing to obtain the data item eiFrequency of (f)i
A fourth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to a high-frequency key threshold epsilon or not is judged, if yes, the fifth module is started, and if not, the sixth module is started;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e isiReplace the key with the smallest value in the set S of high frequency keys and set the value of the key as fi+fminWherein f isminIs the minimum value of the keys in the high-frequency key set S, and then the tenth module is switched to; otherwise, the data item e is putiAnd frequency fiInserting the key value as a new key value into the high-frequency key set S, and then switching to a tenth module;
a sixth module for judging the data item eiFrequency of (f)iWhether the size is larger than or equal to the low-frequency key threshold value theta is judged, if yes, the ninth module is switched to, and if not, the seventh module is switched to;
a seventh module, configured to determine whether the low frequency key queue Q is full, and if so, delete the data item e of the head node in the low frequency key queue Q firsthThen the data item eiInserting the data item into the low frequency key queue Q, then entering an eighth module, otherwise, directly inserting the data item eiInserting the low-frequency key queue Q, and then switching to a ninth module;
an eighth module for determining a data item e of a head node in the low frequency key queue QhIs the probability of attenuation p ═
Figure FDA0002878296820000041
If the number is more than the random number r, using a digital bloom filter to carry out comparison on the data item e of the head node in the low-frequency key queue QhUpdating to obtain the data item ehThe updated frequency is then entered into a ninth module, wherein b is a preset exponential base number, b > 1 and b ≈ 1, fhFor data item e of head node in low frequency key queue QhR is a random number generated by the random number generator in the range of [0, 1); otherwise, entering a ninth module;
a ninth module for grouping data items e using a key-value grouping algorithmiAllocating downstream instances and adding 1 to the total number M of processed data items in the data stream;
a tenth module for merging the data item e with the set of high frequency keys SiSame asDetermining the number of downstream instances to which the key can be allocated according to the value size corresponding to the key, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item eiAnd adds 1 to the total number of processed data items M in the data stream.
CN202011629933.9A 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting Active CN112783644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011629933.9A CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011629933.9A CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Publications (2)

Publication Number Publication Date
CN112783644A true CN112783644A (en) 2021-05-11
CN112783644B CN112783644B (en) 2023-06-23

Family

ID=75754673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011629933.9A Active CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Country Status (1)

Country Link
CN (1) CN112783644B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319381A (en) * 2023-05-25 2023-06-23 中国地质大学(北京) Communication and resource-aware data stream grouping method and system
CN116346827A (en) * 2023-05-30 2023-06-27 中国地质大学(北京) Real-time grouping method and system for inclined data flow

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776698A (en) * 2018-06-08 2018-11-09 湖南大学 A kind of data fragmentation method of the skew-resistant based on Spark
US10862827B1 (en) * 2016-10-12 2020-12-08 Barefoot Networks, Inc. Network forwarding element with key-value processing in the data plane

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10862827B1 (en) * 2016-10-12 2020-12-08 Barefoot Networks, Inc. Network forwarding element with key-value processing in the data plane
CN108776698A (en) * 2018-06-08 2018-11-09 湖南大学 A kind of data fragmentation method of the skew-resistant based on Spark

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MICHAEL MITZENMACHER,ET AL.: "Hierarchical Heavy Hitters with the Space Saving Algorithm", 《HTTPS://ARXIV.ORG/ABS/1102.5540》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319381A (en) * 2023-05-25 2023-06-23 中国地质大学(北京) Communication and resource-aware data stream grouping method and system
CN116319381B (en) * 2023-05-25 2023-07-25 中国地质大学(北京) Communication and resource-aware data stream grouping method and system
CN116346827A (en) * 2023-05-30 2023-06-27 中国地质大学(北京) Real-time grouping method and system for inclined data flow
CN116346827B (en) * 2023-05-30 2023-08-11 中国地质大学(北京) Real-time grouping method and system for inclined data flow

Also Published As

Publication number Publication date
CN112783644B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
JP6716727B2 (en) Streaming data distributed processing method and apparatus
CN103345514B (en) Streaming data processing method under big data environment
US8447901B2 (en) Managing buffer conditions through sorting
US20120209943A1 (en) Apparatus and method for controlling distributed memory cluster
CN103914399B (en) Disk buffering method and device in a kind of concurrent computational system
CN101674233B (en) Peterson graph-based storage network structure and data read-write method thereof
US10002019B2 (en) System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups
CN101515298A (en) Inserting method based on tree-shaped data structure node and storing device
CN105159604A (en) Disk data read-write method and system
CN107729135B (en) Method and device for parallel data processing in sequence
CN101923558A (en) Storage network structure and reading and writing method for data based on (d, k) Mohr diagram
CN110058940B (en) Data processing method and device in multi-thread environment
CN112783644A (en) Distributed inclined stream processing method and system based on high-frequency key value counting
CN112947860B (en) Hierarchical storage and scheduling method for distributed data copies
CN112866136B (en) Service data processing method and device
CN108399175B (en) Data storage and query method and device
US8032543B2 (en) Sorting apparatus and method
US9578120B1 (en) Messaging with key-value persistence
CN109285015B (en) Virtual resource allocation method and system
CN115904246A (en) Data reading method and device based on multi-path DDR memory
Wang et al. Per-flow queue management with succinct priority indexing structures for high speed packet scheduling
Afek et al. Recursive design of hardware priority queues
CN113807555A (en) Address selection method and device for distribution center, electronic equipment and storage medium
US20130290378A1 (en) Adaptive probabilistic indexing with skip lists
Wang et al. A bloom filter-based index for distributed storage systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Li Kenli

Inventor after: Liu Chubo

Inventor after: Zhou Xu

Inventor after: Guo Yaolian

Inventor after: Tang Zhuo

Inventor after: Liu Yuanchun

Inventor after: Luo Wenming

Inventor after: Song Yingjie

Inventor after: Yang Wangdong

Inventor after: Cao Ronghui

Inventor after: Xiao Guoqing

Inventor before: Tang Zhuo

Inventor before: Liu Chubo

Inventor before: Zhou Xu

Inventor before: Guo Yaolian

Inventor before: Li Kenli

Inventor before: Liu Yuanchun

Inventor before: Luo Wenming

Inventor before: Song Yingjie

Inventor before: Yang Wangdong

Inventor before: Cao Ronghui

Inventor before: Xiao Guoqing

GR01 Patent grant
GR01 Patent grant