CN112783644B - Distributed inclined flow processing method and system based on high-frequency key value counting - Google Patents

Distributed inclined flow processing method and system based on high-frequency key value counting Download PDF

Info

Publication number
CN112783644B
CN112783644B CN202011629933.9A CN202011629933A CN112783644B CN 112783644 B CN112783644 B CN 112783644B CN 202011629933 A CN202011629933 A CN 202011629933A CN 112783644 B CN112783644 B CN 112783644B
Authority
CN
China
Prior art keywords
data item
key
frequency
downstream
frequency key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011629933.9A
Other languages
Chinese (zh)
Other versions
CN112783644A (en
Inventor
李肯立
郭耀莲
唐卓
刘园春
罗文明
宋莹洁
阳王东
曹嵘晖
肖国庆
刘楚波
周旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202011629933.9A priority Critical patent/CN112783644B/en
Publication of CN112783644A publication Critical patent/CN112783644A/en
Application granted granted Critical
Publication of CN112783644B publication Critical patent/CN112783644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed inclined flow processing method and a system based on high-frequency key value counting, wherein the basic idea is that a counting bloom filter is used for counting each data item in a data flow, the data item is respectively identified as a high-frequency key, a potential high-frequency key and a low-frequency key according to frequency number, so that the distribution of different data items is obtained, a strategy of adding random suffix to regroupe and aggregate is adopted for the high-frequency key to distribute a downstream example, and a key value grouping strategy is adopted for a non-high-frequency key to distribute the downstream example, thereby realizing the load balance among different downstream examples and improving the system performance. The invention can solve the technical problems of extremely high memory overhead of random packet downstream examples and unbalanced load among key value packet downstream examples in the inclined flow processing method.

Description

Distributed inclined flow processing method and system based on high-frequency key value counting
Technical Field
The invention belongs to the field of big data processing, and particularly relates to a distributed inclined flow processing method and system based on high-frequency key value counting.
Background
With the development of big data technology, a large number of applications based on data flow are presented in the fields of social networks, financial data analysis, e-commerce transactions, etc. Compared with the traditional data, the data flow has the characteristics of dynamics, high speed, mass, infinite and the like, the traditional distributed processing method cannot predict and control the arrival time and the arrival scale of the data flow, and when the arrival scale of the data is extremely large, the processing performance of the traditional distributed processing method is drastically reduced. To address the challenges described above, methods based on distributed stream processing systems such as S4, storm, spark Streaming, flink, etc. have been developed. In addition, the data stream distribution in practical application is highly inclined, that is, the frequency of each data in the data stream is greatly different.
The distributed stream processing method organizes and connects the running nodes in the distributed stream processing system into an application processing flow in a logic topology mode, the connection information is usually expressed as a directed acyclic graph, the vertex in the graph represents an operation in the application, and the edge represents the flow direction of data streams among the operations. The distributed stream processing system creates a plurality of downstream examples for each data operation, and the purpose of the grouping strategy in the stream processing method is to group the data sent by the upstream operation and distribute the data to each downstream example, so that the grouping strategy of the stream processing directly influences the quantity and distribution condition of the data processed by each downstream example. The existing basic grouping method for distributed stream processing comprises random grouping and key value grouping, wherein the random grouping adopts a polling mechanism to distribute each data item to each downstream instance in an equiprobable mode, so that uniform distribution of system workload is easy to realize; key-value groupings assign data items of the same key to one downstream instance based on hash operations, the state of the key of each data item being maintained by only one downstream instance.
However, the existing distributed oblique flow processing method has the following technical problems: each downstream instance in the random packet maintains the state of all keys, and the memory overhead of the downstream instance is extremely high; while the key value groupings assign the same key to the same downstream instance, the values of different keys differ significantly, resulting in load imbalance between downstream instances, and as the data flow gradient increases, load imbalance between downstream instances is more severe.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a distributed oblique flow processing method and a system based on high-frequency key value counting, and aims to solve the technical problems of extremely high memory overhead of random grouping downstream examples and unbalanced load among key value grouping downstream examples in the oblique flow processing method.
To achieve the above object, according to one aspect of the present invention, there is provided a distributed oblique flow processing method based on high frequency key value counting, comprising the steps of:
(1) Acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
(2) Judging data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
(3) Pair data item e using a computational bloom filter i Processing to obtain the data item e i Frequency f of (f) i
(4) Judging data item e i Frequency f of (f) i If the size is larger than or equal to the high-frequency key threshold epsilon, the step (5) is carried out, otherwise, the step (6) is carried out;
(5) Judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, adding the data item e i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then turning to the step (10);
(6) Judging data item e i Frequency f of (f) i If the size is larger than or equal to the low frequency key threshold value theta, the step (9) is carried out, otherwise, the step (7) is carried out;
(7) Judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then go to step (8), otherwise, directly insert data item e i Inserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) Judging a data item e of a head node in the low-frequency key queue Q h Attenuation probability of (2)
Figure BDA0002878296830000031
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q h Update to get the data item e h The updated frequency number is then entered into step (9), wherein b is a preset exponent base, b>1 and b.apprxeq.1, f h Data item e being the head node in the low frequency key queue Q h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a step (9);
(9) Data item e using key grouping algorithm i Distributing downstream examples, adding 1 to the total number M of processed data items in the data stream, and ending the process;
(10) According to the high frequency key set S and the data item e i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number of processed data items M in the data stream, the process ends.
Preferably, the high-frequency key set S in the step (2) is implemented through a data structure based on a stream summary in a space saving algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and point to the same father bucket, and a bidirectional linked list link is used between different father buckets in the high-frequency key set S.
Preferably, the counter bloom filter in step (3) is an array b= { B [ 0] containing w counters],B[1],…,B[w-1]First, the computational bloom filter uses t different hash functions h 1 (),h 2 (),...,h t () Calculate data item e i Hash values h respectively corresponding to 1 (e i ),h 2 (e i ),...,h t (e i ) Then, processing results h after each hash value modulo w are calculated 1 (e i )%w,h 2 (e i )%w,...,h t (e i ) % w, then, adding 1 to each element equal to each processing result in the array B, and taking the minimum value in all the obtained elements as a data item e i Frequency f of (f) i
Preferably, the high frequency key threshold epsilon in step (4) is defined by the acquisition data item e i Previously, the total number M of processed data items in the data stream was determined and there were
Figure BDA0002878296830000041
Preferably, in step (8), a counter bloom filter is used to count the data item e of the head node in the low frequency key queue Q h Updating is performed on the data item e of the head node in the low-frequency key queue Q h The elements in the corresponding array B in the CBF are decremented by 1.
Preferably, step (10) comprises the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high-frequency key set S max -f min If the size of the (a) is larger than M/M, wherein M is the number of downstream examples, if so, the step (10-2) is entered, otherwise, the step (10-5) is entered;
(10-2) judging the data item e i If the key with the largest value is in the high-frequency key set S, m downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-3);
(10-3) judging the data item e i If the key with the minimum value in the high-frequency key set S is the key with the minimum value, 2 downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding a presetOne of the 2 random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-4);
(10-4) centered key assignment to high frequency keyset S-median
Figure BDA0002878296830000042
A downstream instance of the data item e i Randomly adding one suffix in the preset random suffixes with the same number as the downstream examples, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream example number, and allocating the downstream example corresponding to the downstream example number to the data item e i Ending the process;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, for the data item e i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i The process ends.
According to another aspect of the present invention, there is provided a distributed oblique flow processing system based on high frequency key value counting, including:
a first module for acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
a second module for judging the data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering a tenth module, otherwise entering a third module;
a third module for using a counter bloom filter to the data item e i Processing to obtain the data item e i Frequency f of (f) i
A fourth module for judging the data item e i Frequency f of (f) i The size isIf not, the method enters a fifth module, and if not, the method enters a sixth module;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e is selected i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the tenth module is shifted to; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then transferring to a tenth module;
a sixth module for judging the data item e i Frequency f of (f) i If the size is larger than or equal to the low frequency key threshold value theta, the method is switched to a ninth module, otherwise, the method is switched to a seventh module;
a seventh module for judging whether the low frequency key queue Q is full, if yes, deleting the data item e of the head node in the low frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then enter the eighth module, otherwise, directly insert the data item e i Inserting the code into the low-frequency key queue Q, and then transferring to a ninth module;
an eighth module for determining a data item e of the head node in the low frequency key queue Q h Attenuation probability of (2)
Figure BDA0002878296830000051
Figure BDA0002878296830000052
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q h Update to get the data item e h The updated frequency number then enters a ninth module, wherein b is a preset exponent base, b>1 and b.apprxeq.1, f h Data item e being the head node in the low frequency key queue Q h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a ninth module;
a ninth module for using keysThe value grouping algorithm is data item e i Allocating downstream examples, and adding 1 to the total number M of processed data items in the data stream;
tenth module for selecting data item e according to the high frequency key set S i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number M of processed data items in the data stream.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) In the step (2) of the invention, the high-frequency key set is stored by adopting a lightweight data structure, so that the space efficiency is very high, the downstream instance is convenient to load the set into the memory, and meanwhile, the high-frequency key set is supported to quickly inquire and update on line, so that the keys allocated to the downstream instance are quickly selected;
(2) In the step (3) of the invention, each data item in the data stream is monitored by using a counting bloom filter, and the Ji Shushi bloom filter has high calculation and memory efficiency and supports the insertion and deletion of the data item;
(3) In the step (6), the low-frequency keys with limited length are used for caching the low-frequency keys in the queue, and the low-frequency keys stored in the counter type bloom filter are removed according to the characteristic of first-in first-out of the queue and with certain attenuation probability, the attenuation probability can filter out relatively smaller low-frequency keys, so that the memory occupation of the counter type bloom filter is saved, and the probability of conflict of hash operation results among different data items is reduced;
(4) In the step (7) of the invention, the keys in the high-frequency key set are distinguished according to the value, the keys with small key values are evenly distributed to 2 downstream examples, the keys with large key values are distributed with different downstream examples according to the value, the number of the distributed downstream examples can be dynamically updated according to the change of the data flow, thus realizing the load balancing among the downstream examples, and the less downstream examples are distributed for the keys with relatively small key values, thereby reducing unnecessary memory expenditure.
Drawings
FIG. 1 is a schematic illustration of the process of the present invention;
fig. 2 is a flow chart of the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The basic idea of the invention is that as shown in fig. 1, a counter type bloom filter (Counting Bloom Filter, abbreviated as CBF) is used for counting data flows, data items are respectively identified as a high-frequency key, a potential high-frequency key and a low-frequency key according to frequency numbers, so that the distribution of different data items is obtained, the high-frequency keys are stored in a high-frequency key set, the low-frequency keys are stored in a low-frequency key queue, downstream examples are allocated to the high-frequency keys by adopting a strategy of adding random suffix for regrouping and aggregation, and downstream examples are allocated to non-high-frequency keys by adopting a key value grouping strategy, so that the loads among different downstream examples are realized, and the system performance is improved.
As shown in fig. 2, the present invention provides a distributed oblique flow processing method based on high-frequency key value counting, which includes the following steps:
(1) Acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
specifically, the initial value of the total number M of processed data items in the data stream is 0, and the total number M of processed data items is counted and updated as each data item in the data stream is processed in turn.
(2) Judging data item e i If the key is in the high-frequency key set S, adding 1 to a value (value) corresponding to a key (key) which is the same as the data item in the high-frequency key set S, and then entering a step (10), otherwise, entering a step (3);
specifically, the format of the high-frequency key set S is, for example { (value) 1 ,key 1 ,key 2 ),(value 2 ,key 3 ) … }, wherein (value) 1 ,key 1 ,key 2 ) For one record in the set of high frequency keys S, there is one value in each record, but there can be one or more keys, the maximum number of keys in the set of high frequency keys S is C, where C is set according to the preset expected error E, and there is
Figure BDA0002878296830000081
key 1 ,key 2 ,key 3 Representing a key 1 And value 2 Representing the value.
In this step, each data item e is judged i Whether the key is located in the high-frequency key set S is determined by judging whether the key in the high-frequency key set S is matched with the data item e i In agreement, if so, then the acquisition data item e is described i Previously, the processed data item in the data stream was recorded as a high frequency key, i.e. the data item e was directly recorded i And accumulating statistics of the values of the corresponding keys in the high-frequency key set.
Preferably, the high-frequency key set S is implemented through a data structure based on Stream Summary (Stream Summary) in a Space Saving (Space Saving) algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and point to the same parent Bucket (i.e. socket), and two-way linked lists are used for linking different parent buckets in the high-frequency key set S.
For example, the currently received data item to be processed is { talk, nano space, first, title, wiki }, there is already a record { (255, nano space), (84, first, case), (61, left), (35, word) }, the data item talk, title and wiki is not in the high frequency key set S, step (2) is entered, and the next data item nano space and first are in the high frequency key set S, the corresponding value is increased by 1, so the high frequency key set S is updated to { (256, nano space), (85, first), (84, case), (61, left), (35, word) }.
(3) Data item e using a computational bloom filter (Counting Bloom Filter, CBF for short) i Processing to obtain the data item e i Frequency f of (f) i
Specifically, CBF is an array b= { B [ 0] containing w counters],B[1],…,B[w-1]Specifically, the present step is that, first, the CBF uses t different hash functions h 1 (),h 2 (),...,h t () Calculate data item e i Hash values h respectively corresponding to 1 (e i ),h 2 (e i ),...,h t (e i ) Then, processing results h after each hash value modulo w are calculated 1 (e i )%w,h 2 (e i )%w,...,h t (e i ) % w, then, adding 1 to each element equal to each processing result in the array B, and taking the minimum value in all the obtained elements as a data item e i Frequency f of (f) i
Wherein the hash function number t of the CBF is preferably set as
Figure BDA0002878296830000091
n is the number of categories of data items in the data stream; the number w of counters is preferably +.>
Figure BDA0002878296830000092
Delta is error rate error of CBF; the CBF supports counter-plus-1 processing when data items are inserted and counter-minus-1 processing when data items are deleted.
The data item e obtained in this step i Frequency f of (f) i Including the acquisition of data item e i Previously, the processed data item e in the data stream i CBF is updated continuously with the arrival of the data stream; the CBF uses a plurality of hash functions, and takes the minimum value as the statistical frequency, so as to reduce the probability of hash collision among different data items and improve the statistical accuracy.
For the example in step (2), it is assumed that before the data item talk is acquired, the frequency of the data item talk recorded in the CBF in the processed data item in the data stream is 24, the frequency of the data item title recorded in the CBF is 29, after the processing in this step, the frequency of the data item talk returned is 25, the frequency of the data item title returned is 30, and the frequency of the data item wiki returned is 1;
(4) Judging data item e i Frequency f of (f) i Whether the size is larger than or equal to a high-frequency key threshold epsilon, if so, the data item is indicated to be a high-frequency key, then the step (5) is carried out, otherwise, the data item is indicated to be a non-high-frequency key, and then the step (6) is carried out;
specifically, the high frequency key threshold ε is determined by the capture data item e i Previously, the total number M of processed data items in the data stream was determined and there were
Figure BDA0002878296830000093
For the example in step (2), assuming that the total number of processed data items m=1203 in the data stream before the data item talk is acquired, the expected error e=0.05 (i.e. c=1/e=20), the high frequency key threshold value
Figure BDA0002878296830000094
The data item talk is a non-high frequency key; before acquiring the title of the data item, if the total number of processed data items m=1206 and the high-frequency key threshold epsilon=30 in the data stream, the title of the data item is a high-frequency key; the data item wiki is a non-high frequency key;
(5) Judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, adding the data item e i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then turning to the step (10);
for the example in step (2), the data item title is a high frequency key, and the number of existing keys in the current high frequency key set S does not exceed C, and the data item title is directly inserted into the high frequency key set S, so as to obtain an updated high frequency key set S as { (256, nasspace), (85, first), (84, case), (61, filter), (35, word), (30, title) }.
(6) Judging data item e i Frequency f of (f) i Whether the magnitude is greater than or equal to the low frequency key threshold value theta, if so,the considered data item e i Is a potential high frequency key, then go to step (9), otherwise go to step (7);
specifically, the value range of the low frequency key threshold value θ is [2,10], preferably 5;
(7) Judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then go to step (8), otherwise, directly insert data item e i Inserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
the length of the low-frequency key queue Q is equal to C, namely the length is the same as the maximum key number of the high-frequency key set, and the length of the queue can be adjusted according to the size of the data stream in practical application;
(8) Judging a data item e of a head node in the low-frequency key queue Q h Attenuation probability of (2)
Figure BDA0002878296830000101
Whether or not the data item e is larger than the random number r, if yes, using CBF to the data item e of the head node in the low frequency key queue Q h Update to get the data item e h The updated frequency number is then entered into step (9), wherein b is a preset exponent base, b>1 and b.apprxeq.1, f h Data item e being the head node in the low frequency key queue Q h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a step (9);
specifically, the CBF is used for the data item e of the head node in the low frequency key queue Q h Updating is performed on the data item e of the head node in the low-frequency key queue Q h The elements in the corresponding array B in the CBF are decremented by 1.
For the example in step (2), the frequency of the data item talk is 25, and then the data item talk is continuously stored in the CBF; the data item wiki needs to be inserted into the low-frequency key queue Q, but at this time, the low-frequency key queue Q is already provided with data items { page, trans-format, ns, ns, ce, page, …, user }, the length is 20, and is full, the data item page of the head node in the low-frequency key queue Q is deleted before the data item wiki is inserted, and meanwhile the data item page of the head node in the low-frequency key queue Q is queried and is kept in the CBFStored frequency f, calculating decay probability p=b -f Generating random numbers between 0 and 1), and subtracting 1 from the array element corresponding to the data item page in the CBF when the random numbers are smaller than the attenuation probability p.
The step (3), the step (4) and the step (8) have the advantages that the frequency of each data item of the data stream is monitored by using the CBF with high memory efficiency, on one hand, the dynamic high-frequency key threshold is set, the size change of the data stream can be adapted, the high-frequency key set can be identified more accurately, and the identification precision of the high-frequency keys is improved; on the other hand, a low-frequency key threshold value is set, the low-frequency keys are stored in a low-frequency key queue, and according to the characteristic of first-in first-out of the queue, the corresponding array elements in the CBF are subtracted by 1 according to the attenuation probability, the CBF is reversely updated, most of the low-frequency keys can be filtered in actual application data, so that the CBF can monitor data flow with smaller memory, and memory overhead is reduced.
(9) Data item e using key grouping algorithm i Distributing downstream examples, adding 1 to the total number M of processed data items in the data stream, and ending the process;
specifically, the key value grouping algorithm is realized based on hash operation, the assigned downstream instance number is obtained after the data item is subjected to hash operation, the downstream instance corresponding to the downstream instance number is assigned to the data item, namely, the data item corresponding to the same key is assigned to the same downstream instance;
(10) According to the high frequency key set S and the data item e i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i Adding 1 to the total number M of the processed data items in the data stream, and ending the process;
specifically, the method comprises the following sub-steps:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high-frequency key set S max -f min If the size of the (a) is larger than M/M, wherein M is the number of downstream examples, if so, the step (10-2) is entered, otherwise, the step (10-5) is entered;
(10-2) judging the data item e i At high heightIf the key with the largest value in the frequency key set S is the key with the largest value, m downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-3);
specifically, different keys may be assigned to different numbers of downstream instances, each key being preset with the same number of random suffixes as the number of downstream instances that may be assigned; and adding a suffix of the corresponding sequence number to the data item according to the random number sequence number generated by the random function.
(10-3) judging the data item e i If the key with the minimum value in the high-frequency key set S is the key with the minimum value, 2 downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-4);
(10-4) centered key assignment to high frequency keyset S-median
Figure BDA0002878296830000121
A downstream instance of the data item e i Randomly adding one suffix in the preset random suffixes with the same number as the downstream examples, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream example number, and allocating the downstream example corresponding to the downstream example number to the data item e i Ending the process;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, for the data item e i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Process knotA bundle.
For the instance in step (2), assuming that the number of downstream instances is m=6, after the arrival of the data item nano space in the data stream, the high frequency key set S is updated to { (256, nano space), (84, first, case), (61, filter), (35, word) }, where m=1204, f max -f min 221 is greater than M/m=200, so the key in the current high frequency key set S is considered to have a large difference in value, and the naspace is the key with the largest value, which can be assigned to 6 downstream instances, randomly generating one at [1,6]Random numbers in the space are added with suffixes of sequence numbers corresponding to the random numbers in { 1, _2, _3, _4, _5, _6}, and downstream instance numbers which can be allocated are obtained after hash operation is carried out on the data item added with the suffixes, and the corresponding downstream instance is allocated to the data item; after arrival of the first data item in the data stream, the set of high frequency keys S is updated to { (256, naspace), (85, first), (84, case), (61, letter), (35, word) }, where m=1205, f max -f min The data item first, which is a key with centered value, can be assigned to 221 greater than M/m=200
Figure BDA0002878296830000131
A downstream instance, randomly generates an intermediate [1,2 ]]The random number in between and the suffix of the random number corresponding to the sequence number in the data item first is added in { _1, _2}, the data item added with the suffix is subjected to hash operation to obtain a downstream instance number which can be allocated, and the corresponding downstream instance is allocated to the data item; after the data item title arrives, the high frequency key set S is updated to { (256, naspace), (85, first), (84, case), (61, letter), (35, word), (30, title) }, where m=1206, f max -f min The data item title is the smallest key with 226 greater than M/m=201, and 2 downstream instances can be allocated, randomly generating one at [1,2]And (3) adding a suffix of a random number corresponding to the sequence number in { _1 and _2} to the data item title, carrying out hash operation on the data item added with the suffix to obtain an assignable downstream instance number, and assigning the corresponding downstream instance to the data item.
The method has the advantages that keys in the high-frequency key set are distinguished according to the value, keys with small key values are evenly distributed to 2 downstream instances, keys with large key values are distributed with different downstream instances according to the value, the number of the distributed downstream instances can be dynamically updated according to the change of data flow, load balancing among the downstream instances is achieved, fewer downstream instances are distributed to keys with relatively small key values, and unnecessary memory expenditure is reduced.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (7)

1. The distributed oblique flow processing method based on the high-frequency key value counting is characterized by comprising the following steps of:
(1) Acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
(2) Judging data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering the step (10), otherwise, entering the step (3);
(3) Pair data item e using a computational bloom filter i Processing to obtain the data item e i Frequency f of (f) i
(4) Judging data item e i Frequency f of (f) i If the size is larger than or equal to the high-frequency key threshold epsilon, the step (5) is carried out, otherwise, the step (6) is carried out;
(5) Judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, adding the data item e i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the step (10) is carried out; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then turning to the step (10);
(6) Judging data item e i Frequency of (2)Number f i If the size is larger than or equal to the low frequency key threshold value theta, the step (9) is carried out, otherwise, the step (7) is carried out;
(7) Judging whether the low-frequency key queue Q is full, if so, deleting the data item e of the head node in the low-frequency key queue Q h And then the data item e i Insert into the low frequency key queue Q, then go to step (8), otherwise, directly insert data item e i Inserting the low-frequency key into the low-frequency key queue Q, and then turning to the step (9);
(8) Judging a data item e of a head node in the low-frequency key queue Q h Attenuation probability of (2)
Figure FDA0002878296820000011
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q h Update to get the data item e h The updated frequency is then entered into step (9), where b is a preset exponent base, b > 1 and b≡1, f h Data item e being the head node in the low frequency key queue Q h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a step (9);
(9) Data item e using key grouping algorithm i Distributing downstream examples, adding 1 to the total number M of processed data items in the data stream, and ending the process;
(10) According to the high frequency key set S and the data item e i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number of processed data items M in the data stream, the process ends.
2. The distributed oblique flow processing method based on high-frequency key value counting as claimed in claim 1, wherein the high-frequency key set S in the step (2) is implemented by a data structure based on a flow digest in a space saving algorithm, keys with the same count value in the high-frequency key set S are linked in the same linked list and point to the same parent barrel, and a bidirectional linked list is used between different parent barrels in the high-frequency key set S.
3. The method of high frequency key counting based distributed oblique flow processing as set forth in claim 1, wherein the counted bloom filter in step (3) is an array b= { B [ 0] containing w counters],B[1],...,B[w-1]First, the computational bloom filter uses t different hash functions h 1 (),h 2 (),...,h t () Calculate data item e i Hash values h respectively corresponding to 1 (e i ),h 2 (e i ),...,h t (e i ) Then, processing results h after each hash value modulo w are calculated 1 (e i )%w,h 2 (e i )%w,…,h t (e i ) % w, then, adding 1 to each element equal to each processing result in the array B, and taking the minimum value in all the obtained elements as a data item e i Frequency f of (f) i
4. The distributed oblique flow processing method based on high frequency key value count as claimed in claim 1, wherein the high frequency key threshold epsilon in step (4) is obtained by obtaining the data item e i Previously, the total number M of processed data items in the data stream was determined and there were
Figure FDA0002878296820000021
5. The high frequency key value count based distributed oblique flow processing method as set forth in claim 1, wherein the counter bloom filter is used in step (8) for the data item e of the head node in the low frequency key queue Q h Updating is performed on the data item e of the head node in the low-frequency key queue Q h The elements in the corresponding array B in the CBF are decremented by 1.
6. The distributed oblique flow processing method based on high-frequency key value counting as claimed in claim 1, wherein the step (10) includes the sub-steps of:
(10-1) determining the difference f between the maximum value and the minimum value of the keys in the high-frequency key set S max -f min If the size of the (a) is larger than M/M, wherein M is the number of downstream examples, if so, the step (10-2) is entered, otherwise, the step (10-5) is entered;
(10-2) judging the data item e i If the key with the largest value is in the high-frequency key set S, m downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix in m preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-3);
(10-3) judging the data item e i If the key with the minimum value in the high-frequency key set S is the key with the minimum value, 2 downstream examples are allocated to the key corresponding to the data item, and the data item e is allocated to the key i Randomly adding one suffix of 2 preset random suffixes, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instance number, and allocating the downstream instance corresponding to the downstream instance number to the data item e i Ending the process, otherwise, entering the step (10-4);
(10-4) centered key assignment to high frequency keyset S-median
Figure FDA0002878296820000031
A downstream instance of the data item e i Randomly adding one suffix in the preset random suffixes with the same number as the downstream examples, carrying out hash operation on the data item added with the suffix to obtain an allocated downstream example number, and allocating the downstream example corresponding to the downstream example number to the data item e i Ending the process;
(10-5) assigning 2 downstream instances to the key corresponding to the data item, for the data item e i Randomly adding one suffix of 2 preset random suffixes, and carrying out hash operation on the data item added with the suffix to obtain an allocated downstream instanceNumber, the downstream instance corresponding to the downstream instance number is allocated to the data item e i The process ends.
7. A distributed oblique flow processing system based on high frequency key value counting, comprising the following modules:
a first module for acquiring a data item e to be processed in a data stream i And in data item e in the data stream i The total number M of previously processed data items;
a second module for judging the data item e i If the key is positioned in the high-frequency key set S, adding 1 to the value corresponding to the key which is the same as the data item in the high-frequency key set S, and then entering a tenth module, otherwise entering a third module;
a third module for using a counter bloom filter to the data item e i Processing to obtain the data item e i Frequency f of (f) i
A fourth module for judging the data item e i Frequency f of (f) i If the size is larger than or equal to the high-frequency key threshold epsilon, entering a fifth module, otherwise, entering a sixth module;
a fifth module for judging whether the existing key number in the high-frequency key set S is equal to the maximum key number C of the high-frequency key set, if so, the data item e is selected i Replace the key with the smallest value in the high-frequency key set S and set the value of the key to f i +f min Wherein f min Is the minimum value of the keys in the high-frequency key set S, and then the tenth module is shifted to; otherwise, data item e i Frequency f i Inserting the new key value into the high-frequency key set S, and then transferring to a tenth module;
a sixth module for judging the data item e i Frequency f of (f) i If the size is larger than or equal to the low frequency key threshold value theta, the method is switched to a ninth module, otherwise, the method is switched to a seventh module;
a seventh module for judging whether the low frequency key queue Q is full, if yes, deleting the data item e of the head node in the low frequency key queue Q h And then the data item e i Inserted into the low frequency key queue QThen enter the eighth module, otherwise, directly enter the data item e i Inserting the code into the low-frequency key queue Q, and then transferring to a ninth module;
an eighth module for determining a data item e of the head node in the low frequency key queue Q h Attenuation probability p=
Figure FDA0002878296820000041
Whether or not the data item e is larger than the random number r, if yes, using a counter bloom filter to the data item e of the head node in the low frequency key queue Q h Update to get the data item e h The updated frequency number then enters a ninth module, wherein b is a preset exponent base, b is greater than 1 and b is approximately equal to 1, f h Data item e being the head node in the low frequency key queue Q h R is a random number with the range of [0, 1) generated by a random number generator; otherwise, entering a ninth module;
a ninth module for grouping the data item e by using the key value grouping algorithm i Allocating downstream examples, and adding 1 to the total number M of processed data items in the data stream;
tenth module for selecting data item e according to the high frequency key set S i The value size corresponding to the same key is used for determining the number of downstream instances to which the key can be allocated, selecting one downstream instance from the downstream instances according to the determined number of downstream instances, and allocating the selected downstream instance to the data item e i And adds 1 to the total number M of processed data items in the data stream.
CN202011629933.9A 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting Active CN112783644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011629933.9A CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011629933.9A CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Publications (2)

Publication Number Publication Date
CN112783644A CN112783644A (en) 2021-05-11
CN112783644B true CN112783644B (en) 2023-06-23

Family

ID=75754673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011629933.9A Active CN112783644B (en) 2020-12-31 2020-12-31 Distributed inclined flow processing method and system based on high-frequency key value counting

Country Status (1)

Country Link
CN (1) CN112783644B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319381B (en) * 2023-05-25 2023-07-25 中国地质大学(北京) Communication and resource-aware data stream grouping method and system
CN116346827B (en) * 2023-05-30 2023-08-11 中国地质大学(北京) Real-time grouping method and system for inclined data flow
CN118657141A (en) * 2024-08-20 2024-09-17 浙江大华技术股份有限公司 Data analysis method and device for inclined partition and computer storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776698A (en) * 2018-06-08 2018-11-09 湖南大学 A kind of data fragmentation method of the skew-resistant based on Spark
US10862827B1 (en) * 2016-10-12 2020-12-08 Barefoot Networks, Inc. Network forwarding element with key-value processing in the data plane

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10862827B1 (en) * 2016-10-12 2020-12-08 Barefoot Networks, Inc. Network forwarding element with key-value processing in the data plane
CN108776698A (en) * 2018-06-08 2018-11-09 湖南大学 A kind of data fragmentation method of the skew-resistant based on Spark

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Hierarchical Heavy Hitters with the Space Saving Algorithm;Michael Mitzenmacher,et al.;《https://arxiv.org/abs/1102.5540》;全文 *

Also Published As

Publication number Publication date
CN112783644A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN112783644B (en) Distributed inclined flow processing method and system based on high-frequency key value counting
JP6716727B2 (en) Streaming data distributed processing method and apparatus
CN111427844B (en) Data migration system and method for file hierarchical storage
Zhong et al. Burstsketch: Finding bursts in data streams
Zhou et al. Designing low-complexity heavy-traffic delay-optimal load balancing schemes: Theory to algorithms
US10846795B2 (en) Order book management device in a hardware platform
AU2012217617B2 (en) Sorting
CN102521347B (en) Pattern matching intermediate result management method based on priority
CN103312825A (en) Method and device for data distribution and storage
CN113518130B (en) Packet burst load balancing method and system based on multi-core processor
CN112131005A (en) Resource adjustment strategy determination method and device
CN103428099A (en) Flow control method for universal multi-core network processor
Li et al. Ladderfilter: Filtering infrequent items with small memory and time overhead
CN105554069B (en) A kind of big data processing distributed cache system and its method
Fan et al. Onesketch: A generic and accurate sketch for data streams
CN100466622C (en) Method and system for random packet interval sampling on network
CN114020471B (en) Sketch-based lightweight elephant flow detection method and platform
Deng et al. Spatial-keyword skyline publish/subscribe query processing over distributed sliding window streaming data
CN109981726A (en) A kind of distribution method of memory node, server and system
CN109445934B (en) Query request distribution method and system
CN114063931A (en) Data storage method based on big data
Sun et al. Hee-sketch: an efficient sketch for sliding-window frequency estimation over skewed data streams
CN110716931A (en) Bloom filter based on Hash fingerprint
CN104901947A (en) Continuous numerical matching method and continuous numerical matching device based on TCAM
Qi et al. MapReduce-based data stream processing over large history data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Li Kenli

Inventor after: Liu Chubo

Inventor after: Zhou Xu

Inventor after: Guo Yaolian

Inventor after: Tang Zhuo

Inventor after: Liu Yuanchun

Inventor after: Luo Wenming

Inventor after: Song Yingjie

Inventor after: Yang Wangdong

Inventor after: Cao Ronghui

Inventor after: Xiao Guoqing

Inventor before: Tang Zhuo

Inventor before: Liu Chubo

Inventor before: Zhou Xu

Inventor before: Guo Yaolian

Inventor before: Li Kenli

Inventor before: Liu Yuanchun

Inventor before: Luo Wenming

Inventor before: Song Yingjie

Inventor before: Yang Wangdong

Inventor before: Cao Ronghui

Inventor before: Xiao Guoqing

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant