CN111770025B - Parallel data partitioning method and device, electronic equipment and storage medium - Google Patents

Parallel data partitioning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111770025B
CN111770025B CN202010571805.7A CN202010571805A CN111770025B CN 111770025 B CN111770025 B CN 111770025B CN 202010571805 A CN202010571805 A CN 202010571805A CN 111770025 B CN111770025 B CN 111770025B
Authority
CN
China
Prior art keywords
data
frequency
frequency key
key value
key values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010571805.7A
Other languages
Chinese (zh)
Other versions
CN111770025A (en
Inventor
刘刚
荔轲建
毛睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202010571805.7A priority Critical patent/CN111770025B/en
Publication of CN111770025A publication Critical patent/CN111770025A/en
Application granted granted Critical
Publication of CN111770025B publication Critical patent/CN111770025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/125Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/54Organization of routing tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a parallel data partitioning method, a device, electronic equipment and a storage medium, wherein the method is applied to data interaction between an upstream operator node module and a downstream operator node module, the upstream operator node module comprises a plurality of upstream operator nodes, the downstream operator node module comprises a plurality of downstream operator nodes, and the method comprises the following steps: a plurality of upstream operator nodes parallelly acquire the outlines of the respective data sub-streams; aggregating the outlines of the multiple data sub-streams to obtain the outlines of the data streams to be partitioned; constructing a block outline of the data stream to be partitioned by using a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value; and distributing the data flow to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value. The invention solves the problem that the existing active data partitioning method of the stateful operator can not be used in a parallelization way, and can be well adapted to a parallel data stream distribution mode.

Description

Parallel data partitioning method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for partitioning parallel data, an electronic device, and a storage medium.
Background
According to whether the processing logic of an operator is related to the key value of a data tuple in a data stream, the balancing mode of data-level load is divided into two modes: tuple-based (tuple-based) and key-based (key-based). The tuple-based data level load balancing method takes a single tuple as a distribution unit, and generally adopts a random grouping strategy. While the random grouping strategy uses a random or circular distribution mode to balance the distribution quantity of the data tuples as much as possible, but because the random grouping strategy ignores the content of the data tuples, the random grouping strategy is only suitable for stateless operators (stateless operators). If a stateful operator uses a random grouping strategy, the state integrity of the operator cannot be guaranteed. Key-value based data level load balancing is related to the key values of the data tuples, and a key-value grouping strategy is generally adopted. Whereas the key-value grouping policy is specific to data distribution of stateful operators related to key-values. The key value grouping strategy is usually implemented by adopting a hash function h (t) to modulo the parallelism of a downstream operator, and the implementation mode is simple, convenient and deterministic. However, when the data tuples appear frequently and have large differences in a certain dimension (attribute), the key-value grouping strategy based on the hash function h (t) can cause the problem that the data volume received by some downstream operator nodes is too high.
Therefore, aiming at The problem of unbalanced data quantity distribution results among downstream operator nodes caused by a key value grouping strategy realized based on a hash function h (t), students have proposed various key value grouping strategies, wherein several of The most important key value grouping strategies are a mixed route distribution method for treating high-frequency key values and low-frequency key values differently, namely, a partition function (partitioning) proposed by Gedik (Gedik B.partitioning functions for static data parallel in stream processing [ J ]. The VL DB Journal,2014,23 (4): 517-539); the framework for supporting dynamic data distribution proposed by Fang et al (Fang J, Z hangR, fuTZJ, equivalent. Parallel Stream Processing adaptation of sky and Variance. [ C ]. Washington, DC, USA.2017.Association for Computing Machinery: 15-26); rivetti et al (Rivetti N, querzoni L, anceau E, et al. Efficient Key Grouping for near-online time balancing in streaming systems. [ C ]. Oslo, norway.2015.Association for computing machinery: 80-91) propose an active data partitioning policy DKG (Distribution-aware Key Grouping) capable of sensing data tuple Distribution.
In terms of common knowledge, rebalancing load (rebalancing load) requires setting parameters, such as how often to check whether the load is in an unbalanced state, how often to rebalance the load, etc. These parameters are often associated with data flow applications, with different data flow applications having different parameters. These parameters imply a trade-off between load imbalance and the cost of rebalancing the load. The cost of rebalancing the load depends on the size of the state of the key that needs to be migrated and may halt the normal execution of the dataflow application. Different data stream applications have different key value state types, sizes, and processing delay requirements. In addition, rebalancing load operations require migration of phase key values and states from one downstream operator node (instance) to another downstream operatorNodes, which means that a stream processing system is required to support this migration operation. Storm (Toshiniwal A, taneja S, shukla A, et al. Storm @ twitter. [ C.)]Snow bird, utah, USA.2014.Association for computing machinery: 147-156) and S amza (Noghabi SA, paramasivam K, panY, et al. Samza: stateful detergent stream processing at LinkedIn [ J]Proc.vldbondow, 2017,10 (12): 1634-1645) uses a coarse-grained stream partitioning paradigm that causes each data stream to be sliced into as many data sub-streams as downstream operator nodes, which are processed in parallel. However, since a key value cannot be decoupled from the original data sub-stream, the coarse-grained flow partition paradigm is not compatible with the key value migration process. In contrast, S4 (L.N, B.R, A.N, et al.S4: distributed Stream C omputing Platform. [ C]2010.13-13Dec 2010 170-177) and Flink (carbone P, katsifodimios a, kth, et al apache Flink TM :Stream and Batch Pr ocessing in a Single Engine[J]Ieee data engineering bulletin,2015,38) provides a fine-grained data partitioning paradigm that divides a data stream into multiple data sub-streams according to partition key values, forming a one-to-one mapping of key values to operator nodes. And each key value is independently processed by one operator node, so that the migration of the key values is easily supported by a fine-grained data partitioning paradigm.
In addition, most of the existing rebalancing load methods with state transition operations are passive. Such passive methods require monitoring the system state to determine whether the load is unbalanced, and if so, a new partition function is formulated and a state transition process is initiated to eliminate the load imbalance. This passive approach has obvious drawbacks: additional overhead is required to monitor system status in real time; if the distribution of the data stream fluctuates frequently, the system responds frequently to the occurrence of load imbalances, which can significantly reduce system performance. Although the aforementioned DKG method has overcome many of the drawbacks of the passive method, the DKG method still has two distinct disadvantages: the DKG method is a data distribution method executed off-line and cannot be really applied to a real-time stream processing system; the DKG method cannot be used in a parallelization manner, that is, an operator for distributing data upstream only has one node, which also limits the application of the DKG method to a parallelized stream data processing system.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the utility model provides a parallel data partitioning method, a device, an electronic device and a storage medium, aiming at solving the problem that the existing state operator active data partitioning method DKG can not be used in a parallelization way.
In order to solve the technical problems, the invention adopts the technical scheme that:
the first aspect of the embodiments of the present invention provides a parallel data partitioning method, which is applied to data interaction between an upstream operator node module and a downstream operator node module, where the upstream operator node module includes a plurality of upstream operator nodes, the downstream operator node module includes a plurality of downstream operator nodes, a data stream to be partitioned includes a plurality of data sub-streams having a plurality of data tuples, and there is a one-to-one correspondence relationship between the plurality of data sub-streams and the plurality of upstream operator nodes, and the method includes the following steps:
the upstream operator nodes parallelly acquire the outlines of the respective data sub-streams, and the outlines of the data sub-streams are frequency arrays of low-frequency key values and [ high-frequency key values and frequency of high-frequency key values ] pairs of the data sub-streams;
aggregating the outlines of the data sub-streams to obtain the outlines of the data streams to be partitioned;
constructing the outline of the data stream to be partitioned by using a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value;
and distributing the data flow to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value.
In some embodiments, the constructing a skeleton of the data stream to be partitioned by using a shortest processing time first algorithm to obtain a routing table of a new high-frequency key value specifically includes the following steps:
sorting all high-frequency key values and [ high-frequency key values and frequency of high-frequency key values ] pairs of the data stream to be partitioned;
adding the frequency array of the low-frequency key values of the data stream to be partitioned to a historical data distribution record array to obtain a new data distribution record array;
and on the basis of the new data distribution record array, sequentially constructing a plurality of high-frequency key values of the data stream to be partitioned according to the sequence of the frequencies from large to small by using a shortest processing time priority algorithm, namely mapping the high-frequency key value with the highest frequency to a downstream operator node with the lowest load, continuously circulating, and finally obtaining a routing table of new high-frequency key values.
In some embodiments, the constructing the summary of the data stream to be partitioned by using the shortest processing time first algorithm to obtain the routing table of the new high-frequency key value further includes the following steps:
comparing the routing table of the new high-frequency key value with the routing table of the old high-frequency key value to obtain a migration table of the key value state;
and carrying out key value state migration on each downstream operator node in the downstream operator node module according to the migration table of the key value state.
In some embodiments, said plurality of upstream operator nodes concurrently obtain respective outlines of data sub-streams, specifically including the steps of:
a plurality of upstream operator nodes parallelly acquire pairs of high-frequency key values and frequencies of the high-frequency key values of the data sub-streams;
and a plurality of upstream operator nodes parallelly acquire frequency arrays of low-frequency key values of the respective data sub-streams.
In some embodiments, the multiple upstream operator nodes obtain pairs of [ high frequency key values, frequencies of high frequency key values ] of respective data sub-streams in parallel, and specifically include the following steps:
the plurality of upstream operator nodes respectively identify the respective data substreams by utilizing a Space Saving algorithm to obtain high-frequency key values and frequency of the high-frequency key values of the data substreams;
the upstream operator nodes combine the high-frequency key value and the frequency of the high-frequency key value of each data sub-stream to obtain a pair of the high-frequency key value and the frequency of the high-frequency key value of the data sub-stream.
In some embodiments, the obtaining, by the plurality of upstream operator nodes, a frequency array of low-frequency key values of respective data substreams in parallel includes:
the upstream operator nodes respectively map all data tuples in the respective data sub-streams to respective arrays by utilizing a hash function, and the number of elements in the arrays is equal to that of downstream operator nodes in a downstream operator node module;
the upstream operator nodes respectively subtract the frequency of the high-frequency key value of the respective data sub-stream by using the respective arrays to obtain the low-frequency key value of the data sub-stream and the mapping frequency on each downstream operator node in the downstream operator node module;
the upstream operator nodes respectively summarize low-frequency key values of the respective data sub-streams, and mapping frequency on each downstream operator node in the downstream operator node module obtains a frequency array of the low-frequency key values of the data sub-streams.
In some embodiments, the distributing the data stream to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value specifically includes the following steps:
and distributing all data tuples in each data sub-stream to each downstream operator node in the downstream operator node module according to the routing table of the new high-frequency key value.
A second aspect of the embodiments of the present invention provides a parallel data partitioning apparatus, including:
the upstream operator node module is used for utilizing a plurality of upstream operator nodes to obtain the outlines of the respective data sub-streams in parallel, and the outlines of the data sub-streams are frequency arrays of low-frequency key values of the data sub-streams and pairs of [ high-frequency key values and frequency of high-frequency key values ];
the scheduling modules have a one-to-one correspondence relationship with the upstream operator nodes and are used for extracting the outline of each data sub-stream in the upstream operator nodes;
the construction module is used for aggregating the outlines of the data sub-streams to obtain the outlines of the data streams to be partitioned, and constructing the outlines of the data streams to be partitioned by using a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value;
the scheduling module is further configured to distribute the data stream to be partitioned to a downstream operator node module according to a routing table of the new high-frequency key value;
and the downstream operator node module is used for receiving the data stream to be partitioned distributed by the scheduling module.
A third aspect of embodiments of the present invention provides an electronic device, which includes a storage device and one or more processors, where the storage device is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are configured to execute the method according to the first aspect of embodiments of the present invention.
A fourth aspect of embodiments of the present invention provides a storage medium having stored thereon executable instructions that, when executed, perform a method according to the first aspect of embodiments of the present invention.
From the above description, compared with the prior art, the invention has the following beneficial effects:
the method comprises the steps of utilizing a plurality of upstream operator nodes to parallelly acquire the outlines of data sub-streams to be partitioned, then aggregating the outlines of all the data sub-streams to obtain the global outline of the data stream to be partitioned, finally constructing the outlines of the data stream to be partitioned by utilizing a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value, and then routing and distributing the respective data sub-streams by each upstream operator node according to the routing table of the new high-frequency key value. The invention solves the problem that the existing active data partitioning method of the stateful operator can not be used in a parallelization way, and can be well adapted to a parallel data stream distribution mode.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below. It is to be understood that the drawings in the following description are of some, but not all, embodiments of the invention. For a person skilled in the art, other figures can also be obtained from the provided figures without inventive effort.
FIG. 1 is a flowchart of a parallel data partitioning method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a method for partitioning parallel data according to a second embodiment of the present invention;
FIG. 3 is a flow chart of a parallel data partitioning method provided by the second embodiment of the present invention connected to FIG. 2;
fig. 4 is a block diagram of a parallel data partitioning apparatus according to a third embodiment of the present invention;
FIG. 5 is a pseudo code of a finite state machine model of any scheduling module according to a third embodiment of the present invention;
FIG. 6 is pseudo code for a finite state machine model of a build module provided by a third embodiment of the present invention;
fig. 7 is a block diagram of an electronic device according to a fourth embodiment of the present invention;
fig. 8 is a block diagram of a storage medium according to a fifth embodiment of the present invention.
Detailed Description
For purposes of promoting a clear understanding of the objects, aspects and advantages of the invention, reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements throughout. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Referring to fig. 1, fig. 1 is a flowchart illustrating a parallel data partitioning method according to a first embodiment of the present invention.
As shown in fig. 1, a parallel data partitioning method provided in a first embodiment of the present invention is applied to data interaction between an upstream operator node module and a downstream operator node module, where the upstream operator node module includes a plurality of upstream operator nodes, the downstream operator node module includes a plurality of downstream operator nodes, a data stream to be partitioned includes a plurality of data sub-streams having a plurality of data tuples, and there is a one-to-one correspondence between the plurality of data sub-streams and the plurality of upstream operator nodes, and the method includes the following steps:
s11, a plurality of upstream operator nodes parallelly acquire the outlines of respective data sub-streams, wherein the outlines of the data sub-streams are frequency arrays of low-frequency key values of the data sub-streams and [ high-frequency key values and frequency of the high-frequency key values ] pairs;
s12, aggregating the outlines of the data sub-streams to obtain the outlines of the data streams to be partitioned;
s13, constructing a skeleton of the data stream to be partitioned by using a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value;
and S14, distributing the data flow to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value.
It should be noted that the parallel data partitioning method provided in this embodiment is applied in the scientific research field, and can be used as a comparison algorithm for key value grouping of big data, where the application mode is programming to implement the method, and then test comparison is performed; in the production field, the method can be integrated in an actual distributed stream processing system so as to perform large-scale data analysis, data mining, weblog analysis and the like, and the application mode is programming so as to realize the integration into the distributed stream processing system.
The parallel data partitioning method provided in the first embodiment of the present invention includes obtaining the outlines of the data sub-streams to be partitioned in parallel by using a plurality of upstream operator nodes, then aggregating the outlines of all the data sub-streams to obtain a global outline of the data stream to be partitioned, finally constructing the outlines of the data stream to be partitioned by using a shortest processing time-first algorithm to obtain a routing table of a new high-frequency key value, and then routing and distributing the respective data sub-streams by each of the upstream operator nodes according to the routing table of the new high-frequency key value. The invention solves the problem that the existing active data partitioning method of the stateful operator can not be used in a parallelization way, and can be well adapted to a parallel data stream distribution mode.
Referring to fig. 2 and fig. 3, fig. 2 is a flowchart illustrating a parallel data partitioning method according to a second embodiment of the present invention, and fig. 3 is a flowchart illustrating a parallel data partitioning method according to the second embodiment of the present invention connected to fig. 2.
Compared with the parallel data partitioning method provided by the first embodiment of the present invention, the parallel data partitioning method provided by the second embodiment of the present invention is designed in detail with respect to the specific flows of steps S11, S13, and S14, and in the second embodiment of the present invention:
further, as shown in fig. 2 and 3, step S13 specifically includes:
s131, sorting all high-frequency key values and pairs of [ high-frequency key values and frequencies of high-frequency key values ] of the data stream to be partitioned;
s132, adding the frequency array of the low-frequency key values of the data stream to be partitioned to the historical data distribution record array to obtain a new data distribution record array;
and S133, on the basis of the new data distribution record array, sequentially constructing a plurality of high-frequency key values of the data stream to be partitioned according to the sequence of the frequencies from large to small by using a shortest processing time priority algorithm, namely mapping the high-frequency key value with the highest frequency to a downstream operator node with the lowest load, and continuously circulating to obtain a routing table of new high-frequency key values.
Further, as shown in fig. 3, step S13 is followed by:
s21, comparing the routing table of the new high-frequency key value with the routing table of the old high-frequency key value to obtain a transition table of the key value state;
and S22, carrying out key value state migration on each downstream operator node in the downstream operator node module according to the migration table of the key value state.
Further, as shown in fig. 2, step S11 specifically includes:
s111, a plurality of upstream operator nodes parallelly acquire a pair of [ high-frequency key values and frequencies of the high-frequency key values ] of the data sub-streams;
and S112, the upstream operator nodes parallelly acquire frequency arrays of low-frequency key values of the data substreams.
Further, step S111 specifically includes:
s1111, identifying the respective data sub-streams by the plurality of upstream operator nodes respectively by using a Space Saving algorithm to obtain high-frequency key values of the data sub-streams and the frequency of the high-frequency key values;
s1112, combining the high frequency key value and the frequency of the high frequency key value of each data sub-stream by the multiple upstream operator nodes to obtain a [ high frequency key value, frequency of high frequency key value ] pair of the data sub-stream.
Step S112 specifically includes:
s1121, respectively mapping all data tuples in respective data substreams to respective arrays by using a hash function through a plurality of upstream operator nodes, wherein the number of elements in the arrays is equal to that of downstream operator nodes in a downstream operator node module;
s1122, the multiple upstream operator nodes respectively use the respective arrays to subtract the frequency of the high-frequency key values of the respective data sub-streams to obtain the low-frequency key values of the data sub-streams, and the mapping frequency of each downstream operator node in the downstream operator node modules is obtained;
s1123, the multiple upstream operator nodes respectively summarize low-frequency key values of the respective data substreams, and map frequencies on each downstream operator node in the downstream operator node module to obtain a frequency array of low-frequency key values of the data substreams.
It should be appreciated that the Space Saving algorithm is a count-based, deterministic method of estimating the frequency of data stream elements. The algorithm has two parameters of theta and epsilon, and 0<θ<Epsilon is less than or equal to 1. In addition, the algorithm maintains
Figure DA00025498823053002635
Each [ element, number ]]To, and returns all frequency of occurrence f i And > theta.m, where m is the number of data stream elements and i is an element. Furthermore, methwally et al (Methwally A, agrawal D, abbadi A E. Effeci)ent computation of frequent and top-k elements in data streams.[C]Edinburgh, UK.2005.Springer-Verlag: 398-412) have demonstrated that when 0<θ<When epsilon, the algorithm estimates the frequency of the element i
Figure BDA0002549882300000092
Conform to
Figure BDA0002549882300000093
Further, as shown in fig. 3, step S14 specifically includes:
s141, distributing all data tuples in each data sub-stream to each downstream operator node in the downstream operator node module according to the routing table of the new high-frequency key value.
In the parallel data partitioning method provided in the second embodiment of the present invention, each parallel upstream operator node in the upstream operator node module uses an array, records the number of data tuples sent to each parallel downstream operator node in the downstream operator node module, and applies the array to a routing table for constructing a new high-frequency key value, thereby reducing final load imbalance. In addition, in the routing table for constructing the new high-frequency key value, only the high-frequency key value is constructed to avoid the low-frequency key value from participating in key value migration, so that the migration overhead is minimized. Experiments show that compared with the DKG method, the parallel data partitioning method (abbreviated as OKG method) provided by this embodiment reduces the load imbalance ratio at the highest parallelism and the highest inclination by 87% and 89%, respectively, so that the load distribution result is more uniform.
Referring to fig. 4, 5 and 6, fig. 4 is a block diagram of a parallel data partitioning apparatus according to a third embodiment of the present invention, fig. 5 is pseudo code of a finite state machine model of any scheduling module according to the third embodiment of the present invention, and fig. 6 is pseudo code of a finite state machine model of a building module according to the third embodiment of the present invention.
As shown in fig. 4, corresponding to the parallel data partitioning method provided in the first embodiment of the present invention, a previous data partitioning apparatus 100 provided in the third embodiment of the present invention is an apparatus for distributing a data stream to be partitioned, which is located in an upstream operator node module, to a downstream operator node module, where the upstream operator node module includes a plurality of upstream operator nodes, the downstream operator node module includes a plurality of downstream operator nodes, and the parallel data partitioning apparatus 100 includes:
an upstream operator node module 101, configured to utilize multiple upstream operator nodes to obtain respective outlines of data sub-streams in parallel, where the outlines of the data sub-streams are frequency arrays of low-frequency key values of the data sub-streams and [ high-frequency key values, frequency of high-frequency key values ] pairs;
the scheduling modules 102, the scheduling modules 102 and the upstream operator nodes have a one-to-one correspondence relationship, and are configured to extract a synopsis of each data sub-stream in the upstream operator nodes;
a constructing module 103, configured to aggregate the outlines of the multiple data sub-streams to obtain an outline of a data stream to be partitioned, and construct the outline of the data stream to be partitioned by using a shortest processing time first algorithm to obtain a routing table of a new high-frequency key value;
the scheduling modules 102 are further configured to distribute the partition data flow to the downstream operator node module according to the routing table of the new high-frequency key value;
and the downstream operator node module 104 is configured to receive the data streams to be partitioned, which are distributed by the multiple scheduling modules 102.
In this embodiment, the scheduling module 102, the construction module 103, and the downstream operator node module 104 all run periodically in the form of a finite state machine. The following describes the finite state machine models of the scheduling module 102, the constructing module 103, and the downstream operator node module 104, respectively:
pseudo code for the finite state machine model of the scheduling module 102 is shown in FIG. 5. And, the finite state machine model of the scheduling module 102 includes 4 states: COLLECT, LEARN, WAIT, and ASSIGN. The cool state is an initial state of the scheduling module 102 in each cycle, and when the scheduling module 102 is in this state, the scheduling module 102 receives and stores a plurality of data tuples (for example, the number of data tuples in each data substream is m) in a plurality of data substreams in the upstream operator node module 101, and when the scheduling module 102 receives enough m data tuples, the scheduling module 102 enters the LEARN state. In the LEARN state, the scheduling module 102 starts to LEARN the key value distribution of the current data sub-stream. Specifically, the scheduling module 102 feeds all data tuples one by one to the Space Saving algorithm node (line 2 of code fragment 1 in fig. 5) and the Buckets array (line 3 of code fragment 1 in fig. 5). The scheduling module then combines the high-frequency key values returned by the Space Saving algorithm with the blocks to form a sketch, i.e., a skeleton (Make function of code segment 1 in fig. 5, lines 5-11), of the data substream, and sends the skeleton to the constructing module 103. After the sketch is successfully sent, the scheduling module 102 enters a WAIT state to WAIT for the construction module 103 to return a routing table of a new high-frequency key value. After receiving the routing table of the new high-frequency key value, the scheduling module 102 enters an ASSIGN state, and starts to actually distribute the data tuple to the downstream operator node module 104 (the Ass ign function of code segment 1 in fig. 5, lines 12-20). After the data tuple is distributed, the scheduling module 102 enters a cold initial state of the next cycle.
Pseudo code for the finite state machine model of the construction module 103 is shown in FIG. 6. And, the finite state machine model of the construction module 103 includes two states: WAIT ALL and compound. The WAIT ALL state of the construction module 103 continues until a plurality of scheduling modules 102 send a sketch of ALL data substreams. The state of the build module 103 then transitions to the complete state. In the complete state, the construction module 103 starts constructing a routing table of new high frequency key values (complete function of code fragment 2 in fig. 6). Specifically, the execution steps of the compound function are as follows:
step one, aggregating the sketch sent by all the scheduling modules 102 into a complete global sketch, and sorting the high-frequency key values and [ the high-frequency key values, the frequency of the high-frequency key values ] pairs in the global sketch (a complete function of code segment 2 in fig. 6, line 2);
secondly, adding the low-frequency Buckets of the global sketch to an old data distribution record array HB (historical Buckets) (the COMPILE function of the code segment 2 in FIG. 6, lines 4-7) to obtain a new data distribution record array;
thirdly, on the basis of the new data distribution record array, constructing a mapping pair of the high-frequency key values and the frequencies thereof according to a shortest processing time priority principle, thereby obtaining a routing table of the new high-frequency key values (a complete function of a code segment 2 in fig. 6, lines 9 to 13).
Next, the constructing module 103 compares the routing tables of the new and old high-frequency key values, obtains a migration table of the key value state, and sends the migration table of the key value state to the downstream operator node module 104. After receiving the notification of successful state migration of ALL downstream operator nodes in the downstream operator node module 104, the constructing module 103 sends the routing table of the new high-frequency key value to ALL scheduling modules 102, and enters the WAIT ALL state.
The finite state machine model of the downstream operator node module 104 includes two states: RUN and MIGRATE. After receiving the migration table of the key value state sent by the construction module 103, the downstream operator node module 104 enters the MIGRATE state, and immediately starts to execute the state migration work. After the migration is completed, the downstream operator node module 104 notifies the construction module 103 that the state migration is completed, and then enters the RUN state. In RUN state, the downstream operator node module 104 receives all data tuples sent from the scheduler module 102 and performs the subsequent actual processing.
In the parallel data partitioning apparatus provided in the third embodiment of the present invention, the scheduling module, the constructing module, and the downstream operator node module all periodically and circularly operate in the form of a finite state machine, so that the parallel data partitioning method implemented by the apparatus in the first embodiment or the second embodiment of the present invention or in combination of the two embodiments becomes an online executed data distribution method, compared with the DKG method, and can be really applied to a real-time stream processing system.
Referring to fig. 7, fig. 7 is a block diagram of an electronic device according to a fourth embodiment of the invention.
As shown in fig. 7, an electronic device 200 according to a fourth embodiment of the present invention includes:
a storage device 201 and one or more processors 202, wherein the storage device 201 is configured to store one or more programs, and wherein the one or more programs, when executed by the one or more processors 202, cause the one or more processors 202 to perform the parallel data partitioning method as provided by the first embodiment or the second embodiment or a combination thereof of the present invention.
In this embodiment, the electronic apparatus 200 further includes a bus 203, and the bus 203 is used for connecting the storage device 201 and the one or more processors 202.
Referring to fig. 8, fig. 8 is a block diagram of a storage medium according to a fifth embodiment of the present invention.
As shown in fig. 8, a storage medium 300 according to a fifth embodiment of the present invention has executable instructions 301 stored thereon, and when executed, the executable instructions 301 perform a parallel data partitioning method according to the first embodiment or the second embodiment or a combination of the first embodiment and the second embodiment.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk), among others.
It should be noted that, in the summary of the present invention, each embodiment is described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the method class embodiment, since it is similar to the product class embodiment, the description is simple, and the relevant points can be referred to the partial description of the product class embodiment.
It is further noted that, in the present disclosure, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this disclosure may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the summary is not intended to be limited to the embodiments shown in the summary, but is to be accorded the widest scope consistent with the principles and novel features disclosed in the summary.

Claims (9)

1. A parallel data partitioning method is characterized by being realized based on a parallel data partitioning device, wherein the parallel data partitioning device comprises an upstream operator node module, a plurality of scheduling modules, a construction module and a downstream operator node module, the upstream operator node module comprises a plurality of upstream operator nodes, the downstream operator node module comprises a plurality of downstream operator nodes, the scheduling modules and the upstream operator nodes are in one-to-one correspondence, a data flow to be partitioned comprises a plurality of data sub-flows with a plurality of data tuples, the data sub-flows and the upstream operator nodes are in one-to-one correspondence, and the scheduling modules, the construction module and the downstream operator node module periodically run in the form of finite state machines;
the method comprises the following steps:
a plurality of upstream operator nodes in the upstream operator node module parallelly acquire the outlines of the respective data sub-streams; wherein, the outline of the data sub-stream is a frequency array of low-frequency key values of the data sub-stream and a [ high-frequency key values, frequency of high-frequency key values ] pair;
the dispatching modules respectively extract the outlines of all the data sub-streams in the upstream operator nodes;
the construction module is used for aggregating the outlines of the data sub-streams to obtain the outlines of the data streams to be partitioned, and constructing the outlines of the data streams to be partitioned by utilizing a shortest processing time priority algorithm to obtain a routing table of a new high-frequency key value;
and the scheduling modules distribute the data flow to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value.
2. The method according to claim 1, wherein the step of constructing a skeleton of the data stream to be partitioned by using a shortest processing time first algorithm to obtain a routing table of a new high-frequency key value comprises the following steps:
sorting all high-frequency key values and [ high-frequency key values and frequency of high-frequency key values ] pairs of the data stream to be partitioned;
adding the frequency array of the low-frequency key values of the data stream to be partitioned to a historical data distribution record array to obtain a new data distribution record array;
and on the basis of the new data distribution record array, sequentially constructing a plurality of high-frequency key values of the data stream to be partitioned according to the sequence of the frequencies from large to small by using a shortest processing time priority algorithm, namely mapping the high-frequency key value with the highest frequency to a downstream operator node with the lowest load, continuously circulating, and finally obtaining a routing table of new high-frequency key values.
3. The method of claim 2, wherein the step of constructing the outlines of the data streams to be partitioned by using the shortest processing time first algorithm to obtain the routing table of the new high frequency key values further comprises the steps of:
comparing the routing table of the new high-frequency key value with the routing table of the old high-frequency key value to obtain a migration table of the key value state;
and carrying out key value state migration on each downstream operator node in the downstream operator node module according to the migration table of the key value state.
4. The method according to claim 1, wherein said upstream operator nodes obtain respective outlines of data sub-streams in parallel, and specifically comprises the steps of:
a plurality of upstream operator nodes parallelly acquire pairs of high-frequency key values and frequencies of the high-frequency key values of the data sub-streams;
and a plurality of upstream operator nodes parallelly acquire frequency arrays of low-frequency key values of the respective data sub-streams.
5. The method of claim 4, wherein said plurality of upstream operator nodes obtain [ high frequency key values, frequency of high frequency key values ] pairs of respective data substreams in parallel, comprising the steps of:
the plurality of upstream operator nodes respectively identify the respective data substreams by utilizing a Space Saving algorithm to obtain high-frequency key values and frequency of the high-frequency key values of the data substreams;
the upstream operator nodes combine the high-frequency key value and the frequency of the high-frequency key value of each data sub-stream to obtain a pair of the high-frequency key value and the frequency of the high-frequency key value of the data sub-stream.
6. The method of claim 5, wherein said plurality of upstream operator nodes concurrently obtain frequency arrays of low frequency key values for respective data substreams, comprising the steps of:
the upstream operator nodes respectively map all data tuples in the respective data sub-streams to respective arrays by utilizing a hash function, and the number of elements in the arrays is equal to that of downstream operator nodes in a downstream operator node module;
the upstream operator nodes respectively subtract the frequency of the high-frequency key value of the respective data sub-stream by using the respective arrays to obtain the low-frequency key value of the data sub-stream and the mapping frequency on each downstream operator node in the downstream operator node module;
and the upstream operator nodes respectively summarize low-frequency key values of the respective data substreams, and the mapping frequency of each downstream operator node in the downstream operator node module obtains a frequency array of the low-frequency key values of the data substreams.
7. The method for partitioning parallel data according to claim 1, wherein the distributing the data stream to be partitioned to a downstream operator node module according to the routing table of the new high-frequency key value specifically comprises the following steps:
and distributing all data tuples in each data sub-stream to each downstream operator node in the downstream operator node module according to the routing table of the new high-frequency key value.
8. An electronic device, comprising: a storage device to store one or more programs, and one or more processors to cause the one or more processors to perform the method of any of claims 1-7 when the one or more programs are executed by the one or more processors.
9. A storage medium having stored thereon executable instructions that, when executed, perform the method of any one of claims 1-7.
CN202010571805.7A 2020-06-22 2020-06-22 Parallel data partitioning method and device, electronic equipment and storage medium Active CN111770025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010571805.7A CN111770025B (en) 2020-06-22 2020-06-22 Parallel data partitioning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010571805.7A CN111770025B (en) 2020-06-22 2020-06-22 Parallel data partitioning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111770025A CN111770025A (en) 2020-10-13
CN111770025B true CN111770025B (en) 2022-12-30

Family

ID=72721525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010571805.7A Active CN111770025B (en) 2020-06-22 2020-06-22 Parallel data partitioning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111770025B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656369A (en) * 2021-08-13 2021-11-16 辽宁华盾安全技术有限责任公司 Log distributed streaming acquisition and calculation method in big data scene

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021126A (en) * 2016-05-31 2016-10-12 腾讯科技(深圳)有限公司 Cache data processing method, server and configuration device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201739155A (en) * 2016-04-18 2017-11-01 Inno-Tech Co Ltd Power controller reducing the conduction loss and the switching loss and maintaining the best operation efficiency under different input voltages
US10162830B2 (en) * 2016-06-22 2018-12-25 Oath (Americas) Inc. Systems and methods for dynamic partitioning in distributed environments

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021126A (en) * 2016-05-31 2016-10-12 腾讯科技(深圳)有限公司 Cache data processing method, server and configuration device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《A Holistic Stream Partitioning Algorithm for Distributed Stream Processing Systems》;刘刚等;《2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies》;20191201;正文1-6页 *

Also Published As

Publication number Publication date
CN111770025A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
Nasir et al. When two choices are not enough: Balancing at scale in distributed stream processing
CN110168516B (en) Dynamic computing node grouping method and system for large-scale parallel processing
EP2583175B1 (en) Parallel processing of continuous queries on data streams
Brenna et al. Distributed event stream processing with non-deterministic finite automata
JP2016515228A (en) Data stream splitting for low latency data access
US10896178B2 (en) High performance query processing and data analytics
CN110798517B (en) Decentralized cluster load balancing method and system, mobile terminal and storage medium
US20110276649A1 (en) Method for Efficient Partition and Replication of Social-Based Applications
US9723045B2 (en) Communicating tuples in a message
Hu et al. Trix: Triangle counting at extreme scale
US11221890B2 (en) Systems and methods for dynamic partitioning in distributed environments
US10162830B2 (en) Systems and methods for dynamic partitioning in distributed environments
Ji et al. Wide area analytics for geographically distributed datacenters
CN111770025B (en) Parallel data partitioning method and device, electronic equipment and storage medium
US11349896B2 (en) Optimal strategy for data replay in a complex multiplexing network
WO2021031527A1 (en) Distributed database table join method and device, system, server, and medium
Nasir et al. Partial key grouping: Load-balanced partitioning of distributed streams
WO2017113865A1 (en) Method and device for big data increment calculation
US20160154853A1 (en) Batching tuples
CN106446039B (en) Aggregation type big data query method and device
Jayasekara et al. Enhancing the scalability and performance of iterative graph algorithms on apache storm
Awekar et al. Parallel all pairs similarity search
AU2019239150B2 (en) Partitioning data in a clustered database environment
CN115794806A (en) Gridding processing system, method and device for financial data and computing equipment
Buono et al. A high-throughput and low-latency parallelization of window-based stream joins on multicores

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant