CN114780541A - Data partitioning method, device, equipment and medium in micro-batch stream processing system - Google Patents

Data partitioning method, device, equipment and medium in micro-batch stream processing system Download PDF

Info

Publication number
CN114780541A
CN114780541A CN202210339704.6A CN202210339704A CN114780541A CN 114780541 A CN114780541 A CN 114780541A CN 202210339704 A CN202210339704 A CN 202210339704A CN 114780541 A CN114780541 A CN 114780541A
Authority
CN
China
Prior art keywords
key
data
batch
count
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210339704.6A
Other languages
Chinese (zh)
Other versions
CN114780541B (en
Inventor
李书亮
高杨
王霄阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY
Zhejiang University ZJU
Original Assignee
HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY, Zhejiang University ZJU filed Critical HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY
Priority to CN202210339704.6A priority Critical patent/CN114780541B/en
Priority claimed from CN202210339704.6A external-priority patent/CN114780541B/en
Publication of CN114780541A publication Critical patent/CN114780541A/en
Application granted granted Critical
Publication of CN114780541B publication Critical patent/CN114780541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Abstract

The application relates to the technical field of data stream real-time processing, and provides a data partitioning method and device, computer equipment, a storage medium and a computer program product in a micro batch stream processing system. According to the method, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, an ordered list of one key and frequency related information of the key can be obtained by traversing a balanced binary tree, the sequencing time of a processing stage is reduced, the problem is abstracted into a classical binning problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of the data partitions is realized, the problem is abstracted into a variable capacity binning problem in the processing stage, key clusters are distributed by using a worst adaptation algorithm, the load balance among tasks is guaranteed, and the data processing throughput can be greatly improved without increasing delay.

Description

Data partitioning method, device, equipment and medium in micro-batch stream processing system
Technical Field
The present application relates to the field of data stream real-time processing technologies, and in particular, to a data partitioning method and apparatus in a micro batch processing system, a computer device, a storage medium, and a computer program product.
Background
With the development of big data technology becoming mature, the demand for real-time processing is becoming more extensive, and the demand is often distributed in applications such as social network analysis and click traffic analysis. The importance of processing large data streams in real time is self evident, which results in the creation of a large number of distributed stream processing systems.
In the prior art, for example, some micro batch Streaming processing systems such as Spark Streaming, Comet, Google Dataflow, etc. adopt a processing model of one batch at a time to improve processing throughput, and compared with a traditional one-tuple Streaming processing system at a time, the micro batch Streaming processing system has the advantages of higher speed, more efficient fault-tolerant mechanism, etc. However, the performance of the micro-batch stream processing system in this technology is very sensitive to the dynamic change of the load characteristics by using the basic data partitioning technology, and the resource utilization rate is very dependent on the workload which is evenly divided on the processing units.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data partitioning method, apparatus, computer device, storage medium and computer program product in a micro batch processing system.
In a first aspect, the present application provides a method for partitioning data in a micro batch processing system. The method comprises the following steps:
acquiring a data stream tuple;
maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;
partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
In one embodiment, the method further comprises: recording the processing time and batch interval of each batch; acquiring the proportion of each batch processing time to the batch interval; acquiring continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold; and adjusting the Map task and/or the Reduce task according to the continuous batch count.
In one embodiment, the preset proportional threshold comprises a first proportional threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes: obtaining a count of consecutive batches for which the ratio is greater than the first ratio threshold.
In one embodiment, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count includes: when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.
In one embodiment, the preset proportion threshold comprises a second proportion threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes: acquiring a continuous batch count of which the proportion is smaller than the second proportion threshold.
In one embodiment, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count includes: when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.
In a second aspect, the present application further provides a device for partitioning dynamic data in a micro batch processing system.
The device comprises:
the acquisition module is used for acquiring a data stream tuple;
a maintenance module to maintain the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
a generating module, configured to traverse the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;
the partitioning module is used for partitioning the data stream tuples in the ordered list according to batches based on a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
the distribution processing module is used for distributing the key clusters to buckets in the Reduce stage for processing by using the information whether the keys in the data block are divided or not through a Map task based on a worst adaptation algorithm; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
acquiring a data stream tuple; maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree; traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key; partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal; distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a data stream tuple; maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree; traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key; partitioning the data stream tuples in the ordered list in batches based on a preset partition condition; each partition is a data block, and information whether a key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal; distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
acquiring a data stream tuple; maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree; traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key; partitioning the data stream tuples in the ordered list in batches based on a preset partition condition; each partition is a data block, and information whether a key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal; distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
According to the data partitioning method, the data partitioning device, the computer equipment, the storage medium and the computer program product in the micro batch streaming processing system, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, the ordered list of one key and frequency related information thereof can be obtained by traversing a balanced binary tree, the sorting time of a processing stage is reduced, the problem is abstracted into the classical boxing problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of the data partitions is realized, the problem is abstracted into the variable capacity boxing problem in the processing stage, the worst adaptation algorithm is used for distributing key clusters, the load balance among tasks is ensured, and the data processing throughput can be greatly improved without increasing delay.
Drawings
FIG. 1 is a flow diagram that illustrates a method for partitioning data in a micro batch stream processing system, according to one embodiment;
FIG. 2 is a flow diagram illustrating data caching and dynamic partitioning in a micro batch stream processing system, according to an embodiment;
FIG. 3 is a flow diagram of a frequency sensing technique in one embodiment;
FIG. 4 is a flow diagram that illustrates load partitioning in a batch phase implementation;
FIG. 5 is a flow diagram of process stage partitioning in one embodiment;
FIG. 6 is a block diagram of an apparatus for partitioning data in a micro batch processing system, according to one embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
The data partitioning method in the micro batch streaming system provided by the embodiment of the application can be applied to a server, and the server can be realized by an independent server or a server cluster formed by a plurality of servers.
The following describes a data partitioning method in a micro batch stream processing system provided in the present application in detail with reference to various embodiments and accompanying drawings.
In one embodiment, as shown in fig. 1 and in conjunction with fig. 2, there is provided a method for data partitioning in a micro-batch stream processing system, comprising:
step S101, acquiring a data stream tuple;
step S102, maintaining data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of a data stream tuple, a first pointer pointing to a tuple list corresponding to the key and frequency count of the key; the frequency count of the key is also saved to a balanced binary search tree; each key in the hash table is provided with a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
step S103, traversing the balanced binary search tree to generate an ordered list; wherein, the ordered list comprises keys, frequency counts of the keys and tuple lists corresponding to the keys;
the above steps S101 to S103 mainly use frequency-aware buffering technique to prepare the batch for partitioningThe time required is minimized. Specifically, the frequency-aware buffering technique: a batch of data stream tuples are obtained, and a hash table and a balanced binary search tree are established to maintain statistical information of the data stream tuples. Specifically, the data stream tuple is stored in the hash table HTable according to the key of the data stream tuple<k,vi>Wherein k is a bond, vitupleList for pointing to tuple list corresponding to each keyiThe hash table HTable also stores a frequency count of the keyiWhile, the key frequency countsiIt is saved in the balanced binary tree CountTree and each key in the hash table HTable possesses a bidirectional pointer (i.e. a second pointer) pointing to the corresponding frequency count node in the balanced binary tree CountTree, which second pointer allows to directly update the count node of the key. Based on this, traversal of the balanced binary tree CountTree generates an ordered list of keys and their associated frequency information<ki,counti,tupleListi>,kiRepresenting the ith key in the data stream tuple. With reference to fig. 3, the more detailed procedure is as follows:
input data stream S, batch interval tstart-tendAnd setting the update compensation budget and the initial frequency compensation f. Firstly, resetting a hash table HTable and a countTree used for saving frequency counts; then circularly traversing the data stream tuples received in the batch interval, and adding 1 to the data stream tuple count Nc; if the key of the data flow tuple is in the hash table HTable, the data flow tuple is inserted into the linked list of the key in the hash table HTable, and the frequency k.freq of the current key is updatedcurrDifference Delta between current key frequency and frequency before updatefreq=k.Freqcurr-k.FrequpdatedThe difference Delta between the present time and the last update timetime=Timenow-klastUpdateTimeIf the current frequency step k of the keyf.stepEqual to DeltafreqOr the current time step kt.stepEqual to DeltatimeThen k.freq in CountTree is updatedcurrBudget and k.frequpdatedIf k isf.stepEqual to DeltafreqThen update
Figure BDA0003578585830000071
If k ist.stepEqual to DeltatimeThen k is updatedt.step=(tend-TimeNow) And k, budget. If the key of the tuple is not in the hash table HTable, the counting value K of different keys is incremented by 1, the tuple is inserted into the hash table HTable, the key of the tuple is inserted into a balanced binary tree, and k.freq is initializedcurr、k.FrequpdatedTo 1, initialize kt.step=(tend-TimeNow)/budget、kf.step=f。
In order to increase the data processing rate, the data processing rate is often updated in a coarser granularity manner, that is, the budget is periodically updated within a certain time interval, where the budget is a compensation value determined according to the requirement, a control parameter fstep is defined, the count of the node is updated once every time fstep new tuples of the same key are received, and the initial fstep is set to a constant that can reflect the optimal step size
Figure BDA0003578585830000072
Wherein N isestIs the number of data elements in the next batch interval, K, at the average data rateAvgIs the average of the different bonds in the past several batches. Fstep is to adaptively update the estimation for each key according to the ratio of the frequency of the current key and the total number of tuples received in the current batch interval, namely, the key with higher frequency needs to receive more data tuples to trigger the update; to ensure that the nodes of all tuples are updated, a time-based control parameter tstep is set to update keys that have not been updated for a long time, which is estimated based on the elapsed time for the key's widget update and the batch interval remaining duration.
Step S104, partitioning the data stream tuples in the ordered list according to batches based on a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises the following steps: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
the step S104 is mainly to balance the load partitions. Specifically, all data stream tuples sharing the same key value are modeled as a single item, the data stream tuples in the ordered list in step S103 are partitioned in batches, each partition is a data block, information about whether a key is partitioned needs to be stored in each data block, and the partitioning process needs to satisfy the following preset partition conditions: limiting the number of times of splitting of the single items, minimizing the number of different single items in the data blocks, and maintaining the capacity of each data block equal.
Specifically, this step S104 is to define the batch partitioning problem as a balanced bin problem for the partitionable items, given a set of N different items: k is a radical of1,k2…,kNEach term having a size of SnWhere 1. ltoreq. n.ltoreq.N, given that B ═ B1,b2,…,bBC, then the balanced binning problem of separable items is to assign items to different bins if the following conditions are met simultaneously: (1) for any bj,j∈[1,B]The number of tuples in each box is equal to the capacity C of the box; (2) for any bj,j∈[1,B]The number of different items in each box is more than or equal to N/B; (3) for any item, it is required to be divided as many times as possible. When boxing, count the frequencyiDividing the data stream tuple corresponding to the key larger than the ratio of the data block size to the data block cardinal number into two items, wherein the data stream tuple size of one item is equal to the ratio of the data block size to the data block cardinal number, putting the data stream tuple into the data block, and putting the other item into a new list; then, the remaining keys in the ordered list are assigned to data blocks in a serpentine arrangement, and finally the keys in the new list are assigned to data blocks in accordance with an optimal adaptation algorithm.
With reference to fig. 4, the more detailed procedure is as follows:
the process may consist of three independent loop traversal algorithms. The method comprises the following specific steps: a) traversing the binary tree to obtain an ordered list of keys and frequency information thereof<ki,counti,tupleListi>The tuple count value Nc and the count values K of different keys are used as input, and the required data partition number P is set; defining a partition size PSize=Nc/PPartition cardinality PkK/P, threshold S of split keycut=PSize/PkSetting the current partition bjIs the first partition b1(ii) a b) Traverse the keys in the list, when their countiGreater than ScutWhen it is, will ScutPut a tuple into bjMeanwhile, the rest part is put into a temporary list RList, and b corresponding to the key is updatedjAt position Pos (k) ═ bj(ii) a And is provided with bj=bj%PJ is self-incremented by 1, then step b) is repeated until no count is presentiGreater than Scut(ii) a c) Traversing the rest keys in the List, traversing the partitions bj, sequentially putting one key, reversing the partition sequence after traversing the partitions, and repeating the step c); d) and traversing the key in the temporary list RList, setting b to Pos (k), putting the key into b if the key can be put into b completely, otherwise filling the b completely, and then packing the rest part into the partition with the minimum residual capacity capable of accommodating the key.
Step S105, distributing the key cluster to the buckets of the Reduce stage for processing by using the information whether the keys in the data block are divided or not through the Map task based on the worst adaptation algorithm; the output of the Map stage is a cluster consisting of key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of buckets.
Step S105 is a processing stage partition, and in the balanced load batch partition step, each data block has information about whether a key is divided, and the database is processed by a Map task that uses the information about the key division to assign a key cluster to a bucket in the Reduce stage for processing. Wherein the output of the Map stage is a cluster consisting of key-value pairs, each key cluster having all data values of the same key, key cluster CkCan be represented as Ck={(k,vi)|viE is k, and vi is a data value corresponding to the key k; suppose that the K key clusters output by a given Map phase need to be allocated into r ReduceBuckets. The output of Map task is I ═ CkL K belongs to K, in order to ensure the load balance of the Reduce stage, the distribution of each Reduce bucket needs to be ensured to be consistent, so the capacity of the bucket is set to be
Figure BDA0003578585830000091
The partitioning problem in the processing stage can be simplified into a boxing problem; consider a key cluster as an item, Reducebuckets as a box. Unlike the batch partitioning problem, the partitioning problem at the processing stage is a variable capacity balanced binning problem, defined as follows: given a set of M items, A boxes a1,a2,…,aAEach box having a capacity of CiThen the variable capacity balanced binning problem is to assign items to different bins a if the following condition is met1,a2,…,aAPerforming the following steps; (1) for any aj,j∈[1,A]All have the number of tuples in the box smaller than the capacity C of the boxj(ii) a (2) For any aj,j∈[1,A]The number of different items in the box is more than or equal to M. With reference to fig. 5, the more detailed procedure is as follows:
as shown in fig. 2, the partition result will enter the Map task for processing, and fig. 5 shows a detailed process of allocating the Map task intermediate result to Reduce buckets. Firstly, the input information is a key cluster C obtained by a Map task, the data partition obtained through the steps contains information whether the keys are divided, a set R of all sockets in the Reduce stage is set with Bucksize |/| R |, and the divided keys are distributed by using a Hash algorithm, so that only the keys which are not divided are left in the key cluster, and the keys are sorted in a descending order. And traversing keys in the key cluster, allocating a larger key cluster to the R-th bucket as much as possible according to a worst adaptation algorithm, deleting the R-th bucket from the R, resetting the R to all the buckets if no bucket exists in the R, and continuously traversing the rest keys in the key cluster.
According to the data partitioning method in the micro batch stream processing system, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, a key and an ordered list of frequency related information of the key can be obtained by traversing a balanced binary tree, the sorting time of a processing stage is reduced, the problem is abstracted into a classical boxing problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of data partitioning is realized, the problem is abstracted into a variable capacity boxing problem in the processing stage, a worst adaptation algorithm is used for distributing key clusters, the load balance among tasks is ensured, and the data processing throughput can be greatly improved without increasing delay.
In some embodiments, the above method may further comprise the steps of:
recording the processing time and batch interval of each batch; acquiring the proportion of each batch processing time to each batch interval; acquiring continuous batch counts with the proportion meeting a preset proportion threshold according to the preset proportion threshold; the Map task and/or Reduce task are adjusted based on the consecutive batch counts.
In the embodiment, the main resource dynamic management is realized by setting a threshold of Map-Reduce task processing time to change the parallelism degree during operation, adjusting the Map-Reduce task according to the change of the workload, and specifically continuously adjusting the Map task and/or the Reduce task according to the ratio of each batch of processing time to the time interval between two batches of data stream tuples.
In some of these embodiments, the preset scaling threshold comprises a first scaling threshold; the obtaining of the continuous batch count whose proportion meets the preset proportion threshold according to the preset proportion threshold includes: a consecutive batch count is obtained having a proportion greater than a first proportion threshold.
Further, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count specifically includes:
when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count having a ratio greater than a first ratio threshold.
In some further embodiments, the preset scaling threshold comprises a second scaling threshold; the aforementioned obtaining, according to the preset proportion threshold, the continuous batch count whose proportion satisfies the preset proportion threshold includes: a continuous batch count is obtained with a proportion less than a second proportion threshold.
Further, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count specifically includes:
when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count having a proportion less than a second proportion threshold.
In the above embodiment, when the ratio of the processing time of each batch to the time interval between two batches of data stream tuples exceeds or falls below the set threshold in consecutive batches, the adjustment of the Map-Reduce task is triggered. The method comprises the following specific steps:
using StatsdTo record the processing time and interval ratio of the first d batches, and the data rate and data distribution state information, and define the ratio of the processing time to the interval
Figure BDA0003578585830000111
Adding the ratio, data rate and data distribution to Stats for each batchdIn (1). Setting the first proportional threshold to thres1And represents W by counti>thres1Is the first consecutive batch count, if W occursi<thres1The first consecutive batch count may be reset to zero and recounted. When the first consecutive batch count equals d, i.e. W for consecutive d batchesiIf the data rate is greater than the preset counting threshold, increasing the corresponding Map task if the data rate is increased, and increasing the Reduce task if the data distribution is increased; similarly, let the second proportional threshold be thres2When d batches W are consecutivei<thres2And reducing corresponding tasks according to the change conditions of the data rate and the data distribution, namely reducing Map tasks if the data rate is reduced, and reducing Reduce tasks if the data distribution is reduced.
The embodiment adopts the dynamic resource management technology to realize the dynamic adjustment of the load and adjust the parallel degree in the operation, so that the method has robustness to the fluctuation of data distribution and arrival rate, and can greatly improve the data processing throughput under the condition of not increasing delay.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a data partitioning apparatus in a micro batch processing system for implementing the data partitioning method in the micro batch processing system. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so specific limitations in the embodiments of the data partitioning apparatus in one or more micro batch processing systems provided below can be referred to the limitations on the data partitioning method in the micro batch processing system, and are not described herein again.
In one embodiment, as shown in fig. 6, a data partitioning apparatus in a micro-batch stream processing system is provided, and the apparatus 600 may include:
an obtaining module 601, configured to obtain a data stream tuple;
a maintaining module 602, configured to maintain the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
a generating module 603, configured to traverse the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;
a partitioning module 604, configured to partition the data stream tuples in the ordered list in batches based on a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
the allocation processing module 605 is configured to allocate, by a Map task, the key cluster to buckets in the Reduce stage for processing based on a worst-case adaptation algorithm by using information about whether the key in the data block is divided; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
In one embodiment, the apparatus 600 may further include:
the task adjusting module is used for recording each batch processing time and each batch interval; acquiring the proportion of each batch processing time to the batch interval; acquiring a continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold; and adjusting the Map task and/or the Reduce task according to the continuous batch count.
In one embodiment, the preset proportional threshold comprises a first proportional threshold; and the task adjusting module is used for acquiring the continuous batch count of which the proportion is greater than the first proportion threshold value.
In one embodiment, the task adjusting module is configured to, when the first continuous batch count reaches a preset count threshold, add a Map task when the data rate is increased, and add a Reduce task when the data distribution is increased; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.
In one embodiment, the preset proportion threshold comprises a second proportion threshold; and the task adjusting module is used for acquiring the continuous batch count of which the proportion is smaller than the second proportion threshold value.
In one embodiment, the task adjusting module is configured to Reduce the Map task when the data rate is reduced and Reduce the Reduce task when the data distribution is reduced when the second consecutive batch count reaches a preset count threshold; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.
The modules in the data partitioning device in the micro batch processing system can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as data stream tuples and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data partitioning in a micro-batch processing system.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.
It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A method of data partitioning in a micro-batch stream processing system, the method comprising:
acquiring a data stream tuple;
maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;
partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
2. The method of claim 1, further comprising:
recording the processing time and batch interval of each batch;
acquiring the proportion of each batch processing time to the batch interval;
acquiring a continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold;
and adjusting the Map task and/or the Reduce task according to the continuous batch count.
3. The method of claim 2, wherein the preset scaling threshold comprises a first scaling threshold; the acquiring, according to a preset proportion threshold, a continuous batch count of which the proportion meets the preset proportion threshold includes:
obtaining a count of consecutive batches for which the ratio is greater than the first ratio threshold.
4. The method of claim 3, wherein said adjusting Map tasks and/or Reduce tasks based on said consecutive batch counts comprises:
when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.
5. The method of claim 2, wherein the preset scaling threshold comprises a second scaling threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes:
acquiring a continuous batch count of which the proportion is smaller than the second proportion threshold.
6. The method of claim 5, wherein said adjusting Map tasks and/or Reduce tasks based on said consecutive batch counts comprises:
when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.
7. An apparatus for dynamic data partitioning in a micro-batch streaming system, the apparatus comprising:
the acquisition module is used for acquiring data stream tuples;
a maintenance module to maintain the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;
a generating module, configured to traverse the balanced binary search tree and generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;
the partitioning module is used for partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;
the distribution processing module is used for distributing the key clusters to buckets in the Reduce stage for processing by using Map tasks based on the information of whether the keys in the data blocks are divided or not based on the worst adaptation algorithm; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202210339704.6A 2022-04-01 Data partitioning method, device, equipment and medium in micro batch flow processing system Active CN114780541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210339704.6A CN114780541B (en) 2022-04-01 Data partitioning method, device, equipment and medium in micro batch flow processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210339704.6A CN114780541B (en) 2022-04-01 Data partitioning method, device, equipment and medium in micro batch flow processing system

Publications (2)

Publication Number Publication Date
CN114780541A true CN114780541A (en) 2022-07-22
CN114780541B CN114780541B (en) 2024-04-12

Family

ID=

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462582A (en) * 2014-12-30 2015-03-25 武汉大学 Web data similarity detection method based on two-stage filtration of structure and content
WO2017031961A1 (en) * 2015-08-24 2017-03-02 华为技术有限公司 Data processing method and apparatus
US9613127B1 (en) * 2014-06-30 2017-04-04 Quantcast Corporation Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN109325034A (en) * 2018-10-12 2019-02-12 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN110955732A (en) * 2019-12-16 2020-04-03 湖南大学 Method and system for realizing partition load balance in Spark environment
CN111858607A (en) * 2020-07-24 2020-10-30 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113468178A (en) * 2021-07-07 2021-10-01 武汉达梦数据库股份有限公司 Data partition loading method and device of association table

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9613127B1 (en) * 2014-06-30 2017-04-04 Quantcast Corporation Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations
CN104462582A (en) * 2014-12-30 2015-03-25 武汉大学 Web data similarity detection method based on two-stage filtration of structure and content
WO2017031961A1 (en) * 2015-08-24 2017-03-02 华为技术有限公司 Data processing method and apparatus
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN109325034A (en) * 2018-10-12 2019-02-12 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN110955732A (en) * 2019-12-16 2020-04-03 湖南大学 Method and system for realizing partition load balance in Spark environment
CN111858607A (en) * 2020-07-24 2020-10-30 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113468178A (en) * 2021-07-07 2021-10-01 武汉达梦数据库股份有限公司 Data partition loading method and device of association table

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
S. ANCY等: "Locality based data partitioning in Map reduce", 2016 INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, AND OPTIMIZATION TECHNIQUES, pages 4869 - 4874 *
周华平;刘光宗;张贝贝;: "基于索引偏移的MapReduce聚类负载均衡策略", 计算机科学, no. 05, pages 310 - 316 *
门威;: "基于MapReduce的大数据处理算法综述", 濮阳职业技术学院学报, no. 05, pages 91 - 94 *

Similar Documents

Publication Publication Date Title
Li et al. The strategy of mining association rule based on cloud computing
Yahya et al. An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model
WO2015196911A1 (en) Data mining method and node
CN107450855B (en) Model-variable data distribution method and system for distributed storage
Zhang et al. Accelerate large-scale iterative computation through asynchronous accumulative updates
Tsalouchidou et al. Scalable dynamic graph summarization
CN112114960B (en) Scheduling strategy for remote sensing image parallel cluster processing adapting to internet scene
US11221890B2 (en) Systems and methods for dynamic partitioning in distributed environments
US20190377655A1 (en) Two-stage distributed estimation system
US10162830B2 (en) Systems and methods for dynamic partitioning in distributed environments
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
Makanju et al. Deep parallelization of parallel FP-growth using parent-child MapReduce
CN114816711A (en) Batch task processing method and device, computer equipment and storage medium
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Perrot et al. HeatPipe: High throughput, low latency big data heatmap with spark streaming
WO2017113865A1 (en) Method and device for big data increment calculation
CN114780541B (en) Data partitioning method, device, equipment and medium in micro batch flow processing system
Kumar et al. Graphsteal: Dynamic re-partitioning for efficient graph processing in heterogeneous clusters
CN114780541A (en) Data partitioning method, device, equipment and medium in micro-batch stream processing system
Ghosh et al. Popular is cheaper: Curtailing memory costs in interactive analytics engines
KR101718739B1 (en) System and Method for Replicating Dynamic Data for Heterogeneous Hadoop
CN112764935B (en) Big data processing method and device, electronic equipment and storage medium
Kumar et al. Cost model for pregel on graphx
Golab et al. Distributed data placement via graph partitioning
CN108875786B (en) Optimization method of consistency problem of food data parallel computing based on Storm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant