CN114780541A

CN114780541A - Data partitioning method, device, equipment and medium in micro-batch stream processing system

Info

Publication number: CN114780541A
Application number: CN202210339704.6A
Authority: CN
Inventors: 李书亮; 高杨; 王霄阳
Original assignee: HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY; Zhejiang University ZJU
Current assignee: HONG KONG-ZHUHAI-MACAO BRIDGE AUTHORITY; Zhejiang University ZJU
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-07-22
Anticipated expiration: 2042-04-01

Abstract

The application relates to the technical field of data stream real-time processing, and provides a data partitioning method and device, computer equipment, a storage medium and a computer program product in a micro batch stream processing system. According to the method, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, an ordered list of one key and frequency related information of the key can be obtained by traversing a balanced binary tree, the sequencing time of a processing stage is reduced, the problem is abstracted into a classical binning problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of the data partitions is realized, the problem is abstracted into a variable capacity binning problem in the processing stage, key clusters are distributed by using a worst adaptation algorithm, the load balance among tasks is guaranteed, and the data processing throughput can be greatly improved without increasing delay.

Description

Data partitioning method, device, equipment and medium in micro-batch stream processing system

Technical Field

The present application relates to the field of data stream real-time processing technologies, and in particular, to a data partitioning method and apparatus in a micro batch processing system, a computer device, a storage medium, and a computer program product.

Background

With the development of big data technology becoming mature, the demand for real-time processing is becoming more extensive, and the demand is often distributed in applications such as social network analysis and click traffic analysis. The importance of processing large data streams in real time is self evident, which results in the creation of a large number of distributed stream processing systems.

In the prior art, for example, some micro batch Streaming processing systems such as Spark Streaming, Comet, Google Dataflow, etc. adopt a processing model of one batch at a time to improve processing throughput, and compared with a traditional one-tuple Streaming processing system at a time, the micro batch Streaming processing system has the advantages of higher speed, more efficient fault-tolerant mechanism, etc. However, the performance of the micro-batch stream processing system in this technology is very sensitive to the dynamic change of the load characteristics by using the basic data partitioning technology, and the resource utilization rate is very dependent on the workload which is evenly divided on the processing units.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data partitioning method, apparatus, computer device, storage medium and computer program product in a micro batch processing system.

In a first aspect, the present application provides a method for partitioning data in a micro batch processing system. The method comprises the following steps:

acquiring a data stream tuple;

maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;

traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;

partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

In one embodiment, the method further comprises: recording the processing time and batch interval of each batch; acquiring the proportion of each batch processing time to the batch interval; acquiring continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold; and adjusting the Map task and/or the Reduce task according to the continuous batch count.

In one embodiment, the preset proportional threshold comprises a first proportional threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes: obtaining a count of consecutive batches for which the ratio is greater than the first ratio threshold.

In one embodiment, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count includes: when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.

In one embodiment, the preset proportion threshold comprises a second proportion threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes: acquiring a continuous batch count of which the proportion is smaller than the second proportion threshold.

In one embodiment, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count includes: when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.

In a second aspect, the present application further provides a device for partitioning dynamic data in a micro batch processing system.

The device comprises:

the acquisition module is used for acquiring a data stream tuple;

a maintenance module to maintain the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;

a generating module, configured to traverse the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;

the partitioning module is used for partitioning the data stream tuples in the ordered list according to batches based on a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

the distribution processing module is used for distributing the key clusters to buckets in the Reduce stage for processing by using the information whether the keys in the data block are divided or not through a Map task based on a worst adaptation algorithm; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring a data stream tuple; maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree; traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key; partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal; distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring a data stream tuple; maintaining the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree; traversing the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key; partitioning the data stream tuples in the ordered list in batches based on a preset partition condition; each partition is a data block, and information whether a key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal; distributing the key cluster to buckets of a Reduce stage for processing by using the information whether the keys in the data block are divided or not based on the worst adaptation algorithm through a Map task; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

According to the data partitioning method, the data partitioning device, the computer equipment, the storage medium and the computer program product in the micro batch streaming processing system, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, the ordered list of one key and frequency related information thereof can be obtained by traversing a balanced binary tree, the sorting time of a processing stage is reduced, the problem is abstracted into the classical boxing problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of the data partitions is realized, the problem is abstracted into the variable capacity boxing problem in the processing stage, the worst adaptation algorithm is used for distributing key clusters, the load balance among tasks is ensured, and the data processing throughput can be greatly improved without increasing delay.

Drawings

FIG. 1 is a flow diagram that illustrates a method for partitioning data in a micro batch stream processing system, according to one embodiment;

FIG. 2 is a flow diagram illustrating data caching and dynamic partitioning in a micro batch stream processing system, according to an embodiment;

FIG. 3 is a flow diagram of a frequency sensing technique in one embodiment;

FIG. 4 is a flow diagram that illustrates load partitioning in a batch phase implementation;

FIG. 5 is a flow diagram of process stage partitioning in one embodiment;

FIG. 6 is a block diagram of an apparatus for partitioning data in a micro batch processing system, according to one embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The data partitioning method in the micro batch streaming system provided by the embodiment of the application can be applied to a server, and the server can be realized by an independent server or a server cluster formed by a plurality of servers.

The following describes a data partitioning method in a micro batch stream processing system provided in the present application in detail with reference to various embodiments and accompanying drawings.

In one embodiment, as shown in fig. 1 and in conjunction with fig. 2, there is provided a method for data partitioning in a micro-batch stream processing system, comprising:

step S101, acquiring a data stream tuple;

step S102, maintaining data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of a data stream tuple, a first pointer pointing to a tuple list corresponding to the key and frequency count of the key; the frequency count of the key is also saved to a balanced binary search tree; each key in the hash table is provided with a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;

step S103, traversing the balanced binary search tree to generate an ordered list; wherein, the ordered list comprises keys, frequency counts of the keys and tuple lists corresponding to the keys;

the above steps S101 to S103 mainly use frequency-aware buffering technique to prepare the batch for partitioningThe time required is minimized. Specifically, the frequency-aware buffering technique: a batch of data stream tuples are obtained, and a hash table and a balanced binary search tree are established to maintain statistical information of the data stream tuples. Specifically, the data stream tuple is stored in the hash table HTable according to the key of the data stream tuple<k,v_i>Wherein k is a bond, v_itupleList for pointing to tuple list corresponding to each key_iThe hash table HTable also stores a frequency count of the key_iWhile, the key frequency counts_iIt is saved in the balanced binary tree CountTree and each key in the hash table HTable possesses a bidirectional pointer (i.e. a second pointer) pointing to the corresponding frequency count node in the balanced binary tree CountTree, which second pointer allows to directly update the count node of the key. Based on this, traversal of the balanced binary tree CountTree generates an ordered list of keys and their associated frequency information<k_i，count_i，tupleList_i>，k_iRepresenting the ith key in the data stream tuple. With reference to fig. 3, the more detailed procedure is as follows:

input data stream S, batch interval t_start-t_endAnd setting the update compensation budget and the initial frequency compensation f. Firstly, resetting a hash table HTable and a countTree used for saving frequency counts; then circularly traversing the data stream tuples received in the batch interval, and adding 1 to the data stream tuple count Nc; if the key of the data flow tuple is in the hash table HTable, the data flow tuple is inserted into the linked list of the key in the hash table HTable, and the frequency k.freq of the current key is updated_currDifference Delta between current key frequency and frequency before update_freq＝k.Freq_curr-k.Freq_updatedThe difference Delta between the present time and the last update time_time＝Time_now-k_{lastUpdateTime}If the current frequency step k of the key_f.stepEqual to Delta_freqOr the current time step k_t.stepEqual to Delta_timeThen k.freq in CountTree is updated_currBudget and k.freq_updatedIf k is_f.stepEqual to Delta_freqThen update

If k is_t.stepEqual to Delta_timeThen k is updated_t.step＝(t_end-Time_Now) And k, budget. If the key of the tuple is not in the hash table HTable, the counting value K of different keys is incremented by 1, the tuple is inserted into the hash table HTable, the key of the tuple is inserted into a balanced binary tree, and k.freq is initialized_curr、k.Freq_updatedTo 1, initialize k_t.step＝(t_end-Time_Now)/budget、k_f.step＝f。

In order to increase the data processing rate, the data processing rate is often updated in a coarser granularity manner, that is, the budget is periodically updated within a certain time interval, where the budget is a compensation value determined according to the requirement, a control parameter fstep is defined, the count of the node is updated once every time fstep new tuples of the same key are received, and the initial fstep is set to a constant that can reflect the optimal step size

Wherein N is_estIs the number of data elements in the next batch interval, K, at the average data rate_AvgIs the average of the different bonds in the past several batches. Fstep is to adaptively update the estimation for each key according to the ratio of the frequency of the current key and the total number of tuples received in the current batch interval, namely, the key with higher frequency needs to receive more data tuples to trigger the update; to ensure that the nodes of all tuples are updated, a time-based control parameter tstep is set to update keys that have not been updated for a long time, which is estimated based on the elapsed time for the key's widget update and the batch interval remaining duration.

Step S104, partitioning the data stream tuples in the ordered list according to batches based on a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises the following steps: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

the step S104 is mainly to balance the load partitions. Specifically, all data stream tuples sharing the same key value are modeled as a single item, the data stream tuples in the ordered list in step S103 are partitioned in batches, each partition is a data block, information about whether a key is partitioned needs to be stored in each data block, and the partitioning process needs to satisfy the following preset partition conditions: limiting the number of times of splitting of the single items, minimizing the number of different single items in the data blocks, and maintaining the capacity of each data block equal.

Specifically, this step S104 is to define the batch partitioning problem as a balanced bin problem for the partitionable items, given a set of N different items: k is a radical of₁,k₂…,k_NEach term having a size of S_nWhere 1. ltoreq. n.ltoreq.N, given that B ═ B₁,b₂,…,b_BC, then the balanced binning problem of separable items is to assign items to different bins if the following conditions are met simultaneously: (1) for any b_j,j∈[1,B]The number of tuples in each box is equal to the capacity C of the box; (2) for any b_j,j∈[1,B]The number of different items in each box is more than or equal to N/B; (3) for any item, it is required to be divided as many times as possible. When boxing, count the frequency_iDividing the data stream tuple corresponding to the key larger than the ratio of the data block size to the data block cardinal number into two items, wherein the data stream tuple size of one item is equal to the ratio of the data block size to the data block cardinal number, putting the data stream tuple into the data block, and putting the other item into a new list; then, the remaining keys in the ordered list are assigned to data blocks in a serpentine arrangement, and finally the keys in the new list are assigned to data blocks in accordance with an optimal adaptation algorithm.

With reference to fig. 4, the more detailed procedure is as follows:

the process may consist of three independent loop traversal algorithms. The method comprises the following specific steps: a) traversing the binary tree to obtain an ordered list of keys and frequency information thereof<k_i，count_i，tupleList_i>The tuple count value Nc and the count values K of different keys are used as input, and the required data partition number P is set; defining a partition size P_Size＝Nc/PPartition cardinality P_kK/P, threshold S of split key_cut＝P_Size/P_kSetting the current partition b_jIs the first partition b₁(ii) a b) Traverse the keys in the list, when their count_iGreater than S_cutWhen it is, will S_cutPut a tuple into b_jMeanwhile, the rest part is put into a temporary list RList, and b corresponding to the key is updated_jAt position Pos (k) ═ b_j(ii) a And is provided with b_j＝b_j％PJ is self-incremented by 1, then step b) is repeated until no count is present_iGreater than S_cut(ii) a c) Traversing the rest keys in the List, traversing the partitions bj, sequentially putting one key, reversing the partition sequence after traversing the partitions, and repeating the step c); d) and traversing the key in the temporary list RList, setting b to Pos (k), putting the key into b if the key can be put into b completely, otherwise filling the b completely, and then packing the rest part into the partition with the minimum residual capacity capable of accommodating the key.

Step S105, distributing the key cluster to the buckets of the Reduce stage for processing by using the information whether the keys in the data block are divided or not through the Map task based on the worst adaptation algorithm; the output of the Map stage is a cluster consisting of key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of buckets.

Step S105 is a processing stage partition, and in the balanced load batch partition step, each data block has information about whether a key is divided, and the database is processed by a Map task that uses the information about the key division to assign a key cluster to a bucket in the Reduce stage for processing. Wherein the output of the Map stage is a cluster consisting of key-value pairs, each key cluster having all data values of the same key, key cluster C_kCan be represented as C_k＝{(k,v_i)|v_iE is k, and vi is a data value corresponding to the key k; suppose that the K key clusters output by a given Map phase need to be allocated into r ReduceBuckets. The output of Map task is I ═ C_kL K belongs to K, in order to ensure the load balance of the Reduce stage, the distribution of each Reduce bucket needs to be ensured to be consistent, so the capacity of the bucket is set to be

The partitioning problem in the processing stage can be simplified into a boxing problem; consider a key cluster as an item, Reducebuckets as a box. Unlike the batch partitioning problem, the partitioning problem at the processing stage is a variable capacity balanced binning problem, defined as follows: given a set of M items, A boxes a₁,a₂,…,a_AEach box having a capacity of C_iThen the variable capacity balanced binning problem is to assign items to different bins a if the following condition is met₁,a₂,…,a_APerforming the following steps; (1) for any a_j,j∈[1,A]All have the number of tuples in the box smaller than the capacity C of the box_j(ii) a (2) For any a_j,j∈[1,A]The number of different items in the box is more than or equal to M. With reference to fig. 5, the more detailed procedure is as follows:

as shown in fig. 2, the partition result will enter the Map task for processing, and fig. 5 shows a detailed process of allocating the Map task intermediate result to Reduce buckets. Firstly, the input information is a key cluster C obtained by a Map task, the data partition obtained through the steps contains information whether the keys are divided, a set R of all sockets in the Reduce stage is set with Bucksize |/| R |, and the divided keys are distributed by using a Hash algorithm, so that only the keys which are not divided are left in the key cluster, and the keys are sorted in a descending order. And traversing keys in the key cluster, allocating a larger key cluster to the R-th bucket as much as possible according to a worst adaptation algorithm, deleting the R-th bucket from the R, resetting the R to all the buckets if no bucket exists in the R, and continuously traversing the rest keys in the key cluster.

According to the data partitioning method in the micro batch stream processing system, the time required by preparation work before batch partitioning is minimized through a frequency perception buffering technology, a key and an ordered list of frequency related information of the key can be obtained by traversing a balanced binary tree, the sorting time of a processing stage is reduced, the problem is abstracted into a classical boxing problem in the batch partitioning stage, the fragmentation degree of the key is limited, the radix difference among data blocks is minimized, the sizes of the data blocks are kept equal, the load balance of data partitioning is realized, the problem is abstracted into a variable capacity boxing problem in the processing stage, a worst adaptation algorithm is used for distributing key clusters, the load balance among tasks is ensured, and the data processing throughput can be greatly improved without increasing delay.

In some embodiments, the above method may further comprise the steps of:

recording the processing time and batch interval of each batch; acquiring the proportion of each batch processing time to each batch interval; acquiring continuous batch counts with the proportion meeting a preset proportion threshold according to the preset proportion threshold; the Map task and/or Reduce task are adjusted based on the consecutive batch counts.

In the embodiment, the main resource dynamic management is realized by setting a threshold of Map-Reduce task processing time to change the parallelism degree during operation, adjusting the Map-Reduce task according to the change of the workload, and specifically continuously adjusting the Map task and/or the Reduce task according to the ratio of each batch of processing time to the time interval between two batches of data stream tuples.

In some of these embodiments, the preset scaling threshold comprises a first scaling threshold; the obtaining of the continuous batch count whose proportion meets the preset proportion threshold according to the preset proportion threshold includes: a consecutive batch count is obtained having a proportion greater than a first proportion threshold.

Further, the adjusting Map tasks and/or Reduce tasks according to the continuous batch count specifically includes:

when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count having a ratio greater than a first ratio threshold.

In some further embodiments, the preset scaling threshold comprises a second scaling threshold; the aforementioned obtaining, according to the preset proportion threshold, the continuous batch count whose proportion satisfies the preset proportion threshold includes: a continuous batch count is obtained with a proportion less than a second proportion threshold.

when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count having a proportion less than a second proportion threshold.

In the above embodiment, when the ratio of the processing time of each batch to the time interval between two batches of data stream tuples exceeds or falls below the set threshold in consecutive batches, the adjustment of the Map-Reduce task is triggered. The method comprises the following specific steps:

using Stats_dTo record the processing time and interval ratio of the first d batches, and the data rate and data distribution state information, and define the ratio of the processing time to the interval

Adding the ratio, data rate and data distribution to Stats for each batch_dIn (1). Setting the first proportional threshold to thres₁And represents W by count_i＞thres₁Is the first consecutive batch count, if W occurs_i＜thres₁The first consecutive batch count may be reset to zero and recounted. When the first consecutive batch count equals d, i.e. W for consecutive d batches_iIf the data rate is greater than the preset counting threshold, increasing the corresponding Map task if the data rate is increased, and increasing the Reduce task if the data distribution is increased; similarly, let the second proportional threshold be thres₂When d batches W are consecutive_i＜thres₂And reducing corresponding tasks according to the change conditions of the data rate and the data distribution, namely reducing Map tasks if the data rate is reduced, and reducing Reduce tasks if the data distribution is reduced.

The embodiment adopts the dynamic resource management technology to realize the dynamic adjustment of the load and adjust the parallel degree in the operation, so that the method has robustness to the fluctuation of data distribution and arrival rate, and can greatly improve the data processing throughput under the condition of not increasing delay.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a data partitioning apparatus in a micro batch processing system for implementing the data partitioning method in the micro batch processing system. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so specific limitations in the embodiments of the data partitioning apparatus in one or more micro batch processing systems provided below can be referred to the limitations on the data partitioning method in the micro batch processing system, and are not described herein again.

In one embodiment, as shown in fig. 6, a data partitioning apparatus in a micro-batch stream processing system is provided, and the apparatus 600 may include:

an obtaining module 601, configured to obtain a data stream tuple;

a maintaining module 602, configured to maintain the data stream tuples based on a hash table and a balanced binary search tree; the hash table stores a key of the data stream tuple, a first pointer pointing to a tuple list corresponding to the key and a frequency count of the key; the frequency count of the key is also saved to the balanced binary search tree; each key in the hash table has a second pointer pointing to a corresponding frequency counting node in the balanced binary search tree;

a generating module 603, configured to traverse the balanced binary search tree to generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;

a partitioning module 604, configured to partition the data stream tuples in the ordered list in batches based on a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

the allocation processing module 605 is configured to allocate, by a Map task, the key cluster to buckets in the Reduce stage for processing based on a worst-case adaptation algorithm by using information about whether the key in the data block is divided; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

In one embodiment, the apparatus 600 may further include:

the task adjusting module is used for recording each batch processing time and each batch interval; acquiring the proportion of each batch processing time to the batch interval; acquiring a continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold; and adjusting the Map task and/or the Reduce task according to the continuous batch count.

In one embodiment, the preset proportional threshold comprises a first proportional threshold; and the task adjusting module is used for acquiring the continuous batch count of which the proportion is greater than the first proportion threshold value.

In one embodiment, the task adjusting module is configured to, when the first continuous batch count reaches a preset count threshold, add a Map task when the data rate is increased, and add a Reduce task when the data distribution is increased; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.

In one embodiment, the preset proportion threshold comprises a second proportion threshold; and the task adjusting module is used for acquiring the continuous batch count of which the proportion is smaller than the second proportion threshold value.

In one embodiment, the task adjusting module is configured to Reduce the Map task when the data rate is reduced and Reduce the Reduce task when the data distribution is reduced when the second consecutive batch count reaches a preset count threshold; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.

The modules in the data partitioning device in the micro batch processing system can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as data stream tuples and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data partitioning in a micro-batch processing system.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of data partitioning in a micro-batch stream processing system, the method comprising:

acquiring a data stream tuple;

partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information about whether the key is divided is stored in each data block; all data stream tuples sharing the same key value are modeled as a single item, and the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

2. The method of claim 1, further comprising:

recording the processing time and batch interval of each batch;

acquiring the proportion of each batch processing time to the batch interval;

acquiring a continuous batch count of which the proportion meets a preset proportion threshold according to a preset proportion threshold;

and adjusting the Map task and/or the Reduce task according to the continuous batch count.

3. The method of claim 2, wherein the preset scaling threshold comprises a first scaling threshold; the acquiring, according to a preset proportion threshold, a continuous batch count of which the proportion meets the preset proportion threshold includes:

obtaining a count of consecutive batches for which the ratio is greater than the first ratio threshold.

4. The method of claim 3, wherein said adjusting Map tasks and/or Reduce tasks based on said consecutive batch counts comprises:

when the first continuous batch count reaches a preset count threshold, adding a Map task under the condition of increasing the data rate, and adding a Reduce task under the condition of increasing the data distribution; wherein the first consecutive batch count is a consecutive batch count for which the ratio is greater than the first ratio threshold.

5. The method of claim 2, wherein the preset scaling threshold comprises a second scaling threshold; the acquiring, according to a preset proportion threshold, the continuous batch count of which the proportion meets the preset proportion threshold includes:

acquiring a continuous batch count of which the proportion is smaller than the second proportion threshold.

6. The method of claim 5, wherein said adjusting Map tasks and/or Reduce tasks based on said consecutive batch counts comprises:

when the second continuous batch count reaches a preset count threshold, reducing Map tasks under the condition of reducing the data rate, and reducing Reduce tasks under the condition of reducing the data distribution; wherein the second consecutive batch count is a consecutive batch count for which the ratio is less than the second ratio threshold.

7. An apparatus for dynamic data partitioning in a micro-batch streaming system, the apparatus comprising:

the acquisition module is used for acquiring data stream tuples;

a generating module, configured to traverse the balanced binary search tree and generate the ordered list; wherein the ordered list comprises the key, a frequency count of the key, and a tuple list corresponding to the key;

the partitioning module is used for partitioning the data stream tuples in the ordered list according to a preset partitioning condition; each partition is a data block, and information whether a key is divided is stored in each data block; modeling all data stream tuples sharing the same key value as a single item, wherein the preset partition condition comprises: limiting the splitting times of the single items, minimizing the number of different single items in the data blocks and maintaining the capacity of each data block to be equal;

the distribution processing module is used for distributing the key clusters to buckets in the Reduce stage for processing by using Map tasks based on the information of whether the keys in the data blocks are divided or not based on the worst adaptation algorithm; the output of the Map stage is a cluster formed by key values, and each key cluster has all data values of the same key; the capacity of buckets is determined according to the ratio of the number of key clusters to the number of the buckets.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.