CN114675969A - Elastic scaling stream processing method and system based on self-adaptive load partition - Google Patents

Elastic scaling stream processing method and system based on self-adaptive load partition Download PDF

Info

Publication number
CN114675969A
CN114675969A CN202210313490.5A CN202210313490A CN114675969A CN 114675969 A CN114675969 A CN 114675969A CN 202210313490 A CN202210313490 A CN 202210313490A CN 114675969 A CN114675969 A CN 114675969A
Authority
CN
China
Prior art keywords
data
operator
time
load
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210313490.5A
Other languages
Chinese (zh)
Inventor
邹北骥
张涛
朱承璋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210313490.5A priority Critical patent/CN114675969A/en
Publication of CN114675969A publication Critical patent/CN114675969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an elastic scaling stream processing method based on self-adaptive load partitioning, which comprises the steps of constructing a stream processing system based on a Flink prototype; constructing a DKG model for distributing data to downstream operator instances and managing the computing states in the instances; constructing an index collector model to collect and store performance index data of the stream processing system; sharing performance index data; constructing a discriminator model for calculating an elastic scaling strategy implementation factor and a load partition strategy implementation factor; constructing a corresponding elastic scaling strategy and a load partitioning strategy; and the reconfiguration controller module is constructed to apply the strategy to the stream processing system to complete the elastic scaling stream processing based on the adaptive load partition. The invention also discloses a system for realizing the elastic scaling stream processing method based on the self-adaptive load partition. The invention can realize lower end-to-end processing delay and higher throughput on balanced and inclined data flow, and has high reliability, good implementation effect, science and reasonability.

Description

Elastic scaling stream processing method and system based on self-adaptive load partition
Technical Field
The invention belongs to the field of computer data processing, and particularly relates to an elastic scaling stream processing method and system based on self-adaptive load partitioning.
Background
Real-world data streams tend to change dynamically, presenting significant challenges to the processing performance of distributed stream processing systems. At high speed input, the system may generate high processing delay due to insufficient processing performance, and even cause input data loss. And under the low-speed input, a large amount of computing resources in the system are in an idle state, and the resource utilization rate is low. Moreover, the skewed distribution of the input data streams also exacerbates the resource utilization imbalance of the distributed stream processing system, resulting in reduced processing performance.
In order to solve the problem of resource allocation in the task execution process of the distributed data stream processing system, researchers have proposed a variety of resource allocation schemes. The static resource allocation scheme is the simplest resource allocation scheme, which achieves resource allocation by setting a resource configuration that can satisfy a foreseeable maximum load. Static resource schemes have the characteristic of simple configuration, but the maximum workload which can occur is difficult to predict in a complex environment, and the resource allocation needs to be adjusted manually in different environments. Furthermore, the static resource allocation scheme does not generate load peaks in most of the processing time, which undoubtedly results in a great amount of wasted computing resources.
In order to solve the problem of resource waste in the static resource allocation scheme, a large number of dynamic resource allocation schemes are proposed. Dynamic resource allocation schemes are largely divided into rule-based methods and model-based methods. The rule-based method generates a resource allocation scheme using predefined rules: the rule-based approach first collects the necessary performance indicators and applies the performance indicators to predefined expert rules, determines system runtime symptoms according to the rules and generates resource reconfiguration actions. Since rules require domain expert design, rule-based methods cannot be universally applied to different production environments. The model-based method firstly establishes a system model for the distributed stream processing system, establishes an optimization problem of system performance and resource allocation, then solves the optimization problem by using an optimization method, and takes a calculated optimal solution as a resource allocation scheme of the next step. Although the model-based method can produce a good scaling effect on the equalized data stream, it is difficult to establish an accurate performance model in the skewed data stream, resulting in poor effects.
Disclosure of Invention
The invention aims to provide a flexible scaling stream processing method based on adaptive load partitioning, which has high reliability and good implementation effect and is scientific and reasonable.
The second objective of the present invention is to provide a system for implementing the adaptive load partition-based flexible scalable stream processing method.
The invention provides an elastic scaling stream processing method based on self-adaptive load partitioning, which comprises the following steps:
s1, constructing a stream processing system based on a Flink prototype in the prior art;
s2, constructing a DKG (replicated Key Group) model for distributing data to downstream operator instances and managing the calculation state in the instances based on the stream processing system constructed in the step S1;
s3, constructing an index collector model for collecting and storing performance index data of the stream processing system;
s4, sharing the performance index data stored in the step S3;
s5, constructing a discriminator model for calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor;
s6, constructing a corresponding elastic scaling strategy and a corresponding load partition strategy according to the elastic scaling strategy implementation factor and the load partition strategy implementation factor obtained in the step S5;
and S7, constructing a reconfiguration controller module for applying the strategy obtained in the step S6 to the stream processing system so as to finish the elastic scaling stream processing based on the adaptive load partition.
The constructing of the DKG model described in step S2 is used to distribute data to downstream operator instances and manage computation states in the instances, and specifically includes the following steps:
calculating channels for sending data according to the load partition algorithm by using the input data, and sending the data to the downstream instance from the selected channels;
after receiving the input data, the downstream instance determines whether the set flag is included:
if the set mark is included, directly sending the data to a corresponding downstream operator instance, so as to apply the physical partition to the logical partition;
otherwise, realizing logical partitioning according to a Hash partitioning method, and then sending the data to a downstream operator instance according to the result of the logical partitioning;
finally, the state of the input data is managed by adopting the following steps:
when the data is transmitted to a downstream operator instance, the KG value of the data is calculated by the following formula:
KG=murmurhash(hashcode(key))%Pmax
wherein murmurhash () is a murmurr hash function; hashcode () is a multiplicative hash function; pmaxMaximum parallelism supported for the stream processing system; the key is a key for inputting data; % is the operation of taking the rest;
then, the storage location SI corresponding to the KG value is calculated by the following formula:
Figure BDA0003568016990000031
in the formula NKGThe maximum number of keygroups supported for the stream processing system; n is a radical ofinstIs the number of instances; the integer part is intercepted and divided;
and then acquiring the state from the local state back end according to the obtained storage position SI:
if the state of the input data does not exist, a new state is created in the rear end of the local state;
if the state of the input data exists, directly acquiring the corresponding state;
executing stateful calculation, and storing the updated state to the state rear end;
and finally, adding the calculation results of the operators to obtain a uniform calculation result, thereby ensuring the correctness of the calculation result.
The constructing of the index collector model in step S3 is used to collect and store performance index data of the stream processing system, and specifically includes the following steps:
each operator instance initializes a local index collector for storing an effective time index, a processing data volume index, an output data volume index and a time overhead index in the process of selecting an output channel; when the index collector is initialized, acquiring the window length and the storage path of the index persistence process from the configuration file;
each time data is processed, the effective time for processing is calculated: recording the current nanosecond time before data deserialization, recording the current nanosecond time after data processing is finished and a serialization process is finished, subtracting the recorded two nanosecond times to obtain effective time, and accumulating the effective time into an effective time index in an index collector;
updating the data volume processing index, the data volume output index and the time overhead index of the output channel selection process: after deserializing to obtain a datum, the index of the processed data amount is increased by 1; when serialization and one data wait for output, the output data quantity index is increased by 1; calculating a time overhead index of the process of selecting an output channel in a time difference mode;
after a piece of data is processed, whether the difference between the recorded nanosecond time and the initial nanosecond time exceeds the configured window length is judged at the end of the processing process:
if the window size does not exceed the configured window size, no operation is executed;
and if the configured window size is exceeded, performing index calculation and storage operation:
the index calculation includes calculating a true processing rate and a true output rate using the following equations:
Figure BDA0003568016990000051
Figure BDA0003568016990000052
in the formula Rtrue-procThe real processing rate; n is a radical ofprocIs the amount of data processed; t isusefulIs the effective time for data processing; rtrue-outputTrue output rate; n is a radical of hydrogenoutputIs the amount of output data;
and after the calculation is finished, storing the calculation result into a performance index data file.
The step S4 of sharing the performance index data stored in the step S3 specifically includes the following steps:
adopting Samba, inotify and mv tools to realize real-time sharing of performance index data files;
before a stream processing system is started, configuring Samba to realize folder sharing, and setting a performance index file storage path configured in an inotify monitoring stream processing system;
when the stream processing system stores a performance index data file, inotify generates a complete path of the performance index data file, and triggers mv operation to move the performance index data file to a local shared folder;
samba shares performance index data files among multiple host nodes through an SMB protocol, enabling the system to access performance index data of the stream processing system.
The constructing of the discriminator model in step S5 is used to calculate an elastic scaling policy enforcement factor and a load partitioning policy enforcement factor, and specifically includes the following steps:
calculating to obtain an implementation factor of the elastic scaling strategy by adopting the following steps:
reading performance index data from a Samba shared file system;
the discriminator reads each performance index file and adds the speed information of the operator instance in the file into the topological structure of the task;
after all the performance index data are integrated according to operators, calculating the real input rate and the average real processing rate of each operator; and the ratio of the real input rate to the real processing rate is rounded up to be used as the optimal operator parallelism (namely the resource allocation quantity);
and comparing the optimal operator parallelism with the current operator parallelism, and calculating to obtain an elastic scaling strategy implementation factor:
if the total difference between the optimal operator parallelism and the current operator parallelism exceeds a set threshold, setting an elastic scaling strategy implementation factor as a first set value to indicate that the elastic scaling strategy needs to be executed;
if the total difference between the optimal operator parallelism and the current operator parallelism does not exceed the set threshold, setting the implementation factor of the elastic scaling strategy as a second set value to show that the current operator parallelism is not changed;
the load partition strategy implementation factor is obtained by adopting the following steps:
screening out operator performance index data at the downstream of the logic partition from all the performance index data, reading the performance index data of all the operator examples, and then calculating the reciprocal of the observation processing rate in the performance index file to be used as the observation processing time of the operator examples;
acquiring the maximum observation processing time and the minimum observation processing time, and calculating the difference between the maximum observation processing time and the minimum observation processing time as the maximum waiting time in a load unbalance state;
and (3) calculating the queuing time in a load balancing state by using a queuing theory method: modeling each operator instance into a GI/G/1 queuing model by adopting a formula
Figure BDA0003568016990000061
Estimating an average queuing time for each operator instance, where TqueueFor the estimated average queuing time per operator instance, ρ is utilization, caAs coefficient of variation of time of arrival, csAs the coefficient of variation of service time, Rture-procThe real processing rate corresponding to the operator example;
calculating to obtain a difference value between the maximum waiting time of each operator in the load unbalance state and the average queuing time of each operator in the load balance state, wherein the difference value is used for representing the extra waiting time generated by load unbalance;
multiplying the average utilization rate of each operator by the extra waiting time to obtain a final extra waiting time judgment value;
counting the time of load partition (namely selecting an output channel process), and multiplying the time by a set coefficient to serve as a final load partition time judgment value;
finally, the magnitude of the extra latency decision value and the load partition time decision value are determined:
if the extra waiting time decision value is larger than the load partition time decision value, setting the load partition strategy implementation factor as a third set value, which indicates that the load partition strategy needs to be executed;
if the extra-latency decision value is less than or equal to the load partitioning time decision value, the load partitioning policy enforcement factor is set to a fourth setting indicating that the load partitioning policy does not need to be executed.
The elastic scaling strategy described in step S6 specifically includes the following steps:
acquiring performance index data;
reading a directed acyclic graph representing a processing task, and storing the directed acyclic graph in a form of an adjacent linked list;
using a topological sorting algorithm to obtain operator sorting of tasks from a Source operator (Source) to a Sink operator (Sink);
taking the expected output rate set by the source operator as the input rate of the first downstream operator, and calculating the parallelism P of the operators as
Figure BDA0003568016990000071
Rtrue-inputIs the true input rate; rtrue-procIs the true processing rate;
calculating to obtain the real output rate Rtrue-outputIs composed of
Figure BDA0003568016990000072
Wherein N isoutputNumber of output data, T, for operator instanceusefulFor effective time for data processingAnd the real output rate is used as the real input rate of the downstream operator;
and continuously iterating the calculation process, and calculating the parallelism of the operators one by one from the source operator to the summary operator of the topological structure.
The load partitioning policy of step S6 is specifically constructed by the following steps:
acquiring performance index data;
creating a hash mapping with a fixed length to store the frequency of different input data, wherein the hash mapping only stores possible hot data; meanwhile, an array is created to store the data volume sent by different channels;
when a piece of data is input, it is determined whether the data has been recorded on the hash map:
if the data is recorded on the Hash mapping, increasing the frequency corresponding to the data by 1;
if the data is not recorded on the Hash mapping and the Hash mapping does not reach the upper limit of the capacity, recording the data on the Hash mapping and recording the frequency corresponding to the data as 1;
if the data is not recorded on the Hash mapping and the Hash mapping reaches the upper limit of the capacity, replacing the data with the lowest frequency number in the data recorded by the Hash mapping with the current data, and recording the frequency number corresponding to the data and adding 1;
updating the quantity of input data every time data is input; when the input quantity reaches a set value, reducing the frequency numbers corresponding to all the data recorded by the Hash mapping according to a set proportion;
and when the frequency is updated every time, calculating the current thermal data by adopting the following steps:
calculating the frequency of all data in the hash mapping accounting for the total input data volume
Figure BDA0003568016990000081
FkeyRepresenting the frequency, N, of the datakeyRepresenting the amount of data in the hash map, NtotalRepresenting the total input data volume;
using the set parameters, by formula
Figure BDA0003568016990000082
Calculating a thermal data threshold, θhotRepresenting thermal data threshold, P parallelism, θdefRepresenting user-defined parameters for adjusting thermal data thresholds;
comparing the data frequency in the hash map to the thermal data threshold: data with a data frequency exceeding a thermal data threshold is considered thermal data;
judging the input data:
if the input data is thermal data, selecting the channel with the least output data amount in all the output channels as the selected output channel;
if the input data is not hot data, selecting the channel with the least data transmission from the set two channels as the selected channel;
and after the channel is selected, updating the data volume sent by the channel.
The invention also discloses a system for realizing the elastic scaling stream processing method based on the self-adaptive load partition, which comprises a Flink system module, a DKG module, an index collector module, an index file sharing module, a discriminator module, an elastic scaling strategy generation module, a load partition strategy generation module and a reconfiguration control module; the Flink system module, the DKG module, the index collector module, the index file sharing module and the discriminator module are sequentially connected in series; the output end of the discriminator module is simultaneously connected with the input ends of the elastic scaling strategy generation module and the load partitioning strategy generation module; the output ends of the elastic scaling strategy generation module and the load partition strategy generation module are simultaneously connected with the reconfiguration control module; the output end of the reconfiguration control module is connected with the Flink system module; the Flink system module is used for constructing a stream processing system; the DKG module is used for constructing a DKG model, distributing data to downstream operator examples and managing the calculation states in the examples; the index collector module is used for constructing an index collector model and collecting and storing index data of the stream processing system; the index file sharing module is used for sharing the stored performance index data files; the discriminator module is used for constructing a discriminator model and calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor according to the shared performance index data; the elastic scaling strategy generating module is used for generating a corresponding elastic scaling strategy according to the implementation factor of the elastic scaling strategy; the load partition strategy generating module is used for generating a corresponding load partition strategy according to the load partition strategy implementation factor; and the reconfiguration control module is used for applying the obtained elastic scaling strategy and the load partition strategy to the stream processing system so as to complete the elastic scaling stream processing based on the self-adaptive load partition.
The invention provides a processing method and a system of elastic scaling stream based on self-adaptive load partition, which combines the load partition technology and the elastic scaling technology and also provides a self-adaptive load partition scheme; therefore, the invention not only can realize lower end-to-end processing delay and higher throughput on balanced data flow, but also can still realize lower end-to-end processing delay and higher throughput on inclined data flow; in addition, the invention has high reliability, good implementation effect and is scientific and reasonable.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic process diagram of DKG model processing in the method of the present invention.
FIG. 3 is a schematic diagram of the process of the index collector model in the method of the present invention.
FIG. 4 is a diagram illustrating index file sharing in the method of the present invention.
FIG. 5 is a schematic diagram of frequency updating in the load partitioning strategy in the method of the present invention.
FIG. 6 is a functional block diagram of the system of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides an elastic scaling stream processing method based on self-adaptive load partitioning, which comprises the following steps:
s1, constructing a stream processing system based on a Flink prototype in the prior art;
s2, constructing a DKG (replicated Key Group) model for distributing data to downstream operator instances and managing the calculation state in the instances based on the stream processing system constructed in the step S1; the method specifically comprises the following steps (as shown in figure 2):
calculating channels for sending data according to the load partition algorithm by using the input data, and sending the data to the downstream instance from the selected channels;
after receiving the input data, the downstream instance determines whether the set flag is included:
if the set mark is included, directly sending the data to a corresponding downstream operator instance, so as to apply the physical partition to the logical partition;
otherwise, realizing logical partitioning according to a Hash partitioning method, and then sending the data to a downstream operator instance according to the result of the logical partitioning;
finally, the state of the input data is managed by adopting the following steps:
when the data is transmitted to a downstream operator instance, the KG value of the data is calculated by the following formula:
KG=murmurhash(hashcode(key))%Pmax
wherein murmurhash () is a murmur hash function; hashcode () is a multiplicative hash function; pmaxMaximum parallelism supported for the stream processing system; the key is a key for inputting data; % is the operation of taking the rest;
then, the storage location SI corresponding to the KG value is calculated by the following formula:
Figure BDA0003568016990000111
in the formula NKGThe maximum number of keygroups supported for the stream processing system; n is a radical of hydrogeninstIs the number of instances; the integer part is intercepted and divided;
and then acquiring the state from the local state back end according to the obtained storage position SI:
if the state of the input data does not exist, a new state is created in the rear end of the local state;
if the state of the input data exists, directly acquiring the corresponding state;
executing stateful calculation, and storing the updated state to the state rear end;
finally, the calculation results of a plurality of operators are added to obtain a uniform calculation result, so that the correctness of the calculation result is ensured;
s3, constructing an index collector model for collecting and storing performance index data of the stream processing system; the method specifically comprises the following steps (as shown in figure 3):
each operator instance initializes a local index collector for storing an effective time index, a processing data volume index, an output data volume index and a time overhead index in the process of selecting an output channel; when the index collector is initialized, acquiring the window length and the storage path of the index persistence process from the configuration file;
each time data is processed, the effective time for processing is calculated: recording the current nanosecond time before data deserialization, recording the current nanosecond time after data processing is finished and a serialization process is finished, subtracting the recorded two nanosecond times to obtain effective time, and accumulating the effective time into an effective time index in an index collector; the effective time excludes the time waiting for reading and writing in the calculation process, and the processing performance of the system can be more accurately reflected;
updating the data volume processing index, the data volume output index and the time overhead index of the output channel selection process: after deserializing to obtain a datum, increasing the index of the processed data amount by 1; when serialization and one data wait for output, the output data quantity index is increased by 1; calculating a time overhead index of the process of selecting an output channel in a time difference mode;
after a piece of data is processed, whether the difference between the recorded nanosecond time and the initial nanosecond time exceeds the configured window length is judged at the end of the processing process:
if the window size does not exceed the configured window size, no operation is executed;
and if the configured window size is exceeded, performing index calculation and storage operation:
the index calculation includes calculating a true processing rate and a true output rate using the following equations:
Figure BDA0003568016990000121
Figure BDA0003568016990000122
in the formula Rtrue-procThe real processing rate; n is a radical ofprocIs the amount of data processed; t isusefulIs the effective time for data processing; rtrue-outputTrue output rate; n is a radical ofoutputIs the amount of output data;
after the calculation is finished, storing the calculation result into a performance index data file;
after the design of a DKG model and an index collector model is finished, compiling a Flink source program by adopting a Maven tool; after compiling is completed, deploying the generated Flink system into a distributed environment;
s4, sharing the performance index data stored in the step S3; the method specifically comprises the following steps (as shown in figure 4):
adopting Samba, inotify and mv tools to realize real-time sharing of performance index data files;
before a stream processing system is started, configuring Samba to realize folder sharing, and setting a performance index file storage path configured in an inotify monitoring stream processing system;
when the stream processing system stores a performance index data file, inotify generates a complete path of the performance index data file, and triggers mv operation to move the performance index data file to a local shared folder;
samba shares performance index data files at multiple host nodes through an SMB protocol, so that the system can access the performance index data of the stream processing system;
s5, constructing a discriminator model for calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor; the method specifically comprises the following steps:
calculating an implementation factor of an elastic scaling strategy by adopting the following steps:
reading performance index data from a Samba shared file system;
the discriminator reads each performance index file and adds the speed information of the operator instance in the file into the topological structure of the task;
after all the performance index data are integrated according to operators, calculating the real input rate and the average real processing rate of each operator; and the ratio of the real input rate to the real processing rate is rounded up to be used as the optimal operator parallelism (namely the resource allocation quantity);
and comparing the optimal operator parallelism with the current operator parallelism, and calculating to obtain an elastic scaling strategy implementation factor:
if the total difference between the optimal operator parallelism and the current operator parallelism exceeds a set threshold, setting an elastic scaling strategy implementation factor as a first set value to indicate that the elastic scaling strategy needs to be executed;
if the total difference between the optimal operator parallelism and the current operator parallelism does not exceed the set threshold, setting the implementation factor of the elastic scaling strategy as a second set value to show that the current operator parallelism is not changed;
because the load partitioning policy applies to each input data, and compared to the processing time of the data, the load partitioning policy has a greater time overhead; moreover, in some cases, load partitioning policies are not required; therefore, the load partition strategy implementation factor is calculated by adopting the following steps:
screening out operator performance index data at the downstream of the logic partition from all the performance index data, reading the performance index data of all the operator examples, and then calculating the reciprocal of the observation processing rate in the performance index file to be used as the observation processing time of the operator examples;
acquiring the maximum observation processing time and the minimum observation processing time, and calculating the difference between the maximum observation processing time and the minimum observation processing time as the maximum waiting time in a load unbalance state; since the actual processing time between different instances of an operator is almost the same, the difference between the observed processing times appears as a time difference waiting for data input; and if the processing performance is reduced due to load unbalance, at least one operator instance is in a full load state, and the observation processing time of the operator instance is minimum; thus, the maximum latency between operator instances due to load imbalance is represented using the difference between the instances' maximum processing time and minimum processing time;
in order to eliminate the influence of the queuing time in the balanced state, the queuing time in the load balanced state is calculated by using a queuing theory method: under an ideal load balancing scene, the input rate and the real processing rate of all the operator examples are the same; therefore, the example of each operator is modeled into a GI/G/1 queuing model by adopting a formula
Figure BDA0003568016990000141
Estimating an average queuing time for each operator instance, where TqueueFor the estimated average queuing time per operator instance, ρ is utilization, caAs coefficient of variation of time of arrival, csAs the coefficient of variation of service time, Rture-procThe real processing rate corresponding to the operator example;
calculating to obtain a difference value between the maximum waiting time of each operator in a load unbalance state and the average queuing time of each operator in a load balance state, wherein the difference value is used for representing the extra waiting time generated by load unbalance;
when the load is in a balanced state and the utilization rate of each operator instance is low, the waiting time calculated by the method is long; the reason is that due to low utilization rate of example resources, the calculation result is amplified due to small error between observation processing rates of operators; therefore, the average utilization of each operator is multiplied by the extra latency as the final extra latency decision value;
counting the time of the load partition (namely selecting an output channel process), and multiplying the time by a set coefficient to serve as a final load partition time judgment value;
finally, the magnitude of the extra latency decision value and the load partition time decision value are determined:
if the extra waiting time decision value is larger than the load partition time decision value, setting the load partition strategy implementation factor as a third set value, which indicates that the load partition strategy needs to be executed;
if the extra waiting time decision value is less than or equal to the load partition time decision value, setting the load partition strategy implementation factor as a fourth set value, which indicates that the load partition strategy is not required to be executed;
s6, constructing a corresponding elastic scaling strategy and a corresponding load partition strategy according to the elastic scaling strategy implementation factor and the load partition strategy implementation factor obtained in the step S5;
in specific implementation, the elastic scaling strategy is specifically constructed by the following steps:
acquiring performance index data;
reading a directed acyclic graph representing a processing task, and storing the directed acyclic graph in a form of an adjacent linked list;
using a topological sorting algorithm to obtain operator sorting of tasks from a Source operator (Source) to a Sink operator (Sink);
taking the expected output rate set by the source operator as the input rate of the first downstream operator, and calculating the parallelism P of the operators as
Figure BDA0003568016990000151
Rtrue-inputIs the true input rate; r istrue-procIs the true processing rate;
calculating to obtain the real output rate Rtrue-outputIs composed of
Figure BDA0003568016990000161
Wherein N isoutputNumber of output data, T, for operator instanceusefulThe effective time for data processing is taken, and the real output rate is taken as the real input rate of a downstream operator;
continuously iterating the calculation process, and calculating the parallelism of the operators one by one from a source operator to a sink operator of the topological structure;
the basic idea of load partitioning is to distribute frequently occurring hot data to all downstream operator instances, while infrequent cold data is distributed to two fixed operator instances; the candidate examples of the infrequent data distribution are generated through two hash functions; therefore, the load partitioning policy is specifically constructed by adopting the following steps (as shown in fig. 5):
acquiring performance index data;
creating a hash mapping with a fixed length to store the frequency of different input data, wherein the hash mapping only stores possible hot data; meanwhile, an array is created to store the data volume sent by different channels;
when a piece of data is input, it is determined whether the data has been recorded on the hash map:
if the data is recorded on the Hash mapping, increasing the frequency number corresponding to the data by 1;
if the data is not recorded on the Hash mapping and the Hash mapping does not reach the upper limit of the capacity, recording the data on the Hash mapping and recording the frequency corresponding to the data as 1;
if the data is not recorded on the Hash mapping and the Hash mapping reaches the upper limit of the capacity, replacing the data with the lowest frequency number in the data recorded by the Hash mapping with the current data, and recording the frequency number corresponding to the data and adding 1;
possible hot key updating is realized by replacing low-frequency data, so that new hot data can be changed into hot data without large accumulation in the early stage; the low-frequency data and the unrecorded data belong to cold data and are treated equally on the data partition; as shown in FIG. 5, the next time the data input is "two" and is not stored in the hash map; then the element "the" with the least frequency is found, and replaced with ("two", 2) ("the", 1); the next input data "a" exists in the hash mapping, and the element item of "a" in the hash mapping is added with 1;
most load partitioning methods do not consider that the heat of historical data is weakened along with the advance of time, so that the sensitivity to the load data distribution at the current moment is low; the load partitioning policy generator uses a method of reducing historical data frequency over time; updating the quantity of input data every time data is input; when the input number reaches a set value, reducing the frequency numbers corresponding to all the data recorded by the Hash mapping according to a set proportion (for example, multiplying the frequency numbers by a coefficient of 0.5);
and when the frequency is updated every time, calculating the current thermal data by adopting the following steps:
calculating the frequency of all data in the hash mapping accounting for the total input data volume
Figure BDA0003568016990000171
FkeyRepresenting the frequency, N, of the datakeyRepresenting the amount of data in the hash map, NtotalRepresenting the total input data volume;
using the set parameters, by formula
Figure BDA0003568016990000172
Calculating a thermal data threshold, θhotRepresenting thermal data threshold, P parallelism, θdefRepresenting user-defined parameters for adjusting thermal data thresholds;
comparing the data frequency in the hash map to the thermal data threshold: data with a data frequency exceeding a thermal data threshold is considered thermal data;
judging the input data:
if the input data is thermal data, selecting the channel with the least output data volume in all the output channels as the selected output channel;
if the input data is not hot data, selecting the channel with the least data transmission from the set two channels as the selected channel;
after the channel is selected, updating the data volume sent by the channel;
and S7, constructing a reconfiguration controller module for applying the strategy obtained in the step S6 to the stream processing system so as to complete the flexible scaling stream processing based on the adaptive load partition.
FIG. 6 is a functional block diagram of the system of the present invention: the system for realizing the elastic scaling stream processing method based on the self-adaptive load partition comprises a Flink system module, a DKG module, an index collector module, an index file sharing module, a discriminator module, an elastic scaling strategy generation module, a load partition strategy generation module and a reconfiguration control module; the Flink system module, the DKG module, the index collector module, the index file sharing module and the discriminator module are sequentially connected in series; the output end of the discriminator module is simultaneously connected with the input ends of the elastic scaling strategy generation module and the load partitioning strategy generation module; the output ends of the elastic scaling strategy generation module and the load partition strategy generation module are simultaneously connected with the reconfiguration control module; the output end of the reconfiguration control module is connected with the Flink system module; the Flink system module is used for constructing a stream processing system; the DKG module is used for constructing a DKG model, distributing data to downstream operator instances and managing the computing state in the instances; the index collector module is used for constructing an index collector model and collecting and storing index data of the stream processing system; the index file sharing module is used for sharing the stored performance index data files; the discriminator module is used for constructing a discriminator model and calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor according to the shared performance index data; the elastic scaling strategy generating module is used for generating a corresponding elastic scaling strategy according to the implementation factor of the elastic scaling strategy; the load partition strategy generating module is used for generating a corresponding load partition strategy according to the load partition strategy implementation factor; and the reconfiguration control module is used for applying the obtained elastic scaling strategy and the load partition strategy to the stream processing system so as to complete the elastic scaling stream processing based on the self-adaptive load partition.
The elastic scaling stream processing method and system based on the self-adaptive load partition can be applied to the fields of network public opinion analysis, industrial sensor data monitoring and the like.
Under the scene of network public opinion analysis, for example, the popularity of microblog messages is counted, the method and the system not only can realize the basic keyword popularity counting function in low popularity events, but also can keep the normal function in high popularity events and can not cause system breakdown.
In the industrial sensor data monitoring scene, a large amount of data collected by an industrial sensor is used as stream data, and the stream processing method is used for data processing and analysis, so that the stream processing process of the industrial sensor data has the characteristics of low delay and high throughput, the real-time performance of data processing is guaranteed, and production problems can be found in time to avoid loss.

Claims (8)

1. An elastic scaling stream processing method based on adaptive load partitioning comprises the following steps:
s1, constructing a stream processing system based on a Flink prototype in the prior art;
s2, constructing a DKG (replicated Key Group) model for distributing data to downstream operator instances and managing the calculation state in the instances based on the stream processing system constructed in the step S1;
s3, constructing an index collector model for collecting and storing performance index data of the stream processing system;
s4, sharing the performance index data stored in the step S3;
s5, constructing a discriminator model for calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor;
s6, constructing a corresponding elastic scaling strategy and a corresponding load partition strategy according to the elastic scaling strategy implementation factor and the load partition strategy implementation factor obtained in the step S5;
and S7, constructing a reconfiguration controller module for applying the strategy obtained in the step S6 to the stream processing system so as to finish the elastic scaling stream processing based on the adaptive load partition.
2. The adaptive load partitioning-based elastic scaling stream processing method according to claim 1, wherein the step S2 of constructing a DKG model for distributing data to downstream operator instances and managing computation states in the instances comprises the following steps:
calculating channels for sending data according to the load partition algorithm by using the input data, and sending the data to the downstream instance from the selected channels;
after receiving the input data, the downstream instance determines whether the set flag is included:
if the set mark is included, directly sending the data to a corresponding downstream operator instance, so as to apply the physical partition to the logical partition;
otherwise, realizing logical partitioning according to a Hash partitioning method, and then sending the data to a downstream operator instance according to the result of the logical partitioning;
finally, the state of the input data is managed by adopting the following steps:
when the data is transmitted to a downstream operator instance, the KG value of the data is calculated by the following formula:
KG=murmurhash(hashcode(key))%Pmax
wherein murmurhash () is a murmur hash function; hashcode () is a multiplicative hash function; pmaxMaximum parallelism supported for the stream processing system; the key is a key for inputting data; % is the operation of taking the rest;
then, the storage location SI corresponding to the KG value is calculated by the following formula:
Figure FDA0003568016980000021
in the formula NKGThe maximum number of keygroups supported for the stream processing system; n is a radical of hydrogeninstIs the number of instances; the integer part is intercepted and divided;
and then acquiring the state from the local state back end according to the obtained storage position SI:
if the state of the input data does not exist, a new state is created in the rear end of the local state;
if the state of the input data exists, directly acquiring the corresponding state;
executing stateful calculation, and storing the updated state to the state rear end;
and finally, adding the calculation results of the operators to obtain a uniform calculation result, thereby ensuring the correctness of the calculation result.
3. The adaptive load partitioning-based elastic scaling stream processing method according to claim 2, wherein the construction index collector model in step S3 is used for collecting and storing performance index data of the stream processing system, and specifically includes the following steps:
each operator instance initializes a local index collector for storing an effective time index, a processing data volume index, an output data volume index and a time overhead index in the process of selecting an output channel; when the index collector is initialized, acquiring the window length and the storage path of the index persistence process from the configuration file;
each time data is processed, the effective time for processing is calculated: recording the current nanosecond time before data deserialization, recording the current nanosecond time after data processing is finished and a serialization process is finished, subtracting the recorded two nanosecond times to obtain effective time, and accumulating the effective time into an effective time index in an index collector;
updating the data volume processing index, the data volume output index and the time overhead index of the output channel selection process: after deserializing to obtain a datum, increasing the index of the processed data amount by 1; when serialization and one data wait for output, the output data quantity index is increased by 1; calculating a time overhead index of the process of selecting an output channel in a time difference mode;
after a piece of data is processed, whether the difference between the recorded nanosecond time and the initial nanosecond time exceeds the configured window length is judged at the end of the processing process:
if the window size does not exceed the configured window size, no operation is executed;
and if the configured window size is exceeded, performing index calculation and storage operation:
the index calculation includes calculating a true processing rate and a true output rate using the following equations:
Figure FDA0003568016980000031
Figure FDA0003568016980000032
in the formula Rtrue-procThe real processing rate; n is a radical ofprocThe amount of data to process; t isusefulIs the effective time for data processing; rtrue-outputThe real output rate; n is a radical ofoutputIs the amount of output data;
and after the calculation is finished, storing the calculation result into a performance index data file.
4. The adaptive load partition-based elastic scaling stream processing method according to claim 3, wherein the step S4 of sharing the performance index data stored in the step S3 includes the following steps:
adopting Samba, inotify and mv tools to realize real-time sharing of performance index data files;
before a stream processing system is started, configuring Samba to realize folder sharing, and setting a performance index file storage path configured in an inotify monitoring stream processing system;
when the stream processing system stores a performance index data file, inotify generates a complete path of the performance index data file, and triggers mv operation to move the performance index data file to a local shared folder;
samba shares performance index data files among multiple host nodes through an SMB protocol, enabling the system to access performance index data of the stream processing system.
5. The adaptive load partitioning-based elastic scaling stream processing method according to claim 4, wherein the constructing of the discriminator model in step S5 is used for calculating an elastic scaling policy enforcement factor and a load partitioning policy enforcement factor, and specifically includes the following steps:
calculating an implementation factor of an elastic scaling strategy by adopting the following steps:
reading performance index data from a Samba shared file system;
the discriminator reads each performance index file and adds the speed information of the operator instance in the file into the topological structure of the task;
after all the performance index data are integrated according to operators, calculating the real input rate and the average real processing rate of each operator; the ratio of the real input rate to the real processing rate is rounded up to be used as the optimal operator parallelism;
and comparing the optimal operator parallelism with the current operator parallelism, and calculating to obtain an elastic scaling strategy implementation factor:
if the total difference between the optimal operator parallelism and the current operator parallelism exceeds a set threshold, setting an implementation factor of the elastic scaling strategy as a first set value, which indicates that the elastic scaling strategy needs to be executed;
if the total difference between the optimal operator parallelism and the current operator parallelism does not exceed the set threshold, setting the implementation factor of the elastic scaling strategy to be a second set value, which indicates that the current operator parallelism is not changed;
the load partition strategy implementation factor is obtained by adopting the following steps:
screening out operator performance index data at the downstream of the logic partition from all the performance index data, reading the performance index data of all the operator examples, and then solving the reciprocal of the observation processing rate in the performance index file as the observation processing time of the operator examples;
acquiring the maximum observation processing time and the minimum observation processing time, and calculating the difference between the maximum observation processing time and the minimum observation processing time as the maximum waiting time in a load unbalance state;
and (3) calculating the queuing time in a load balancing state by using a queuing theory method: modeling the instance of each operator into a GI/G/1 queuing model by adopting a formula
Figure FDA0003568016980000051
Estimate each calculationAverage queuing time of subinstance, where TqueueFor the estimated average queuing time per operator instance, ρ is utilization, caAs coefficient of variation of time of arrival, csAs the coefficient of variation of service time, Rture-procThe real processing rate corresponding to the operator example;
calculating to obtain a difference value between the maximum waiting time of each operator in the load unbalance state and the average queuing time of each operator in the load balance state, wherein the difference value is used for representing the extra waiting time generated by load unbalance;
multiplying the average utilization rate of each operator by the extra waiting time to obtain a final extra waiting time judgment value;
counting the time of the load partition (namely selecting an output channel process), and multiplying the time by a set coefficient to serve as a final load partition time judgment value;
finally, the magnitude of the extra latency decision value and the load partition time decision value are determined:
if the extra waiting time decision value is greater than the load partition time decision value, setting the load partition strategy implementation factor as a third set value, which indicates that the load partition strategy needs to be executed;
if the extra-latency decision value is less than or equal to the load partitioning time decision value, the load partitioning policy enforcement factor is set to a fourth setting indicating that the load partitioning policy does not need to be executed.
6. The adaptive load partition-based elastic scaling stream processing method according to claim 5, wherein the elastic scaling strategy of step S6 is specifically constructed by the following steps:
acquiring performance index data;
reading a directed acyclic graph representing a processing task, and storing the directed acyclic graph in a form of an adjacent linked list;
using a topological sorting algorithm to obtain operator sorting of tasks from a Source operator (Source) to a Sink operator (Sink);
taking the expected output rate set by the source operator as the input rate of the first operator at the downstreamRate and calculating the parallelism P of the operator as
Figure FDA0003568016980000061
Rtrue-inputIs the true input rate; rtrue-procIs the true processing rate;
calculating to obtain the real output rate Rtrue-outputIs composed of
Figure FDA0003568016980000062
Wherein N isoutputNumber of output data, T, for operator instanceusefulThe effective time for data processing is taken, and the real output rate is taken as the real input rate of a downstream operator;
and continuously iterating the calculation process, and calculating the parallelism of the operators one by one from the source operator to the summary operator of the topological structure.
7. The adaptive load partitioning-based elastic scaling stream processing method according to claim 5, wherein the load partitioning policy of step S6 is specifically constructed by the following steps:
acquiring performance index data;
creating a hash mapping with a fixed length to store the frequency of different input data, wherein the hash mapping only stores possible hot data; meanwhile, an array is created to store the data volume sent by different channels;
when a piece of data is input, it is determined whether the data has been recorded on the hash map:
if the data is recorded on the Hash mapping, increasing the frequency corresponding to the data by 1;
if the data is not recorded on the Hash mapping and the Hash mapping does not reach the upper limit of the capacity, recording the data on the Hash mapping and recording the frequency corresponding to the data as 1;
if the data is not recorded on the Hash mapping and the Hash mapping reaches the upper limit of the capacity, replacing the data with the lowest frequency number in the data recorded by the Hash mapping with the current data, and recording the frequency number corresponding to the data and adding 1;
updating the quantity of input data every time data is input; when the input quantity reaches a set value, reducing the frequency numbers corresponding to all the data recorded by the Hash mapping according to a set proportion;
and when the frequency is updated every time, calculating the current thermal data by adopting the following steps:
calculating the frequency of all data in the hash mapping accounting for the total input data volume
Figure FDA0003568016980000071
FkeyRepresenting the frequency, N, of the datakeyRepresenting the amount of data in the hash map, NtotalRepresenting the total input data volume;
using the set parameters, by formula
Figure FDA0003568016980000072
Calculating a thermal data threshold, θhotRepresenting thermal data threshold, P parallelism, θdefRepresenting user-defined parameters for adjusting thermal data thresholds;
comparing the data frequency in the hash map to the thermal data threshold: data with a data frequency exceeding a thermal data threshold is considered thermal data;
judging the input data:
if the input data is thermal data, selecting the channel with the least output data amount in all the output channels as the selected output channel;
if the input data is not hot data, selecting the channel with the least data transmission from the set two channels as the selected channel;
and after the channel is selected, updating the data volume sent by the channel.
8. A system for realizing the elastic scaling stream processing method based on the adaptive load partition according to any one of claims 1 to 7, which is characterized by comprising a Flink system module, a DKG module, an index collector module, an index file sharing module, a discriminator module, an elastic scaling strategy generation module, a load partition strategy generation module and a reconfiguration control module; the Flink system module, the DKG module, the index collector module, the index file sharing module and the discriminator module are sequentially connected in series; the output end of the discriminator module is simultaneously connected with the input ends of the elastic scaling strategy generation module and the load partitioning strategy generation module; the output ends of the elastic scaling strategy generation module and the load partition strategy generation module are simultaneously connected with the reconfiguration control module; the output end of the reconfiguration control module is connected with the Flink system module; the Flink system module is used for constructing a stream processing system; the DKG module is used for constructing a DKG model, distributing data to downstream operator instances and managing the computing state in the instances; the index collector module is used for constructing an index collector model and collecting and storing index data of the stream processing system; the index file sharing module is used for sharing the stored performance index data files; the discriminator module is used for constructing a discriminator model and calculating an elastic scaling strategy implementation factor and a load partitioning strategy implementation factor according to the shared performance index data; the elastic scaling strategy generating module is used for generating a corresponding elastic scaling strategy according to the implementation factor of the elastic scaling strategy; the load partition strategy generating module is used for generating a corresponding load partition strategy according to the load partition strategy implementation factor; the reconfiguration control module is used for applying the obtained elastic scaling strategy and the load partition strategy to the stream processing system, so as to complete the elastic scaling stream processing based on the self-adaptive load partition.
CN202210313490.5A 2022-03-28 2022-03-28 Elastic scaling stream processing method and system based on self-adaptive load partition Pending CN114675969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210313490.5A CN114675969A (en) 2022-03-28 2022-03-28 Elastic scaling stream processing method and system based on self-adaptive load partition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210313490.5A CN114675969A (en) 2022-03-28 2022-03-28 Elastic scaling stream processing method and system based on self-adaptive load partition

Publications (1)

Publication Number Publication Date
CN114675969A true CN114675969A (en) 2022-06-28

Family

ID=82075675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210313490.5A Pending CN114675969A (en) 2022-03-28 2022-03-28 Elastic scaling stream processing method and system based on self-adaptive load partition

Country Status (1)

Country Link
CN (1) CN114675969A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319381A (en) * 2023-05-25 2023-06-23 中国地质大学(北京) Communication and resource-aware data stream grouping method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319381A (en) * 2023-05-25 2023-06-23 中国地质大学(北京) Communication and resource-aware data stream grouping method and system
CN116319381B (en) * 2023-05-25 2023-07-25 中国地质大学(北京) Communication and resource-aware data stream grouping method and system

Similar Documents

Publication Publication Date Title
Sahoo et al. ESMLB: Efficient switch migration-based load balancing for multicontroller SDN in IoT
Liu et al. Adaptive asynchronous federated learning in resource-constrained edge computing
Yu et al. Stochastic load balancing for virtual resource management in datacenters
CN1956457B (en) Method and apparatus for arranging mesh work in mesh computing system
US7487206B2 (en) Method for providing load diffusion in data stream correlations
Feng et al. Optimal state-free, size-aware dispatching for heterogeneous M/G/-type systems
US8504943B2 (en) Displaying group icons representing respective groups of nodes
CN102063330A (en) Performance data acquisition method for large-scale parallel program
Wei et al. Multi-dimensional resource allocation in distributed data centers using deep reinforcement learning
Ding et al. Kubernetes-oriented microservice placement with dynamic resource allocation
CN107291539A (en) Cluster program scheduler method based on resource significance level
CN114675969A (en) Elastic scaling stream processing method and system based on self-adaptive load partition
Schneider et al. Dynamic load balancing for ordered data-parallel regions in distributed streaming systems
Perwej The ambient scrutinize of scheduling algorithms in big data territory
CN114900525A (en) Method and system for deflecting data stream
CN111176784A (en) Virtual machine integration method based on extreme learning machine and ant colony system
Yagoubi et al. Load balancing strategy in grid environment
Oguara et al. An adaptive load management mechanism for distributed simulation of multi-agent systems
Marinho et al. LABAREDA: a predictive and elastic load balancing service for cloud-replicated databases
Fan et al. An adaptive feedback load balancing algorithm in HDFS
CN111580950A (en) Self-adaptive feedback resource scheduling method for improving cloud reliability
Huang The value-of-information in matching with queues
Soosai et al. Dynamic replica replacement strategy in data grid
Meddeber et al. Dependent tasks assignment and data consistency management for grid computing
Wang et al. Model-based scheduling for stream processing systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination