CN106845708A

CN106845708A - A kind of data flow processing system Multipurpose Optimal Method based on uncertainty

Info

Publication number: CN106845708A
Application number: CN201710044897.1A
Authority: CN
Inventors: 曹朝; 盛伟; 曲大成
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2017-06-13
Anticipated expiration: 2037-01-20
Also published as: CN106845708B

Abstract

A kind of data flow processing system Multipurpose Optimal Method based on uncertainty disclosed by the invention, is related to a kind of Multipurpose Optimal Method for data flow processing system, belongs to Computer Applied Technology, real-time big data analysis field.The bound and the bound of throughput of operating lag of the present invention according to specified by user, provide uncertain region area；Based on this target of diminution uncertain region area, one group of Pareto optimal solution with Typical Representative meaning is obtained by recursive two points of probe methods, be that user provides selection space on operating lag and throughput.The present invention has a wide range of application suitable for different real-time big data analysis system multiple-objection optimization scenes, practical, it is easy to promote.Additionally, the present invention is processed in itself just for data, and it is not only restricted to the source of data, it is adaptable to the treatment to the data in all of engineer applied.

Description

Multi-objective optimization method of data stream processing system based on uncertainty

Technical Field

The invention relates to a multi-objective optimization method of a data stream processing system based on uncertainty, in particular to a multi-objective optimization method for the data stream processing system, and belongs to the field of computer application technology and real-time big data analysis.

Background

In recent years, a large number of real-time big data analysis applications, such as social network dynamic analysis, intelligent traffic data analysis, large-scale data center monitoring, gene data analysis and the like, emerge. The application has large data volume and continuously and quickly generates or updates data, and requires a data analysis system to continuously return or update an analysis result in real time, which is called as real-time Big data (Big & fast data) analysis. Such applications have urgent needs for real-time big data analysis systems, and the systems are required to give quantitative guarantees on response delay and throughput rate.

At present, in real-time big data analysis application, the requirements of users on response delay and throughput rate depend on historical experience, IT personnel manually configure a proper execution plan for analysis operation in a data stream processing system, and quantitative guarantee for the response delay and the throughput rate of real-time big data analysis is lacked; even experienced IT personnel can not ensure that a better execution plan is configured, so that the operation efficiency of analysis operation is low, and the requirement of upper-layer application on real-time performance cannot be met.

The method is a multi-objective optimization method designed based on two important indexes in real-time big data, namely response delay and throughput rate. And constructing a multi-objective optimization model based on the given response delay and throughput rate model, and theoretically ensuring that the optimal execution plan is selected. The multi-objective optimization of the real-time big data analysis system has important significance for providing real-time big data analysis cloud service with service quality guarantee, and providing a real-time big data analysis platform and an optimization framework for the national key industry and important monitoring application.

Although the existing multi-objective optimization method based on weight addition solves the pareto optimal problem of response delay and throughput rate of a convex objective function under certain condition constraints, the pareto optimal problem under the condition of a concave objective function cannot be solved; in addition, the multi-objective optimization method based on weight summation returns to the user a set of solutions which are different in decryption degree, difficult to explain and not representative, and the user actually needs a representative set of solutions on the pareto curve. Therefore, the multi-objective optimization method based on weight summation cannot meet the multi-objective optimization in the IT personnel interaction scene.

Disclosure of Invention

The method aims at the defect that the optimal solution of palitor is random because the existing multi-objective optimization method based on weight addition does not consider the condition that users have trade-off on response delay and throughput rate when deployment and use. The invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which aims to solve the technical problems that: aiming at the multi-objective optimization problem of the data stream processing system, the method can avoid the random defect of the pareto optimal solution, obtain a group of pareto optimal solutions with typical representative meanings, and provide a selection space for a user on response delay and throughput rate.

The purpose of the invention is realized by the following technical scheme:

the invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which provides an uncertain area according to an upper and lower bound of response delay and an upper and lower bound of throughput rate specified by a user; based on the goal of reducing the area of an uncertain region, a group of pareto optimal solutions with typical representative meanings are obtained through a recursive dichotomy detection method, and a selection space is provided for a user on response delay and throughput.

The invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which comprises the following steps:

step 1: input an upper bound on the current response delay, denoted L_upper(ii) a Inputting a lower bound of the current response delay, denoted L_lower(ii) a And inputting a threshold value of the area of the uncertain region, and recording the threshold value as UA.

Step 2: upper bound L based on current response delay_upperAnd a lower bound L_lowerAn upper bound and a lower bound of the current throughput rate are calculated, respectively.

Step 2.1: according to current response delayUpper bound L_upperCalculating the upper bound of the current throughput rate, denoted as T_upperThe calculation formula is as follows:

wherein s.t. represents a constraint; c represents a specific system configuration; λ represents the real-time input data rate;_(c,λ)representing the parameter configuration c of a specific system and the throughput rate under the condition of real-time input data rate lambda; psi_(c,λ)Representing the parameter configuration c of a specific system and the response delay under the condition of real-time input data rate lambda; the expression above the expression (1) reflects that, given the input data rate λ, the response delay ψ is satisfied_(c,λ)Less than the upper bound L of response delay_upperIn a set of system-specific parameter configurations c seeking to enable throughput rates_(c,λ)Maximizing specific system configuration c and recording the maximum throughput rate as T_upper；

Step 2.2: lower bound L based on current response delay_lowerCalculating the lower bound of the current throughput rate, denoted as T_lowerThe calculation formula is as follows:

the above equation reflects that given an input data rate λ, the response delay ψ is satisfied_(c,λ)Less than the lower bound L of response delay_lowerIn a set of system-specific parameter configurations c seeking to enable throughput rates_(c,λ)Maximizing specific system configuration c and recording the maximum throughput rate as T_lower。

And step 3: upper bound L based on current response delay_upperAnd a lower bound L_lowerThe current probe response delay, the maximum probe throughput rate and the specific system configuration of the maximum probe throughput rate are calculated by bisection.

Step 3.1: according to whenUpper bound L of pre-response delay_upperAnd a lower bound L_lowerCalculating the current probe response delay, denoted as L_middleThe calculation formula is as follows:

L_middle＝(L_lower+L_upper)/2； (3)

step 3.2: according to the current probe response delay L_middleCalculating the current maximum probing throughput and the specific system configuration of the maximum probing throughput, which are respectively denoted as T_middle、c_middleThe calculation formula is as follows:

and 4, step 4: upper bound L based on current response delay_upperAnd a lower bound L_lowerUpper bound of throughput rate T_upperAnd a lower bound T_lowerProbe response delay L_middleAnd maximum sounding throughput rate T_middleAnd respectively calculating the areas of the uncertain regions of the current left half part and the current right half part.

Step 4.1: lower bound L based on current response delay_lowerAnd a probe response delay L_middleAnd a lower bound T of the current throughput rate_lowerAnd maximum sounding throughput rate T_middleCalculating the area of the uncertain region of the current left half part and recording as ua_leftThe calculation formula is as follows:

ua_left＝(L_middle-L_lower)×(T_middle-T_lower)； (5)

step 4.2: upper bound L based on current response delay_upperAnd a probe response delay L_middleAnd an upper bound T of the current throughput rate_upperAnd maximum sounding throughput rate T_middleCalculating the area of the uncertain region on the right half of the current image, and recording as ua_rightThe calculation formula is as follows:

ua_right＝(L_upper-L_middle)×(T_upper-T_middle)； (6)

and 5: and judging whether the areas of the uncertain regions of the current left half part and the right half part are smaller than or equal to a threshold value UA of the areas of the uncertain regions, and determining whether to perform recursive iterative detection.

Step 5.1: judging the area ua of the left half uncertain region_leftWhether the area of the uncertain region is smaller than or equal to a threshold value UA is judged so as to determine whether to carry out recursive iterative detection, and the specific process is as follows:

step 5.1.1: if the area ua of the left half is not determined_leftIf the area is smaller than or equal to the area threshold UA of the uncertain region, setting the left half detection result group as an empty set, and turning to the step 5.2; otherwise, turning to step 5.1.2;

wherein, the left half of the detection result set is marked as plan_leftThe calculation formula is as follows:

step 5.1.2: delaying a current probe response by L_middleAs the next probe response delay upper bound L_upperLower bound of current response delay L_lowerAs the next probe response delay lower bound L_lowerRecursively iterating the left half; finally, the left half of the set plan of probe results is recorded_left；

Wherein the left half of the detection result set plan_leftThe calculation formula of (a) is as follows:

plan_left＝prob(L_lower,L_middle)； (8)

wherein prob (L)_lower,L_upper) Represents recursive iterative detection;

step 5.2: judging the uncertain region of the right half partArea ua_rightWhether the area of the uncertain region is smaller than or equal to a threshold value UA is judged so as to determine whether to carry out recursive iterative detection, and the specific process is as follows:

step 5.2.1: area ua of the indeterminate region in the right half_rightIf the area of the uncertain region is smaller than or equal to the threshold value UA, setting the right half detection result group as an empty set, and turning to the step 6; otherwise, turning to step 5.2.2;

wherein, the right half detection result set is marked as plan_rightThe calculation formula is as follows:

step 5.2.2: delaying a current probe response by L_middleAs the next probe response delay lower bound L_lowerUpper bound on current response delay L_upperAs the next probe response delay upper bound L_upperRecursively iterating the right half; finally, the right half of the set of probe results plan is recorded_right；

Wherein the right half of the detection result set plan_rightThe calculation formula of (a) is as follows:

plan_right＝prob(L_middle,L_upper)； (10)

step 6: calculating a current detection result set, and combining the current detection result set with the left half detection result set plan_leftAnd a right half probe result set plan_rightAnd merging and returning a final detection result group to obtain a group of pareto optimal solutions with typical representative meanings for multi-objective optimization of the data stream processing system.

Step 6.1: according to the current probe response delay L_middleMaximum sounding throughput T_middleAnd specific system configuration c of maximum probe throughput_middleCalculating the current detection result set, and recording as plan_middleCalculatingThe formula is as follows:

plan_middle＝{(L_middle,T_middle,c_middle)}； (11)

step 6.2: the current detection result set is pland_middleAnd the left half of the detection result set plan_leftAnd a right half probe result set plan_rightMerging, returning a final detection result set, and marking as plan, wherein the calculation formula is as follows:

plan＝plan_left∪plan_middle∪plan_right； (12)

and returning a final detection result set plan, namely a pareto optimal solution with typical representative significance for the multi-objective optimization of the data stream processing system.

Has the advantages that:

1. the invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which is based on the upper and lower bounds of response delay, based on the relation between a pareto optimal point and a constraint optimization solution, and uses the area of an uncertain region as the measurement of uncertainty, thereby providing a quantitative determination standard for the uncertainty of detection depth.

2. The invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which improves the efficiency of detecting a pareto optimal solution with typical representative meaning by a binary detection method based on the goal of reducing the area of an uncertain region.

3. The invention discloses a multi-objective optimization method of a data stream processing system based on uncertainty, which can return a series of meaningful and representative palitor optimal solutions within a response delay or throughput rate range specified by a user, and ensure that the user can accept a desired optimal solution within the range;

4. the multi-objective optimization method of the data stream processing system based on the uncertainty is suitable for different real-time big data analysis system multi-objective optimization scenes, and is wide in application range, strong in practicability and easy to popularize.

5. The multi-objective optimization method of the data stream processing system based on the uncertainty only processes data, can obtain a group of pareto optimal solutions with typical representative meanings without being limited by data sources, and is suitable for processing data in all engineering applications.

Drawings

FIG. 1 is a schematic flowchart of the method and embodiment 1 of the "method for multi-objective optimization of data stream processing system based on uncertainty" according to the present invention;

FIG. 2 is a schematic flowchart of recursive iterative detection in embodiment 2 of the "a method for multi-objective optimization of a data stream processing system based on uncertainty" according to the present invention;

FIG. 3 is a comparison graph of the present method and the weighted sum experiment in embodiment 1 of the "multi-objective optimization method for data stream processing system based on uncertainty" of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and examples, but the present invention is not limited to these examples.

Example 1:

this embodiment describes a process of applying the "multi-objective optimization method for data stream processing system based on uncertainty" in a specific real-time big data analysis system Apache Spark Streaming scenario.

Fig. 1 is a flowchart of an algorithm of the method and a flowchart of the present embodiment. As can be seen from the figure, the method comprises the following steps:

step 1: upper bound L of current response delay_upperInitialized to a lower bound L of 10.0, the current response delay_lowerIs initialized to 0.5 and the threshold value UA for the area of the uncertainty region is initialized to 10000.

Step 2: an upper bound and a lower bound of the current throughput rate are calculated, respectively, based on the upper bound 10.0 and the lower bound 0.5 of the current response delay.

Step 2.1: according to the upper bound 10.0 of the current response delay, the upper bound T of the current throughput rate is calculated_upperTo 1677367.1230139078, the calculation formula is as follows:

wherein s.t. represents a constraint; c represents a specific system configuration; λ represents the real-time input data rate;_(c,λ)representing the parameter configuration c of a specific system and the throughput rate under the condition of real-time input data rate lambda; psi_(c,λ)Representing the parameter configuration c of a specific system and the response delay under the condition of real-time input data rate lambda; the above equation reflects that given an input data rate λ, the response delay ψ is satisfied_(c,λ)Less than the upper bound L of response delay_upperIn a set of system-specific parameter configurations c seeking to enable throughput rates_(c,λ)Maximizing specific system configuration c and recording the maximum throughput rate as T_upper；

Step 2.2: according to the lower bound 0.5 of the current response delay, calculating the lower bound T of the current throughput rate_lowerTo 1288034.188034188, the calculation formula is as follows:

the above equation reflects that given an input data rate λ, the response delay ψ is satisfied_(c,λ)Less than the lower bound L of response delay_lowerIn a set of system-specific parameter configurations c seeking to enable throughput rates_(c,λ)Maximizing specific system configuration c and recording the maximum throughput rate as T_lower；

And step 3: and calculating the current probe response delay, the maximum probe throughput rate and the specific system configuration of the maximum probe throughput rate by a dichotomy according to the upper bound 10.0 and the lower bound 0.5 of the current response delay.

Step 3.1: calculating the current probe response delay L according to the upper bound 10.0 and the lower bound 0.5 of the current response delay_middleTo 5.25, the calculation is as follows:

L_middle＝(L_lower+L_upper)/2＝(10.0+0.5)/2＝5.25； (3)

step 3.2: calculating the current maximum probing throughput rate T according to the current probing response delay 5.25_middleAnd specific system configuration of maximum probe throughput rate c_middle1674561.0772396186, c respectively₁The calculation formula is as follows:

and 4, step 4: the uncertainty region areas for the current left and right halves are calculated based on the upper and lower bounds of the current response delay, 10.0 and 0.5, the upper and lower bounds of the throughput rate, 1677367.1230139078 and 1288034.188034188, the probe response delay, 5.25, and the maximum probe throughput rate, 1674561.0772396186, respectively.

Step 4.1: calculating the area ua of the uncertain region in the current left half according to the lower bound 0.5 of the current response delay and the probe response delay 5.25 as well as the lower bound 1288034.188034188 of the current throughput and the maximum probe throughput 1674561.0772396186_leftTo 183600.2723725793, the calculation formula is as follows:

step 4.2: calculating the uncertain region area ua of the current right half according to the upper bound 10.0 of the current response delay and the probe response delay 5.25 as well as the upper bound 1677367.1230139078 of the current throughput and the maximum probe throughput 1674561.0772396186_rightTo 13328.71742787275, the calculation formula is as follows:

and 5: and judging whether the areas of the uncertain regions of the current left half part and the right half part are less than or equal to a threshold value 10000 of the areas of the uncertain regions, and determining whether to perform recursive iterative detection.

Step 6.1: specific system configuration c according to current probe response delay 5.25, maximum probe throughput 1674561.0772396186, and maximum probe throughput rate₁Calculating the current detection result set plan_middleIs { (5.25,1674561.0772396186, c)₁) The calculation formula is as follows:

step 6.2: the current detection result set is pland_middleAnd the left half of the detection result set plan_leftAnd a right half probe result set plan_rightMerging and returning the final detection result set plan to be plan_left∪{(5.25,1674561.0772396186,c₁)}∪plan_rightThe calculation formula is as follows:

The experimental comparison of the method and the weight summation is shown in fig. 3. Where the abscissa represents response delay (sec), the ordinate represents throughput (million bars/sec), and the point in the graph represents the maximum throughput that can be achieved at a certain response delay, i.e., the pareto optimal solution. The left diagram represents the weight summation method, and the right diagram represents the method. As can be seen from fig. 3, the pareto optimal solution of the weight summation is concentrated in a small part of the region, cannot represent the distribution of response delay and throughput rate in the whole space, and cannot provide a group of pareto optimal solutions with typical representative meanings to users; the solution sets of the method are distributed uniformly in the whole space, various optimal choices of response delay and throughput rate can be provided for users, and a group of pareto optimal solutions with typical representative meanings can be provided for users.

Example 2:

this embodiment specifically illustrates the recursive iterative detection described in step 5 of the present invention and the recursive iterative detection in step 5 of embodiment 1, and the algorithm flow is shown in fig. 2. As can be seen from fig. 2, the specific steps of recursive iterative detection are:

Step 5.1: judging whether the area 183600.2723725793 of the uncertain region in the left half part is smaller than or equal to a threshold 10000 of the area of the uncertain region, thereby determining whether to perform recursive iterative detection, which specifically comprises the following steps:

step 5.1.1: if the area 183600.2723725793 of the uncertain region of the left half is smaller than or equal to the threshold 10000 of the area of the uncertain region, setting the detection result group of the left half as an empty set, and turning to the step 5.2; otherwise, turning to step 5.1.2;

step 5.1.2: the current probe response delay of 5.25 is taken as the next probe response delay upper bound L_upperThe lower bound of the current response delay 0.5 is used as the lower bound of the next detection response delay L_lowerRecursively iterating the left half; finally, the left half of the set plan of probe results is recorded_left；

wherein prob (L)_lower,L_upper) Represents recursive iterative detection;

step 5.2: judging whether the uncertain region area 13328.71742787275 of the right half part is smaller than or equal to the uncertain region area threshold 10000, thereby determining whether to perform recursive iterative detection, which specifically comprises the following steps:

step 5.2.1: if the uncertain region area 13328.71742787275 of the right half is smaller than or equal to the threshold value 10000 of the uncertain region area, setting the detection result group of the right half as an empty set, and turning to step 6; otherwise, turning to step 5.2.2;

wherein,right half probe result set plan_rightThe calculation formula of (a) is as follows:

step 5.2.2: the current probe response delay of 5.25 is taken as the next probe response delay lower bound L_lowerThe upper bound of the current response delay 10.0 is used as the upper bound of the next probe response delay L_upperRecursively iterating the right half; finally, the right half of the set of probe results plan is recorded_right；

so far, from step 5.1 to step 5.2, the recursive iterative detection of step 5 in embodiment 1 is completed.

Example 3:

the specific real-time big data analysis system Apache Spark Streaming in the embodiment 1 is changed into other real-time big data analysis systems such as Apache Storm, Google Dataflow, etc., that is, the multi-objective optimization method provided by the invention is not limited by the source of data, and is suitable for processing data in all engineering applications.

The technical contents not described in the above embodiments can be implemented by taking or referring to the existing technologies.

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A multi-objective optimization method of a data stream processing system based on uncertainty is characterized in that: comprises the following steps of (a) carrying out,

step 1: input an upper bound on the current response delay, denoted L_upper(ii) a Inputting a lower bound of the current response delay, denoted L_lower(ii) a Inputting a threshold value of the area of the uncertain region, and marking as UA;

step 2: upper bound L based on current response delay_upperAnd a lower bound L_lowerRespectively calculating an upper bound and a lower bound of the current throughput rate;

and step 3:upper bound L based on current response delay_upperAnd a lower bound L_lowerCalculating the current detection response delay, the maximum detection throughput rate and the specific system configuration of the maximum detection throughput rate by a bisection method;

step 3.1: upper bound L based on current response delay_upperAnd a lower bound L_lowerCalculating the current probe response delay, denoted as L_middleThe calculation formula is as follows:

L_middle＝(L_lower+L_upper)/2； (3)

and 4, step 4: upper bound L based on current response delay_upperAnd a lower bound L_lowerUpper bound of throughput rate T_upperAnd a lower bound T_lowerProbe response delay L_middleAnd maximum sounding throughput rate T_middleRespectively calculating the areas of uncertain regions of the current left half part and the current right half part;

ua_left＝(L_middle-L_lower)×(T_middle-T_lower)； (5)

ua_right＝(L_upper-L_middle)×(T_upper-T_middle)； (6)

and 5: judging whether the areas of the uncertain regions of the current left half part and the right half part are smaller than or equal to a threshold value UA of the areas of the uncertain regions, and determining whether to perform recursive iterative detection;

2. The method of claim 1 for multi-objective uncertainty-based optimization of a data stream processing system, wherein: the specific implementation method of the step 2 is that,

step 2.1: upper bound L based on current response delay_upperCalculating the upper bound of the current throughput rate, denoted as T_upperThe calculation formula is as follows:

3. The method of claim 1 or 2 for multi-objective uncertainty-based optimization of a data stream processing system, wherein: the specific implementation method of the step 6 is that,

step 6.1: according to the current probe response delay L_middleMaximum sounding throughput T_middleAnd specific system configuration c of maximum probe throughput_middleCalculating the current detection result set, and recording as plan_middleThe calculation formula is as follows:

plan_middle＝{(L_middle,T_middle,c_middle)}； (11)

plan＝plan_left∪plan_middle∪plan_right；( 12)

4. The method of claim 3 for multi-objective uncertainty-based optimization of data stream processing systems, wherein: the specific implementation method of the step 5 is that,

plan_left＝prob(L_lower,L_middle)； (8)

wherein prob (L)_lo_wer,L_upper) Represents recursive iterative detection;

step 5.2: judging the area ua of the uncertain region of the right half_rightWhether the area of the uncertain region is smaller than or equal to a threshold value UA is judged so as to determine whether to carry out recursive iterative detection, and the specific process is as follows:

plan_right＝prob(L_middle,L_upper)； (10)。

5. a multi-objective optimization method of a data stream processing system based on uncertainty is characterized in that: giving the area of the uncertain region according to the upper and lower bounds of the response delay and the upper and lower bounds of the throughput rate specified by the user; based on the goal of reducing the area of an uncertain region, a group of pareto optimal solutions with typical representative meanings are obtained through a recursive dichotomy detection method, and a selection space is provided for a user on response delay and throughput.