CN116346827B - Real-time grouping method and system for inclined data flow - Google Patents

Real-time grouping method and system for inclined data flow Download PDF

Info

Publication number
CN116346827B
CN116346827B CN202310625541.2A CN202310625541A CN116346827B CN 116346827 B CN116346827 B CN 116346827B CN 202310625541 A CN202310625541 A CN 202310625541A CN 116346827 B CN116346827 B CN 116346827B
Authority
CN
China
Prior art keywords
instance
data stream
frequency
key
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310625541.2A
Other languages
Chinese (zh)
Other versions
CN116346827A (en
Inventor
孙大为
雷思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences Beijing
Original Assignee
China University of Geosciences Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences Beijing filed Critical China University of Geosciences Beijing
Priority to CN202310625541.2A priority Critical patent/CN116346827B/en
Publication of CN116346827A publication Critical patent/CN116346827A/en
Application granted granted Critical
Publication of CN116346827B publication Critical patent/CN116346827B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0852Delays
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the technical field of data stream grouping, in particular to a real-time grouping method and system for inclined data streams. The method comprises the following steps: periodically acquiring operation information of a system by a monitor; when the upstream instance outputs a data stream, the group obtains the frequency of key values in the data stream; classifying the key values in the data stream by the group according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key; the group determines a candidate instance set according to the classification result of the key values in the data stream; the group assigns a weight table according to the candidate examples, determines a target example in the candidate example set, and assigns elements in the data stream to the target example. By adopting the invention, the high-frequency key can be distributed to all downstream examples, the example weight is calculated through average processing delay and network delay, and after multiple rounds of feedback adjustment, the load between the examples is at a relatively balanced level.

Description

Real-time grouping method and system for inclined data flow
Technical Field
The invention relates to the technical field of data stream grouping, in particular to a real-time grouping method and system for inclined data streams.
Background
In a distributed stream computing system, data skew and cluster heterogeneous can result in load maldistribution among parallel processing tasks with state operators. The existing stream grouping scheme mainly focuses on the balance of data division of stateful operators, ignores the influence of computing node processing capability differences and network cost on system performance, and cannot meet the high requirement of the elasticity and the expandability of a distributed stream computing system in running, so that high delay and low throughput are caused.
The Shuffle Grouping (SG) and Key Grouping (KG) are the most representative Grouping schemes in distributed stream computing systems. The SG randomly allocates each tuple to downstream parallel instances based on the polling rules, ensuring that the number of tuples processed by each instance is substantially the same. KG takes the specified field as a key and assigns tuples to downstream instances according to a hash function. However, for stateful operators, while SG can effectively achieve data-level load balancing, it is too costly to scale easily. KG can simply store state, but it easily causes load imbalance among multiple instances. And neither SG nor KG take into account inter-instance network costs, instance processing rates, data stream content variations and rate fluctuations.
To handle unbalanced loading caused by data skew, partial Key Grouping (PKG) uses two new techniques: key segmentation and local load estimation to adapt the classical "two-choice power" to the distributed stream computation setup. When two candidate instances are recognized for the high frequency key, D-choies assign hot keys to the candidate instances with d.gtoreq.2 according to frequency.
Also, to solve the problem of hotkey variation over time, a new load balancing mechanism (FISH) has proposed recent hotkey identification based on tuple number decay and allocation of tuples by computing states of heuristic downstream workers. Subsequently, a popularity-aware differentiated distributed stream processing system (psstream) assigns hot keys using SG and less common keys using KG. PSstream uses a lightweight probability counting scheme to identify the current hotkey and designs an adaptive threshold configuration scheme to accommodate dynamic popularity variations in real-time streams.
PFG proposes a sktch-based pre-filter grouping algorithm that uses a re-strike algorithm to dynamically monitor the items in the stream. Randomly pointing the detected high-frequency key value key to more than two candidate examples in a limited number of workers, wherein the candidate example IDs are continuous; on the other hand, for the less frequent keys, they point directly to the two candidate instances. The PFG uses the local load estimate to select as the target instance the one of the candidate instances that has the least allocation process data.
Many studies have considered locality to improve system performance by reducing network costs. In this case, the associated keys are assigned to instances hosted on the same computing node. In the middle, a random locality-aware stream partitioning (SLSP) method is proposed that considers both task locality and downstream states. Squirrel proposes a network aware grouping method that sets dynamic weights and priorities for each downstream instance based on network location and load between instances.
In summary, the above solutions provide valuable insight for data stream packets, but they do not have enough elastic scalability, and there still exist problems such as excessive memory overhead, resource waste caused by backlog of instance load with slow processing capability, etc.
Disclosure of Invention
The embodiment of the invention provides a real-time grouping method and a real-time grouping system for inclined data streams. The technical scheme is as follows:
in one aspect, a real-time grouping method for a tilted data stream is provided, and the method is implemented by a real-time grouping system for a tilted data stream, wherein the real-time grouping system for a tilted data stream comprises a group installed on an upstream instance and a monitor installed on a downstream instance;
the method comprises the following steps:
the monitor periodically acquires the operation information of the system;
when an upstream instance outputs a data stream, the group acquires the frequency of a key value in the data stream;
classifying the key values in the data stream by the group according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key;
the group determines a candidate instance set according to the classification result of the key values in the data stream;
and the group allocates a weight table according to the candidate examples, determines a target example in the candidate example set, and allocates the tuple in the data stream to the target example.
Optionally, the obtaining, by the group, the frequency of the key value in the data stream includes:
based on a double-layer frequency statistical model, the group acquires the frequency of key values in the data stream;
the double-layer frequency statistical model consists of two layers, wherein the first layer is a filter consisting of counters and is used for storing high-frequency keys and counts in a data stream; the second layer is a sketch, which tracks the counts of other key values in real time using the classical architecture CMS.
Optionally, the step of classifying the key values in the data stream by the group according to the frequencies of the key values in the data stream includes a high-frequency key and a low-frequency key, and the step of classifying the key values includes:
setting classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>
When the frequency of key values in the data stream is greater than or equal to a classification thresholdWhen the key value is determined to be a high-frequency key;
when the frequency of key values in the data stream is less than a classification thresholdAnd determining the key value as a low-frequency key.
Optionally, the setting of the classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>Comprising:
setting classification thresholdClassification threshold +.>Initializing to 1/n, where n is the total number of downstream instances;
dynamically adjusting classification thresholdIf the load imbalance of the downstream instance is greater than the preset maximum load imbalance, then the classification threshold value +.>Decreasing by a multiple; if the load imbalance of the downstream instance is less than the preset minimum load imbalance, the threshold value is classified +.>Linearly increasing.
Optionally, the grouping determines a candidate instance set according to a classification result of the key value in the data stream, including:
when the classification result of the key value in the data stream is a high-frequency key, the group determines all downstream examples as a candidate example set of the key value;
when the classification result of the key value in the data stream is a low-frequency key, the group maps out a candidate instance set of the key value in a downstream instance through two independent hash functions.
Optionally, the operation information includes an instance processing rate, a data input rate, and a network delay between upstream and downstream instances;
the group allocates a weight table according to candidate examples, determines a target example in the candidate example set, allocates the tuple in the data stream to the target example, and comprises the following steps:
according to the example processing rate and the data input rate acquired by the monitor, the group estimates the average processing delay of the example through a queuing model;
according to the average processing delay of the examples and the network delay between the upstream examples and the downstream examples, determining the weight corresponding to the examples, further determining the weight corresponding to each example, and generating a candidate example distribution weight table;
and according to the generated candidate instance allocation weight table, determining a target instance in the candidate instance set, and allocating the tuple in the data stream to the target instance.
Optionally, the group assigns a weight table according to candidate instances, determines a target instance in the candidate instance set, assigns a tuple in a data stream to the target instance, and includes:
calculating the sum of the weights of all candidate instances in the candidate instance set, extracting a random number from 0 to the sum of the weights, traversing the candidate instance allocation weight table, determining the candidate instance as a target instance when the random number falls within the weight range of a certain candidate instance in the candidate instance set, and allocating the meta-components in the data stream to the target instance.
In another aspect, a real-time packet system for a tilted data stream is provided, where the real-time packet system for a tilted data stream is used to implement a real-time packet method for a tilted data stream, and the real-time packet system for a tilted data stream includes a group installed on an upstream instance and a monitor installed on a downstream instance; wherein,,
the monitor periodically acquires the operation information of the system;
the group obtains the frequency of key values in the data stream when the upstream instance outputs the data stream; classifying the key values in the data stream according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key; determining a candidate instance set according to the classification result of the key values in the data stream; and determining a target instance in the candidate instance set according to a candidate instance allocation weight table, and allocating the tuple in the data stream to the target instance.
Optionally, the group is configured to:
based on a double-layer frequency statistical model, the group acquires the frequency of key values in the data stream;
the double-layer frequency statistical model consists of two layers, wherein the first layer is a filter consisting of counters and is used for storing high-frequency keys and counts in a data stream; the second layer is a sketch, which tracks the counts of other key values in real time using the classical architecture CMS.
Optionally, the group is configured to:
setting classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>
When the frequency of key values in the data stream is greater than or equal to a classification thresholdWhen the key value is determined to be a high-frequency key;
when the frequency of key values in the data stream is less than a classification thresholdAnd determining the key value as a low-frequency key.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) And identifying the high-frequency key value in the data stream by counting the frequency of the key value, and distributing the high-frequency key value to a plurality of parallel examples for common processing by utilizing key splitting, so as to solve the system bottleneck problem caused by the high-frequency key.
(2) The invention utilizes a monitor to obtain the processing rate and the data input rate of the examples, estimates the average processing delay of the tuples in the examples through a queuing model, hopes the communication among the components to select more near field communication by monitoring the network delay among the obtained examples, combines the processing rate of the examples with the network delay, and jointly adjusts the allocation of the tuples in the data stream.
(3) The invention distributes the high-frequency key to all downstream examples, calculates the example weight through average processing delay and network delay, and the load among the examples is at a relatively balanced level after multiple rounds of feedback adjustment.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1-1 is a flow chart of a real-time grouping method for an inclined data stream according to an embodiment of the present invention;
FIGS. 1-2 are schematic block diagrams of a real-time grouping method for oblique data streams according to an embodiment of the present invention;
FIG. 2-1 is a schematic diagram of CAS frequency statistics provided by embodiments of the present invention;
FIG. 2-2 is a schematic diagram of an algorithm 1 for determining key frequency according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an algorithm 2 for frequency classification according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an intersystem messaging and processing provided by an embodiment of the present invention;
FIG. 5-1 is a schematic diagram of an algorithm 3 for delay sensing according to an embodiment of the present invention;
FIG. 5-2 is a schematic diagram of a specific case of random distribution of weights according to an embodiment of the present invention;
FIGS. 5-3 are schematic diagrams of algorithm 4 providing an example allocation in accordance with embodiments of the present invention;
fig. 6 is a block diagram of a real-time packet system for a tilted data stream according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a real-time grouping method for a tilted data stream, which can be realized by a real-time grouping system for the tilted data stream, wherein the real-time grouping system for the tilted data stream mainly comprises a group installed on an upstream instance and a monitor (also called monitor) installed on a downstream instance, and the monitor is used for collecting and analyzing runtime information stored in a database, such as network delay, data stream rate and instance processing rate. The group is used to identify high frequency keys in the real-time data stream, determine candidate instances of the keys, and dynamically adjust tuple allocation weights of downstream instances to allocate instances according to runtime information.
As shown in fig. 1-1, which is a flow chart of a real-time grouping method for oblique data flow and in fig. 1-2, which is a flow chart schematic, the process flow of the method may include the following steps:
s1, periodically acquiring operation information of a system by a monitor.
The operation information may include an instance processing rate, a data input rate, and a network delay between the upstream and downstream instances, wherein the actual processing rate is a rate at which the downstream instance processes the tuple.
S2, when the upstream instance outputs the data stream, the group obtains the frequency of the key value in the data stream.
Optionally, step S2 may specifically include:
based on the double-layer frequency statistical model, the group obtains the frequency of key values in the data stream.
The double-layer frequency statistical model consists of two layers, wherein the first layer is a filter consisting of counters and is used for storing high-frequency keys and counts in a data stream; the second layer is a sketch, which tracks the counts of other key values in real time using the classical architecture CMS.
In a possible implementation, when the upstream instance outputs the data stream, the statistical frequency of the tuples in the data stream is first updated in real time (including periodic time count decay) to obtain the key frequency.
The frequency statistics mainly comprises the steps of calculating the frequency of key values in the next stage according to the occurrence frequency of the data stream, and classifying the key values by using the frequency in a key value classification algorithm. Since the grouping algorithm must be lightweight, it can react quickly. The frequency statistics use the simplest statistical approach, i.e. empirical probability.
In the data stream S the total number of current input tuples is defined as m. Key valueWhen the number of occurrences in the data stream S is i, a key value is defined +.>The empirical probability of occurrence in the data stream S is +.>. The empirical probability distribution D of key values in the data stream is shown in equation (1).
(1)
First the algorithm needs to keep track of the number of occurrences of the key. When a key in a data stream belongs to a larger domain (e.g., an IP address is used as the key), the system cannot count the high frequency key while preserving each key and its frequency, and the memory consumed for preserving the complete key domain is too large, and is generally not necessary. The system typically stores only the high frequency key values and their corresponding frequencies, and stores only the frequencies for the low frequency key values, not the specific contents of the key values.
The grouping uses a dual layer frequency Counter (CAS) that combines counter and notch to improve the accuracy of the frequency statistics. The frequency statistical model consists of two layers, the counter with the size forms the first layer of filter, and the high-frequency key value and the count in the data stream are stored. Where newCount is the total count of key values, oldCount represents the old count at the time of the data exchange with the sketch. The second part is the sketch, which tracks key-value frequencies in real time using the classical architecture CMS. CMS is a size ofIs of width +.>Depth of +.>. At count time, the key value is mapped to a specific location using d independent hash functions of length w, as shown in FIG. 2-1.
The filter accurately calculates the frequency of the high-frequency key value to a certain extent, and the filter filters the high-frequency key value, so that the error caused by hash collision is reduced, and the low-frequency key value cannot be mistaken for the high-frequency key value. When the key value frequency in the sketch is higher than the filter minimum key value frequency, data exchange occurs between the filter and the sketch. For items in the filter for which no data exchange is taking place,. The error of the filter only occurs when data exchange with the slot is performed. So when the sketch error is smaller, the frequency statistics are more accurate. Use of the sketch->Constructing a mapping of key values to matrix by different hash functions>. The frequency calculation in the search is shown in formula (2).
(2)
In order to ensure certain accuracy of the sketch, the accuracy parameters are defined asError probability parameter is->. Then,/>. At least +.>The error of the estimated frequency of occurrence of a sketch for any one element with respect to its true frequency is less than epsilon times the sum of the true frequencies of all elements. I.e. assuming key value error of filter +.>The total frequency of the filter key is +.>Then->
The error of the double-layer statistical model CAS is expected to be as shown in equation (3).
(3)
The frequency statistics algorithm supports frequency lookup of key values. When key values are stored in the filter, the location lookup time average is. When the key value is stored in the search, it is first determined whether the key value is in the filter by traversing the filter. If not, positioning in the search, wherein the search time is +.>. The statistical model lookup time expectation is shown in equation (4).
(4)
Thus, as the size of the filter is increased, the error may decrease, but the processing time may increase. And when the filters employ different data structures, the complexity of the search of the filters is also different.
Since distributed stream computing systems are very sensitive to application delays, both frequency statistics and lookups should be lightweight, requiring small computational and memory overhead. Meanwhile, when the data processing speed of the data stream is higher, the key value count in the frequency statistics model is increased by orders of magnitude along with the accumulation of time, so that memory overflow is easy to cause, and certain processing is needed. Therefore, the frequency statistical model adopts a time decay method during the statistics, and weights the statistics times in the current time interval more than the statistics times in the previous time interval.
The frequency statistics algorithm sets a time decay period T, which occurs at the end of each period. Each item count in the CAS is updated and calculated as shown in equation (5).
(5)
Wherein the method comprises the steps ofFor the attenuation coefficient, the value range is +.>. At the same time, for calculating the key value frequency, the current tuple total +.>Also multiplied by the attenuation coefficient. />The value of (2) depends on the functional requirements, if the historical data has a certain influence on the current data, then +.>The value of (2) should be relatively large. If the function emphasizes the statistics of the data in the latest time, then +.>A smaller value should be assigned indicating that the historical data has less impact on the current data. In extreme cases, will->Set to 0, indicating that the history data is discarded entirely. After a period of time has ended, each count is initialized to 0 and the count is restarted. If it isSetting to 1 indicates that all data has the same effect on statistics after starting input from the data stream. In addition to this, the process is carried out,the value of (2) and the content change speed of the data stream are also related, and when the content of the data stream is changed very fast, the content of the data stream is changed very fast>Should be larger. When the content of the data stream is gradually changing, +.>Should be compared withIs small. In summary, after considering the function and data changes,the value of (2) should be small.
As shown in the schematic diagram of algorithm 1 in fig. 2-2, algorithm 1 counts key frequencies, inputs as the output data stream of the upstream instance and decay period length T, and outputs as key frequencies of tuples in the data stream. When a tuple arrives in the input stream, it is first looked upWhether there is a tuple key +.>If so, increment its count by 1; if not found and +.>If the maximum capacity C is not reached, then +.>Added to->And sets the counter to initial values 1 and 0 (as in lines 6-11 of fig. 2-2); if->Full, algorithm use ∈ ->The hash function will->Mapping to +.>Is->Line corresponds to position and adds 1 to the count, and +.>Count of +.>The smallest of the rows (e.g., rows 12-18 of fig. 2-2). When->Middle->The number of occurrence of key value is greater than +.>Exchange +.>And->For example, lines 19-24 of fig. 2-2). Finally, the total number of tuples is added to 1. According to the time decay model, when the time period is over, the time decay calculation is performed on the key number in the key value statistical algorithm, and the time is updated (as in lines 1-5 of fig. 2-2). Algorithm complexity is +.>Where n is the size of the filter.
S3, classifying the key values in the data stream by the group according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key.
Alternatively, step S3 may include the following steps S31-S33:
s31, setting a classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>
In one possible implementation, the key-value splitting technique may effectively mitigate load imbalance caused by data skew, but may also incur some overhead. With respect to key grouping, key splitting is used to distribute a portion of the key's meta-component to more than one number of downstream candidate instances for processing, so each downstream candidate instance that processes the key must maintain a state of the key, resulting in additional memory overhead. And, for stateful nodes, it is necessary to combine part of the states of these downstream candidate instances to obtain the complete state of the key, resulting in a certain aggregation overhead. Therefore, the memory overhead and the aggregation overhead are related to the size of the key value segmentation, that is, the more the number of the selected key values and the number of candidate instances to be allocated, the greater the overhead caused by the larger the number of the selected key values and the number of the candidate instances to be allocated; conversely, the fewer the number of key values and candidate instances selected, the less overhead.
The tuple key value field size for all instances that the upstream vertex sends to the downstream vertex is. For key valuesDistributing it to multiple instances, i.e +.>Is +.>. Wherein the bond value->The number of candidate instances allocated is +.>
Memory overhead for maintaining state of downstream verticesThe aggregate overhead AGG of the key-value division is positively related to the key-value division size, i.e. +.>、/>The overhead of key-value splitting is shown in equation (6).
(6)
Since the frequency classification algorithm determines the high frequency key value by means of frequency and threshold value, the threshold value is set as small as possible, so that the cost of key value segmentation is reducedSmaller. However, the threshold is sized to preferentially ensure that the system is balanced. If the threshold is set too high, a large number of low frequency key values are identified as high frequency key values, and the key value segmentation overhead is too high. If the threshold is set too low, no key will be identified as a high frequency key, and the grouping algorithm is similar to PKG, resulting in load imbalance and low throughput.
The key value classification model classifies the key values based on the frequencies of the key values, and allocates different numbers of candidate instances, and the higher the frequency is, the more candidate instances are allocated. While distributing high frequency key values to multiple instances results in increased memory overhead, the resulting overhead versus performance improvement is acceptable when the threshold is set properly.
Optionally, dynamically adjusting classification thresholds based on downstream instance loadS311 to S312 may specifically include:
s311, setting a classification thresholdClassification threshold +.>Initialized to 1/n. Where n is the total number of downstream instances.
S312, dynamically adjusting the classification thresholdIf the load imbalance of the downstream instance is greater than the preset maximum load imbalance, then the classification threshold value +.>Decreasing by a multiple; if the load imbalance of the downstream instance is less than the preset minimum load imbalance, the threshold value is classified +.>Linearly increasing.
S32, the key value frequency in the data stream is greater than or equal to the classification threshold valueWhen the key value is determined to be a high frequency key.
S33, the key value frequency in the data stream is smaller than the classification threshold valueAnd determining the key value as a low-frequency key.
In one possible implementation, the load balancing study shows that the load balancing is related to the number of downstream instances. Classification threshold when a low frequency key is assigned to two candidate instancesBetween 2/n and 1/5n, where n is the number of downstream instances. The ideal average load of the system is +.>When the frequency of the key value exceeds the ideal load, it must lead to load imbalance, thus defining the classification threshold value as +.>When the key frequency exceeds +.>When it is a high frequency key value, and is distributed to n instances downstream. The frequency is less than->When it is a low frequency key, it is distributed to two instances downstream. The classification threshold is dynamically adjusted when the system load imbalance exceeds an acceptable imbalance level set by the application.
The frequency classification algorithm inputs the frequency of the key, the classification threshold valueThe number of downstream instances, and the load imbalance tolerance, the output is the number of candidate instances of the key. Algorithm 2 is schematically shown in FIG. 3, when the key frequency is greater than the threshold +.>The value of d is n, which is the total number of downstream instances. When the key value frequency is less than +>When the candidate instance of the key is 2 (e.g., lines 1-7 of FIG. 3). When the system is running, the algorithm calculates the unbalance degree of the system through the information of the running time, and when the unbalance degree of the system is larger than the unbalance tolerance defined by the application, the algorithm is about->A stepwise adjustment is made, and it is desirable to restore the system balance fastest. Algorithm pair +.>A minor adjustment is made. This is because a part of the key values already have a frequency exceeding +.>Having allocated a number of instances exceeding 2, memory space has been expended to maintain the corresponding state of the key, so the algorithm makes minor adjustments to the number of instances of this portion of the key (e.g., lines 8-14 of FIG. 3). That is, the algorithm increases the number of candidate instances rapidly, but decreases very slowly.
S4, determining a candidate instance set by the group according to the classification result of the key values in the data stream.
Optionally, step S4 may specifically include the following steps S41 to S42:
s41, when the classification result of the key values in the data stream is a high-frequency key, the group determines all downstream instances as candidate instance sets of the key values.
S42, when the classification result of the key value in the data stream is a low-frequency key, the group maps out a candidate instance set of the key value in the downstream instance through two independent hash functions.
S5, the group allocates a weight table according to the candidate examples, determines a target example in the candidate example set, and allocates the elements in the data stream to the target example.
Optionally, S5 may include the following steps S51-S53:
s51, according to the example processing rate and the data input rate acquired by the monitor, the group estimates the average processing delay of the example through the queuing model.
In a possible implementation, due to fluctuations in the data flow, the grouping algorithm needs to estimate the completion time of each instance downstream in real time, dynamically adjusting the instance weights. In a fixed period, instance weights sense the state information of each downstream instance at runtime and adjust adaptively. The delay sensing means that the average delay of the downstream examples is calculated according to a delay estimation model by using the state information such as the network delay between the upstream examples and the downstream examples, the processing rate and the input rate of each downstream example and the like which are collected by monitoring.
The message transmission and processing of the distributed system is shown in fig. 4, in the working node, each Worker process has an independent receiving thread to monitor the message sent by the network on the receiving port, and send the message to the corresponding thread, and an independent sending thread sends the message to other Worker processes through the network. Each thread has an accept queue and a send queue, the accept queue stores messages to be processed by Waker or other threads in Waker, and the send queue stores messages to other threads. Thread 1 and thread 2 are within the same Worker process, thread 1 sending a message to thread 2 using common memory. Thread 1 and the thread of process 2 are at the same node, and thread 1 sends a message to the thread of process 2 using a queue. When the thread 1 sends a message to the instance of the node 2, the sending thread calculates a target instance and sends the target instance to the node 2 through the 1-hop router. Likewise, it is sent to node 3 via an N-hop router. Thus, there is a great difference in network delay between instances at different locations, and the packet mechanism should take into account optimization of network delay.
The downstream instance receives the upstream instance and places it in a message queue, typically using first-com first-served (FCFS). This guarantees the time sequence of events, an important requirement in many data stream processing applications. Both example processing and message queuing employ a widely used M/1 queuing theory model that allows computation time and latency to be estimated, with the queue model used being able to capture randomness of arrival time and service time.
Hypothesis exampleArrival rate of->The treatment rate is +.>. Let->When->At this time, the instance processing speed cannot keep pace with the incoming tuple speed, and the number of tuples in the instance wait queue increases over time, resulting in infinite queuing delay. Conversely, when +.>When the processing speed of the tuple is faster than the arrival speed of the tuple. Number of average tuples within an instance->Average queue Length->The following formulas (7) and (8) can be calculated.
(7)
(8)
According to the Erlang formula, the tuple in the instance has a total time T obeying parameter ofThe average processing delay of a tuple includes the queuing time and computation time of the tuple, so the tuple averages the processing delay in the instanceAverage queuing delay is +.>As in formulas (9) and (10).
(9)
(10)
Thus, the completion time of the tuple is related to the communication delay between the upstream and downstream instances, the processing delay of the downstream instance. Monitoring is placed at each instance to detect the arrival rate and processing rate of the instance. Assume downstream instanceAnd upstream instance->Network delay is +.>. The delay of the tuple from the downstream issue to the end of the instance processing is estimated as follows (11):
(11)
s52, determining weights corresponding to the instances according to the average processing delay of the instances and the network delay between the upstream and downstream instances, further determining the weights corresponding to each instance, and generating a candidate instance distribution weight table.
In one possible implementation, in Es-Stream, each upstream instance has an independent packetizer and runs a packetization algorithm. Thus, for each upstream instance, a local imbalance may be created due to network delays. However, a relative global balance is eventually achieved for all upstream instances. To reduce the effect of local imbalance on global balance, traffic between high-latency instances is reduced as much as possibleThe lower limit of the local imbalance is reached. Thus, the weight calculation for each downstream instance is as in equation (12).
(12)
If the downstream instance weight allocation is too large, the instance input rate increases. The tuple increases in the instance with decreasing instance weights as the upstream instance acquires the real-time input rate and the processing rate.
The delay aware algorithm inputs as an upstream instance and a downstream instance set and outputs as a downstream instance weight table for that instance. As shown in algorithm 3 schematic of fig. 5-1, at the beginning of system operation, the upstream instance obtains the communication time between all downstream instances and initializes the instance weights to the inverse of the network delay (as in lines 2-5 of fig. 5-1); and then the upstream instance periodically acquires the input rate and the output rate of the downstream instance, and estimates the instance processing delay according to the queuing model and the network delay. And the inverse of the delay is used to calculate the weights (as in lines 6-12 of fig. 5-1). At initialization and run-time adjustment, the algorithm normalizes the weights and orders the weight table from large to small. Wherein the weight of the instance at the stage is adjusted with the processing rate and the data input rate of the instance at the downstream of the previous stage, and the input rate of the instance at the downstream of the next stage is changed due to the change of the weight of the instance at the stage. Thus, the weights are dynamically adjusted adaptively so that the system reaches an optimal state.
S53, according to the generated candidate instance allocation weight table, determining a target instance in the candidate instance set, and allocating the metadata in the data stream to the target instance.
Alternatively, S53 may specifically include: calculating the sum of the weights of all candidate instances in the candidate instance set, extracting a random number from 0 to the sum of the weights, traversing the candidate instance allocation weight table, determining the candidate instance as a target instance when the random number falls within the weight range of a certain candidate instance in the candidate instance set, and allocating the meta-component in the data stream to the target instance.
In one possible implementation, in a distributed stream computing system, the packetizer ultimately needs to select a target instance for distribution at a downstream instance. The algorithm adopts a weight random method, firstly obtains a random number, traverses a weight table, and when the random number falls in a weight interval of a certain instance, the instance is a target instance of tuple distribution.
A specific case of random distribution of weights is shown in fig. 5-2, for example. Firstly randomly obtaining a number r of 0.4 and v j1 Weight of 0.35, where r-0.35=0.05>0, thus not selecting v j1 。v j1 Weight of 0.25, at this time 0.05-0.25<0, r falls to v j2 The target instance is v j2
The instance allocation algorithm inputs an output data stream for an upstream instance and a downstream instance set, and outputs an instance allocation scheme for a tuple. When the tuple arrives, the instance weight table of the current time period is acquired according to the algorithm 3, and then the candidate instance number of the tuple is calculated according to the algorithm 1 and the algorithm 2. When the number of instances d is the total number of downstream instances n, the candidate set of instances is all downstream instances. Such asAlgorithm 4 of fig. 5-3 is a schematic diagram of randomly extracting a number from 0 to 1, traversing the instance weight table, and when the random number falls within the weight range of the candidate instance set, the candidate instance is the final target instance (e.g., lines 1-13 of fig. 5-3). When the number d of instances is 2, calculating candidate instances of the key value in the downstream instance by using 2 independent random hash functionsAnd->A number is randomly drawn from the 0 to two instance weight sums and when the random number falls within the weight range of an instance, the candidate instance is the final target instance (e.g., lines 14-25 of fig. 5-3). Finally, the upstream instance distributes the meta-component to the target instance target.
When the system just begins to run, the instance weights are initialized by network delays. And acquiring running information of the downstream instance at the running time, and calculating the average processing delay of the downstream instance by using the data input rate and the instance processing rate through a queuing model. The instance weights are then jointly calculated based on the average processing delay and the network delay. The lower the delay, the higher the instance weight and the higher the probability that an instance is assigned a tuple. When the number of candidate instances is n, the candidate instance set is all downstream instances. When the candidate instance number is 2, the candidate instance set of the downstream instance is mapped with two independent hash functions. Then, according to the weight, random grouping is carried out in the candidate instance set, and the target instance of tuple distribution is selected. The instance weights are adaptively adjusted according to downstream instance information during running.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) And identifying the high-frequency key value in the data stream by counting the frequency of the key value, and distributing the high-frequency key value to a plurality of parallel examples for common processing by utilizing key splitting, so as to solve the system bottleneck problem caused by the high-frequency key.
(2) The invention utilizes a monitor to obtain the processing rate and the data input rate of the examples, estimates the average processing delay of the tuples in the examples through a queuing model, hopes the communication among the components to select more near field communication by monitoring the network delay among the obtained examples, combines the processing rate of the examples with the network delay, and jointly adjusts the allocation of the tuples in the data stream.
(3) The invention distributes the high-frequency key to all downstream examples, calculates the example weight through average processing delay and network delay, and the load among the examples is at a relatively balanced level after multiple rounds of feedback adjustment.
FIG. 6 is a block diagram of a real-time data flow oriented packet system for implementing a real-time data flow oriented packet method including a group installed on an upstream instance and a monitor installed on a downstream instance, according to an example embodiment; wherein,,
the monitor periodically acquires the operation information of the system;
the group obtains the frequency of key values in the data stream when the upstream instance outputs the data stream; classifying the key values in the data stream according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key; determining a candidate instance set according to the classification result of the key values in the data stream; and determining a target instance in the candidate instance set according to a candidate instance allocation weight table, and allocating the tuple in the data stream to the target instance.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) And identifying the high-frequency key value in the data stream by counting the frequency of the key value, and distributing the high-frequency key value to a plurality of parallel examples for common processing by utilizing key splitting, so as to solve the system bottleneck problem caused by the high-frequency key.
(2) The invention utilizes a monitor to obtain the processing rate and the data input rate of the examples, estimates the average processing delay of the tuples in the examples through a queuing model, hopes the communication among the components to select more near field communication by monitoring the network delay among the obtained examples, combines the processing rate of the examples with the network delay, and jointly adjusts the allocation of the tuples in the data stream.
(3) The invention distributes the high-frequency key to all downstream examples, calculates the example weight through average processing delay and network delay, and the load among the examples is at a relatively balanced level after multiple rounds of feedback adjustment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (9)

1. The real-time grouping method for the inclined data flow is characterized in that the method is realized by a real-time grouping system for the inclined data flow, and the real-time grouping system for the inclined data flow comprises a grouping device group installed on an upstream instance and a monitor installed on a downstream instance;
the method comprises the following steps:
the monitor periodically acquires the operation information of the system;
when an upstream instance outputs a data stream, the group acquires the frequency of a key value in the data stream;
classifying the key values in the data stream by the group according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key;
the group determines a candidate instance set according to the classification result of the key values in the data stream;
the group allocates a weight table according to candidate examples, a target example is determined in the candidate example set, and the tuple in the data stream is allocated to the target example;
wherein the operation information comprises an instance processing rate, a data input rate and network delay between upstream and downstream instances;
the group allocates a weight table according to candidate examples, determines a target example in the candidate example set, allocates the tuple in the data stream to the target example, and comprises the following steps:
according to the example processing rate and the data input rate acquired by the monitor, the group estimates the average processing delay of the example through a queuing model;
according to the average processing delay of the examples and the network delay between the upstream examples and the downstream examples, determining the weight corresponding to the examples, further determining the weight corresponding to each example, and generating a candidate example distribution weight table;
and according to the generated candidate instance allocation weight table, determining a target instance in the candidate instance set, and allocating the tuple in the data stream to the target instance.
2. The method of claim 1, wherein the obtaining, by the group, a frequency of key values in the data stream comprises:
based on a double-layer frequency statistical model, the group acquires the frequency of key values in the data stream;
the double-layer frequency statistical model consists of two layers, wherein the first layer is a filter consisting of counters and is used for storing high-frequency keys and counts in a data stream; the second layer is a sketch, which tracks the counts of other key values in real time using the classical architecture CMS.
3. The method of claim 1, wherein the grouping of key values in the data stream by the group according to the frequency of key values in the data stream, the key value grouping result comprising a high frequency key and a low frequency key, comprises:
setting classification thresholdDynamic adjustment of load according to downstream instancesClassification threshold->
When the frequency of key values in the data stream is greater than or equal to a classification thresholdWhen the key value is determined to be a high-frequency key;
when the frequency of key values in the data stream is less than a classification thresholdAnd determining the key value as a low-frequency key.
4. A method according to claim 3, wherein the setting of a classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>Comprising:
setting classification thresholdClassification threshold +.>Initializing to 1/n, where n is the total number of downstream instances;
dynamically adjusting classification thresholdIf the load imbalance of the downstream instance is greater than the preset maximum load imbalance, then the classification threshold value +.>Decreasing by a multiple; if the load imbalance of the downstream instance is less than the preset maximumSmall load imbalance, the classification threshold is +.>Linearly increasing.
5. The method of claim 1, wherein the grouping of key values in the data stream to determine the candidate instance set comprises:
when the classification result of the key value in the data stream is a high-frequency key, the group determines all downstream examples as a candidate example set of the key value;
when the classification result of the key value in the data stream is a low-frequency key, the group maps out a candidate instance set of the key value in a downstream instance through two independent hash functions.
6. The method of claim 1, wherein the group assigns weights according to candidate instances, wherein determining a target instance from the candidate instance set, assigning tuples in a data stream to the target instance, comprises:
calculating the sum of the weights of all candidate instances in the candidate instance set, extracting a random number from 0 to the sum of the weights, traversing the candidate instance allocation weight table, determining the candidate instance as a target instance when the random number falls within the weight range of a certain candidate instance in the candidate instance set, and allocating the meta-components in the data stream to the target instance.
7. The real-time grouping system for the inclined data flow is characterized by being used for realizing a real-time grouping method for the inclined data flow, and comprises a group installed on an upstream instance and a monitor installed on a downstream instance; wherein,,
the monitor periodically acquires the operation information of the system;
the group obtains the frequency of key values in the data stream when the upstream instance outputs the data stream; classifying the key values in the data stream according to the frequency of the key values in the data stream, wherein the classification result of the key values comprises a high-frequency key and a low-frequency key; determining a candidate instance set according to the classification result of the key values in the data stream; according to a candidate instance allocation weight table, determining a target instance in the candidate instance set, and allocating the tuple in the data stream to the target instance;
wherein the operation information comprises an instance processing rate, a data input rate and network delay between upstream and downstream instances;
the group allocates a weight table according to candidate examples, determines a target example in the candidate example set, allocates the tuple in the data stream to the target example, and comprises the following steps:
according to the example processing rate and the data input rate acquired by the monitor, the group estimates the average processing delay of the example through a queuing model;
according to the average processing delay of the examples and the network delay between the upstream examples and the downstream examples, determining the weight corresponding to the examples, further determining the weight corresponding to each example, and generating a candidate example distribution weight table;
and according to the generated candidate instance allocation weight table, determining a target instance in the candidate instance set, and allocating the tuple in the data stream to the target instance.
8. The system of claim 7, wherein the group is configured to:
based on a double-layer frequency statistical model, the group acquires the frequency of key values in the data stream;
the double-layer frequency statistical model consists of two layers, wherein the first layer is a filter consisting of counters and is used for storing high-frequency keys and counts in a data stream; the second layer is a sketch, which tracks the counts of other key values in real time using the classical architecture CMS.
9. The system of claim 7, wherein the group is configured to:
setting classification thresholdDynamically adjusting the classification threshold according to the downstream instance load +.>
When the frequency of key values in the data stream is greater than or equal to a classification thresholdWhen the key value is determined to be a high-frequency key;
when the frequency of key values in the data stream is less than a classification thresholdAnd determining the key value as a low-frequency key.
CN202310625541.2A 2023-05-30 2023-05-30 Real-time grouping method and system for inclined data flow Active CN116346827B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310625541.2A CN116346827B (en) 2023-05-30 2023-05-30 Real-time grouping method and system for inclined data flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310625541.2A CN116346827B (en) 2023-05-30 2023-05-30 Real-time grouping method and system for inclined data flow

Publications (2)

Publication Number Publication Date
CN116346827A CN116346827A (en) 2023-06-27
CN116346827B true CN116346827B (en) 2023-08-11

Family

ID=86876346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310625541.2A Active CN116346827B (en) 2023-05-30 2023-05-30 Real-time grouping method and system for inclined data flow

Country Status (1)

Country Link
CN (1) CN116346827B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783644A (en) * 2020-12-31 2021-05-11 湖南大学 Distributed inclined stream processing method and system based on high-frequency key value counting
CN114816715A (en) * 2022-05-20 2022-07-29 中国地质大学(北京) Cross-region-oriented flow calculation delay optimization method and device
CN115203935A (en) * 2022-07-12 2022-10-18 厦门大学 Frequency selection surface structure topology inverse prediction method and device based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202092B (en) * 2015-05-04 2020-03-06 阿里巴巴集团控股有限公司 Data processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783644A (en) * 2020-12-31 2021-05-11 湖南大学 Distributed inclined stream processing method and system based on high-frequency key value counting
CN114816715A (en) * 2022-05-20 2022-07-29 中国地质大学(北京) Cross-region-oriented flow calculation delay optimization method and device
CN115203935A (en) * 2022-07-12 2022-10-18 厦门大学 Frequency selection surface structure topology inverse prediction method and device based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁文娟 ; 陈红 ; 吴云乘 ; 赵丹 ; 李翠平 ; .持续监控下差分隐私保护.软件学报.2020,(第06期),全文. *

Also Published As

Publication number Publication date
CN116346827A (en) 2023-06-27

Similar Documents

Publication Publication Date Title
US11888756B2 (en) Software load balancer to maximize utilization
KR100586283B1 (en) Dynamic thread pool tuning techniques
CN106790726B (en) Priority queue dynamic feedback load balancing resource scheduling method based on Docker cloud platform
KR100429904B1 (en) Router providing differentiated quality-of-service and fast internet protocol packet classification method for the same
CN110365765B (en) Bandwidth scheduling method and device of cache server
US20230093389A1 (en) Service request allocation method and apparatus, computer device, and storage medium
US7610425B2 (en) Approach for managing interrupt load distribution
CN106462460B (en) Dimension-based load balancing
US7243351B2 (en) System and method for task scheduling based upon the classification value and probability
US8898295B2 (en) Achieving endpoint isolation by fairly sharing bandwidth
US20180247265A1 (en) Task grouping method and apparatus, electronic device, and computer storage medium
Li et al. Low-complexity multi-resource packet scheduling for network function virtualization
US20050055694A1 (en) Dynamic load balancing resource allocation
WO2021012663A1 (en) Access log processing method and device
US20140143300A1 (en) Method and Apparatus for Controlling Utilization in a Horizontally Scaled Software Application
Ding et al. Optimal operator state migration for elastic data stream processing
US9609054B2 (en) Load balancing scalable storage utilizing optimization modules
CN111078391A (en) Service request processing method, device and equipment
CN116346827B (en) Real-time grouping method and system for inclined data flow
CN108200185B (en) Method and device for realizing load balance
CN112685167A (en) Resource using method, electronic device and computer program product
Martin et al. Predicting energy consumption with streammine3g
Wang et al. Model-based scheduling for stream processing systems
CN115174583A (en) Server load balancing method based on programmable data plane
CN116489099B (en) Self-adaptive load balancing scheduling method and system based on flow classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant