CN107948007B

CN107948007B - Long flow identification method based on sampling and two-stage CBF

Info

Publication number: CN107948007B
Application number: CN201710934979.3A
Authority: CN
Inventors: 秦文虎; 翟金凤; 孙立博; 鲁凯; 林学勇
Original assignee: Southeast University; Nanjing Institute of Measurement and Testing Technology
Current assignee: Southeast University; Nanjing Institute of Measurement and Testing Technology
Priority date: 2017-10-10
Filing date: 2017-10-10
Publication date: 2021-09-10
Anticipated expiration: 2037-10-10
Also published as: CN107948007A

Abstract

The invention provides a long flow identification algorithm based on sampling and two-stage CBF, which comprises the following steps: carrying out periodic sampling on the message; setting a long flow threshold value, and configuring two-stage CBF structure parameters; for the sampled message, judging whether the message belongs to the identified long stream or not through the second-stage CBF, if so, inserting the message, if not, judging whether the stream to which the message belongs is the long stream or not through the first-stage CBF, if so, recording the stream identification of the message, updating the record of the message in the two-stage CBF, and if not, inserting the message into the first-stage CBF; and repeating the process until all the sampled messages are processed, inquiring all the non-sampled messages through the second-level CBF, and inserting the non-sampled messages if the non-sampled messages belong to the identified long stream, or else, not processing the non-sampled messages. The invention can not only realize accurate identification of long flow, but also realize high-precision measurement of flow length on the basis of effectively saving space and time resources.

Description

Long flow identification method based on sampling and two-stage CBF

Technical Field

The invention belongs to the technical field of network flow measurement, relates to a long flow identification method, and particularly relates to a long flow identification method based on sampling and two-stage Counting Bloom Filter.

Background

The increasing speed of high-speed network operation and the rapid increase of traffic data make it more and more difficult to accurately measure network traffic. Many researches show that the statistics of the network flow shows a strong heavy tail distribution characteristic, and because a small amount of long flows occupy most of the network flow, the long flow information can be mastered under most conditions to meet the actual application requirements, so that the identification of the long flows is particularly important.

The existing long flow identification method mainly uses a sampling technology, a hash technology and a Bloom Filter technology. When the sampling technology is singly used for identifying the long flow, the flow identification information needs to be maintained in the identification process, so that large calculation overhead is generated, and the system processing speed is reduced; when the hash technology or the Bloom Filter technology is used alone to process all messages passing through a link, hash collision is increased, and accuracy of a measurement result is affected. The disadvantage of using only one technology can be effectively solved by combining the sampling technology with the hash technology or the Bloom Filter technology. Compared with the hash technology, the Bloom Filter can obviously reduce hash collision by maintaining a plurality of independent hash functions, and greatly reduce the storage overhead brought by maintaining the flow identification for each flow, one of the improved structures, the Counting Bloom Filter, can count the messages hashed into the storage space, and can record the flow identification of the long flow when the number of the messages exceeds the threshold value, so that the long flow identification can be realized more efficiently by combining the sampling technology and the Counting Bloom Filter.

The existing long flow identification method based on sampling and Counting Bloom Filter (CBF) generally uses simple linear estimation to estimate the number of messages contained in the original long flow, has certain flow length measurement error, and cannot meet the requirement of higher precision.

Disclosure of Invention

In order to solve the problems, the invention provides a long stream identification method based on sampling and two-stage Counting Bloom Filter, which identifies the message belonging to the long stream through the two-stage Counting Bloom Filter based on message sampling.

In order to achieve the purpose, the invention provides the following technical scheme:

the long flow identification method based on sampling and two-stage CBF comprises the following steps:

step 1, periodically sampling messages passing through a link in observation time according to sampling frequency;

step 2, setting a threshold value T of the long flow, and configuring two-stage Counting Bloom Filter structure parameters;

step 3, judging whether each sampled message belongs to the identified long stream or not through the second-stage Counting Bloom Filter, if so, inserting the message into the second-stage Counting Bloom Filter, and continuing to process the next message; if the long stream does not belong to the identified long stream, executing the step 4;

step 4, judging whether the flow to which the message belongs is a long flow or not through the first-stage Counting Bloom Filter, if so, recording the flow identification of the message, updating the record of the message in the two-stage Counting Bloom Filter, and continuously processing the next message; if not, executing step 5;

step 5, inserting the message into the first Counting Bloom Filter, and continuing to process the next message;

and 6, after the steps 3-5 are repeated to complete the processing of all the sampled messages, inquiring all the non-sampled messages through the second-level Counting Bloom Filter, if the messages belong to the identified long stream, inserting the messages into the second-level Counting Bloom Filter, and otherwise, not performing any processing.

Further, the extraction frequency in step 1 is one extraction frequency every n messages.

Furthermore, when the total number of the messages is larger, the sampling frequency is reduced, and when the total number of the messages is smaller, the sampling frequency is improved.

Further, the step 2 specifically includes the following steps:

setting a long flow threshold as T-N.m%, wherein N is the total number of messages passing through a link in observation time, and m is the percentage of the total number of messages occupied by the long flow; the threshold value for long flow identification by using sampling message is set as T₁T/n; the two-stage Counting Bloom Filter selects the same k hash functions h (1), h (2), …, h (k) with small conflict; length m of Counter array in first-stage Counting Bloom Filter structure₁Setting the power of 2 greater than the total N/N of the sampled messages, and distributing the number b of bits to each counter₁The conditions are satisfied:

the length m2 of the Counter array in the second-stage Counting Bloom Filter structure is set to be greater than the power of 2 of the total number N of messages, and the number b of bits allocated to each Counter₂The conditions are satisfied:

further, each counter is allocated a number of bits greater than the number in which the condition is satisfied.

Further, the step 3 specifically includes the following steps:

for each sampled message, mapping the sampled message to a corresponding position of a second-level Counting Bloom Filter through k hash functions, if k counter values of the corresponding position are not all 0, judging that the message belongs to the identified long stream, inserting the message into the second-level Counting Bloom Filter, continuously processing a next message, if any one of the k counter values of the corresponding position is 0, judging that the message does not belong to the identified long stream, and executing the step 4.

Further, the step 4 specifically includes the following steps:

mapping the sampled message to a first-stage Counting Bloom Filter through k hash functions, and solving the minimum value of k counters at corresponding positions; if the minimum value of the k counters is equal to the threshold value T₁If yes, the flow is judged to be a long flow, the flow identification of the message is recorded, and the k counter values are respectively subtracted by the threshold value T₁And mapping the counter value to a second-stage Counting Bloom Filter, and setting k counter values of corresponding positions as T₁+1, continuing to process the next message; if the minimum value of the k counters is not equal to the threshold value T₁If yes, the flow is judged not to be the long flow, and step 5 is executed.

Further, the process of inserting the packet into the first Counting Bloom Filter in the step 5 includes: the k counter values in the first Counting Bloom Filter are each incremented by 1.

Further, the process of inserting the second Counting Bloom Filter in the steps 3 and 6 includes: the k counter values in the second Counting Bloom Filter are respectively added with 1.

Further, step 7 is included, after all non-sampled messages are processed, the recorded stream identifier is mapped to the second-level Counting Bloom Filter.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention can not only realize accurate identification of the long stream, but also realize high-precision measurement of the original stream length on the basis of effectively saving space and time resources. The invention has good real-time performance, can be well adapted to the current high-speed network link environment, and has great significance for network management application such as network charging, bandwidth planning, safety detection and the like.

Drawings

FIG. 1 is a flow chart of the method steps of the present invention, wherein after all messages sampled in (i) are processed, non-sampled messages in (ii) are processed.

Fig. 2 shows specific long stream information in the implementation data.

Fig. 3 shows simulation results based on implementation data.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

The whole flow of the long flow identification method based on sampling and two-stage Counting Bloom Filter provided by the invention is shown in figure 1, and the method comprises the following steps:

step 1, periodically sampling the messages passing through the link in the observation time according to the frequency of every n extracted messages. When the total number of the messages is large, a relatively small sampling frequency can be properly selected to improve the processing speed of the method, for example, one message is extracted every 100 messages; when the total number of messages is small, a relatively large sampling frequency can be selected to ensure the accuracy of long stream identification, such as extracting every 10 messages.

Step 2, setting a threshold T of the long flow, and reasonably configuring two-stage Counting Bloom Filter structural parameters:

setting the total number of the messages passing through the link in the observation time as N, if the flow occupying more than m% of the total number of the messages is defined as a long flow, setting the threshold value as T-N.m%, and setting the threshold value for identifying the long flow by using the sampling messages as T₁T/n; the two-stage Counting Bloom Filter selects the same k hash functions with small conflicts (the concept and judgment standard of the hash function with small conflicts are known to those skilled in the art), namely h (1), h (2), …, h (k), wherein k is 1 to 3; length m of Counter array in first-stage Counting Bloom Filter structure₁Setting the number of bits b allocated to each Counter to a power of 2 greater than the total number of sample messages N/N₁The conditions are required to be satisfied:

and several bits need to be properly allocated more to avoid counter overflow; length m of Counter array in second-level Counting Bloom Filter structure₂Setting the power of 2 greater than the total number N of messages and the number b of bits allocated to each counter₂The conditions are required to be satisfied:

several bits are also allocated in excess as appropriate to avoid counter overflow.

And 3, mapping each sampled message to a corresponding position of a second-level Counting Bloom Filter through k hash functions, if the k counter values of the corresponding positions are not 0, judging that the message belongs to the identified long stream, inserting the message into the second-level Counting Bloom Filter, adding 1 to the k counter values respectively, continuously processing the next message, if any one of the k counter values of the corresponding positions is 0, judging that the message does not belong to the identified long stream, and executing the step 4.

Step 4, mapping the sampled message to a first-stage Counting Bloom Filter through k hash functions, and solving the minimum value of k counters at corresponding positions; if the minimum value of the k counters is equal to the threshold value T₁If yes, then judging the flow to which it belongs is a long flow, recording the flow identification of the message, and counting the kSubtracting threshold T from the value of the device respectively₁And mapping the counter value to a second-stage Counting Bloom Filter, and setting k counter values of corresponding positions as T₁+1, continuing to process the next message; if the minimum value of the k counters is not equal to the threshold value T₁If yes, the flow is judged not to be the long flow, and step 5 is executed.

And 5, inserting the message into the first Counting Bloom Filter, namely adding 1 to the k counter values respectively, and continuing to process the next message.

And 6, after the steps 3-5 are repeated to complete the processing of all the sampled messages, inquiring all the non-sampled messages through the second-stage Counting Bloom Filter, if the k counter values of the corresponding positions of the messages mapped to the second-stage Counting Bloom Filter are not 0, judging that the messages belong to the identified long stream, inserting the messages into the second-stage Counting Bloom Filter, namely adding 1 to the k counter values of the corresponding positions respectively, and otherwise, not performing any processing.

As an improvement, the method further comprises a step 7 of mapping the recorded stream identifier into the second-level Counting Bloom Filter after all non-sampled messages are processed. The stream identifier recorded in step 4 is the stream identifier of the long stream identified by the method of the present invention, and the minimum value of the counter at the corresponding position in the second Counting Bloom Filter is the stream length of the long stream measured by the method.

The invention selects actual Trace data collected in chicago at 2016, 3, 17 and publicly provided by Internet data analysis cooperative organization (CAIDA) to carry out simulation analysis, and is realized by visual studio software. The first 5000000 message data in the Trace are intercepted to carry out experiments, the threshold value T is set to be 21000, the number of real long flows with the message number exceeding the threshold value is 3, and fig. 2 shows specific long flow information. The flow in the experiment refers to a message set with the same source and destination IP addresses, and the definition of the specific flow identifier can be determined according to the actual application requirements of the network. When the sampling frequency is set to 1/100, SHA1 algorithm is adopted for the hash functions of the two-stage Counting Bloom Filter, and the number of the hash functions is set to 1, the simulation result of the invention is shown in FIG. 3. Comparing fig. 2 and fig. 3, it can be found that the long stream information identified by the present invention is identical to the real long stream information, and the accurate identification of the long stream and the high-precision measurement of the original stream length can be realized.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. The long stream identification method based on sampling and two-stage CBF is characterized by comprising the following steps:

the step 2 specifically comprises the following processes:

length m of Counter array in second-level Counting Bloom Filter structure₂Setting the power of 2 greater than the total number N of messages and the number b of bits allocated to each counter₂The conditions are satisfied:

the step 3 specifically comprises the following steps:

for each sampled message, mapping the sampled message to a corresponding position of a second-level Counting Bloom Filter through k hash functions, if k counter values of the corresponding position are not all 0, judging that the message belongs to the identified long stream, inserting the message into the second-level Counting Bloom Filter, continuously processing a next message, if any one of the k counter values of the corresponding position is 0, judging that the message does not belong to the identified long stream, and executing a step 4;

the step 4 specifically comprises the following steps:

mapping the sampled message to a first-stage Counting Bloom Filter through k hash functions, and solving the minimum value of k counters at corresponding positions; if the minimum value of the k counters is equal to the threshold value T₁If yes, the flow is judged to be a long flow, the flow identification of the message is recorded, and the k counter values are respectively subtracted by the threshold value T₁And mapping the counter value to a second-stage Counting Bloom Filter, and setting k counter values of corresponding positions as T₁+1, continuing to process the next message; if the minimum value of the k counters is not equal to the threshold value T₁If yes, judging that the stream to which the stream belongs is not a long stream, and executing the step 5;

2. The method for long flow identification based on decimation and two-stage CBF according to claim 1, wherein said decimation frequency in step 1 is every n packets.

3. The method of claim 1, wherein the sampling frequency is decreased when the total number of packets is large and increased when the total number of packets is small.

4. A method for sample and two stage CBF based long stream identification as claimed in claim 1, wherein each counter is allocated a number of bits more than the number in the satisfied condition.

5. The method for identifying a long flow based on sampling and two-stage CBF according to claim 1, wherein the step 5 of inserting the packet into the first-stage Counting Bloom Filter comprises: the k counter values in the first Counting Bloom Filter are each incremented by 1.

6. The method for identifying the long stream based on the sampling and two-stage CBF according to claim 1, wherein the step 3 and the step 6 for inserting the second-stage Counting Bloom Filter comprises: the k counter values in the second Counting Bloom Filter are respectively added with 1.

7. The method of claim 1, further comprising a step 7 of mapping the recorded flow id to a second-level Counting Bloom Filter after all non-sampled packets have been processed.