CN117201368A - Network flow base number estimation method and system based on pre-sampling - Google Patents

Network flow base number estimation method and system based on pre-sampling Download PDF

Info

Publication number
CN117201368A
CN117201368A CN202311081497.XA CN202311081497A CN117201368A CN 117201368 A CN117201368 A CN 117201368A CN 202311081497 A CN202311081497 A CN 202311081497A CN 117201368 A CN117201368 A CN 117201368A
Authority
CN
China
Prior art keywords
sampling
stream
base
register
radix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311081497.XA
Other languages
Chinese (zh)
Inventor
黄河
孙玉娥
杜扬
梁嘉琛
宋邦奥
冯敏远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202311081497.XA priority Critical patent/CN117201368A/en
Publication of CN117201368A publication Critical patent/CN117201368A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a network flow base number estimation method and a system based on pre-sampling, wherein the method comprises the following steps: carrying out packet level sampling on the data packet through a counter and a preset threshold value; register updating is carried out on the sampled data packet; and processing the stream information stored in the register through a base estimation algorithm to obtain target stream base information and sending the target stream base information to the client. Leading in a pre-packet level sampling technology PPS, firstly improving the existing element level sampling technology, and completing the judgment of whether to sample or not by only using a counter and a preset threshold value. Compared with a common element hash sampling mode, the method has the advantages that the calculation complexity is greatly reduced, and the throughput is effectively improved. And the information of the data packet is not needed when the pre-packet level sampling is carried out, so that the time for extracting the element label or the stream label is saved. In addition, the application corrects the original base estimation result by utilizing a probability analysis mode.

Description

Network flow base number estimation method and system based on pre-sampling
Technical Field
The present application relates to the field of network measurement technologies, and in particular, to a method and a system for estimating a network flow base number based on pre-sampling.
Background
With the advent of the era of digitization and intelligence, various new network applications are emerging, and the number of networking devices and the scale of network traffic are increasing explosively. To address the opportunities and challenges presented by the rapidly growing data communication demands, network service operators and cloud service companies have advanced network infrastructure virtualization and fine-grained management throughout recent years, where network traffic measurement is an important fundamental function of acquiring network performance and supporting network management. The network flow base number measurement in the high-speed network flow measurement plays a very important role in the current network management, and can provide analysis basis for important network functions such as scanning attack detection, performance diagnosis, anomaly detection and the like. Therefore, it is necessary to achieve accurate and high-speed radix measurement of millions or more of traffic in the environment of the current dramatic increase in the number of network devices and link rates.
The current mainstream practice in the industry is divided into two schemes, a scheme based on the sketch and a scheme based on the sampling technology.
The sketch-based scheme stores data information as much as possible in a much smaller space than the original traffic data by utilizing a data compression technique so as to adapt to the limited storage space on the chip, and can provide better precision assurance. The Sketch-based approach generates only a constant level of time overhead for the query and update operations for each packet, while having a probability-guaranteed measurement accuracy. However, this can only solve the problem of the memory resource which is short in the radix measurement process, and has no significant effect in improving the throughput, and typical examples are the virtual Bitmap algorithm proposed by Li et al, the virtual HyperLogLog algorithm proposed by Xiao et al, and the like.
The scheme based on the sampling technology only selects part of representative network flow data subsets to be collected and processed, and the sampled flow information is sent to the off-chip for recording by utilizing the sampling module on the chip. Throughput can be improved by utilizing lower sampling probabilities, however, it is a serious problem to trade off its sampling probability and estimation accuracy: a smaller sampling probability severely reduces the final measurement accuracy, while a larger sampling probability consumes a lot of storage and communication resources.
There are also ways to combine the sktch technique with element level sampling, but this presents new problems: on the one hand, although the element level sampling technique can utilize a lower sampling rate to improve the data throughput, the accuracy loss is caused, and the expected estimation result is difficult to obtain directly through the existing Sketch technique; on the other hand, the existing element level sampling methods mostly need to generate random numbers in real time or hash elements. And the sampling processing operations all need to carry out complex calculation, and a great deal of time is additionally consumed under the condition that the hardware calculation speed is difficult to increase, so that the processing speed of an algorithm is greatly reduced.
Based on the problem of radix measurement in the high-speed network environment in the prior art, no effective solution has been proposed.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the above-mentioned drawbacks and shortcomings of the prior art, the present application provides a method and a system for estimating network flow base number based on pre-sampling, which reduce the calculation and storage costs required by a detection system on the premise of ensuring a certain accuracy, and simultaneously reduce the influence of the system on the network to the greatest extent. Real-time performance and effectiveness of sampling flow are guaranteed; the accuracy of the algorithm is improved, and the precision of the base number estimation is ensured; the speed of the system is improved, and the storage cost is reduced; the throughput of the flow measurement is improved. Meanwhile, the application also needs to ensure the expandability, compatibility and usability of the system.
The application mainly comprises the following aspects:
in a first aspect, the present application provides a network flow base estimation method based on pre-sampling, including:
carrying out packet level sampling on the data packet through a counter and a preset threshold value;
register updating is carried out on the sampled data packet;
and processing the stream information stored in the register through a base estimation algorithm to obtain target stream base information and sending the target stream base information to the client.
In one embodiment of the present application, performing packet level sampling on an initial data packet by using a counter and a preset threshold value includes:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
In one embodiment of the present application, performing register update on the sampled data packet includes:
analyzing the sampled data packet, and extracting a stream label and an element label of the data packet;
and recalculate the flow label and the element label using a hash function;
obtaining a register according to a first target value in the element tag;
and updating the register according to the second target value in the element tag.
In one embodiment of the application, the radix estimation algorithm includes an original stream radix calculation and a correction coefficient calculation.
In one embodiment of the present application, the calculation formula of the original stream radix in the original stream radix calculation is:
wherein,estimating a value for the base of the original stream, m being the number of virtual registers used per stream, j being each of the virtual register setsIndex of virtual register, alpha m Estimating bias factor, R, when m virtual register numbers are used for each stream f Is the virtual register set for stream f.
In one embodiment of the present application, in the calculation of the original flow radix, when a small flow part is identified, the small flow part needs to be corrected, and a specific calculation formula is as follows:
wherein,for small streams, the base value is estimated, and z is the number in the register with a value of 0.
In one embodiment of the present application, denoising is performed based on radix estimation results in the original flow radix calculation, and a specific formula is:
wherein,for all estimated radix sums, N is the number of physical registers in the measurement process.
In one embodiment of the present application, the correction factor calculation includes: packet-level sample large-stream correction and packet-level sample small-stream correction.
In a second aspect, an embodiment of the present application further provides a network flow base estimation system based on pre-sampling, including:
the sampling module is used for carrying out packet level sampling on the data packet through the counter and a preset threshold value;
the updating module is used for updating the register of the sampled data packet;
and the result processing module is used for processing the stream information stored in the register through a base number estimation algorithm to obtain target stream base number information and sending the target stream base number information to the client.
In one embodiment of the present application, the sampling module is further configured to:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
Compared with the prior art, the technical scheme of the application has the following advantages:
the application discloses a network flow base number estimation method and a system based on pre-Sampling, which introduce a pre-packet level Sampling technology PPS (Preposition Packet-level Sampling). First, the existing element level sampling technology is improved, and only one counter and a preset threshold value are used for judging whether to sample or not. Compared with a common element hash sampling mode, the method has the advantages that the calculation complexity is greatly reduced, and the throughput is effectively improved. And the information of the data packet is not needed when the pre-packet level sampling is carried out, so that the time for extracting the element label or the stream label is saved. In addition, the application corrects the original base estimation result by utilizing a probability analysis mode. Experimental results based on real world data sets show that the PPS algorithm provided by the application can greatly improve throughput of flow measurement on the basis of guaranteeing base number estimation accuracy.
Aiming at the problem that the data packet processing speed is not matched with the high-speed network flow, the application improves the existing element level sampling technology, reduces the average operation time of sampling processing and improves the throughput of an algorithm.
Aiming at the data characteristic loss brought by the sampling technology, the application analyzes the numerical frequency distribution condition in the existing register updating algorithm, and provides a correction coefficient for correcting the last base number estimation result, thereby effectively ensuring the precision of base number estimation.
According to the method, a real network flow data set is used for selecting an algorithm verification method, and an algorithm is simulated and tested under different sampling probabilities by simulating a real network environment. Experimental results show that compared with the element level sampling algorithm, the algorithm provided by the application has higher throughput and smaller estimation error.
Drawings
In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which:
FIG. 1 is a flow chart of a method for estimating network flow cardinality based on pre-sampling according to an embodiment of the present application;
FIG. 2 is a pseudo code diagram of a sampling and register update algorithm for pre-sampling based network flow radix estimation according to an embodiment of the present application;
FIG. 3 shows a sample analysis diagram under the ES algorithm provided by an embodiment of the present application;
FIG. 4 shows a sample analysis diagram under the PPS algorithm provided by an embodiment of the present application;
FIG. 5 is a pseudo code diagram of a flow base estimation result query algorithm for pre-sampling based network flow base estimation according to an embodiment of the present application;
fig. 6 shows a functional block diagram of a network flow base estimation system based on pre-sampling according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for the purpose of illustration and description only and are not intended to limit the scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present application. It should be appreciated that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure.
In addition, the described embodiments are only some, but not all, embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art based on embodiments of the application without making any inventive effort, fall within the scope of the application.
In order to enable one skilled in the art to use the present disclosure, the following embodiments are presented in connection with a particular application scenario "pre-sampling based network flow radix estimation", and the general principles defined herein may be applied to other embodiments and application scenarios by one skilled in the art without departing from the spirit and scope of the present disclosure.
The method disclosed by the embodiment of the application can be applied to any scene needing to carry out the network flow base estimation based on the pre-sampling, the embodiment of the application does not limit specific application scenes, and any scheme using the network flow base estimation based on the pre-sampling provided by the embodiment of the application is within the protection scope of the application. In order to facilitate understanding of the present application, the following detailed description of the technical solution provided by the present application is provided in connection with specific embodiments.
The scarce space and computational resources on network processors greatly limit the performance of existing systems to measure network flow cardinality. Since the upper limit of the calculation speed of the hardware is relatively fixed, the measurement speed of the existing method cannot keep pace with the development of the speed of the modern high-speed link. For systems using sampling methods, implementations are limited by technology and resources, often requiring a trade-off between sampling rate and estimation accuracy. The sampling rate is low, and the estimation accuracy is low; the sampling rate is high, so that the speed of a high-speed network cannot be matched, and the real-time performance cannot be achieved. Therefore, the existing architecture must make a trade-off between accuracy and real-time, and cannot meet the requirement of measuring the network flow base.
In the sampling algorithm, many sampling methods cannot effectively compress data, lose original characteristic information of more flow, and alleviate processing and storage difficulties of network measurement, but cause larger errors in estimation. In a practical network environment, many attacks are made up of small flows, most of the network traffic on the high-speed link is small flows, and many sampling algorithms cannot effectively sample the small flows in real time, i.e. cannot measure the cardinality of the flows. In addition, the existing network flow measurement method has poor expandability. Many network measurement methods are often specific to specific application requirements and cannot be extended to high-speed network environments, resulting in inefficient utilization of resources.
Most of the radix estimation algorithms based on the skips need to maintain one skich for each stream, which consumes a lot of memory resources. While addressing this problem, there have been studies that propose to use stream sharing techniques, such as the virtual HyperLogLog algorithm proposed by Xiao et al, to maintain a set of physical registers, each stream sharing these physical registers using hash techniques. However, the flow sharing technique necessarily brings a lot of noise due to the existence of hash collision. At the same time, the algorithm based on the sketch is mainly aimed at the efficient utilization of on-chip storage resources, but the throughput of data is not improved significantly.
In order to solve the problems, the application introduces a pre-packet level Sampling technology PPS (Preposition Packet-level Sampling). First, the existing element level sampling technology is improved, and only one counter and a preset threshold value are used for judging whether to sample or not. Compared with a common element hash sampling mode, the method has the advantages that the calculation complexity is greatly reduced, and the throughput is effectively improved. And the information of the data packet is not needed when the pre-packet level sampling is carried out, so that the time for extracting the element label or the stream label is saved. In addition, the application corrects the original base estimation result by utilizing a probability analysis mode. Experimental results based on real world data sets show that the PPS algorithm provided by the application can greatly improve throughput of flow measurement on the basis of guaranteeing base number estimation accuracy. For the existing algorithm, the improvement aspect and the purpose of the application are as follows;
aiming at the problem that the data packet processing speed is not matched with the high-speed network flow, the existing element level sampling technology is improved, the average operation time of sampling processing is reduced, and the throughput of an algorithm is improved.
Aiming at the data characteristic loss brought by the sampling technology, on the existing register updating algorithm, the numerical frequency distribution condition is analyzed, and a correction coefficient is provided for correcting the last base number estimation result, so that the accuracy of base number estimation is effectively ensured.
And according to the selection of the algorithm verification method, a real network flow data set is used, and the algorithm is simulated and tested in different sampling probabilities. Experimental results show that compared with the element level sampling algorithm, the algorithm provided by the application has higher throughput and smaller estimation error.
Fig. 1 is a flowchart of a network flow base estimation method based on pre-sampling according to an embodiment of the present application, where the method provided by the embodiment of the present application includes the following steps:
s101: and carrying out packet level sampling on the data packet through the counter and a preset threshold value.
In some possible embodiments, the packet-level sampling of the initial data packet by the counter and the preset threshold includes:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
S102: and updating the register of the sampled data packet.
In some possible embodiments, the register updating of the sampled data packet includes:
analyzing the sampled data packet, and extracting a stream label and an element label of the data packet;
and recalculate the flow label and the element label using a hash function;
obtaining a register according to a first target value in the element tag;
and updating the register according to the second target value in the element tag.
The application combines the packet level sampling technology with the Sketch technology, mainly solves the problem of low efficiency in the existing element level sampling processing process, and performs packet level sampling judgment by using a counter and a preset threshold value: the value of the counter is less than the threshold value for sampling and vice versa. The algorithm does not need complex calculation such as packet analysis and the like when sampling and judging, effectively shortens the average sampling processing time and greatly improves the throughput. Register updates are then performed on the sampled data packets.
S103: and processing the stream information stored in the register through a base estimation algorithm to obtain target stream base information and sending the target stream base information to the client.
Illustratively, the information stored in the register is recoded, and the result of each stream coding is utilized to analyze the rules of numerical value occurrence in combination with the existing distribution rules, so as to correct errors caused by the packet level sampling technology and the Sketch technology.
Specifically, let P (0 < P < 1) be the sampling probability in a certain measurement period. When each data packet arrives, the algorithm first needs to make a decision as to whether or not to sample according to P. In order to greatly improve the throughput of data, the flow label or element label of each data packet is not extracted before sampling. In actual measurement, a counter with a size of 7 bits is maintained, the value of the counter is Count, and the initial value is set to 0.
After each packet arrives, it is first determined whether to perform a reset operation, i.e., compare Count with 100. If equal to 100, reset Count to 0 and write back to the counter; otherwise, do not do any treatment. The algorithm then determines whether the data packet needs to be sampled. Count is compared to the sampling probability P. If Count is less than P100, consider that sampling is needed and execute a register update algorithm; otherwise, the data packet is skipped. Finally, the value of the sample counter is updated, and Count is incremented by one.
After the algorithm determines that sampling is required, a register updating algorithm is executed. Firstly, the data packet is parsed, and the stream tag f and the element tag e of the data packet are extracted. And recalculate f and e using a hash function, i.e., let f=g (f), e=g (e). Then calculate the virtual register that needs to be updated this time according to the value b of the previous log2m bit of eWherein seed [ b ]]The value of the b-th random number generated before each measurement period is represented. Next, calculate the p value of the last 32-log2m bits of e, if greater than R f [b]And writing the p value into the register. Detailed sampling and register update algorithm pseudocode is shown in fig. 2.
In some possible implementations, the radix estimation algorithm includes an original stream radix calculation and a correction coefficient calculation.
Illustratively, in querying the radix estimation results, it is split into two parts: and (5) calculating the base number of the original stream and calculating a correction coefficient.
The original stream base estimation adopts an estimation formula proposed by the existing HLL algorithm. For f-streams, it has m logical registers, R f [1],R f [2],...,R f [3]. As can be seen from the previous register update algorithm, for each logical register there isThe probability of (2) is such that the value counted in the logical register is k. Thus, the base value stored in the logical register can be considered to be 2 k . Also because the radix of each flow is stored using m buckets, the radix value stored in each bucket is only a true valueTherefore, after calculating the harmonic mean of m buckets, m is multiplied to obtainOriginal stream base estimate. Finally, for each stream, its original stream base value is calculated herein using equation (1).
In some possible embodiments, the calculation formula of the original flow radix in the original flow radix calculation is:
wherein,for the original stream radix estimate value, m is the number of virtual registers used per stream, j is the index of each virtual register in the virtual register set, α m Estimating bias factor, R, when m virtual register numbers are used for each stream f Is the virtual register set for stream f.
Alpha in the above formula m Calculating its exact value is relatively complex. In practical applications, the following values are generally used: alpha 16 =0.673,α 32 =0.697,α 64 =0.709, and α when m is 128 or more m =0.7213/(1+1.079/m)。
It is noted that the above formula may deviate significantly in the small flow portion. When z is not equal to 0, the algorithm identifies it as a small stream and calculates the original stream base estimate value using equation (2). Where z is the number of 0's in the logical register.
In some possible embodiments, in the calculation of the original flow radix, when a small flow part is identified, the small flow part needs to be corrected, and a specific calculation formula is as follows:
wherein,for small streams, the base value is estimated, and z is the number in the register with a value of 0.
Because of the use of stream sharing techniques, the radix estimation result obtained by using the two formulas for each stream inevitably has noise. The denoising formula (3) proposed in vHLL is used in the present application herein, where the sum of all estimated cardinalities is represented, and can be calculated using formulas (1) and (2).
In some possible embodiments, denoising is performed in the original stream radix calculation based on radix estimation results, and a specific formula is:
wherein,for all estimated radix sums, N is the number of physical registers in the measurement process.
In some possible implementations, the correction factor calculation includes: packet-level sample large-stream correction and packet-level sample small-stream correction. The result from equation (3) is not accurate due to the use of sampling techniques. In the element level sampling algorithm, since sampling is performed for elements, the final result can be obtained by dividing the element level sampling probability only when the cardinality is finally obtained. However, since packet-level sampling is performed on a data packet, it cannot be directly and simply obtained.
It can be seen from fig. 3 that the ratio of the sampling radix to the actual radix for each stream under the ES algorithm is close to the sampling probability. It can be seen from fig. 4 that the ratio of the sampling radix to the actual radix for each stream under the PPS algorithm is closer to the packet-level sampling probability only in the large stream portion. The application then calculates correction coefficients for the large and small stream portions, respectively.
First record P f Sampling the ratio of radix to actual radix, T, for each stream e (P) is a correction coefficient function. It can be seen from fig. 3 and 4 that under the pre-packet level sampling algorithm, the sampling base and the actual base are still close to the linear relationship, so that a first order polynomial is selected for linear fitting to obtain a correction coefficient functionThe numerical expression (4).
T e (P)=αP+β(4);
The application then uses a least squares method to find the best matching correction coefficient function for the dataset. For the followingStreams greater than 2.687m, which are considered as large streams and others as small streams, give the target value (5) of the least squares method, respectively.
Where cnt is Fref [ i ] in the frequency array Fref for each stream]Greater than or equal to P2 i Is a number of (3). The final objective of the least squares fit is to minimize the target value S. Finally, the data set used according to the application is calculated to form equation (6) for converting the packet-level sampling probability P into the element-level sampling probability.
Dividing the original stream radix value from equation (3) by T e (P) obtaining the final radix estimation result. The complete flow radix estimation result query algorithm pseudocode is shown in fig. 5. In particular embodiments, network flow radix measurement is one of the underlying data sources for network management, and may be employed in many network management functions. For example, when a scan attack needs to be detected, the traffic estimator may define the flow label as a source address and each destination address as an element, measure the cardinality of each source address flow in real time and detect a scan attacker accessing too many destination addresses in a short time. Specifically, in the practical application process, the following implementation steps are included:
1) Before measurement begins, an algorithm module is deployed in a designated measurement node.
2) Before each measurement cycle starts, the manager needs to set basic parameters of measurement, such as flow label, element label, sampling probability, etc. of this measurement. After the parameters are set, the algorithm module initializes the corresponding information.
3) After each measurement period has started, the packet receiver on the measurement node starts to operate. When the algorithm module detects that the data packet arrives, the sampling algorithm shown in the algorithm 1 is used for sampling and judging the data packet (note that the data packet is not subjected to analysis operation at this time, such as extracting the source IP address and the destination IP address of the data packet). If the sampling condition is not satisfied, ignoring the data packet and continuously accepting the next data packet; otherwise, the data packet is parsed and recorded by using the register update algorithm shown in fig. 2. After each measurement period is finished, the algorithm module deployed on the measurement node processes the stored flow information according to the radix estimation algorithm shown in fig. 5, and sends the last flow radix information to the software end, so that the manager can input the flow label to be queried, and the system returns the corresponding radix value.
Further, the real world traffic data set for experiments of the present application is derived from CAIDA. The number of packets captured per minute in this dataset averages up to 31247634 and the number of different source IP addresses averages up to 585298. The application takes one minute as a measurement period, takes the source IP address in the data set as a flow label and takes the destination IP address as an element label for experiment. In the experimental process, the application uses C++ to realize all codes. In order to compare the performance of the PPS algorithm presented herein, the ES algorithm was also implemented in experiments using c++. In addition, the above algorithm implementations all use 1MB of memory space for maintaining register sets in the Sketch; the hash functions involved in the algorithm all originate from MURMUR3. Table 1 below shows the Average Relative Error (ARE) for the different solutions and table 2 shows the average absolute error (MAE) for the different solutions.
TABLE 1
TABLE 2
Based on the same application conception, the embodiment of the application also provides a pre-sampling-based network flow base estimation system corresponding to the pre-sampling-based network flow base estimation method provided by the embodiment, and because the principle of solving the problem of the system in the embodiment of the application is similar to that of the embodiment of the application, the implementation of the system can refer to the implementation of the system, and the repetition is omitted.
Fig. 6 is a functional block diagram of a network flow base estimation system 600 based on pre-sampling according to an embodiment of the present application. As shown in fig. 6, includes:
a sampling module 610, configured to sample the data packet at a packet level through the counter and a preset threshold;
an updating module 620, configured to update a register of the sampled data packet;
and the result processing module 630 is configured to process the flow information stored in the register through a radix estimation algorithm, obtain target flow radix information, and send the target flow radix information to the client.
In some possible embodiments, the sampling module 610 is further configured to:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
Note that, a packet (packet): user data is transmitted in the network in the form of data packets. The packet includes not only user data but also necessary information such as a source address (src), a destination address (dst), a source port (src_port), a destination port (dst_port), a protocol (protocol), and the like. The source address and the source port are the IP address and the sending port of the packet sending computer, and the destination address and the destination port are the IP address and the receiving port of the packet receiving computer. The protocol specifies the format and processing method of the data packet.
Hash (Hash): hash is a mathematical function that converts an arbitrary length input into a fixed length encrypted output. If such a function is used for the same data, its Hash value will be the same; if a Hash function is used for different data, the Hash values may be the same, or may be different, and if the Hash values are the same, a Hash collision may be considered to occur.
Network flow radix estimation (Flow Cardinality Estimation): the network flow base is used for measuring the number of different elements in each flow, and the definition of the flow and the elements can be flexibly configured to meet the requirement of not measuring the scene. The goal of per-stream radix measurement is to measure the number of different elements in each stream. I.e. given a set of packets P and a set of flows F within a measurement period, for any one flow F e F its base is noted as n f . The actual radix of each stream in the measurement period is { n } f1 ,n f2 ,n f3 ,. } per-flow radix measurement is the calculation of radix estimates for each flow over a measurement period:
in summary, the network flow base estimation method based on the pre-Sampling provided by the embodiment of the application introduces a pre-packet level Sampling technique PPS (Preposition Packet-level Sampling). First, the existing element level sampling technology is improved, and only one counter and a preset threshold value are used for judging whether to sample or not. Compared with a common element hash sampling mode, the method has the advantages that the calculation complexity is greatly reduced, and the throughput is effectively improved. And the information of the data packet is not needed when the pre-packet level sampling is carried out, so that the time for extracting the element label or the stream label is saved. In addition, the application corrects the original base estimation result by utilizing a probability analysis mode. Experimental results based on real world data sets show that the PPS algorithm provided by the application can greatly improve throughput of flow measurement on the basis of guaranteeing base number estimation accuracy.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as methods, systems. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims (10)

1. A network flow radix estimation method based on pre-sampling, comprising:
carrying out packet level sampling on the data packet through a counter and a preset threshold value;
register updating is carried out on the sampled data packet;
and processing the stream information stored in the register through a base estimation algorithm to obtain target stream base information and sending the target stream base information to the client.
2. The network flow radix estimation method of claim 1, wherein performing packet level sampling on the initial data packet by a counter and a preset threshold comprises:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
3. The network flow radix estimation method of claim 1, wherein register updating the sampled data packet comprises:
analyzing the sampled data packet, and extracting a stream label and an element label of the data packet;
and recalculate the flow label and the element label using a hash function;
obtaining a register according to a first target value in the element tag;
and updating the register according to the second target value in the element tag.
4. The network flow radix estimation method of claim 1 wherein the radix estimation algorithm comprises an original flow radix calculation and a correction coefficient calculation.
5. The network flow base estimation method according to claim 4, wherein the calculation formula of the original flow base in the original flow base calculation is:
wherein,for the original stream radix estimate value, m is the number of virtual registers used per stream, j is the index of each virtual register in the virtual register set, α m Estimating bias factor, R, when m virtual register numbers are used for each stream f Is the virtual register set for stream f.
6. The network flow base estimation method according to claim 5, wherein in the original flow base calculation, when a small flow part is identified, the small flow part needs to be corrected, and a specific calculation formula is as follows:
wherein,for small streams, the base value is estimated, and z is the number in the register with a value of 0.
7. The network flow base estimation method according to claim 6, wherein denoising is performed based on the base estimation result in the original flow base calculation, and the specific formula is:
wherein,for all estimated radix sums, N is the number of physical registers in the measurement process.
8. The network flow radix estimation method of claim 4, wherein the correction factor calculation comprises: packet-level sample large-stream correction and packet-level sample small-stream correction.
9. A pre-sampling based network flow radix estimation system, comprising:
the sampling module is used for carrying out packet level sampling on the data packet through the counter and a preset threshold value;
the updating module is used for updating the register of the sampled data packet;
and the result processing module is used for processing the stream information stored in the register through a base number estimation algorithm to obtain target stream base number information and sending the target stream base number information to the client.
10. The network flow cardinality estimation system of claim 9, wherein the sampling module is further configured to:
carrying out packet level sampling judgment through a counter and a preset threshold value;
sampling if the value of the counter is smaller than the preset threshold value;
and if the value of the counter is greater than or equal to the preset threshold value, not sampling.
CN202311081497.XA 2023-08-25 2023-08-25 Network flow base number estimation method and system based on pre-sampling Pending CN117201368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311081497.XA CN117201368A (en) 2023-08-25 2023-08-25 Network flow base number estimation method and system based on pre-sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311081497.XA CN117201368A (en) 2023-08-25 2023-08-25 Network flow base number estimation method and system based on pre-sampling

Publications (1)

Publication Number Publication Date
CN117201368A true CN117201368A (en) 2023-12-08

Family

ID=88997128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311081497.XA Pending CN117201368A (en) 2023-08-25 2023-08-25 Network flow base number estimation method and system based on pre-sampling

Country Status (1)

Country Link
CN (1) CN117201368A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117792962A (en) * 2024-02-28 2024-03-29 苏州大学 Distributed stream base measuring method, device and computer readable storage medium
CN117827851A (en) * 2024-03-06 2024-04-05 苏州元澄科技股份有限公司 Data processing structure for measuring flow base number and application thereof
CN117896323A (en) * 2024-03-15 2024-04-16 苏州大学 Priority-based data stream base on-line measurement method and system
CN117792962B (en) * 2024-02-28 2024-05-24 苏州大学 Distributed stream base measuring method, device and computer readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117792962A (en) * 2024-02-28 2024-03-29 苏州大学 Distributed stream base measuring method, device and computer readable storage medium
CN117792962B (en) * 2024-02-28 2024-05-24 苏州大学 Distributed stream base measuring method, device and computer readable storage medium
CN117827851A (en) * 2024-03-06 2024-04-05 苏州元澄科技股份有限公司 Data processing structure for measuring flow base number and application thereof
CN117827851B (en) * 2024-03-06 2024-05-10 苏州元澄科技股份有限公司 Data processing structure for measuring flow base number and application thereof
CN117896323A (en) * 2024-03-15 2024-04-16 苏州大学 Priority-based data stream base on-line measurement method and system
CN117896323B (en) * 2024-03-15 2024-05-31 苏州大学 Priority-based data stream base on-line measurement method and system

Similar Documents

Publication Publication Date Title
US7944822B1 (en) System and method for identifying network applications
US8180916B1 (en) System and method for identifying network applications based on packet content signatures
US7852785B2 (en) Sampling and analyzing packets in a network
US20120182891A1 (en) Packet analysis system and method using hadoop based parallel computation
Liu et al. Detection of superpoints using a vector bloom filter
TWI541662B (en) Methods and systems for estimating entropy
CN109309626B (en) DPDK-based high-speed network data packet capturing, distributing and caching method
CN113364752B (en) Flow abnormity detection method, detection equipment and computer readable storage medium
US7957315B2 (en) System and method for sampling network traffic
US10009239B2 (en) Method and apparatus of estimating conversation in a distributed netflow environment
US10983976B2 (en) Optimized full-spectrum cardinality estimation based on unified counting and ordering estimation techniques
CN117201368A (en) Network flow base number estimation method and system based on pre-sampling
US8064359B2 (en) System and method for spatially consistent sampling of flow records at constrained, content-dependent rates
CN112434298B (en) Network threat detection system based on self-encoder integration
CN111953552B (en) Data flow classification method and message forwarding equipment
CN113452676A (en) Detector allocation method and Internet of things detection system
Lai et al. Tabular interpolation approach based on stable random projection for estimating empirical entropy of high-speed network traffic
Zheng et al. A data streaming algorithm for detection of superpoints with small memory consumption
CN111835599B (en) SketchLearn-based hybrid network measurement method, device and medium
CN116303585A (en) Flag bit-based data stream counting method, flag bit-based data stream counting equipment and storage medium
CN114465786A (en) Monitoring method for encrypted network flow
Ruiz et al. FPGA-based encrypted network traffic identification at 100 Gbit/s
Zhou et al. Per-flow cardinality estimation based on virtual loglog sketching
US20060221855A1 (en) Method and system for real-time detection of hidden traffic patterns
Reviriego et al. Improving packet flow counting with fingerprint counting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination