CN111581489B - Storage space optimized sampling method based on shared counting tree - Google Patents

Storage space optimized sampling method based on shared counting tree Download PDF

Info

Publication number
CN111581489B
CN111581489B CN202010438372.8A CN202010438372A CN111581489B CN 111581489 B CN111581489 B CN 111581489B CN 202010438372 A CN202010438372 A CN 202010438372A CN 111581489 B CN111581489 B CN 111581489B
Authority
CN
China
Prior art keywords
flow
data packet
sampling
stream
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010438372.8A
Other languages
Chinese (zh)
Other versions
CN111581489A (en
Inventor
杨武
玄世昌
王巍
苘大鹏
吕继光
唐德志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202010438372.8A priority Critical patent/CN111581489B/en
Publication of CN111581489A publication Critical patent/CN111581489A/en
Application granted granted Critical
Publication of CN111581489B publication Critical patent/CN111581489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of flow sampling, and particularly relates to a storage space optimized sampling method based on a shared counting tree. The invention aims to save the storage space of sampling equipment, and specifically comprises the steps of determining whether to sample an incoming data packet according to a sampling judgment mechanism; if the incoming data packet is determined to be sampled, retrieving the flow node to which the data packet belongs in a Hash flow tracking table; if the flow node to which the sampling data packet belongs is not retrieved, a flow node is newly established in the flow tracking table for the data packet; when sampling is terminated to a certain flow, restoring and guiding characteristic values stored in flow nodes and a shared counting tree set of the flow into an ordered flow characteristic record buffer area; and writing the sample stream characteristic record into a file when the buffer area is full.

Description

Storage space optimized sampling method based on shared counting tree
Technical Field
The invention belongs to the technical field of flow sampling, and particularly relates to a storage space optimized sampling method based on a shared counting tree.
Background
In recent years, the variety and number of applications in the internet have been significantly expanding. In order to cope with the influence of application change on the network, a network manager needs to measure the traffic application characteristics, and the traffic needs to be classified during the measurement process. To support application classification, the sampled traffic should retain sufficient application characteristics. A particular session of a modern application is typically composed of multiple streams, each of which may have the same source IP but different destination IPs. If more flows are adopted in a certain application program session, more application characteristics are reserved for the sampled flow, and further the machine learning algorithm is facilitated to accurately carry out application identification on the sampled flow. RelSamp only samples the flows corresponding to the source IPs within a certain range, and under the condition that the effective sampling ratio is constant, the RelSamp can increase the number of the flows acquired in the application program session by increasing the flow sampling probability and reducing the packet sampling probability, thereby more keeping the application characteristics of the sampling flow. However, for any statistical characteristic of the streams, such as the stream size, relSamp needs to assign a counter to each stream to record its size. The space allocated for each counter is consistent and the counting range of the counter is guaranteed to cover the counting value of the maximum flow. The network traffic distribution has the characteristic of heavy tail distribution, namely, a large flow occupying a small proportion occupies a large proportion in the network traffic. There are studies that show that ordering the flows by flow size, the top 15% of the flows occupy 95% of the total flow. The size of each flow is recorded by allocating a counter with the same size as the space of each flow, which inevitably causes a great waste of the storage space of the flow sampling equipment. Allocating a counter with a consistent space size for each flow to record other statistical characteristics (e.g., the number of FIN, SYN, and ACK packets coming in the flow during sampling) also wastes the storage space of the flow sampling device. Especially when RelSamp is deployed in a high-speed network environment with huge network flow concurrency, huge storage pressure can be caused on the flow sampling equipment.
Disclosure of Invention
The invention aims to provide a storage space optimized sampling method based on a shared count tree, which saves the storage space of sampling equipment.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: according to a pre-configured finite sampling ratio p e Source IP sampling probability p h And target stream sampling probability
Figure BDA0002503152910000011
Determining output packet sampling probability p p And current stream sampling probability p f
Step (ii) of2: extracting data packets from the data packet buffer queue, and distributing two random numbers r with the value range of [0,1 ] for the data packets f And r p
And step 3: acquiring a source IP of a data packet and calculating a hash value of the source IP;
and 4, step 4: the hash value of the source IP and the source IP selection probability p h Multiplying to obtain a target value;
and 5: if the target value is within the preset range, executing step 6; otherwise, discarding the data packet;
step 6: searching a stream node to which the data packet belongs;
if the flow node r corresponding to the data packet is not found f ≤p f If so, sampling the data packet, creating a flow node for the data packet, creating a flow characteristic storage unit, and updating the flow characteristic storage unit of the flow;
if the flow node to which the data packet belongs is found and r p ≤p p Then sampling the data packet and updating the flow characteristic storage unit of the flow;
if the flow node corresponding to the data packet is not found and r is f >p f Or find the stream node to which the data packet belongs and r p >p p Discarding the data packet;
and 7: when sampling of a certain flow is finished, restoring flow characteristics stored in flow nodes and flow statistical characteristics stored in a shared counting tree into complete flow records, and adding the flow records into a flow record buffer queue; and writing the sample stream characteristic record into a file when the buffer area is full.
The present invention may further comprise:
the step 1 is carried out according to a preset finite sampling ratio p e Source IP sampling probability p h And target stream sampling probability
Figure BDA0002503152910000021
Determining output packet sampling probability p p And current stream sampling probability p f The method comprises the following specific steps:
step 1.1: inputting a preconfigured finite sampling ratio p e Source IP sampling probability p h And target stream sampling probability
Figure BDA0002503152910000022
Step 1.2: initializing packet sampling probability p p And current stream sampling probability p f
p p =p e /p h
p f =p e /p h
Step 1.3: order to
Figure BDA0002503152910000023
Step 1.4: obtaining a current sampling ratio p x
Step 1.5: if it is
Figure BDA0002503152910000024
Then the output packet sampling probability p p And current stream sampling probability p f Ending the calculation; otherwise, executing step 1.6; wherein α is the accuracy of the setting;
step 1.6: if | p x -p e If the | is less than or equal to alpha, returning to the step 1.3; otherwise, executing step 1.7;
step 1.7: if the current sampling ratio p x Greater than a preconfigured finite sampling ratio p e Then let p p =0.5*(p p + t), t =0.00001, return to step 1.6; otherwise, let p p =0.5*(p p + 1), return to step 1.6.
The specific method for creating the flow characteristic storage unit for the data packet in the step 6 is as follows: analyzing the data packet, and writing quintuple information, stream arrival time, stream latest update time, minimum effective load length and maximum effective load length into a stream node; wherein, the latest updating time of the stream and the arrival time of the stream are both the current time; the minimum effective load length and the maximum effective load length are both the effective load length of the application layer of the data packet; if the data packet is a TCP data packet, entering TCP headerLine analysis, detecting whether flag bits ACK, FIN, SYN and RST are set in the line analysis; if the TCP data packet is set, counting the number of the incoming flag bit data packets of the stream in the corresponding shared counting tree; counting the flow in a shared count tree that stores the flow size; calculating the length len of the data packet, taking 32B as a data block, obtaining the number c of the data blocks occupied by the data packet,
Figure BDA0002503152910000031
counting the flow c times in a shared counting tree storing the length of the flow; if the data packet is not a TCP data packet or the TCP data packet is not set, the characteristics of the number of ACK packets, the number of SYN packets, the number of FIN packets and the number of RST packets of the flow to which the data packet belongs do not need to be counted in the shared counting tree.
The specific steps of updating the stream feature storage unit of the stream in the step 6 are as follows:
step 6.1: updating the stream latest update time in the stream node;
step 6.2: analyzing the data packet, and calculating the effective load length of the application layer of the data packet;
step 6.3: if the effective load length of the data packet application layer is greater than the original maximum effective load length, updating the maximum effective load length; if the effective load length of the data packet application layer is smaller than the original minimum effective load length, updating the minimum effective load length;
step 6.4: the statistical feature values are updated in the shared count tree.
The specific steps of updating the statistical characteristic value in the shared count tree in the step 6.4 are as follows:
step 6.4.1: extracting quintuple information of the data packet, hashing the quintuple information, positioning a flow node corresponding to the data packet in a flow table, and extracting a flow label f of the flow from the flow node tuple
Step 6.4.2: generating a random number i, i ∈ [0,r);
step 6.4.3: selecting S [ i ] from a set of random numbers S]And flow label f tuple The exclusive or is performed on the data to be processed,and the XOR value is used as a parameter and is transmitted to a main hash function H to generate the hash function H i (ii) a Computing hash values
Figure BDA0002503152910000032
Positioning an enhanced counter for counting this time; wherein p is the number of enhanced counters;
step 6.4.4: updating the statistic characteristic value C [ u ], C [ u ] ← C [ u ] +1;
if the CU does not overflow, the step is finished, and the statistical characteristic value is updated;
if the C [ u ] overflows and the subscript of the father node of the C [ u ] is greater than or equal to the subscript of the virtual root node, the reinforced counter L [ u ] overflows and does not use the shared counting tree for counting any more;
if C [ u ]]Overflow and the subscript of its parent node is less than the subscript of the virtual root node, update u,
Figure BDA0002503152910000033
step 6.4.4 is re-executed.
The specific step of recovering the stream data statistical characteristic value from the shared counting tree in the step 7 is
Step 7.1: extracting a flow label f from a flow node ending a sample flow tuple
And 7.2: the elements S [ i ] in the random number set S]And carrying out XOR with the flow label of the flow f in sequence, transmitting the XOR values serving as parameters to a main hash function H, and respectively calculating r hash values
Figure BDA0002503152910000041
Positioning r strengthening counters and subtree counters for recording the flow statistical characteristic value;
step 7.3: calculating a statistical characteristic value s of the flow;
Figure BDA0002503152910000042
wherein, X i Is the value of the subtree counter; l i Is the height of the subtree counter; n is meshA total count value of the previous shared count tree; k is a radical of i Is the number of leaf nodes in the sub-tree counter,
Figure BDA0002503152910000043
the invention has the beneficial effects that:
the storage space optimal sampling method based on the shared counting tree does not affect the accuracy rate of applying and identifying the sampling flow by the SVM classifier and the C4.5 classifier, and only a small amount of storage space is needed in the sampling process. Specifically, an SVM classifier is used for carrying out application identification on the sampling flow, and the average accuracy of the application identification in each identified application is 0.867; the application recognition is carried out on the sampling flow of the invention by using a C4.5 classifier, and the average precision of the application recognition is 0.891 in each recognized application. With 9GB flow as input, the invention only needs 700KB of storage space at most and only needs 200KB of storage space at least in the sampling process.
Drawings
Fig. 1 is an overall framework diagram of the present invention.
Fig. 2 is a flow chart of the present invention for determining the probability of correlated sampling.
Fig. 3 is a flow chart of the sampling judgment strategy in the present invention.
FIG. 4 is a diagram of an example of an update stream characteristics storage unit according to the present invention.
FIG. 5 is a three-level shared count tree storage structure diagram.
FIG. 6 is a three-level shared count tree logical structure diagram.
Fig. 7 is a diagram of a counter memory structure.
Fig. 8 is a reinforced counter diagram.
FIG. 9 is a reinforced counter vector diagram.
Fig. 10 is a diagram showing an example of a stream profile restoration process in which quintuple information is tuple 3.
Fig. 11 is a diagram for explaining a selection method of a root node in a sub-tree counter.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a storage space optimal sampling method-MiniSamp based on a shared counting tree. As shown in fig. 1, the technical scheme of MiniSamp is as follows:
(1) And determining whether to sample the incoming data packet according to a sampling judgment mechanism.
(2) If the incoming data packet is determined to be sampled, retrieving the flow node to which the data packet belongs in a Hash flow tracking table; and if the flow node to which the sampling data packet belongs is not retrieved, newly building a flow node in the flow tracking table for the data packet.
(3) The flow characteristic storage structure is composed of a flow characteristic storage unit, and the flow characteristic storage unit is composed of sampling flow nodes and counters for recording sampling flow statistical characteristic values in a shared counting tree set. And after the flow node to which the sampling data packet belongs is positioned in the flow tracking table, extracting the relevant characteristics of the sampling data packet, and updating the characteristic values of the sampling flow in the flow node and the shared counting tree set respectively.
(4) When sampling is terminated to a certain flow, restoring and guiding characteristic values stored in flow nodes and a shared counting tree set of the flow into an ordered flow characteristic record buffer area; and writing the sample stream characteristic record into a file when the buffer area is full.
(5) And inputting the sampling stream feature recording file into a pre-trained classifier, and generating a result of application classification by taking a stream as a unit.
(1) Determining correlated sampling probabilities
In the sampling process, miniSamp judges each incoming data packet in three stages, and determines whether to sample the incoming data packet according to the judgment result. These three phases are a source IP selection phase, a flow selection phase, and a packet selection phase, respectively. Three probabilities, p, are assigned to the three phases h ,p f And p p . Wherein p is h The size, p, of the sampled set of source IPs may be controlled during the source IP selection phase f The number of the flows collected by the IP of the sampling source can be controlled in the flow selection stage, p p The number of data packets sampled in the sample stream may be controlled.
Before formally starting sampling, a sampling probability p is first determined h ,p f And p p The value of (c). The network operator will typically pre-configure the effective sampling ratio p e And p h (p h ≥p e ) And is specified according to its sampling purpose
Figure BDA0002503152910000051
In determining the sampling probability, an iterative algorithm is used, with the incoming packet as input, to p f And p p Is continuously subjected to binary search in order to ensure that the effective sampling ratio is p e Under the premise of (a) making p f Is continuously approached>
Figure BDA0002503152910000052
And gives p at this time p The value of (c). When p is h ,p f ,p p After the three sampling probabilities have been determined, all of the sample stream feature records generated in the stage of determining the sampling probabilities are discarded, and formal sampling is started. A flow chart for determining the sampling probability is shown in fig. 2.
(2) Sampling mechanism
And the packet capturing thread circularly captures packets at a network card of the flow sampling equipment and stores the captured TCP or UDP data packets into a data packet buffer queue. There is a status flag tag, which is set to 1 during the sampling phase.
To illustrate the sampling judgment logic, first, a packet is taken from a packet buffer queue, and two random numbers with the value range of [0,1 ] are distributed to the packet, and r is used for each random number f And r p And (4) showing. In the source IP selection stage, the source IP of the data packet is hashed, and the hash value is compared with p h And multiplying to obtain a target value, entering a flow selection stage if the target value is within a preset range, and otherwise discarding the data packet. In the flow selection stage, the five-tuple information (source IP, source port, destination IP, destination port, transport layer protocol) of the data packet is hashed, the flow node of the flow to which the data packet belongs is searched in a hash flow table according to the hash value, and if the flow node does not belong, the flow node of the flow to which the data packet belongs is searchedThere is a stream node found to correspond to the packet, and r f ≤p f If so, sampling the data packet and creating a flow node for the data packet in a flow table; if the flow node of the packet is found, entering a packet selection stage, if r is the flow node of the packet p ≤p p If so, sampling the data packet and updating the stream storage unit to which the data packet belongs; otherwise, the packet is discarded. The flow chart is shown in fig. 3.
(3) Stream feature update
MiniSamp is a sampling algorithm supporting application classification, and is used for identifying the application type of sampling flow for supporting a machine learning algorithm, and the MiniSamp records the flow statistical characteristics which can be used for application classification in the sampling process. By referring to the existing features that support traffic application classification, and combining with its own background, the unidirectional sampling flow record generated by MiniSamp consists of the following flow features: transport layer protocol, source port, destination port, minimum payload length, maximum payload length, number of packets, total data length, average segment size, number of ACK packets, number of SYN packets, number of FIN packets, and number of RST packets. The characteristics of the minimum effective load length, the maximum effective load length, the data packet number and the like can be used for identifying UDP application, the number of ACK packets, the number of SYN packets, the number of FIN packets, the number of RST packets, the average segment size and the like can be used for identifying TCP application; characteristics of a transport layer protocol, a source port, a destination port, total data length and the like can be used for identifying TCP application and UDP application. In the UDP flow record, the statistical characteristic values of the relevant TCP flag bits are all set to 0.
MiniSamp stores the application characteristics of the unidirectional flow in a flow node and a shared counting tree set respectively, records information such as source IP, destination IP, a source port, a destination port, a transport layer protocol, minimum effective load length, maximum effective load length, flow arrival time, flow latest update time and the like in the flow node, and records information such as the number of data packets, total data length, ACK packets, SYN packets, FIN packets, RST packets and the like in respective shared counting trees. The flow characteristic storage unit is composed of flow nodes and a counter for recording the statistical characteristic value of the flow by the shared counting tree.
When a flow characteristic storage unit is newly established for a sampling data packet, analyzing the data packet, and writing quintuple information, flow arrival time, flow latest update time, minimum effective load length and maximum effective load length into a flow node, wherein the flow latest update time and the flow arrival time are both current time, and the minimum effective load length and the maximum effective load length are both the application layer effective load length of the data packet. If the data packet is a TCP data packet, analyzing a TCP header, detecting whether flag bits ACK, FIN, SYN and RST are set, and if the flag bits ACK, FIN, SYN and RST are set, counting the number of the incoming flag bit data packets of the stream in a corresponding shared counting tree; counting the flow in a shared count tree that stores the size of the flow; the length len of the data packet is calculated, and 32B is taken as a data block, so as to obtain the number c of the data blocks occupied by the data packet. The stream is counted c times in a shared count tree that stores the length of the stream. The calculation formula of c is as follows:
Figure BDA0002503152910000071
when updating the stream feature storage unit of the data packet, firstly updating the latest update time of the stream in the stream node, then analyzing the data packet, calculating the effective load length of the application layer of the data packet, respectively comparing the effective load length with the maximum effective load length and the minimum effective load length in the stream node, and if the effective load length is greater than the original maximum effective load length, updating the maximum effective load length; if the length is less than the original minimum payload length, the minimum payload length is updated. The update operation for the shared count tree is the same as when the flow characteristics storage unit is established. An example of a MiniSamp update stream feature storage unit is shown in fig. 4, where SCT is an abbreviation of a Shared Counter Tree (Shared Counter Tree).
A process for updating statistical characteristic values of a sample stream in a shared count tree is described. In order to record a certain statistical characteristic value of the sample stream, a plurality of shared counters are allocated to all the sample streams in the sampling process. The counters are logically organized into an approximate binary tree, forming the shared count tree in MiniSamp.
The shared count tree is described as follows, the memory space allocated to the shared count tree is N bits, the space allocated to each counter is a bits, and the number of counters in the shared count tree is N/a. The height of the shared counting tree is h, h layers are shared, the lowest layer is the 0 th layer, and the uppermost layer is the h-1 th layer. The number of leaf nodes in the count tree is p, and the degree of non-leaf nodes in the count tree is 2. And if the number of the h-1 level nodes exceeds 1, setting a virtual root node in the counting tree. A three-level shared count tree storage structure is shown in fig. 5 and a logical structure is shown in fig. 6.
In the counter, the most significant bit is used as the status bit, and the remaining a-1 bits are used for counting, and the storage structure of the counter is shown in fig. 7.
A leaf node of the count tree is represented by C [ i ], i ∈ [0,p), and the nodes contained on the path from C [ i ] to the root node constitute a reinforced counter, which is called L [ i ]. For example, L [0] = { C [0], C [8], C [12] } is an enhanced counter. Since the count tree has p leaf nodes in common, there are p enhanced counters. An enhanced set of counters in the count tree is represented by a vector L, with L = { L [1], L [2],. The turbo counter is shown in fig. 8.
For any sampling flow f, five-tuple information of the flow and r independent hash functions h are passed i R (r < p) enhancement counters are selected from the p enhancement counters. The r enhanced counters form an enhanced counter vector for the flow f, with L f Represents the vector, and converts L f Is denoted as L f [i]。L f [i]The specific calculation formula of (1) is as follows, wherein i belongs to [0,r), and the hash function h i The value range of [0,p-1 ].
L f [i]=L[h i (f)] (2)
In order to reduce the difficulty of designing the hash function, r mutually independent hash functions are not really designed, but only one main hash function H is designed and a set consisting of r elements is utilizedAnd combining S to simulate r mutually independent hash functions. Using a hash function h i The formula for hashing stream f is as follows:
Figure BDA0002503152910000081
since r < < p, the probability of randomly choosing r distinct counters from the p enhanced counters of the shared count tree is
Figure BDA0002503152910000082
Thus vector L f The respective enhanced counters in (1) are mutually exclusive.
In fig. 9, r =3, enhanced counter vector L for flow f f ={L[1],L[0],L[4]Enhanced counter vector L for flow g g ={L[3],L[4],L[6]}. Wherein the counter L [4 ] is strengthened]Being shared by stream f and stream g, the number of streams that can be recorded by the shared count tree is much larger than the number of its enhanced counters, since different streams can share the same enhanced counter.
The steps of updating the statistical characteristic values of the number of the sampling stream data packets, the number of the ACK packets, the number of the SYN packets, the number of the FIN packets, the number of the RST packets and the like in the shared counting tree are as follows:
step 1: extracting quintuple information of an incoming data packet, hashing the quintuple information, positioning a flow node corresponding to the data packet in a flow table, and extracting a flow label of the flow from the flow node, wherein the flow label comprises information such as a source IP, a source port, a destination IP, a destination port, a transport layer protocol, arrival time of the flow and the like. Suppose that the flow of an incoming packet is f and its flow label is f tuple
Step 2: a random number i is generated, where i ∈ [0,r).
And step 3: h is a main hash function, and S [ i ] is selected from a random number set S]And flow label f tuple Performing exclusive OR, wherein the scale of the random number set S is r, and transmitting the exclusive OR value as a parameter to a main hash function H to generateHash-forming function h i Calculating a hash value
Figure BDA0002503152910000083
And positioning an enhanced counter of the current counting.
And 4, step 4: c [ u ] ← C [ u ] +1. If CU does not overflow, the procedure is finished. If C [ u ] overflows and the subscript of its parent node is greater than or equal to the subscript of the virtual root node, the reinforced counter L [ u ] overflows and the shared count tree count is no longer used. If C [ u ] overflows and the subscript of its parent node is less than the subscript of the virtual root node, the status bit of C [ u ] is set to 1, u is updated according to equation 5, and the process continues.
Figure BDA0002503152910000084
The steps of recording the total data length of the stream using the shared count tree in the sampling process are as follows:
step 1: the incoming data packet is analyzed, the length of the data packet is calculated to be len, 32B is used as a data block, and the number of the data blocks c of the data packet is calculated.
Step 2: c ← c-1.
And step 3: and counting the number of data blocks of the stream to which the data packet belongs 1 time in the shared counting tree.
And 4, step 4: if c is less than 0, the counting process of the total data length of the stream is finished, otherwise, the step 2 is carried out.
(4) Stream characteristic reduction
And starting the flow characteristic restoring thread every 10s, restoring the flow characteristics stored in the flow nodes and the flow statistical characteristics stored in the shared counting tree into complete flow records by the flow record restoring thread after sampling of a certain flow is finished, and adding the flow records into a flow record buffer queue.
The stream characteristic restoration thread firstly sets the state flag tag to be 0, does not sample at the moment, and temporarily stores the captured data packet in the data packet buffer queue. Waiting for 0.25s to acquire the mutual exclusion lock of the flow table and acquiring the current time so as not to influence the ongoing sampling operation; sequentially scanning flow nodes in a flow table, calculating the time interval between the latest updating time of the flow and the current time in the process of scanning each flow node, if the time interval is more than 16 seconds, determining that the flow is not in an active state in a network environment, finishing sampling the flow, at the moment, respectively reducing the number of data packets, the total length, the number of ACK packets, the number of SYN packets, the number of FIN packets, the number of RST packets and other statistical characteristic values of the sampled flow in a shared counting tree set by using the flow quintuple information and the flow arrival time, multiplying the total length of the flow by 32, and dividing the total length of the flow by the number of the data packets to obtain the average length of the flow; writing quintuple information, stream arrival time, stream duration, minimum effective load length, maximum effective load length, each statistical characteristic value and average length restored in the shared counting tree set in a stream record buffer area; the flow record is added to the flow record buffer queue. And if the number of the flow records in the flow record buffer queue exceeds m (wherein m is far smaller than the scale of the hash table), acquiring a mutual exclusion lock of the ordered flow record buffer queue, wherein the ordered flow record buffer queue carries out ordered arrangement on the flow records by using the source IP and the arrival time of the flow. And after all the flow record elements in the flow record buffer queue are dequeued and added into the ordered flow record buffer queue, releasing the mutual exclusion lock of the ordered flow record buffer queue. And when the scanning and restoring of all the flow nodes in the flow table are completed, releasing the mutual exclusion lock of the flow table, setting the state flag tag to be 1, finishing the flow characteristic restoring stage, and restarting sampling. An example of the stream feature record reduction process for tuple information tuple3 is shown in fig. 10.
The notion of host activity periods is used in MiniSamp to approximate the replacement of application sessions. A host active period is a set of flows for which the source IP is the same, the types of applications to which the flows in the set belong possibly being different, the flows in the set being ordered by arrival time, the arrival time of each flow in the set being less than t seconds apart from the arrival time of the flow immediately preceding it. And the output thread outputs the flow records in the ordered flow record buffer queue to the sampling flow record file by taking the active cycle of the host as a unit.
To the slaveThe process of restoring the statistical characteristic values of the sample streams in the shared count tree is introduced. The concept of introducing a subtree counter, in FIG. 6, as C [13 ]]The root subtree contains node C [13 ]],C[10],C[11],C[4],C[5],C[6],C[7]. All nodes in a sub-tree constitute a sub-tree counter whose value can be calculated, for example, as C13]Value (13) =2 for value of subtree counter of root 2a C[13]+2 a (C[10]+C[11])+(c[4]+C[5]+C[6]+C[7]). Whether a certain node in the shared counting tree can be used as a root node of a sub-tree is related to the state bit of the node counter, and each strengthening counter corresponds to a sub-tree counter. In FIG. 11, counter L [4 ] is enhanced]For example, a method of selecting a root node in a subtree counter is described. In (a), because of C4]Counter does not overflow, so C4 is not counted]Set the status bit, then use C4]For the root node of its subtree, it is not necessary to judge whether it can use C4]The ancestor node of (1) is the root node, and there is only one leaf node C [4 ] in the subtree]The height of the subtree is L =1. In (b) C4]Counter overflow, C10]The counter does not overflow. So to C4]Set the status bit, then use C10]For the root node of its subtree, it is not necessary to judge whether it can use C10]The ancestor node of (1) is the root node, and there are only two leaf nodes C [4 ] in the subtree],C[5]One non-leaf node C [10 ]]The height of the subtree is L =2. In (C) C4]Counter overflow, C10]The counter overflows. So to C4],C[10]Set the status bit, then use C13]Is the root node of a subtree in which there are four leaf nodes C [4 ]],C[5]Three non-leaf nodes C [10 ]],C[11],C[13]The height of the subtree is L =3.
The steps of reducing the statistical eigenvalues of the number of the stream data packets, the number of the ACK packets, the number of the SYN packets, the number of the FIN packets, the number of the RST packets, the total number of the data blocks and the like from the shared counting tree are as follows:
step 1: extracting a stream label f from a stream node ending a sample stream f tuple
Step 2: h is a main hash function, and an element Si in the random number set S is used]I e [0,r) are XOR-ed with the flow label of flow f in sequence, and the XOR value is transmitted as a parameter to the main hash function H and is respectivelyCalculate r hash values
Figure BDA0002503152910000101
Figure BDA0002503152910000102
R enhancement counters and subtree counters are located which record the statistical characteristic values of the flow.
And step 3: respectively calculating the values X of r subtree counters i ,i∈[0,r)。
And 4, step 4: respectively dividing the height l of r sub-tree counters i Calculating and respectively calculating the number of leaf nodes in r subtree counters
Figure BDA0002503152910000103
And 5: and calculating the total count value n of the current shared count tree.
Step 6: the statistical characteristic value of the flow f is calculated by equation 6, wherein equation 6 is derived by statistical principles, the result of which is an estimate of the statistical characteristic value to be restored and it can be shown that this estimate is an unbiased estimate of this statistical characteristic value.
Figure BDA0002503152910000104
When sampling the end of the stream f, the step of MiniSamp restoring the total data length of the stream from the shared count tree is as follows:
step 1: the total data block number c of the stream f is reduced.
Step 2: the length of stream f, len, is calculated, where len ← 32 × c.
The storage space optimal sampling method based on the shared counting tree does not affect the accuracy rate of applying and identifying the sampling flow by the SVM classifier and the C4.5 classifier, and only a small amount of storage space is needed in the sampling process. Specifically, an SVM classifier is used for carrying out application recognition on the sampling flow, and the average precision of the application recognition in each application is 0.867; the application recognition is carried out on the sampling flow of the invention by using a C4.5 classifier, and the average precision of the application recognition is 0.891 in each recognized application. With 9GB flow as input, the invention only needs 700KB of storage space at most and only needs 200KB of storage space at least in the sampling process.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A storage space optimized sampling method based on a shared count tree is characterized by comprising the following steps:
step 1: according to a pre-configured finite sampling ratio p e Source IP sampling probability p h And target stream sampling probability
Figure FDA0003833182600000011
Determining output packet sampling probability p p And current stream sampling probability p f
Step 1.1: inputting a preconfigured finite sampling ratio p e Source IP sampling probability p h And target stream sampling probability
Figure FDA0003833182600000012
Step 1.2: initializing packet sampling probability p p And current stream sampling probability p f
p p =p e /p h
p f =p e /p h
Step 1.3: order to
Figure FDA0003833182600000013
Step 1.4: obtaining a current sampling ratio p x
Step 1.5: if it is
Figure FDA0003833182600000014
Then the output packet sampling probability p p And current stream sampling probability p f Ending the calculation; otherwise, executing step 1.6; wherein α is the accuracy of the setting;
step 1.6: if p x -p e If the | is less than or equal to alpha, returning to the step 1.3; otherwise, executing step 1.7;
step 1.7: if the current sampling ratio p x Greater than a preconfigured finite sampling ratio p e Then let p be p =0.5*(p p + t), t =0.00001, return to step 1.6; otherwise, let p p =0.5*(p p + 1), returning to step 1.6;
and 2, step: extracting data packets from the data packet buffer queue, and distributing two random numbers r with the value range of [0,1 ] for the data packets f And r p
And 3, step 3: acquiring a source IP of a data packet and calculating a hash value of the source IP;
and 4, step 4: the hash value of the source IP and the source IP selection probability p h Multiplying to obtain a target value;
and 5: if the target value is within the preset range, executing step 6; otherwise, discarding the data packet;
step 6: searching a stream node to which the data packet belongs;
if the flow node r corresponding to the data packet is not found f ≤p f If so, sampling the data packet, creating a flow node for the data packet, creating a flow characteristic storage unit, and updating the flow characteristic storage unit of the flow;
if the flow node and r of the data packet are found p ≤p p Then sampling the data packet and updating the flow characteristic storage unit of the flow;
if the flow node r corresponding to the data packet is not found f >p f Or find the stream node to which the data packet belongs and r p >p p Discarding the data packet;
and 7: when sampling of a certain flow is finished, restoring flow characteristics stored in flow nodes and flow statistical characteristics stored in a shared counting tree into complete flow records, and adding the flow records into a flow record buffer queue; and writing the sample stream characteristic record into a file when the buffer area is full.
2. The method for optimized sampling of storage space based on shared count tree as claimed in claim 1, wherein: the specific method for creating the flow characteristic storage unit for the data packet in the step 6 is as follows: analyzing the data packet, and writing quintuple information, stream arrival time, stream latest update time, minimum effective load length and maximum effective load length into a stream node; wherein, the latest updating time of the stream and the arrival time of the stream are both the current time; the minimum effective load length and the maximum effective load length are both the effective load length of the application layer of the data packet; if the data packet is a TCP data packet, analyzing a TCP header, and detecting whether flag bits ACK, FIN, SYN and RST are set; if the TCP data packet is set, counting the number of the incoming flag bit data packets of the stream in the corresponding shared counting tree; counting the flow in a shared count tree that stores the flow size; calculating the length len of the data packet, taking 32B as a data block, obtaining the number c of the data blocks occupied by the data packet,
Figure FDA0003833182600000021
counting the flow c times in a shared counting tree storing the length of the flow; if the data packet is not a TCP data packet or the TCP data packet is not set, the characteristics of the number of ACK packets, the number of SYN packets, the number of FIN packets and the number of RST packets of the flow to which the data packet belongs do not need to be counted in the shared counting tree.
3. The method for optimized sampling of storage space based on shared count tree as claimed in claim 1 or 2, wherein: the specific steps of updating the stream feature storage unit of the stream in the step 6 are as follows:
step 6.1: updating the stream latest update time in the stream node;
step 6.2: analyzing the data packet, and calculating the effective load length of the application layer of the data packet;
step 6.3: if the effective load length of the data packet application layer is greater than the original maximum effective load length, updating the maximum effective load length; if the effective load length of the data packet application layer is smaller than the original minimum effective load length, updating the minimum effective load length;
step 6.4: the statistical feature values are updated in the shared count tree.
4. The method for optimized sampling of storage space based on shared count tree as claimed in claim 3, wherein: the specific steps of updating the statistical characteristic value in the shared count tree in the step 6.4 are as follows:
step 6.4.1: extracting quintuple information of the data packet, hashing the quintuple information, positioning a flow node corresponding to the data packet in a flow table, and extracting a flow label f of the flow from the flow node tuple
Step 6.4.2: generating a random number i, i ∈ [0,r); where r is a hash function h i The number of (2);
step 6.4.3: selecting S [ i ] from a set of random numbers S]And flow label f tuple Performing XOR, and transmitting the XOR value as a parameter to a main hash function H to generate a hash function H i (ii) a Computing hash values
Figure FDA0003833182600000022
Positioning an enhanced counter for counting this time; wherein p is the number of enhanced counters;
step 6.4.4: updating the statistic characteristic value C [ u ], C [ u ] ← C [ u ] +1;
if the CU does not overflow, the step is finished, and the updating of the statistical characteristic value is finished;
if the C [ u ] overflows and the subscript of the father node of the C [ u ] is greater than or equal to the subscript of the virtual root node, the reinforced counter L [ u ] overflows and does not use the shared counting tree for counting any more;
if C [ u ]]Overflow and the subscript of its parent node is less than the subscript of the virtual root node, update u,
Figure FDA0003833182600000031
step 6.4.4 is re-executed.
5. The method of claim 4, wherein the method comprises: the specific step of restoring the stream characteristics from the shared count tree in the step 7 is
Step 7.1: extracting a flow label f from a flow node ending a sample flow tuple
Step 7.2: the elements S [ i ] in the random number set S]And carrying out XOR with the flow label of the flow f in sequence, transmitting the XOR values serving as parameters to a main hash function H, and respectively calculating r hash values
Figure FDA0003833182600000032
Positioning r strengthening counters and subtree counters for recording the flow statistical characteristic value;
step 7.3: calculating a statistical characteristic value s of the flow;
Figure FDA0003833182600000033
wherein X i Is the value of the subtree counter; l. the i Is the height of the subtree counter; n is the total count value of the current shared count tree; k is a radical of formula i Is the number of leaf nodes in the sub-tree counter,
Figure FDA0003833182600000034
CN202010438372.8A 2020-05-22 2020-05-22 Storage space optimized sampling method based on shared counting tree Active CN111581489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010438372.8A CN111581489B (en) 2020-05-22 2020-05-22 Storage space optimized sampling method based on shared counting tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010438372.8A CN111581489B (en) 2020-05-22 2020-05-22 Storage space optimized sampling method based on shared counting tree

Publications (2)

Publication Number Publication Date
CN111581489A CN111581489A (en) 2020-08-25
CN111581489B true CN111581489B (en) 2023-03-24

Family

ID=72115966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010438372.8A Active CN111581489B (en) 2020-05-22 2020-05-22 Storage space optimized sampling method based on shared counting tree

Country Status (1)

Country Link
CN (1) CN111581489B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8864583B1 (en) * 2011-05-03 2014-10-21 Open Invention Network, Llc Computing device independent and transferable game level design and other objects
CN104463922A (en) * 2014-12-03 2015-03-25 天津大学 Image feature coding and recognizing method based on integrated learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101227318B (en) * 2007-12-04 2011-05-11 东南大学 Method for overtrick real-time detection of high speed network flow quantity
CN101668006A (en) * 2009-10-12 2010-03-10 哈尔滨工程大学 Self adaptive network traffic sampling method for anomaly detection
RU2659481C1 (en) * 2014-06-26 2018-07-02 Гугл Инк. Optimized architecture of visualization and sampling for batch processing
CN104317801B (en) * 2014-09-19 2017-07-18 东北大学 A kind of Data clean system and method towards big data
CN104715418A (en) * 2015-03-16 2015-06-17 北京航空航天大学 Novel social network sampling method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8864583B1 (en) * 2011-05-03 2014-10-21 Open Invention Network, Llc Computing device independent and transferable game level design and other objects
CN104463922A (en) * 2014-12-03 2015-03-25 天津大学 Image feature coding and recognizing method based on integrated learning

Also Published As

Publication number Publication date
CN111581489A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN107566206B (en) Flow measuring method, equipment and system
US6871265B1 (en) Method and apparatus for maintaining netflow statistics using an associative memory to identify and maintain netflows
EP2530874B1 (en) Method and apparatus for detecting network attacks using a flow based technique
US7941606B1 (en) Identifying a flow identification value mask based on a flow identification value of a packet
US7197597B1 (en) Performing lookup operations in a content addressable memory based on hashed values of particular use in maintaining statistics for packet flows
US8064359B2 (en) System and method for spatially consistent sampling of flow records at constrained, content-dependent rates
US11245632B2 (en) Automatic flow management
US8782092B2 (en) Method and apparatus for streaming netflow data analysis
CN111240599B (en) Data stream storage method and device
CN114327833A (en) Efficient flow processing method based on software-defined complex rule
US11140078B1 (en) Multi-stage prefix matching enhancements
CN111581489B (en) Storage space optimized sampling method based on shared counting tree
US20160124841A1 (en) Information processing system and data processing method
CN110022343B (en) Adaptive event aggregation
CN108141377B (en) Early classification of network flows
CN112825507A (en) Flow monitoring in a network device
CN114827030B (en) Flow classification device based on folded SRAM and table entry compression method
CN114710444B (en) Data center flow statistics method and system based on tower type abstract and evictable flow table
CN113298125B (en) Internet of things equipment flow abnormity detection method and device based on feature selection and storage medium
CN114884893A (en) Forwarding and control definable cooperative traffic scheduling method and system
CN114884834A (en) Low-overhead Top-k network flow high-precision extraction framework and method
CN111835599A (en) SketchLearn-based hybrid network measurement method, device and medium
Afek et al. Recursive design of hardware priority queues
JP5798530B2 (en) Packet processing apparatus and packet processing method
JP7131705B2 (en) Traffic monitoring device and traffic monitoring method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant