CN113992541A

CN113992541A - Network flow measuring method, system, computer equipment, storage medium and application

Info

Publication number: CN113992541A
Application number: CN202111065220.9A
Authority: CN
Inventors: 靖旭阳; 闫峥; 韩惠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-09-11
Filing date: 2021-09-11
Publication date: 2022-01-28
Anticipated expiration: 2041-09-11
Also published as: CN113992541B

Abstract

The invention belongs to the technical field of network flow measurement, and discloses a network flow measurement method, a system, computer equipment, a storage medium and application. The model can dynamically and adaptively expand the memory size occupied by the compression fusion model according to the distribution condition of the network traffic information. The core idea of the invention is to utilize counters with different sizes to monitor the low-radix host and the high-radix host respectively, namely to use a small counter to monitor the low-radix host, and to use a counter capable of adaptively expanding the size to monitor the high-radix host, thereby ensuring high-efficiency memory utilization when monitoring the low-radix host and high accuracy when monitoring the high-radix host. Based on the model, the invention designs a fast and efficient network flow measuring method to accurately analyze the base number of the host, quickly identify the super host and efficiently reconstruct the abnormal address.

Description

Network flow measuring method, system, computer equipment, storage medium and application

Technical Field

The invention belongs to the technical field of network flow measurement, and particularly relates to a network flow measurement method, a system, computer equipment, a storage medium and application.

Background

At present, with the continuous improvement of network data forwarding speed, an online data processing model with high data acquisition speed and high memory use efficiency is urgently needed. Summary data structures belong to probabilistic statistics-based data structures that have been widely applied to many aspects of network traffic analysis, such as traffic size estimation, host cardinality estimation, and network anomaly detection. They use data-oriented hashing methods to compress the security parameters in a compact manner and perform probabilistic estimates of the data through specific operations. By utilizing the advantages of the summary data structure in the aspects of high-efficiency compression and fusion data and accurate estimation, the challenge brought to super host identification by massive security parameters in a high-speed network environment can be solved.

Although there are many super-host identification methods based on summary data structures, they mainly face the problem of low memory usage efficiency due to unbalanced host base distribution in real network environment. Accurate host radix analysis can be achieved by increasing the size of the counters in the summary data structure, but using large counters to monitor low radix hosts results in significant waste of memory resources. That is, most existing host-radix analysis methods based on summary data structures fail to balance the relationship between accuracy and memory usage.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) the distribution of network traffic information in real network traffic is unbalanced and dynamically changing. For example, it is difficult to determine the size of the counter because the counter is selected to have an inappropriate size, which may result in inaccurate estimation of the high base hosts or unnecessary memory consumption in monitoring the low base hosts, which directly affects the accuracy of the super host identification.

(2) The network traffic information exhibits different attributes. For example, unlike attributes where network flow sizes can be directly added, host cardinality is not directly added. The base of the destination will only increase when the host sends data to a new host. This characteristic of the host cardinality makes it more difficult to monitor high radix hosts and low radix hosts separately than large network flows and small network flows separately.

(3) Summary data structure most data compression fusion models cannot store any information about the original host address since the data compression is fused using hashing techniques.

The difficulty in solving the above problems and defects is: although there are many methods that continuously try to improve the memory efficient use of the data compression fusion model, how to ensure the accuracy of network traffic measurement on the premise of memory efficiency still faces many difficulties. First, the imbalance of network traffic information results in a dominant number of low radix hosts, while the number of high radix hosts is a small fraction. Most host radix analysis methods allocate counters with the same size to monitor the two types of hosts, which inevitably causes a great deal of memory resource waste. Secondly, how to accurately trace the source of the abnormality is also a difficult problem. Therefore, it is necessary to design a data compression fusion model that can implement efficient memory, accurate network flow measurement, and accurate tracing of the abnormal address.

The significance of solving the problems and the defects is as follows: by solving the problems, the resource consumption for acquiring large-scale network flow can be further reduced, the precision of the acquired data is ensured while the data is acquired by using a minimum memory, the effectiveness of the large-scale network flow measurement is further improved, the network anomaly can be accurately identified, and the safety of the large-scale network is ensured.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a network flow measuring method, a system, computer equipment, a storage medium and application.

The invention is realized in such a way that a network flow measurement method dynamically and adaptively expands the memory size occupied by a compression fusion model according to the distribution condition of network flow information; the low base host and the high base host are monitored separately with counters of different sizes, the low base host is monitored with a mini-counter, and the high base host is monitored with an adaptively extended size counter.

Further, the network traffic measurement method comprises:

in the network flow acquisition stage, a memory high-efficiency data compression fusion model is deployed at a plurality of network nodes, and network flow information is recorded by utilizing the updating operation of the model, so that the accuracy of large-scale network flow acquisition is ensured and the memory consumption of an acquisition device is reduced;

after the collection is finished, all the compressed fusion models which record the network flow information are uniformly sent to a data analysis center, the network flow information collected by all the nodes is fused by utilizing the combination operation of the models, the global flow information of the network is obtained, and the distributed monitoring of the network is ensured;

in the network flow analysis stage, the network flow information is measured by utilizing the estimation operation of the memory high-efficiency data compression fusion model, and a super host identification method is utilized to detect a super propagator and a super changer, so that the network abnormity is rapidly and accurately identified;

and in the abnormal tracing stage, reconstructing the abnormal address by utilizing the reversible calculation operation of the memory high-efficiency data compression fusion model, and taking corresponding security defense measures.

Further, the method for constructing the network flow measurement comprises the following steps:

acquiring network flow by using a memory high-efficiency data compression fusion model;

combining data compression fusion models deployed at each acquisition node to obtain global network flow information;

thirdly, network flow measurement is carried out according to flow information recorded by the data compression fusion model;

and step four, detecting the super host by using a super host identification method and tracing the source of the abnormity by using the abnormal information recorded by the data compression fusion model.

Further, the measurement of the network flow information is completed by utilizing a memory high-efficiency data compression fusion model, and the structure is as follows: the memory efficient data compression fusion model comprises two parts, wherein one part is a core part for monitoring the base number of a host, and the other part is an extension part for increasing the monitoring capability of the core part;

the core part of the memory efficient data compression fusion model comprises H core parts with the size of w multiplied by p_iA two-dimensional bit array of H, denoted as ES ═ 1,2₁,...,ES_H) (ii) a At each ES_iMiddle, ES_i[j][l]E {0,1} represents the value of the (j, l) record, where j e {0, 1.. w-1}, l e {0, 1.. u., p_i-1}；ES_iIs associated with a data-oriented hash function for indexing the column position, h_i(x)≡xmodp_iWherein p is₁,p₂,...,p_HIs a pair-wise coprime number around the integer P; and each column contains an extra information group (cd, et, flag), where cd refers to the congestion level and represents the proportion of "1" in the column, et is the number of expansion times, and flag is a list for recording an address in the ES_i+1Position in (E), ES_HColumn (1) does not have flag information; all rows of the two-dimensional bit array in the memory efficient data compression fusion model use the same hash function, f (x) ≡ xmodw, at the initial stage.

Further, in the first step, network traffic collection is performed by using the memory efficient data compression fusion model, and the specific process is as follows:

in the initial stage, the memory efficient data compression fusion model only has a core part, the values of all blocks are 0, the corresponding extra information group of each column is initialized, cd is set to et to 0, and flag is set to be an empty list; epsilon is a preset threshold, t is a natural number, and s/d is a source/destination address; given a stream (s, d), ES is for the column in which it is located_i[·][h_i(s)](i is more than or equal to 1 and less than or equal to H) has two updating modes:

(a) cd < ε and et ═ t: setting ES_i[f′(d)][h_i(s)]1, where f' (d) ≡ dmod (2)^tXw), update cd and convert h_i+1(s) (1. ltoreq. i. ltoreq.H-1) is added to the flag;

(b) cd < ε and et ═ t: ES (ES)_i[·][h_i(s)]Having been filled and needing to be expanded, checking s the cd values of all columns in the memory efficient data compression fusion model, and if the values are all greater than or equal to epsilon, then expanding; in column expansion, first, ES_i[·][h_i(s)]Will be from 2^tXw is increased to 2^t+1X w and set et + ═ 1; transferring the recorded host cardinality information from the old column to the extended column, s, in the old column (ES)_i[·][h_i(s)]1 ≦ i ≦ H) information is transferred to the expanded column (ES) according to the following strategy_i′[·][h_i(s)]) The method comprises the following steps:

for i 1,3,5_i′[f′(d)][h_i(s)]＝ES_i[f′(d)][h_i(s)]；

For i 2,4,6_i′[f′(d)+2^t×w][h_i(s)]＝ES_i[f′(d)][h_i(s)]；

Wherein f' (d) ≡ dmod (2)^t×w)；

In the second step, the network traffic information is fused by using the memory high-efficiency data compression fusion model, and the specific process is as follows: given T memory efficient data compression fusion models with the same size, (ES)¹,...,ES^T) Wherein ES^t(T is more than or equal to 1 and less than or equal to T) is the T-th data compression fusion model. The binding operation is as follows:

wherein

Is ES^tThe value of the block (j, l) in the ith two-dimensional bit array,

is a bit or operation.

Further, in the third step, the network flow measurement is performed by using the memory high-efficiency data compression fusion model, and the specific process is as follows: the destination cardinality of the source address s is estimated,first find out the H column in ES, denoted as ES_i(s)＝ES_i[·][h_i(s)](i is more than or equal to 1 and less than or equal to H), the estimation method comprises the following two methods:

(a)et＝0：ES_i(s) (1 ≦ i ≦ H) without expansion, fusing ES with bit and operation_i(s) (1. ltoreq. i. ltoreq.H) to obtain a digit group for estimating the cardinality, i.e.

According to a probability estimation algorithm, the objective cardinality of s is obtained as follows:

DC(s)＝-wln(v/w)；

where v is the number of zeros in ES(s);

(b)et＝t(t≥1)：ES_i(s) (1. ltoreq. i.ltoreq.H) have been expanded t times and their length becomes 2^txAw; if the estimated radix digit array is generated by directly utilizing bits and operation, the radix information of the host computer is lost; furthermore, the transition strategy in the update operation may add extra bits, and in order to eliminate the error problem caused by t times of column expansion to the radix estimation, the ES is firstly used as the starting point_i(s) (1. ltoreq. i. ltoreq.H) by calculating DC using the following formula_i(s)：

Wherein w' is 2^tXw is an extended column ES_i(s) length, v' is ES_i(s) number of nulls, λ ln (2- ε)²/4(1- ε) is the estimation error compensation for column expansion; finally, the estimated destination cardinality of s is { DC₁(s),...,DC_H(s) } minimum value;

in the third step, a super host identification method based on a memory efficient data compression fusion model is used for detecting a super propagator and a super changer, and the specific process is as follows: detecting a super propagator, firstly finding out an abnormal column et which is not equal to 0 and has the expansion frequency not equal to zero in a memory high-efficiency data compression fusion model, and obtaining a list of H storage abnormal column numbers; if the host cardinality obtained according to the estimation operation of the memory efficient data compression fusion model is larger than a predefined threshold value, the host is regarded as a super propagator;

detecting a super-variant, wherein a variant column in two time intervals needs to be identified; checking an abnormal column et ≠ 0 in the two memory efficient data compression fusion models, and calculating the difference of the base numbers of the columns in the previous time period and the current time period; if the cardinality change of a host is greater than a predefined threshold, the host will be considered a hyper-variator;

in the third step, the memory-based efficient data compression fusion model is utilized to perform exception tracing, and the specific process is as follows: from ES₁To ES_HGenerating a link of columns recording the original address of the anomaly using the flag of the anomalous column, generating a positional link { c } by matching the flag given a hyper-propagator s₁,...,c_HIn which c is_i(1. ltoreq. i. ltoreq.H) is calculated as H_i(s)≡c_imodp_i(ii) a Based on the Chinese remainder theorem, s is calculated as

Wherein p ═ p₁p₂...p_HGreater than or equal to the size of the source address space, Q_i＝p/p_i，Q_iQ_i′≡1modp_i，i＝(1,2,...,H)。

It is a further object of the invention to provide a computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the network traffic measurement method.

It is a further object of the invention to provide a computer readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the network traffic measurement method.

Another object of the present invention is to provide a network traffic measurement system implementing the network traffic measurement method, the network traffic measurement system comprising:

the acquisition network flow stage module is used for deploying the memory efficient self-adaptive data compression fusion model at a plurality of network nodes, performing self-adaptive expansion according to the storage requirement according to the change of flow, namely monitoring a low-base host by using a small counter, monitoring a high-base host by using a counter for adjusting the size of the memory according to the change trend of the base, and recording the information of network flow by using the updating operation of the model; after the collection is finished, uniformly sending each compressed fusion model which records the network flow information to a data analysis center, and fusing the network flow information collected by each node by using the combined operation of the models so as to obtain the global flow information of the network;

the network flow analysis stage module is used for measuring network flow information by utilizing the estimation operation of the memory high-efficiency data compression fusion model and detecting a super propagator and super change by utilizing a super host identification method;

and the abnormal tracing stage module is used for reconstructing and tracing the abnormal address by utilizing the reversible calculation operation of the memory high-efficiency data compression fusion model and acquiring corresponding security defense measures.

The invention also aims to provide application of the network flow measurement method in a network online data processing model.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention aims to solve the problems of large memory resource consumption and precision loss when the high-speed network flow is acquired, and provides a new method for acquiring and storing large-scale network flow.

Compared with the prior art, the invention has the following advantages:

(1) real-time performance: in a high-speed network environment, the memory high-efficiency data compression fusion model can acquire network flow in real time and accurately identify the super host. And the acquisition and time consumption are low, and the requirement of processing mass data is met.

(2) Reversibility: compared with other existing methods, the memory efficient data compression fusion model can accurately reconstruct the super host address in a very short time, and the instantaneity and the accuracy of the abnormal source tracing are guaranteed. And the method supports the deployment in a large-scale network environment for anomaly detection.

(3) Universality: the memory efficient data compression fusion model is a general data structure which can be used for various data analysis tasks. Keys in the memory efficient data compression fusion model can be selected as arbitrary identifiers. For example, spam distribution is identified in LTE networks, and propagation of malicious information is prevented in wireless sensor networks. The heterogeneous network manager can fuse the data collected in each network and perform security analysis on the whole network.

(4) The economic efficiency is as follows: the memory efficient data compression and fusion model ensures the accuracy of network flow measurement while realizing high memory efficiency, can finish accurate analysis on mass data in a scene with limited memory resources, and has the advantages of low required cost, easy deployment and high commercial value.

Drawings

Fig. 1 is a flowchart of a network traffic measurement method according to an embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a network flow measurement system provided by an embodiment of the present invention;

in fig. 2: 1. a network flow acquisition stage module; 2. a network flow analysis stage module; 3. and an exception source tracing stage module.

Fig. 3 is a structural diagram of a memory efficient data compression fusion model according to an embodiment of the present invention.

Fig. 4 is a diagram of an expansion method of the memory efficient data compression fusion model according to the embodiment of the present invention.

Fig. 5 is a comparison graph of network traffic host cardinality estimates provided by an embodiment of the invention.

FIG. 6 is a diagram of experimental results of superpropagator identification provided by embodiments of the present invention.

Fig. 7 is a diagram illustrating the result of a hyper-variator identification experiment according to an embodiment of the present invention.

Fig. 8 is a graph of a result of a throughput rate comparison experiment of a data compression fusion model according to an embodiment of the present invention.

Fig. 9 is a graph illustrating the time consumption comparison experiment result of the data compression fusion model according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a method, a system, a computer device, a storage medium and an application for measuring network traffic, and the present invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for measuring network traffic provided by the present invention includes the following steps:

s101: in the stage of collecting network flow, the memory efficient self-adaptive data compression fusion model can be deployed at a plurality of network nodes and is self-adaptively expanded according to the change of the flow according to the storage requirement, namely a small counter is used for monitoring a low-base host, a counter capable of adjusting the size of the memory according to the change trend of the base is used for monitoring a high-base host, and the update operation of the model is used for recording the information of the network flow. The data are collected by using the memory high-efficiency data compression fusion model, so that the collection precision is ensured, the memory consumption is greatly reduced, and the method can be deployed in various high-speed network environments;

s102: after the collection is finished, all the compressed fusion models which record the network traffic information are uniformly sent to a data analysis center, and the network traffic information collected by all the nodes is fused by utilizing the combination operation of the models, so that the global traffic information of the network is obtained. The data are acquired in a distributed mode by using the memory high-efficiency data compression fusion model, so that the integrity of all-around network flow acquisition can be ensured, all-around network security analysis is supported, and the method is suitable for monitoring a large-scale network environment;

s103: in the network flow analysis stage, the network flow information is measured by utilizing the estimation operation of the memory high-efficiency data compression fusion model, and the super propagator and the super change are detected by utilizing a super host identification method. By utilizing the memory high-efficiency data compression fusion model to perform network security analysis, the abnormity can be accurately identified in mass data, and the security monitoring of a high-speed network is ensured;

s104: and in the abnormal tracing stage, reconstructing and tracing the abnormal address by utilizing the reversible calculation operation of the memory high-efficiency data compression fusion model, and acquiring corresponding security defense measures.

Those skilled in the art of the network traffic measurement method provided by the present invention can also implement other steps, and the network traffic measurement method provided by the present invention in fig. 1 is only one specific embodiment.

As shown in fig. 2, the network traffic measurement system provided by the present invention includes:

the acquisition network flow stage module 1 is used for deploying a memory efficient self-adaptive data compression fusion model at a plurality of network nodes, performing self-adaptive expansion according to the storage requirement according to the change of flow, namely monitoring a low-base host by using a small counter, monitoring a high-base host by using a counter capable of adjusting the size of a memory according to the change trend of the base, and recording the information of network flow by using the updating operation of the model; after the collection is finished, uniformly sending each compressed fusion model which records the network flow information to a data analysis center, and fusing the network flow information collected by each node by using the combined operation of the models to obtain the global flow information of the network;

the network flow analysis stage module 2 is used for measuring network flow information by utilizing the estimation operation of the memory high-efficiency data compression fusion model and detecting a super propagator and super change by utilizing a super host identification method;

and the abnormal tracing stage module 3 is used for reconstructing and tracing the abnormal address by utilizing the reversible calculation operation of the memory high-efficiency data compression fusion model, and acquiring corresponding security defense measures.

The technical solution of the present invention is further described below with reference to the accompanying drawings.

The invention discloses a network flow measuring method based on a memory efficient data compression fusion model, which comprises the following steps:

the memory efficient data compression fusion model maps large-scale network traffic to a compact and constant-size space by using a hash function driven by data, dynamically adjusts the size of a counter in the model according to the change condition of network information, and ensures efficient utilization of a memory and accurate traffic information acquisition, and specifically comprises the following steps:

in the initial stage, the memory efficient data compression fusion model has only a core part and all the block values are 0, and the corresponding extra information group of each column is also initialized, that is, cd-et-0 is set, and flag is an empty list. Let e be a preset threshold, t be a natural number, and s/d be the source/destination address. Given a stream (s, d), ES is for the column in which it is located_i[·][h_i(s)](i is more than or equal to 1 and less than or equal to H) has two updating modes:

(a) cd < ε and et ═ t: the invention is provided with an ES_i[f′(d)][h_i(s)]1, where f' (d) ≡ dmod (2)^tX w). Then, update cd and change h_i+1(s) (1. ltoreq. i. ltoreq.H-1) is added to the flag.

(b) cd < ε and et ═ t: this means that ES_i[·][h_i(s)]Has been filled and needs to be expanded. However, hash collisions (e.g., different source addresses hashed to the same column) can also make cd large. To solve this problem, the cd values of all columns in the memory efficient data compression fusion model need to be checked s and if these values are all greater than or equal to ε, then an expansion will be performed. In column expansion, first, ES_i[·][h_i(s)]Will be from 2^tXw is increased to 2^t+1X w and set to et + ═ 1. The present invention then requires transferring the recorded host cardinality information from the old column to the extended column in order to fully preserve the host cardinality information. Therefore, s is in the old column (ES)_i[·][h_i(s)]1 ≦ i ≦ H) the information in extended column (ES ″) is transferred to according to the following strategy'_i[·][h_i(s)]) The method comprises the following steps:

for i ═ 1,3, 5., ES 'is provided'_i[f′(d)][h_i(s)]＝ES_i[f′(d)][h_i(s)]。

For i ═ 2,4, 6., ES 'is provided'_i[f′(d)+2^t×w][h_i(s)]＝ES_i[f′(d)][h_i(s)]。

Wherein f' (d) ≡ dmod (2)^t×w)。

Measuring the base number of the host according to the flow information recorded by the data compression fusion model;

network flow information collected by each network node is fused by utilizing the combined operation of a memory efficient data compression fusion model, and the method specifically comprises the following steps:

given T memory-efficient data compression fusion models of the same size, namely (ES)¹,...,ES^T) Wherein ES^t(T is more than or equal to 1 and less than or equal to T) is the T-th data compression fusion model. The binding operation is as follows:

wherein

Is ES^tThe value of the block (j, l) in the ith two-dimensional bit array,

is a bit or operation.

And step three, detecting the super host by using a super host identification method and tracing the source of the abnormity by using the abnormal information recorded by the data compression fusion model.

The method for measuring the network flow information by utilizing the estimation operation of the memory efficient data compression fusion model specifically comprises the following steps: to estimate the destination cardinality of the source address s, first find the H column in ES where it is located, denoted ES_i(s)＝ES_i[·][h_i(s)](i is more than or equal to 1 and less than or equal to H). The estimation method has the following two methods:

(a) et is 0: this means that ES_i(s) (1. ltoreq. i.ltoreq.H) is not expanded. To eliminate the over-estimation problem due to hash collision, the invention utilizes bit and operation fusionES_i(s) (1. ltoreq. i. ltoreq.H), a digit group for estimating the cardinality can be obtained, i.e.

According to the probability estimation algorithm, the objective cardinality of s can be obtained as follows:

DC(s)＝-wln(v/w)

where v is the number of zeros in ES(s).

(b) T (t ≧ 1): this means that ES_i(s) (1. ltoreq. i.ltoreq.H) have been expanded t times and their length becomes 2^tX.w. Some host radix information is lost if the bit and operation is used directly to generate the estimated radix bit array. Moreover, the transition strategy in the updating operation may add extra bits to cause the over-estimation problem, so in order to eliminate the error problem caused by t times of column expansion to the radix estimation, the invention firstly uses the ES to estimate the error_i(s) (1. ltoreq. i. ltoreq.H) by calculating DC using the following formula_i(s)：

Wherein w' is 2^tXw is an extended column ES_i(s) length, v' is ES_i(s) number of nulls, λ ln (2- ε)²And/4 (1-epsilon) is the estimation error compensation for column expansion. Finally, the estimated destination cardinality of s is { DC₁(s),...,DC_HMinimum value of(s) }.

The method for identifying the super host is used for detecting the super propagator and the super change, and specifically comprises the following steps: in order to detect the super propagator, the invention firstly finds out the abnormal column (et ≠ 0) with the expansion frequency not equal to zero in the memory efficient data compression fusion model, and can obtain a list of H storage abnormal column numbers. By using a reversible operation, the original address of the host can be easily generated. If the host cardinality, which is derived from the estimation operations of the memory efficient data compression fusion model, is greater than a predefined threshold, then the host will be considered a hyper-propagator.

To detect a super-variant, the present invention requires identifying the variant columns in two time intervals. Thus, the present invention examines the exception columns (et ≠ 0) in the two memory-efficient data compression fusion models and calculates the difference in the basis numbers of these columns over the previous and current time periods. If the cardinality change of a host is greater than a predefined threshold, then the host will be considered a hyper-mutator.

The method for tracing the source of the abnormal address by utilizing the reversible calculation operation of the memory efficient data compression fusion model specifically comprises the following steps: from ES₁To ES_HThe present invention uses the flag of the abnormal column to generate a link of the column recording the abnormal original address. For example, given a super-propagator s, the positional link { c ] is generated by matching the flag₁,...,c_HIn which c is_i(1. ltoreq. i. ltoreq.H) is calculated as H_i(s)≡c_imodp_i. Based on the Chinese remainder theorem, s can be calculated as

Wherein p ═ p₁p₂…p_HGreater than or equal to the size of the source address space, Q_i＝p/p_i，Q_iQ_i′≡1modp_i，i＝(1,2,...,H)。

The technical effects of the present invention will be described in detail with reference to experiments.

The present invention performed a series of experiments to evaluate the performance of the proposed solution. In the experiments, the present invention used 3 flow sets selected from CAIDA equix-nyc and equix-chicago for a duration of 10 minutes, noted CAIDA1, CAIDA2 and CAIDA3 in the experiments. Each flow set contained 10 time periods of 1 minute. The present invention uses a CICFlowMeter to convert all flow sets from packet level to flow level.

The data compression fusion model is compared with the existing methods for super host identification, including DCDS, VBF, RASF and SpreadSketch (SS), which have good super host identification performance. The invention sets the number of the two-dimensional bit arrays in ExtendedSketch and DCDS as 4, and sets the number of the two-dimensional bit arrays in VBF as 5. By adjusting the number of columns and rows, all comparison work is performed under the condition that the memory size is the same. For ExtendedSketch, the invention allocates 90% of available memory to the core part and 10% of available memory to the extension part.

Fig. 5 shows the relative error (ARE) of all comparison methods in the use cases of different memory sizes. The experimental result shows that under the condition of very small memory use, compared with other methods, the super host cardinality is more accurately estimated by the memory efficient data compression fusion model. On the CAIDA1 dataset, the ARE of the data compression fusion model is approximately 1.4, 4.78, 4.82, and 7.22 times lower than SS, RASF, DCDS, and VBF when allocating 0.25MB of memory (similar results on other traffic sets). For RASF, DCDS and VBF, the length of a column for storing a source address is limited by a small memory, so that the maximum estimation capacity of a probability estimation algorithm is influenced, and the method can result in more accurate radix estimation on a low-radix host and inaccurate estimation on a high-radix host. That is, these three methods require columns long enough to complete accurate estimates of both high and low radix masters. For SSs, the limited column length causes a large number of hash collisions, resulting in the loss of some host radix information. The ARE values of RASF, DCDS, VBF, SS ARE decreasing with increasing memory size. The data compression fusion model has no strict requirement on the length of the column, and can dynamically increase the length of the column according to the host base distribution in the network traffic. Through comparison, the data compression fusion model can realize high memory efficiency while improving the monitoring capability of the high-radix host.

Fig. 6 and 7 show the performance of super-host identification for all methods with different memory sizes across all traffic sets. The following points can be derived from the figure: first, when memory is small (from 0.25MB to 1.25MB), the memory-efficient data compression fusion model performs better than other methods in terms of hyper-propagator and hyper-variator identification, always achieving higher accuracy, recall, and F1 scores than all methods simultaneously. Due to the column expansion strategy, the data compression fusion model can dynamically increase the size of each counter according to the cardinality of each host, and the advantage of the data compression fusion model can be allocated to a large number of columns (P) and a small number of rows (w) under the condition of a given memory size. Therefore, the probability that different source addresses are hashed to the same column is reduced. In addition, the reversible calculation method based on the flag further reduces the false alarm generated in the reversible calculation process. The advantages enable the performance of the data compression fusion model in the aspect of super host identification to be better than that of other methods. Secondly, RASF, DCDS, VBF and SS have lower detection precision to the super propagator and the super variable under the condition of smaller memory usage, namely, many normal hosts are misreported as abnormal. When the number of columns is insufficient, the probability of hash collision is high, and under the condition, RASF, DCDS, VBF and SS return a large number of false alarms, so that the identification precision is reduced. In addition, false positives are generated by erroneous column combinations in the DCDS and undesired IP segment combinations in the VBF. For RASF without enough memory, the degeneration algorithm needs to be executed frequently, which reduces the accuracy of host radix estimation and further reduces the recognition capability of the super host. Finally, although the data compression fusion model competes with SS and VBF in terms of recall rate of super-host identification, its accuracy is much higher than that of SS and VBF, especially in the case of limited memory size.

Fig. 8 shows the throughput of all comparison methods for 1MB memory size, with VBF having the highest throughput. VBF records data by using two hash calculations and one bit extraction calculation, while other methods require multiple hash calculations to find the column and row where the source address is located. The throughput of the RASF is lowest because all blocks need to be traversed to calculate the percentage of bits when updating data. However, it can be seen from fig. 7 and 8 that the data compression fusion model has higher throughput than SS and DCDS. Although the data compression fusion model has a lower throughput than VBF, it is more efficient and more accurate in identifying super-hosts in terms of memory consumption than VBF and other methods, as shown in fig. 5 through 7.

FIG. 9 shows the time consuming case of super-host detection and reconstruction of super-host addresses. The present invention fixes the memory size of all methods to 1MB and provides an average result over all time intervals. Since RASF is irreversible, while SS stores the original IP address, the present invention only provides the reversible computation time of DCDS, VBF and data compression fusion model. According to the experimental result, the detection time and the reversible calculation time of the data compression fusion model are far shorter than those of other methods. In the detection stage, the data compression fusion model only checks the additional information group associated with the column, finds out the abnormal column with larger expansion times and carries out the next judgment, and other methods need to traverse each column or each block to carry out radix estimation. For reversible calculations, the data compression fusion model reconstructs the hyper-propagator (hyper-mutator) addresses 120.4 and 26833.3(95 and 27573.8) times faster than VBF and DCDS. This is because the DCDS produces a large number of redundant column combinations, greatly reducing the speed of the reversible computation. The VBF needs to merge multiple strings to obtain an original address. In the memory efficient data compression fusion model, the reversible calculation method based on the flag can accurately guide the combination of the columns and reduce the calculation amount of reversible calculation. In general, the data compression fusion model takes less time to detect and reconstruct the super-host address.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A network flow measurement method is characterized in that the network flow measurement method dynamically and adaptively expands the size of a memory occupied by a compression fusion model according to the distribution condition of network flow information; the low base host and the high base host are monitored separately with counters of different sizes, the low base host is monitored with a mini-counter, and the high base host is monitored with an adaptively extended size counter.

2. The method of network traffic measurement according to claim 1, comprising:

3. The method of network flow measurement according to claim 1, wherein constructing a network flow measurement method comprises the steps of:

4. The method according to claim 3, wherein the network traffic information is measured by using a memory efficient data compression fusion model, and the structure is as follows: the memory efficient data compression fusion model comprises two parts, wherein one part is a core part for monitoring the base number of a host, and the other part is an extension part for increasing the monitoring capability of the core part;

the core part of the memory efficient data compression fusion model comprises H core parts with the size of w multiplied by p_iA two-dimensional bit array of H, denoted as ES ═ 1,2₁,...,ES_H) (ii) a At each ES_iMiddle, ES_i[j][l]E {0,1} represents the value of the (j, l) record, where j e {0, 1.. w-1}, l e {0, 1.. u., p_i-1}；ES_iIs associated with a data-oriented hash function for indexing the column position, h_i(x)≡xmod p_iWherein p is₁,p₂,...,p_HIs a pair-wise coprime number around the integer P; and each column contains an extra information group (cd, et, flag), where cd refers to the congestion level and represents the proportion of "1" in the column, et is the number of expansion times, and flag is a list for recording an address in the ES_i+1Position in (E), ES_HRank of (5)There is flag this information; all rows of the two-dimensional bit array in the memory efficient data compression fusion model use the same hash function, f (x) ≡ xmod w, at the initial stage.

5. The method according to claim 3, wherein in the first step, the network traffic is collected by using a memory efficient data compression fusion model, and the specific process is as follows:

(a) cd < ε and et ═ t: setting ES_i[f′(d)][h_i(s)]1, where f' (d) ≡ d mod (2)^tXw), update cd and convert h_i+1(s) (1. ltoreq. i. ltoreq.H-1) is added to the flag;

(b) cd < ε and et ═ t: ES (ES)_i[·][h_i(s)]Having been filled and needing to be expanded, checking s the cd values of all columns in the memory efficient data compression fusion model, and if the values are all greater than or equal to epsilon, then expanding; in column expansion, first, ES_i[·][h_i(s)]Will be from 2^tXw is increased to 2^t+1X w and set et + ═ 1; transferring the recorded host cardinality information from the old column to the extended column, s, in the old column (ES)_i[·][h_i(s)]And 1 ≦ i ≦ H) information is transferred to the extended column (ES ″) according to the following strategy'_i[·][h_i(s)]) The method comprises the following steps:

for i ═ 1,3, 5., ES 'is provided'_i[f′(d)][h_i(s)]＝ES_i[f′(d)][h_i(s)]；

For i ═ 2,4, 6., ES 'is provided'_i[f′(d)+2^t×w][h_i(s)]＝ES_i[f′(d)][h_i(s)]；

Wherein f' (d) ≡ d mod(2^t×w)；

In the second step, the network traffic information is fused by using the memory high-efficiency data compression fusion model, and the specific process is as follows: given T memory efficient data compression fusion models with the same size, (ES)¹,...,ES^T) Wherein ES^t(T is more than or equal to 1 and less than or equal to T) is a tth data compression fusion model, and the combination operation is as follows:

wherein

Is ES^tThe value of the block (j, l) in the ith two-dimensional bit array,

is a bit or operation.

6. The method for measuring network flow according to claim 3, wherein in step three, the network flow is measured by using the memory efficient data compression fusion model, and the specific process is as follows: estimating destination base number of source address s, firstly finding out H column in ES, and expressing it as ES_i(s)＝ES_i[·][h_i(s)](i is more than or equal to 1 and less than or equal to H), the estimation method comprises the following two methods:

DC(s)＝-wln(v/w)；

where v is the number of zeros in ES(s);

(b)et＝t(t≥1)：ES_i(s) (1. ltoreq. i. ltoreq.H) has been expanded t times andtheir length becomes 2^txAw; if the estimated radix digit array is generated by directly utilizing bits and operation, the radix information of the host computer is lost; furthermore, the transition strategy in the update operation may add extra bits, and in order to eliminate the error problem caused by t times of column expansion to the radix estimation, the ES is firstly used as the starting point_i(s) (1. ltoreq. i. ltoreq.H) by calculating DC using the following formula_i(s)：

in the third step, the memory-based efficient data compression fusion model is utilized to perform exception tracing, and the specific process is as follows: from ES₁To ES_HGenerating a link of columns recording the original address of the anomaly using the flag of the anomalous column, generating a positional link { c } by matching the flag given a hyper-propagator s₁,...,c_HIn which c is_i(1. ltoreq. i. ltoreq.H) is calculated as H_i(s)≡c_imod p_i(ii) a Based on the Chinese remainder theorem, s is calculated as

Wherein p ═ p₁p₂...p_HGreater than or equal to the size of the source address space, Q_i＝p/p_i，Q_iQ′_i≡1mod p_i，i＝(1,2,...,H)。

7. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the network traffic measurement method according to any of claims 1-6.

8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the network traffic measurement method of any of claims 1 to 6.

9. A network flow measurement system for implementing the network flow measurement method according to any one of claims 1 to 6, the network flow measurement system comprising:

10. Use of a network traffic measurement method according to any of claims 1 to 6 in a network on-line data processing model.