CN115225528B

CN115225528B - Network flow data distributed measurement scheduling method, system and medium

Info

Publication number: CN115225528B
Application number: CN202210656146.6A
Authority: CN
Inventors: 刁祖龙; 乔铭宇; 张广兴; 谢鲲; 李振宇
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-04-09
Anticipated expiration: 2042-06-10
Also published as: CN115225528A

Abstract

The invention provides a distributed measurement scheduling method and system for network traffic data based on tensor filling, comprising the following steps: dividing historical flow data into a T-1 period and a T period, and calculating JS divergences of all OD pairs in the distributed network about the T-1 period and the T period; and decomposing the network data in the three-dimensional tensor form by using a CP to obtain three factor matrixes, wherein each action of the matrix is a factor, obtaining three factor sets V1, V2 and V3 corresponding to the three factor matrixes respectively, obtaining JS divergence of each factor in V1 and V2 according to JS divergence of an OD pair, synthesizing JS divergence and variance of each factor in V1 and V2, obtaining importance of each factor, selecting the factor with the highest importance and the factor in V3 to construct a linear equation, sampling one sample, collecting new data according to an acquisition scheme formed by all sampling samples, and recovering full data by using the factor matrixes jointly determined by the history sampling data and the new sampling data as a flow measurement result of the distributed network.

Description

Network flow data distributed measurement scheduling method, system and medium

Technical Field

The invention relates to the technical field of distributed network measurement, in particular to a method, a system and a medium for distributed measurement scheduling of network traffic data based on tensor filling.

Background

In the distributed network measurement tasks (such as network attack detection task and network dynamic management task), when the number of nodes in the network is too large, if the combination condition (OD pair) of all source nodes and target nodes is considered, the end-to-end network performance measurement is realized, so that great measurement cost and communication overhead are required, and the network performance is also affected by the additionally generated injection flow during the measurement. Because the network measurement data has low rank characteristic, only a small part of data is often measured in actual operation, and full-quantity recovery of the whole network data is realized by a filling method.

In recent years, how to realize accurate recovery of whole network data according to data of partial measurement points becomes a research hot spot. Some studies abstract network traffic data into traffic matrices and propose matrix-based methods of recovering network traffic data. For example, a method for precisely decomposing a network matrix into a low-rank matrix, a sparse anomaly matrix, an error matrix and a small noise matrix is proposed to solve the problem of reducing the performance of a matrix filling algorithm in the case of noise or anomaly of data. By using the matrix completion technique, the end-to-end network performance between all node pairs is deduced by measuring only a small part of the end-to-end path, and an adaptive sampling scheme based on sequence and information, and a new sampling stop condition are provided to solve the challenge of rank variation in the actual system. A network delay estimation method based on matrix completion is provided, namely a novel low-rank matrix filling algorithm is provided, and the rank is approximated by iteratively minimizing a weighted Schatten-p norm, so that missing items in an extracted network feature matrix are predicted. Some researchers abstract network traffic data into tensors that possess more dimensional information than the matrix. Thus, many tensor-based methods, such as CANDECOMP/PARAFAC (CP) decomposition, tucker decomposition, etc., are widely used for network data recovery. CP decomposition is the decomposition of a tensor into a series of sums of rank-one tensors, and colloquially speaking, since SVD decomposition of a matrix can be regarded as the sum of a series of matrices obtained by adding corresponding weights (i.e., corresponding singular values) to a matrix obtained by left and right singular vector outer products, CP decomposition of a tensor is the decomposition of a tensor into a sum of many factor tensors. The Tucker decomposition is a higher-order form of Principal Component Analysis (PCA), essentially by maintaining the original tensor decomposition with one core tensor and factor matrices (three factor matrices in the third-order tensor decomposition) for different dimensions, each of which we can see as a linear transformation operation for different dimensions. There are many methods for recovering data using tensors, and as the prior art proposes a new tensor recovery method based on the Alternate Direction Multiplier Method (ADMM), the algorithm can automatically separate out the n tensor data with the lowest rank and the sparse part. The method is more stable and accurate in most cases, has good convergence speed, but cannot automatically select parameters; the Rank Sparse Tensor Decomposition (RSTD) algorithm can automatically explore the low-dimensional structure of tensor data, find the best dimension and basis for each mode, and separate irregular modes, but the method is relatively complex to implement. And (3) reshaping the alignment scheme, forming a rule tensor by using dynamic measurement data, introducing a user domain and a time domain factor matrix, fully utilizing the characteristics of the two domains, and converting a matrix completion problem into a tensor completion problem based on CP decomposition so as to more accurately recover missing data. A new sequential tensor completion algorithm (STC) can effectively use the tensor decomposition result of previous traffic data to derive the tensor decomposition of the current data.

Distributed measurement scheduling refers to how measurement points are selected in a distributed network and recovery of data that is not sampled and measured is achieved by the selected measurement points. In existing matrix or tensor filling methods, it is generally considered how to accurately recover data that has not been sampled for measurement, given distributed measurement points and a small amount of measurement data. However, the fixed distributed measurement location cannot meet the practical application requirements. In recent years, few work is beginning to study a distributed measurement point scheduling method, and through flexible scheduling, different measurement positions are selected at different moments to acquire data, so that the data is recovered with high precision while the sampling is reduced. The automatic selection problem of the measurement nodes is mapped into a set coverage problem, and an ant colony algorithm is used for realizing the automatic selection of the measurement factors in the distributed network measurement. Obviously, the contributions of different measurement points to network data recovery are different, and the measurement points with high measurement benefits should be selected as much as possible in order to reduce the number of measurement points. For an OD pair, if the measured data varies significantly over time, or if the adjacent period number distribution varies significantly, it is believed that the OD pair will have a greater effect on the factor matrix variation, and should be measured. Therefore, in the distributed measurement scheduling work, it is necessary to take into consideration the difference of the adjacent period numerical distribution of the measurement points.

Disclosure of Invention

In summary, the invention introduces Jensen-Shannon divergence (JSD) to network measurement for the first time, and is used for calculating the distribution change condition of (O-D) measurement points from any source point O to a target point D during network measurement. And comprehensively considering JSD, taking value variance and other factors to calculate the importance of each O-D measuring point in a weighted manner, and providing a measuring scheduling method based on importance sampling.

In order to realize low-overhead network measurement tasks, the invention provides a distributed measurement point scheduling method JDSch based on tensor filling, and the distributed measurement points are flexibly scheduled to realize the total recovery of a small amount of measurement and data. Specifically, the invention provides a distributed measurement scheduling method for network traffic data based on tensor filling, which comprises the following steps:

step 1, deducing complete historical flow data based on historical sampling data of a distributed network, dividing the historical flow data into a T-1 period and a T period based on a time window, respectively calculating data distribution of the T-1 period and the T period, and JS divergences of all OD pairs in the distributed network about the T-1 period and the T period;

step 2, decomposing network data in a three-dimensional tensor form by using a CP to obtain three factor matrixes, wherein each action of the matrix is a factor, obtaining three factor sets V1, V2 and V3 corresponding to the three factor matrixes respectively, obtaining JS divergences of each factor in the factor sets V1 and V2 according to JS divergences of the OD pairs, synthesizing the JS divergences and variances of each factor in the V1 and V2 to obtain importance of each factor, selecting the factor with the highest importance and an unknown factor in the V3 to construct a linear equation, sampling one sample through the linear equation, and collecting all sampled samples to form an acquisition scheme;

And step 3, acquiring new data by using the acquisition scheme, and recovering the total data by using a factor matrix jointly determined by the historical sampling data and the new sampling data as a flow measurement result of the distributed network.

The method for dispatching the distributed measurement of the network flow data based on tensor filling further comprises the following step 4:

and respectively carrying out training update on the data in the three factor matrixes by establishing a linear system according to the historical sampling data and the new sampling data, wherein the training update comprises the following steps:

fixing factors known by the factor sets V2 and V3, and establishing factors of which the JS divergence exceeds a threshold value in the linear system update factor set V1;

fixing factors known to the factor sets V1 and V2, and establishing factors in the linear system update factor set V3;

factors known to the factor sets V1 and V3 are fixed, and factors are established in which the JS divergence in the linear system update factor set V2 exceeds a threshold.

The method for dispatching the distributed measurement of the network traffic data based on tensor filling comprises the following steps:

integrating the value range interval length and the data average value of each group of data, and respectively calculating the number of the data dividing intervals of the T-1 period and the T period:

num_bins represents the number of sections to be divided, distance is the value range section length of the current data, average represents the average value of the current data, and round () represents rounding down the data in brackets;

Calculating the data distribution of the measured values according to the number of the data dividing sections of the T-1 period and the T period, and calculating the JS divergence JSD (P _τ ||P _τ-1 )：

Wherein D (p||q) =Σ _x∈χ P (x) log (P (x)/Q (x)) represents KL divergence, P _τ and P _τ-1 The data distribution of the T-th period and the T-1 th period are represented, respectively.

step 21, for the factor in V1, summing the JS divergences of the OD pairs of all source points which are nodes represented by the factor in V1, and taking the JS divergences as the JS divergences of the factor; for the factor in V2, summing the JS divergences of the OD pairs of all target points which are the nodes represented by the factor in V2 as the JS divergences of the factor;

step 22, for the factors in the factor set V1, calculating sample variances of all source nodes in the T-1 period as representative nodes of the factors; for the factors in factor set V2, calculating the sample variance that all target nodes in the T-1 cycle are representative nodes of the factors; for the factor in factor set V3, calculating the sample variance of the data represented by the factor at all times in the T-1 cycle;

step 23, synthesizing JS divergence JSD and sample variance of each factor in the factor sets V1 and V2 to obtain Importance level Importance of each factor according to the following formula:

Importance＝ln ^(JSD+1) *ln ^(variance+1)

Step 24, selecting the most important factor from the factor sets V1 and V2, and constructing a linear equation with the current unknown factor in the factor set V3, wherein the linear equation corresponds to a sampling point; judging whether the linear equation increases the rank of a coefficient matrix of a linear equation set formed by a plurality of linear equations, if so, adding the linear equation into the acquisition scheme corresponding to a sampling point;

step 25, executing step 24 again until each unknown factor in the factor set V3 corresponds to R sampling points, and obtaining R unknowns of the factor by solving a linear equation set established for each unknown factor in V3, thereby obtaining a flow measurement result of the distributed network.

The invention also provides a distributed measurement scheduling system of network traffic data based on tensor filling, which comprises the following steps:

the initial module is used for deducing complete historical flow data based on historical sampling data of the distributed network, dividing the historical flow data into a T-1 period and a T period based on a time window, respectively calculating data distribution of the T-1 period and the T period, and JS divergences of all OD pairs in the distributed network about the T-1 period and the T period;

The computing module is used for decomposing network data in a three-dimensional tensor form by using a CP to obtain three factor matrixes, wherein each action of the matrix is one factor, three factor sets V1, V2 and V3 corresponding to the three factor matrixes are obtained respectively, the JS dispersion of each factor in the factor sets V1 and V2 is obtained according to the JS dispersion of the OD pair, the JS dispersion and variance of each factor in the V1 and V2 are synthesized, the importance of each factor is obtained, the factor with the highest importance and the unknown factor in the V3 are selected to construct a linear equation, one sample is sampled through the linear equation, and all sampled samples are collected to form an acquisition scheme;

and the acquisition module is used for acquiring new data according to the acquisition scheme, and recovering the total data by using a factor matrix jointly determined by the historical sampling data and the new sampling data as a flow measurement result of the distributed network.

The distributed measurement scheduling system based on tensor filling network flow data further comprises an updating module, wherein the updating module is used for establishing a linear system according to the historical sampling data and the new sampling data to respectively train and update the data in three factor matrixes, and the training updating comprises the following steps:

The network traffic data distributed measurement scheduling system based on tensor filling, wherein the initial module is used for:

Wherein D (P Q) = Σ _x∈χ P (x) log (P (x)/Q (x)) represents KL divergence, P _τ and P _τ-1 The data distribution of the T-th period and the T-1 th period are represented, respectively.

The tensor filling-based network flow data distributed measurement scheduling system comprises:

the first submodule is used for summing the JS divergences of the OD pairs of all source points which are nodes represented by the factors in V1 as the JS divergences of the factors in V1; for the factor in V2, summing the JS divergences of the OD pairs of all target points which are the nodes represented by the factor in V2 as the JS divergences of the factor;

The second sub-module is used for calculating sample variances of all source nodes in the T-1 period which are representative nodes of factors in the factor set V1; for the factors in factor set V2, calculating the sample variance that all target nodes in the T-1 cycle are representative nodes of the factors; for the factor in factor set V3, calculating the sample variance of the data represented by the factor at all times in the T-1 cycle;

the third sub-module is configured to obtain Importance level Importance of each factor by integrating JS divergence JSD and sample variance of each factor in the factor sets V1 and V2 as follows:

Importance＝ln ^(JSD+1) *ln ^(variance+1)

a fourth sub-module, configured to select a factor of the factor sets V1 and V2, and construct a linear equation with the current unknown factor in the factor set V3, where the linear equation corresponds to a sampling point; judging whether the linear equation increases the rank of a coefficient matrix of a linear equation set formed by a plurality of linear equations, if so, adding the linear equation into the acquisition scheme corresponding to a sampling point;

and the fifth sub-module is used for calling the fourth sub-module again until each unknown factor in the factor set V3 corresponds to R sampling points, and obtaining R unknown quantities of the factor through solving a linear equation set established for each unknown factor in the factor set V3 to obtain a flow measurement result of the distributed network.

The invention also provides a storage medium for storing a program for executing any one of the tensor filling-based network traffic data distributed measurement scheduling methods.

The invention also provides a client which is used for any tensor filling-based network flow data distributed measurement scheduling system.

Drawings

FIG. 1 is a schematic diagram of an O-D versus JSD computing module.

Fig. 2 is a schematic diagram of a network measurement scheduling module.

Fig. 3 is a schematic diagram of a factor matrix training update module.

Detailed Description

The invention discloses a distributed measurement point scheduling method based on tensor filling, which comprises the following steps:

and the historical measurement data acquisition module. A certain amount of history data is measured in advance, and unmeasured history data is restored through tensor filling by using the measured history data, so that all history data are obtained.

And a JSD computing module. The historical data containing the flow information of each node (each OD pair) of the historical moment network is divided into a T-1 period and a T period, and JSD values of all the OD pairs of the historical data about two adjacent time windows of the T-1 period and the T period are calculated.

A factor matrix learning (CP decomposition) module based on historical data. The module abstracts each line of three factor matrixes obtained by decomposing network data in a three-dimensional tensor form formed by source points (O points) and target points (D points) by using a CP into one factor respectively to obtain three factor sets V1, V2 and V3, so as to establish a linear system through subsequent steps to solve the information of unknown factors, and learn the factor matrixes. V3 represents the time dimension, and the data at the future time is not collected, and is therefore an unknown time. Each factor in V1 represents information of each O point after CP decomposition, each factor in V2 represents information of each D point after CP decomposition, and each factor in V3 represents information of each time after CP decomposition.

The network data includes common active measurement network performance indexes, such as RTT, path data, bandwidth, delay, bottleneck, frequency of bursty traffic, congestion degree, dynamic bottleneck, station reachability, throughput, bandwidth utilization, packet loss rate, response time of server and network device, maximum network traffic, network quality of service QoS (quality of service including image, data, voice, etc.), etc.

And a measuring point scheduling module. The module executes a collection algorithm for future data to determine at which locations sampling is required in the future. The importance of each O-D measuring point is comprehensively calculated by combining the value variance (the measuring value variance related to the O/D), JSD and the like, factors with high importance are greedily selected to establish a linear system, and meanwhile whether the selected factors are feasible or not is judged through sampling feasibility detection.

And a new measurement data acquisition module. The module uses the collection scheme in the previous module to collect new data.

And a tensor-based full data recovery module. The module recovers the full amount of data from three factor matrices determined jointly using the historical sample data and the new sample data in conjunction with the definition of the CP decomposition.

And a factor matrix updating module. The module establishes a linear system according to the historical data and the new sampling data to train and update the data in the three factor matrixes respectively.

The distributed measurement point scheduling method based on tensor filling mainly comprises the following steps:

key point 1: the method firstly introduces Jensen-Shannon divergence (JSD) to network measurement, and proposes a dynamic interval dividing method to discretize continuous measured values, so as to calculate the numerical distribution change condition of adjacent periods of any O-D measuring point during network measurement.

Key point 2: a measurement scheduling method based on importance sampling. The importance of each O-D measurement point is comprehensively calculated by combining the value variance (measurement value variance related to O/D), JSD, and the like. Based on the importance of each O-D calculated for keypoint 1 to the measurement point, a different factor in the set of factors V1, V2 is selected for prescheduled sampling at each instant.

Key point 3: a factor matrix training updating method. And after the low-overhead measurement of each round is completed, updating the factor matrixes of the three directions respectively according to the measured data. Based on the JSD value for each O-D pair, training updates of the V1, V2 factors (time complexity decreases from O (n+m+c) to O (n+m+c)) are performed for O-D pairs exceeding a specified threshold.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

Aiming at the problems of large sample collection quantity and high redundancy of the current network measurement method, the invention provides a tensor filling-based network flow data recovery and measurement method. For a clearer understanding of the technical features, objects and effects of the present invention, the method and system of the present invention will now be described in further detail with reference to the accompanying drawings.

FIG. 1 illustrates an O-D versus JSD computation module. In the module, measurement data cleaning is completed, and the change of the network flow mode of any measurement point in two continuous time windows is calculated through the JSD calculation module.

Step 101: the collected data set is data cleaned to maximize the effect of noise data on subsequent network measurements.

Step 102: the data distribution for the T-1 th cycle is calculated. Since the measured values are all continuous values, the measured values need to be divided into several sections, the measured values are discretized, and the data distribution of the measured values is calculated. Since it is considered that the data distribution for each OD is different for the T-1 period, the number of segments divided by different OD pairs should also be different. The invention does not use the conventional static interval dividing method, namely does not fix the number of the intervals, but proposes a dynamic interval dividing method. Specifically, the invention considers two factors in the data distribution of different OD pairs in the same period, the difference between the maximum and minimum (i.e., the span length), and the average value of the data. Since the total number of data in the same period is fixed for different OD pairs. The difference between the maximum value and the minimum value shows the sparseness of the data distribution, and the greater the length of the value range interval is, the sparseness of the distribution is; the average value of the data represents the size of the whole numerical value. The present invention considers that the more sparse the data distribution, the more groups should be partitioned, but this is also limited by the size of the data values. For example, there are two sets of data whose value ranges are [10001,10100] and [1,100], respectively, although the value range intervals of the two sets of data are equal in length. But it is apparent that the average of the first set of data is greater than the average of the second set of data. For the first set of data, it can be considered that the data is substantially distributed over tens of thousands to tens of thousands of hundreds of thousands, the span of interval length 100 is not very large relative to data that is all tens of thousands of increments; whereas for the second set of data, a span of interval length 100 spans a relatively large span for data within substantially all 100. I.e. we analyze the differences between the data, taking into account not only the differences between the data, but also the multiples between the data. So even though the interval lengths of the two sets of data are equal, the present invention recognizes that the first set of data is more densely distributed than the second set of data, and the number of sets required to be divided is smaller than the second set. In summary, the invention synthesizes the value range interval length and the data average value of each group of data, and proposes a formula for dynamically dividing the interval group number of each group of data:

num_bins represents the number of intervals to be divided, distance is the value range interval length of the set of data, i.e., the difference between the maximum value and the minimum value, average represents the average value of the set of data, and round () represents rounding down the data in brackets.

Step 103: the data distribution of the T-th period is calculated. Similar to step 102, the measured values are discretized first, and then the intervals are dynamically divided, so as to calculate the data distribution of the measured values. The division section is to map continuous data to discrete sections, count the number of data mapped to the discrete sections by original continuous data after discretization of the data, and divide the number of each section by the total number of data to obtain data distribution.

Step 104: the JS divergence (JSD) of the OD pair was calculated. Jensen-Shannon divergence (JSD) is based on KL divergence and is a numerical indicator used to measure probability distribution of data. The more similar the two distributions, the smaller the JSD of both. When the JSD of the two are identical, then it means that the two distributions are identical. And the JS divergence solves the problem that the gradient of the KL asymmetry is easy to disappear, so that the JSD is more suitable for calculating the change of data compared with the KL divergence. We aim to find a similarity between the distribution of network traffic data for two consecutive time windows, since it reveals the change of the current moment in relation to the past moment. Thus, JSD is introduced to network measurements to calculate the similarity of eigenvalue distributions:

Wherein D (p||q) =Σ _x∈χ P (x) log (P (x)/Q (x)) represents KL divergence, P _τ and P _τ-1 The data distribution of the T-th period and the T-1 th period are represented, respectively. After obtaining the JSD for each OD pair, the JSD value for each factor in factor sets V1 and V2 is further calculated. Summing JSD values of the OD pairs of all nodes with source points being represented by the factors in V1 as factors in V1, namely the JSD of the factors; for the factor in V2, summing the JSD values of all the target points for the OD pairs of the nodes represented by the factor in V2 to obtain the JSD of the factor. The factor in V3 does not take JSD values into account.

Fig. 2 illustrates a network measurement scheduling module. In this module, each row of data in the three factor matrices a, B, C of the traffic tensor is abstracted to one factor in the factor sets V1, V2, V3, respectively. Any one sample can be expressed asWherein a is _ir Representing the first of the factor matrices Ai row r column element, b _jr Elements representing the jth row and the jth column in factor matrix B, c _kr The elements representing the kth row and the kth column in the factor matrix C, i.e. the corresponding factors are selected from the factor sets V1, V2, V3, respectively, are connected to form a linear equation for representing a sample. The importance of each factor is obtained by comprehensively considering JSD and variance of each factor in V1 and V2, a linear equation is constructed by preferentially selecting factors with high importance and unknown factors in V3, and a sample is sampled through the linear equation. A sample optional detection is also added to determine if the sample is viable.

Step 201: the rank of the tensor is calculated. The rank of a third-order tensor is the minimum value of the number of rank-one tensor in a tensor, that is, the sum of the outer products of the R rank-one tensors can be used to form a tensor, expressed as follows,

step 202: calculating the variance of each factor in the T-1 cycle, the variance of a factor referring to calculating the variance of all data associated with that factor in the T-1 cycle. For the factor in factor set V1, i.e., calculate the sample variance that all source nodes in cycle T-1 are representative nodes of the factor; for the factor in factor set V2, i.e., calculate the sample variance that all target nodes in cycle T-1 are representative nodes for the factor; for the factor in factor set V3, the sample variance is calculated for the data represented by the factor at all times in the T-1 cycle.

Step 203: and integrating the JSD, variance and other information of each factor in the factor sets V1 and V2 to calculate the importance of the corresponding factor. The JSD of the factor represents the variation amplitude of the factor information, and the larger the amplitude is, the more the factor contains information, the more the factor is suitable for being sampled, so the JSD is positively correlated with the importance; the variance of a factor reflects the sample variability associated with the factor, the greater the variability the more suitable it is to be sampled, so the variance is positively correlated with importance. Assuming variance of the factor as variance and JSD value as JSD, the importance of the factor in V1, V2 can be calculated by the following formula:

Importance＝ln ^(JSD+1) *ln ^(variance+1)

Step 204: and respectively sequencing the factors in the V1 factor set and the V2 factor set according to the importance descending order, and then sampling. I.e. according to the factor importance calculated in step 203, a factor is selected from each of the factor sets V1, V2, and a linear equation is constructed with the current unknown factor in the factor set V3. This equation corresponds to one sample point.

Step 205: sampling alternatives detection, i.e. determining whether sampling of a selected sample is feasible by checking whether the linear equation established for that sample can increase the rank of the coefficient matrix of the existing system of equations. If feasible, it is added to the sampling space; if not, then continue to select new samples. Wherein the system of coefficients matrix is a matrix of coefficients of the constructed linear system of equations.

Step 206: a linear system is constructed to calculate the V3 initial value. For a three-dimensional tensor of rank R, one unknown factor represents one row of information of the factor matrix, so R unknowns are contained. While one sample contributes a linear equation, the condition of a linear equation set with a solution is that the coefficient matrix is full rank, i.e. equal to R. If R feasible samples are needed to be selected, the R equation sets are independent of linearity, a linear equation set with the coefficient matrix rank of R is formed, R unknown quantities of the unknown factors can be solved, the unknown factors are changed into known factors, and information of factor matrixes is updated. I.e. at least R (rank) O-D pairs are acquired at any moment, and the full recovery can be realized. We continue to perform steps 204 and 205 until sampling has been stopped for each unknown factor in V3 to R viable samples. R unknowns for each unknown factor in V3 are then obtained by solving a system of linear equations established for that factor. When R unknowns of each factor in the factor set V3 are solved through the respectively established linear equation set, the initial value of the factor matrix C is obtained.

Fig. 3 illustrates a factor matrix training update module. The information of the unknown factors is updated with the information of the known factors. Because the data in the factor sets V1, V2 are derived from the factor matrices a and B, which are both obtained by CP decomposition of the historical data, i.e., the data in V1 and V2 are not updated with future data obtained by new sampling. The present invention considers that a factor with a small JSD value does not change much in adjacent cycles, and sets a JSD threshold (here, set to 2) from the viewpoint of reducing the time complexity, and a factor with a JSD smaller than the threshold does not perform updating. The present module updates the values of V1 and V2 of JSD exceeding the threshold by building a linear system that integrates the historical data with the newly sampled future data. By linear system is meant herein that each factor uses all historical sample data and future new sample data associated with the factor to build an overdetermined system of equations with the solution to the overdetermined system of equations to update the values of the R variables of the factor.

Step 301: factors known to factor sets V2 and V3 are fixed, factors in factor set V1 with JSD exceeding a threshold are considered as unknown factors, and a linear system is built to update the factors.

Step 302: factors known to factor sets V1 and V2 are fixed, factors in factor set V3 are treated as unknown factors, and a linear system is built to update the factors.

Step 303: factors known to factor sets V1 and V3 are fixed, factors in which JSD exceeds a threshold in factor set V2 are regarded as unknown factors, and a linear system is established to update the factors.

Step 304: and (3) circularly executing the steps 301 to 303, and continuously updating the value of the unknown factor until the whole linear system converges or the maximum training round number is reached, and stopping iterative training. The invention sets a tolerance for the linear system as a condition for measuring whether the system converges or not, namely, sets an initial value of a endurance index as 0, continuously maintains the minimum error of the overall recovery of future data in the system training process, and calculates the size relation between the current error and the minimum error. If the current error_whole is smaller, updating the value of the smallest error_whole; otherwise, the endurance index is incremented by one. And when the endurance index is greater than the tolerance, stopping the iterative training by regarding model convergence. The tolerance of the linear system is set to 20, and the calculation formula of the overall recovery error of future data is as follows:

wherein x is _i,j,k Representing the (i, j, k) th element in the original tensor,represents the (i, j, k) th element recovered after tensor filling.

In summary, the invention refers to JSD to network measurement for the first time, and comprehensively considers information such as JSD, variance and the like to weight and calculate the factor importance. A measurement scheduling method based on importance sampling is provided, and a tensor filling method is used for realizing recovery and measurement of network flow data by combining with a CP decomposition theory. And simultaneously, providing a factor matrix training updating method, and after each round of low-overhead measurement is completed, updating the factor matrices in three directions respectively according to the measurement data. By applying the method of the invention, the information in multiple aspects can be synthesized, the strategy of sample sampling is optimized, the sampling cost is reduced, and the calculation complexity of factor matrix training update is reduced.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Importance＝ln ^(JSD+1) *ln ^(variance+1)

Claims

1. A tensor-filling-based network traffic data distributed measurement scheduling method, comprising:

step 1, deducing complete historical flow data based on historical sampling data of a distributed network, dividing the historical flow data into a T-1 period and a T period based on a time window, respectively calculating data distribution of the T-1 period and the T period, and JS divergence of an OD pair formed by combining all source nodes and target nodes in the distributed network about the T-1 period and the T period;

step 2, decomposing historical flow data in a three-dimensional tensor form by using a CP to obtain three factor matrixes, wherein each action of the matrix is one factor, obtaining three factor sets V1, V2 and V3 corresponding to the three factor matrixes respectively, obtaining JS divergences of each factor in the factor sets V1 and V2 according to JS divergences of the OD pairs, synthesizing JS divergences and variances of each factor in the V1 and V2, obtaining importance of each factor, selecting the factor with the highest importance and an unknown factor in the V3, constructing a linear equation, wherein the linear equation corresponds to one sampling point, and integrating all sampling points to form an acquisition scheme;

and step 3, acquiring new sampling data by using the acquisition scheme, and recovering full data by using a factor matrix jointly determined by the historical sampling data and the new sampling data as a flow measurement result of the distributed network.

2. The tensor-fill-based network traffic data distributed measurement scheduling method of claim 1, further comprising step 4:

3. The distributed measurement scheduling method of network traffic data based on tensor padding according to claim 1, wherein the step 1 comprises:

the length and the average value of the value range interval of each group of data are synthesized, and the number of the data dividing intervals of the T-1 period and the T period are calculated respectively:

According to the number of the data dividing intervals of the T-1 period, mapping the data of the T-1 period to discrete intervals, dividing the number of the data of each discrete interval by the total number of the data of the T-1 period to obtain the data distribution of the T-1 period; according to the number of the data dividing intervals of the T period, mapping the data of the T period to discrete intervals, dividing the number of the data of each discrete interval by the total number of the data of the T period, and obtaining the data distribution of the T period;

and calculate the OD versus JS divergence JSD (P) at the T-1 th and T-th cycles by _τ ||P _τ-1 )：

4. The method for distributed measurement scheduling of network traffic data based on tensor padding of claim 3, wherein the step 2 comprises:

step 21, summing the JS divergences of the OD pairs of all source nodes as the nodes represented by the factors in the factor set V1 as the JS divergences of the factors in the factor set V1; summing the JS divergences of the OD pairs of all target nodes as the nodes represented by the factors in the factor set V2 as the JS divergences of the factors in the factor set V2;

Importance＝ln ^(JSD+1) *1n ^(variance+1)

step 25, executing step 24 again until each unknown factor in the factor set V3 corresponds to R sampling points, and collecting the R sampling points to form an acquisition scheme;

the step 3 comprises the following steps:

and solving a linear equation set established for each unknown factor in V3 to obtain R unknowns of the factor, and obtaining a flow measurement result of the distributed network.

5. A tensor-fill-based network traffic data distributed measurement scheduling system, comprising:

the initial module is used for deducing complete historical flow data based on historical sampling data of the distributed network, dividing the historical flow data into a T-1 period and a T period based on a time window, respectively calculating data distribution of the T-1 period and the T period, and JS divergence of an OD pair formed by combining all source nodes and target nodes in the distributed network about the T-1 period and the T period;

the operation module is used for decomposing historical flow data in a three-dimensional tensor form by using a CP to obtain three factor matrixes, wherein each action of the matrix is a factor to obtain three factor sets V1, V2 and V3 corresponding to the three factor matrixes respectively, the JS dispersion of each factor in the factor sets V1 and V2 is obtained according to the JS dispersion of the OD pair, the JS dispersion and variance of each factor in the V1 and V2 are synthesized to obtain the importance of each factor, the factor with the highest importance and the unknown factor in the V3 are selected to construct a linear equation, the linear equation corresponds to one sampling point, and all sampling points are collected to form an acquisition scheme;

and the acquisition module is used for acquiring the new sampling data according to the acquisition scheme, and recovering the total data by using a factor matrix jointly determined by the historical sampling data and the new sampling data as a flow measurement result of the distributed network.

6. The tensor-fill-based network traffic data distributed measurement scheduling system of claim 5, further comprising an update module for establishing a linear system based on the historical sample data and the new sample data to respectively train and update the data in the three factor matrices, the training update comprising:

7. The tensor-fill-based network traffic data distributed measurement scheduling system of claim 5, wherein the initialization module is configured to:

8. The tensor-fill-based network traffic data distributed measurement scheduling system of claim 7, wherein the operation module comprises:

the first submodule is used for summing the JS divergences of the OD pairs of all source nodes as the nodes represented by the factors in the factor set V1 as the JS divergences of the factors in the factor set V1; summing the JS divergences of the OD pairs of all target nodes as the nodes represented by the factors in the factor set V2 as the JS divergences of the factors in the factor set V2;

Importance＝ln ^(JSD+1) *ln ^(variance+1)

a fifth sub-module, configured to recall the fourth sub-module until each unknown factor in the factor set V3 corresponds to R sampling points, and collect the R sampling points to form an acquisition scheme;

The acquisition module comprises:

9. A storage medium storing a program for executing the tensor-based network traffic data distributed measurement scheduling method according to any one of claims 1 to 4.