CN116132300A - Link identification method based on gradient lifting decision tree feature combination - Google Patents

Link identification method based on gradient lifting decision tree feature combination Download PDF

Info

Publication number
CN116132300A
CN116132300A CN202211122899.5A CN202211122899A CN116132300A CN 116132300 A CN116132300 A CN 116132300A CN 202211122899 A CN202211122899 A CN 202211122899A CN 116132300 A CN116132300 A CN 116132300A
Authority
CN
China
Prior art keywords
link
links
data
network
microwave
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211122899.5A
Other languages
Chinese (zh)
Inventor
莫李思
李俊彬
戴瑞婷
罗家逸
费高雷
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211122899.5A priority Critical patent/CN116132300A/en
Publication of CN116132300A publication Critical patent/CN116132300A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/20Hop count for routing purposes, e.g. TTL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/09Mapping addresses
    • H04L61/10Mapping addresses of different types
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a link identification method based on gradient lifting decision tree feature combination, which comprises the following steps: s1, constructing a link characteristic database and preprocessing to obtain a training data set; s2, constructing a link identification model based on a gradient lifting decision tree, and training the model by utilizing a training data set; s3, adding corresponding auxiliary features based on the combined features to obtain accurate recognition results: and obtaining an interpretable combined feature of the data-result based on the link identification model, analyzing corresponding auxiliary features based on the interpretable combined feature, and adding the auxiliary features to perform link identification to obtain an accurate identification result. The method solves the problem that the traditional link identification technology faces to a complex network and requires hardware equipment to detect links in the field, effectively utilizes gradient to improve the feature combination of the decision tree and the learning capability of the depth model, and improves the link identification efficiency and accuracy.

Description

Link identification method based on gradient lifting decision tree feature combination
Technical Field
The invention belongs to the technical field of communication, and particularly relates to a link identification method based on gradient lifting decision tree feature combination.
Background
With the rapid development of communication technology, different communication networks are interconnected to form a huge network, and after the heterogeneous networks are communicated, a logical interconnection network is formed through a set of general protocols. Heterogeneous link transport networks such as: optical fiber communication network, cellular mobile network, microwave communication network. When long-distance data transmission is carried out, the network layer translates the network address into a corresponding physical address, so that the data transmission is carried out through various link communication modes, and services are provided for the transmission layer, so that cross-network-segment communication is realized, and the data is transmitted according to special communication regulations of different links. As the communication network scale increases, the number of network space communication devices increases explosively, the heterogeneous link transmission network becomes more and more complex, and data transmission generally passes through a plurality of types of links, and finally, independent link characteristics are difficult to separate, so that the link types among the communication devices are difficult to identify, the link tracing difficulty is high, hidden danger is brought to network space safety, and difficulty is brought to network fault detection and maintenance.
In recent years, in the face of a large-scale heterogeneous link network with a complex structure, the research on the network topology structure basically does not consider the link transmission attribute, and the link identification is a great difficulty. In order to improve the communication quality, prevent and detect the link fault, etc., the characteristic of the link transmission medium is used to accurately and rapidly identify the link, provide the capability of network perspective and application analysis for the subsequent research of network structure and fault diagnosis, and have important significance for the research of the communication network composition and reliability.
The conventional link identification technology generally aims at a signal transmitted by a tested communication link, processes the signal at a receiving end through a filter and the like, then analyzes the amplitude-frequency characteristic of the signal, and finally obtains the link characteristic. However, in the signal transmission process, noise or network congestion may occur, there are cases that packets are lost or signals cannot be successfully received from the receiving end, and because the node calculates an optimal path during link transmission, paths passing through each time of transmission may not be consistent, and thus corresponding hardware facilities cannot be arranged under the link transmission path for receiving, and thus, testing cannot be performed from source to destination.
The conventional link identification technology relies on the deployment of hardware facilities, such as signal collectors, filters and the like, however, in the actual network transmission process of signals, especially long-distance end-to-end cross-network transmission, bandwidth limitation and noise interference are easy to occur, and in addition, network congestion can occur to cause excessive bit error rate, so that the test effect is not ideal or test signals cannot be received. The critical link identification technique does not take into account the different properties of links in the heterogeneous network when performing link identification. Therefore, analyzing links of heterogeneous networks using conventional link identification techniques and critical link identification techniques has certain limitations.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a link identification method based on a gradient lifting decision tree feature combination, solves the problem that the traditional link identification technology needs to rely on higher hardware support through the introduction of the link identification method based on the gradient lifting decision tree (GBDT) feature combination, and can effectively and accurately identify the category of a link under the condition of limited hardware deployment.
The aim of the invention is realized by the following technical scheme: the link identification method based on the gradient lifting decision tree feature combination comprises the following steps:
s1, constructing a link characteristic database and preprocessing: detecting a target IP address by using an active detection mode to obtain original detection data, and constructing a link characteristic database; then cleaning outliers in the detected data by using a network behavior analysis method, and completing preprocessing of link data to obtain a training data set;
s2, constructing a link identification model based on a gradient lifting decision tree, and training the model by utilizing a training data set;
s3, adding corresponding auxiliary features based on the combined features to obtain accurate recognition results: and obtaining an interpretable combined feature of the data-result based on the link identification model, analyzing corresponding auxiliary features based on the interpretable combined feature, and adding the auxiliary features to perform link identification to obtain an accurate identification result.
Further, the specific implementation method of the step S1 is as follows:
s11, actively detecting server IP addresses of a plurality of known networks by using an open source network detection tool Scampe, and carrying out targeted detection on links of different types under an experimental environment to obtain original link data; the links are divided into four categories according to experimental environments, namely an optical fiber link, a mobile link, a satellite link and a microwave link; detecting and acquiring data through a server terminal, a mobile phone terminal, a satellite terminal and a microwave terminal respectively; then, the Scampe is used for detecting the known receiving terminal for a plurality of times through the mobile phone terminal, the server terminal, the satellite terminal and the microwave terminal respectively, so that original detection data are obtained; cleaning outliers in the detected data by using a network behavior analysis method;
s12, preprocessing all link data, namely complementing anonymous points lacking in the link by using a private address through TTL values of the link, and restoring the complete link; performing alias resolution on the path data by using an open source alias resolution tool Kapar, comparing the path data with an original path, and associating IP addresses and routing information to obtain characteristic data of different links;
s13, calculating the extreme value sum of the round trip delay of each hop node for the links in the characteristic databaseThe average value of the variance represents the minimum time delay of the communication link under ideal conditions, and the average value of the variance represents the time delay fluctuation condition of the link; the method comprises the steps of performing fitting processing by taking an extremum of round trip delay and a mean value of variance as important characteristics, and adopting a polynomial fitting mode aiming at a complex function as an approximation method of the round trip delay function, wherein the specific method comprises the following steps: assuming that the links contain N hops together, the training dataset is { (1, y) 1 ),(2,y 2 ),…,(x i ,y i ),…,(x N ,y N ) X, where x i For the number of link hops, y i For link round trip delay, performing M-order polynomial fitting on all links of different categories:
Figure BDA0003847140970000021
solving the above formula to obtain polynomial coefficients in vector form
Figure BDA0003847140970000022
Adding the polynomial coefficients to corresponding class labels: optical fiber network, mobile network, microwave network and satellite network, and then stores the link characteristics database as training data set.
Further, the specific implementation method of the step S2 is as follows:
acquiring training data of different categories from a link characteristic database: the method comprises the steps of including an optical fiber network, a mobile network, a microwave network and a satellite network, and then training a large amount of data by using a GBDT characteristic combination as a basis; during training, five-fold cross validation is used, data are randomly divided into five equal parts, one part of the data is taken as a test set in each experiment, and the rest is taken as a training set;
defining an error function
Figure BDA0003847140970000031
For fitting function +.>
Figure BDA0003847140970000032
Root mean square error from the original data:
Figure BDA0003847140970000033
analyzing the accuracy and error function of the training set and the testing set, optimizing by continuously adjusting the polynomial order and adjusting the model training parameters, and taking the polynomial coefficient and the model training parameters corresponding to the minimum error function under the condition that the accuracy of the testing set and the training set meets the requirement.
Further, the specific implementation method of the step S3 is as follows:
s31, searching for similarity of nodes in structure in different partial topologies, and merging the same nodes;
s32, acquiring a state mutation node in a link: constructing an optimization problem of path information inference link information, and obtaining a lowest cost sequence by solving the optimization problem to obtain a state mutation node; forward interval sequence x= (x) between given message arrivals 0 ,x 1 ,…,x n ) Find its state sequence
Figure BDA0003847140970000034
Minimizing the cost, calculating the optimal solution c (q|x) after each change:
Figure BDA0003847140970000035
wherein τ (i) t ,i t+1 ) Representing the secondary low intensity i t Burst to high intensity i t+1 Consumption by bursty state transitions;
Figure BDA0003847140970000036
representing state i t A related exponential density function; calculating the minimum cost value of each state node in the link, and identifying whether the current state node is a mutation node of the link according to the state change rule, wherein the rule is mutatedThe node is a mutation node;
s33, further adding auxiliary features to the link identification model obtained in the last step to carry out two-class correction by analyzing the structural similarity and mutation of nodes in the link, wherein the specific method comprises the following steps:
correcting part of optical fiber links and mobile links according to an RRC mechanism, acquiring round trip delay and corresponding detection time which are preliminarily classified into optical fibers and mobile links through a characteristic database, calculating the average value of the round trip delay extreme value and variance, and analyzing whether the round trip delay of the links generates hops at different time intervals according to the incremental arrangement of the detection time; if the preliminarily classified optical fiber links have time delay jumping characteristics, correcting the categories of the preliminarily classified optical fiber links to be mobile links; if the initially classified mobile links do not have the time delay jump characteristic, correcting the classification of the initially classified mobile links as the mobile links;
according to the transmission distance correction part mobile links and microwave links, obtaining the IP geographic positions of the mobile links and the microwave links which are primarily classified by the feature database, calculating the transmission distance of the links, and if the transmission distance of the mobile links and the microwave links which are primarily classified exceeds 100 km, correcting the types of the mobile links and the microwave links as optical fiber links;
correcting part of optical fiber network and microwave network according to long-term weather variation factors, acquiring round trip delay and detection time of the optical fiber and microwave links which are primarily classified by a characteristic database, acquiring weather conditions at the time according to the detection time, and correcting the type of the optical fiber links which are primarily classified as microwave links if the round trip delay of the optical fiber links has obvious fluctuation compared with the round trip delay of sunny days in overcast and rainy weather; and if the round trip time delay of the primarily classified microwave links does not obviously fluctuate compared with the round trip time delay of the sunny days in overcast and rainy weather, correcting the primarily classified microwave links to be optical fiber links.
The link identification method in the heterogeneous network based on the Gradient Boosting Decision Tree (GBDT) feature combination realizes the classification of the connection relation between the nodes in the network and has the following advantages:
(1) By introducing a link identification method based on Gradient Boosting Decision Tree (GBDT) feature combination, the problem that the traditional link identification technology needs to rely on higher hardware support is solved, and the category of the link can be effectively and accurately identified under the condition of limited hardware deployment.
(2) By comparing and selecting the link characteristics and the auxiliary characteristics with better classifying effect in the heterogeneous network, the link classification can be effectively and accurately carried out, and meanwhile, the time spent for link identification is reduced.
Drawings
FIG. 1 is a flow chart of a link identification method of the present invention;
FIG. 2 is a schematic diagram of a process for constructing a link characteristic database according to the present invention;
FIG. 3 is a schematic diagram of a different type of link data fitting according to the present invention;
fig. 4 is a graph showing the comparison of round trip delay changes during state transitions with and without RRC mechanisms;
fig. 5 is a diagram of a long-distance optical network link transmission delay variation according to the present invention.
Detailed Description
The algorithm used in the present invention is a Gradient Boost Decision Tree (GBDT) algorithm: the Gradient Boosting Decision Tree (GBDT) algorithm, also known as a Multiple Additive Regression Tree (MART), is a learner (GBM) based on the gradient boosting algorithm. The learners can be divided into strong learners and weak learners according to the strength of the prediction performance, the gradient lifting algorithm refers to building a final strong learner model by accumulating a plurality of trained weak learners through an addition model, and the weak learners used by the GBDT are decision trees, generally classification regression trees (CART). GBDT classification algorithms can flexibly process continuous and discrete data and can achieve high classification accuracy in relatively little time. The GBDT algorithm is widely applied, can be used as classification and regression, and has good classification performance and classification efficiency compared with other popular classification algorithms such as vector machines (SVM), random Forests (RF), deep learning and the like. Taking K classification as an example, the GBDT classification algorithm is implemented as follows:
assume that the training sample set is { (x) 1 ,y 1 ),(x 2 ,y 2 ),…,(x i ,y i )…,(x N ,y N ) (x) wherein i ,y i ) Finger characteristic value x i And its correspondingCategory y of (2) i . First initializing a weak learner as an optimal model, i.e., x i Mapping to y i Minimum loss value:
Figure BDA0003847140970000051
wherein the loss function is L (y i C) reflecting the difference between the predicted value and the true value, and indicating the training optimization direction of the current model. The commonly used loss function for GBDT classification algorithms is the logarithmic loss function:
L(y,F(x))=log(1+e -2yF )
then, starting iterative calculation, setting the iteration times m=1, 2, …, M, and calculating the negative gradient (pseudo residual) of the ith sample at the mth iteration for samples i=1, 2, …, N as follows:
Figure BDA0003847140970000052
then fitting the residual error to obtain an mth classification tree and a leaf node region R thereof mj J=1, 2, …, J, and for all leaf nodes, the best negative gradient fit for the J-th leaf node is found as:
Figure BDA0003847140970000053
since the above formula is difficult to optimize, its approximation is generally used instead of:
Figure BDA0003847140970000054
and finally updating the current learner model as follows:
Figure BDA0003847140970000055
therefore, after the iteration is finished, the finally obtained classification tree is:
Figure BDA0003847140970000056
the technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the link identification method based on the gradient lifting decision tree feature combination of the invention comprises the following steps:
s1, constructing a link characteristic database and preprocessing: detecting a target IP address by using an active detection mode to obtain original detection data, and constructing a link characteristic database; then cleaning outliers in the detected data by using a network behavior analysis method, and completing preprocessing of link data to obtain a training data set;
the specific implementation method comprises the following steps:
s11, actively detecting server IP addresses of a plurality of known networks by using an open source network detection tool Scampe, and carrying out targeted detection on links of different types under an experimental environment to obtain original link data; the links are divided into four categories according to experimental environments, namely an optical fiber link, a mobile link, a satellite link and a microwave link; detecting and acquiring data through a server terminal, a mobile phone terminal, a satellite terminal and a microwave terminal respectively; the receiving equipment comprises a router, a mobile base station, a satellite station and a microwave relay station. And then the Scampe is used for detecting the known receiving terminal for a plurality of times through the mobile phone terminal, the server terminal, the satellite terminal and the microwave terminal respectively, so as to obtain original detection data, as shown in fig. 2. And then acquiring information such as round trip delay, IP geographic position and the like from the original data as attribute characteristic types to be analyzed.
And then cleaning the outliers in the detected data by using a network behavior analysis method. Calculating a variation coefficient of the round trip delay, wherein the variation coefficient is a statistic for measuring variation degree of each observed value in data, and the calculation formula is as follows:
Figure BDA0003847140970000061
wherein C is v The coefficient of variation of the round trip delay is represented, sigma represents the standard deviation of the round trip delay, and mu represents the average value of the round trip delay. Generally, the higher the average level of the variable value, the larger the measure of the degree of dispersion thereof, and vice versa. When data statistical analysis is performed, if the variation coefficient is larger than a preset threshold value, the data is considered to be possibly abnormal (the node is excessively large in flow or has load balance), and the data is deleted. Thus, the feature marking and classification of the known links are completed, the data sets of the corresponding communication link types are obtained, and a feature database is constructed.
The data detected by different terminals respectively correspond to links of different categories. And respectively extracting features for each link of different categories, marking, counting the number of the features, wherein the extracted features comprise round trip time delay, TTL values, IP geographic positions, detection time, detection packet size information and labels of the links where the detection packet size information is positioned, and storing the features into a feature database according to the link categories.
S12, preprocessing all link data, wherein part of nodes in the network topology cannot return response information to active detection, so that some anonymous nodes inevitably exist in detected link data, namely, anonymous points lacking in the links are complemented by private addresses through TTL values of the links, and the complete links are restored; in order to solve the problem of fuzzy data route ports obtained by active detection, an open source alias analysis tool Kapar is used for performing alias analysis on path data, comparing the path data with an original path, and correlating IP addresses and route information to obtain characteristic data of different links;
s13, calculating the extreme value of the round trip delay of each hop node and the mean value of the variance for the link in the characteristic database, wherein the minimum value of the link delay represents the minimum value of the delay of the communication link under ideal conditions, and the mean value of the variance represents the delay fluctuation condition of the link; fitting the extreme value of the round trip delay and the mean value of the variance as important characteristics, and adopting a polynomial fitting mode aiming at a complex function as an approximation method of the round trip delay functionThe method comprises the following steps: assuming that the links contain N hops together, the training dataset is { (1, y) 1 ),(2,y 2 ),…,(x i ,y i ),…,(x N ,y N ) X, where x i For the number of link hops, y i For link round trip delay, performing M-order polynomial fitting on all links of different categories:
Figure BDA0003847140970000071
solving the above formula to obtain polynomial coefficients in vector form
Figure BDA0003847140970000072
Adding the polynomial coefficients to corresponding class labels: optical fiber network, mobile network, microwave network and satellite network, and then stores the link characteristics database as training data set.
S2, constructing a link identification model based on a gradient lifting decision tree, training the model by utilizing a training data set, and continuously adjusting training parameters to improve the classification accuracy; the specific implementation method comprises the following steps:
acquiring training data of different categories from a link characteristic database: the method comprises the steps of including an optical fiber network, a mobile network, a microwave network and a satellite network, and then training a large amount of data by using a GBDT characteristic combination as a basis; during training, five-fold cross validation is used, data are randomly divided into five equal parts, one part of the data is taken as a test set in each experiment, and the rest is taken as a training set;
defining an error function
Figure BDA0003847140970000073
For fitting function +.>
Figure BDA0003847140970000074
Root mean square error from the original data:
Figure BDA0003847140970000075
the accuracy and error function of the training set and the testing set are analyzed, and optimization is carried out by continuously adjusting the polynomial order, adjusting model training parameters such as training iteration times and training step length, adding training data and the like. When the accuracy of the training set and the testing set is low, the method is under fitting, and the method needs to be solved by increasing polynomial orders, iteration times and training step length; when the accuracy of the training set is higher and the accuracy of the test set is lower, the fitting is performed, and the problem is solved by reducing the polynomial order, the iteration times, the training step length and the training data. Under the condition that the accuracy of the test set and the training set meets the requirement, polynomial coefficients and model training parameters corresponding to the minimum error function are taken.
The schematic diagram after the data fitting of different types of links is shown in fig. 3, and the upper left is the optical fiber link fitting result; the upper right is the result after the mobile link is fitted; the lower left is the microwave link fitting result; the satellite link fitting results are shown on the lower right.
S3, adding corresponding auxiliary features based on the combined features to obtain accurate recognition results: and obtaining an interpretable combined feature of the data-result based on the link identification model, analyzing corresponding auxiliary features based on the interpretable combined feature, and adding the auxiliary features to perform link identification to obtain an accurate identification result.
The specific implementation method comprises the following steps:
s31, considering the large topological scale, complex structure and the characteristics of bridge points of a real network, analyzing and training for a long time, so that the information needs to be fused, the similarity of nodes in structures in different partial topologies is found, and the same nodes are combined;
s32, in order to analyze the link characteristics through the abrupt node in the link transmission process, acquiring the state abrupt node in the link before carrying out link identification: constructing an optimization problem of path information inference link information based on methods such as statistics and signal processing, and obtaining a lowest cost sequence by solving the optimization problem so as to obtain a state mutation node;
an initial shapeState-dependent exponential density function f 0 (x) Assuming similar link types between each node, its transmission rate is
Figure BDA0003847140970000081
Wherein Δs is a transmission interval, and T is a transmission time; state alpha for subsequent changes i Has a scaling function s such that
Figure BDA0003847140970000082
As the value of i gets smaller, the expected message arrival rate decreases; forward interval sequence x= (x) between given message arrivals 0 ,x 1 ,…,x n ) Find its state sequence
Figure BDA0003847140970000083
Minimizing the cost, calculating the optimal solution c (q|x) after each change:
Figure BDA0003847140970000084
wherein τ (i) t ,i t+1 ) Representing the secondary low intensity i t Burst to high intensity i t+1 Consumption by bursty state transitions; consumption being proportional to the change from each intermediate state, e.g. from
Figure BDA0003847140970000085
To->
Figure BDA0003847140970000086
The increase in (c) yields a consumption of τ (i 1 ,i 2 );
Figure BDA0003847140970000087
Representing state i t A related exponential density function; calculating each status node is on a linkThe node with the mutation rule is the mutation node;
s33, a link identification model obtained through preliminary training has a certain limitation, and different types of links can have similar states under different environments, so that auxiliary features are further added to the link identification model obtained in the previous step to conduct classification correction by analyzing the structural similarity and the mutation of nodes in the links;
in general, networks are mainly classified into two types, namely wired access and wireless access, and corresponding protocol conversion can be generated no matter which way is used for accessing the internet. The accessed link performance is not only influenced by factors such as transmission distance, weather and the like, but also influenced by protocol conversion and a radio resource control mechanism, and the difference of the link performance finally leads to the difference of wired and wireless detection delay results. For example, within a certain transmission range, the delay extremum of the fiber link can be close to 0 milliseconds and very stable; the delay of the mobile link is affected by the RRC mechanism; the time delay average value of the satellite link is more than 500 milliseconds; microwave links are limited in distance and time delays are greatly affected by weather factors. Because the optical fiber link, the mobile link and the microwave link can show similar characteristics under the influence of different factors, the characteristics of different links are considered to add auxiliary characteristics to the link identification model for classification correction:
considering the RRC mechanism of mobile link transmission, the mobile network divides the transmission state into an idle state, an idle state and a connection state, in order to reduce power consumption, the mobile network may be turned into the idle state when there is no signal transmission for a period of time, and state transition is required to be performed when a signal is received in the idle state, which results in that the average value of the round trip delay extremum and variance of the wireless network is higher than the average value of the extremum and variance of the wired network under detection at a certain time interval, and meanwhile, obvious jump of the delay can be observed. The change of the round trip delay in the state transition with or without the RRC mechanism is shown in fig. 4, where the left graph is the change of the probe delay in the state transition without the RRC mechanism, and the right graph is the change of the probe delay in the state transition with the RRC mechanism. Correcting part of optical fiber links and mobile links according to an RRC mechanism, acquiring round trip delay and corresponding detection time which are preliminarily classified into optical fibers and mobile links through a characteristic database, calculating the average value of the round trip delay extreme value and variance, and analyzing whether the round trip delay of the links generates hops at different time intervals according to the incremental arrangement of the detection time; if the preliminarily classified optical fiber links have time delay jumping characteristics, correcting the categories of the preliminarily classified optical fiber links to be mobile links; if the initially classified mobile links do not have delay hopping characteristics, the mobile links are modified to be classified as mobile links.
Considering the difference of link transmission distances, the transmission distance of the optical fiber network is far more than that of mobile communication and microwave communication, and the microwave communication is far more than that of mobile communication, and experiments show that long-distance optical fiber network link transmission can also show a delay state similar to that of a mobile network, as shown in fig. 5, the seventh to eighth hops of links represented by shorter line segments are fluctuations of mobile link transmission, and the seventh to eighth hops of links represented by longer line segments are optical fiber link transmission exceeding one thousand kilometers. Therefore, according to the transmission distance correction part, the IP geographic position of the mobile link and the microwave link which are primarily classified into the mobile link and the microwave link is obtained through the characteristic database, the transmission distance of the link is calculated, and if the transmission distance of the mobile link and the microwave link which are primarily classified exceeds 100 km, the type of the mobile link and the microwave link is corrected to be the optical fiber link;
considering the influence of weather on link transmission, the delay in a cloudy day has obvious fluctuation compared with the delay in a sunny day because the microwave network is greatly influenced by the weather. And correcting part of the optical fiber network and the microwave network according to the long-term weather variation factors, acquiring the round trip delay and the detection time of the optical fiber and the microwave link which are initially classified into the characteristic database, acquiring the current weather condition according to the detection time, and analyzing whether the weather condition has an influence on the round trip delay. If the round trip time delay of the preliminarily classified optical fiber links has obvious fluctuation compared with the round trip time delay of the sunny days in overcast and rainy weather, the class of the preliminarily classified optical fiber links is corrected to be microwave links; and if the round trip time delay of the primarily classified microwave links does not obviously fluctuate compared with the round trip time delay of the sunny days in overcast and rainy weather, correcting the primarily classified microwave links to be optical fiber links.
And obtaining a final link identification model through the correction. The test data are classified through the models before and after correction, and the classification accuracy is compared, so that the overall recognition accuracy of the corrected model is improved to a certain extent, as shown in the table 1.
Table 1 correction of the accuracy rate change table before and after correction using assist features
Figure BDA0003847140970000091
The invention can consider the attribute difference of different types of links under the condition of limited hardware conditions, and effectively and accurately identify various link types in heterogeneous networks, including optical fiber links, mobile links, satellite links and microwave links. The verification is performed with classifier model training and networks of known link classes.
The recognition Accuracy (Accuracy), precision (Precision), recall (Recall) and F1-score (F1 value) obtained by performing five-fold cross-validation on the training data set using the above link recognition scheme are shown in table 2, wherein the F1 value is a harmonic mean of the Precision and Recall, and the Precision and Recall are comprehensively evaluated to measure the performance of the classifier.
Table 2 accuracy table for five fold cross validation
Figure BDA0003847140970000101
The accuracy, precision, recall, and F1 values obtained for link identification using the above-described link identification scheme in a known network environment including four types of links are shown in table 3.
Table 3 accuracy table for identifying known networks
Figure BDA0003847140970000102
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (4)

1. The link identification method based on the gradient lifting decision tree feature combination is characterized by comprising the following steps of:
s1, constructing a link characteristic database and preprocessing: detecting a target IP address by using an active detection mode to obtain original detection data, and constructing a link characteristic database; then cleaning outliers in the detected data by using a network behavior analysis method, and completing preprocessing of link data to obtain a training data set;
s2, constructing a link identification model based on a gradient lifting decision tree, and training the model by utilizing a training data set;
s3, adding corresponding auxiliary features based on the combined features to obtain accurate recognition results: and obtaining an interpretable combined feature of the data-result based on the link identification model, analyzing corresponding auxiliary features based on the interpretable combined feature, and adding the auxiliary features to perform link identification to obtain an accurate identification result.
2. The link identification method based on the gradient boost decision tree feature combination according to claim 1, wherein the specific implementation method of step S1 is as follows:
s11, actively detecting server IP addresses of a plurality of known networks by using an open source network detection tool Scampe, and carrying out targeted detection on links of different types under an experimental environment to obtain original link data; the links are divided into four categories according to experimental environments, namely an optical fiber link, a mobile link, a satellite link and a microwave link; detecting and acquiring data through a server terminal, a mobile phone terminal, a satellite terminal and a microwave terminal respectively; then, the Scampe is used for detecting the known receiving terminal for a plurality of times through the mobile phone terminal, the server terminal, the satellite terminal and the microwave terminal respectively, so that original detection data are obtained; cleaning outliers in the detected data by using a network behavior analysis method;
s12, preprocessing all link data, namely complementing anonymous points lacking in the link by using a private address through TTL values of the link, and restoring the complete link; performing alias resolution on the path data by using an open source alias resolution tool Kapar, comparing the path data with an original path, and associating IP addresses and routing information to obtain characteristic data of different links;
s13, calculating the extreme value of the round trip delay of each hop node and the mean value of the variance for the link in the characteristic database, wherein the minimum value of the link delay represents the minimum value of the delay of the communication link under ideal conditions, and the mean value of the variance represents the delay fluctuation condition of the link; the method comprises the steps of performing fitting processing by taking an extremum of round trip delay and a mean value of variance as important characteristics, and adopting a polynomial fitting mode aiming at a complex function as an approximation method of the round trip delay function, wherein the specific method comprises the following steps: assuming that the links contain N hops together, the training dataset is { (1, y) 1 ),(2,y 2 ),…,(x i ,y i ),…,(x N ,y N ) X, where x i For the number of link hops, y i For link round trip delay, performing M-order polynomial fitting on all links of different categories:
Figure FDA0003847140960000011
solving the above formula to obtain polynomial coefficients in vector form
Figure FDA0003847140960000012
Adding the polynomial coefficients to corresponding class labels: optical fiber network, mobile network, microwave network and satellite network, and then stores the link characteristics database as training data set.
3. The link identification method based on the gradient boost decision tree feature combination according to claim 1, wherein the specific implementation method of step S2 is as follows:
acquiring training data of different categories from a link characteristic database: the method comprises the steps of including an optical fiber network, a mobile network, a microwave network and a satellite network, and then training a large amount of data by using a GBDT characteristic combination as a basis; during training, five-fold cross validation is used, data are randomly divided into five equal parts, one part of the data is taken as a test set in each experiment, and the rest is taken as a training set;
defining an error function
Figure FDA0003847140960000021
For fitting function +.>
Figure FDA0003847140960000022
Root mean square error from the original data: />
Figure FDA0003847140960000023
Analyzing the accuracy and error function of the training set and the testing set, optimizing by continuously adjusting the polynomial order and adjusting the model training parameters, and taking the polynomial coefficient and the model training parameters corresponding to the minimum error function under the condition that the accuracy of the testing set and the training set meets the requirement.
4. The link identification method based on the gradient boost decision tree feature combination according to claim 1, wherein the specific implementation method of step S3 is as follows:
s31, searching for similarity of nodes in structure in different partial topologies, and merging the same nodes;
s32, acquiring a state mutation node in a link: constructing an optimization problem of path information inference link information, and obtaining a lowest cost sequence by solving the optimization problem to obtain a state mutation node; forward interval sequence x= (x) between given message arrivals 0 ,x 1 ,…,x n ) Find its state sequence
Figure FDA0003847140960000024
Minimizing the cost, calculating the optimal solution c (q|x) after each change:
Figure FDA0003847140960000025
wherein τ (i) t ,i t+1 ) Representing the secondary low intensity i t Burst to high intensity i t+1 Consumption by bursty state transitions;
Figure FDA0003847140960000026
representing state i t A related exponential density function; calculating the minimum cost value of each state node in the link, and identifying whether the current state node is a mutation node of the link according to the state change rule, wherein the node with mutation rule is the mutation node;
s33, further adding auxiliary features to the link identification model obtained in the last step to carry out two-class correction by analyzing the structural similarity and mutation of nodes in the link, wherein the specific method comprises the following steps:
correcting part of optical fiber links and mobile links according to an RRC mechanism, acquiring round trip delay and corresponding detection time which are preliminarily classified into optical fibers and mobile links through a characteristic database, calculating the average value of the round trip delay extreme value and variance, and analyzing whether the round trip delay of the links generates hops at different time intervals according to the incremental arrangement of the detection time; if the preliminarily classified optical fiber links have time delay jumping characteristics, correcting the categories of the preliminarily classified optical fiber links to be mobile links; if the initially classified mobile links do not have the time delay jump characteristic, correcting the classification of the initially classified mobile links as the mobile links;
according to the transmission distance correction part mobile links and microwave links, obtaining the IP geographic positions of the mobile links and the microwave links which are primarily classified by the feature database, calculating the transmission distance of the links, and if the transmission distance of the mobile links and the microwave links which are primarily classified exceeds 100 km, correcting the types of the mobile links and the microwave links as optical fiber links;
correcting part of optical fiber network and microwave network according to long-term weather variation factors, acquiring round trip delay and detection time of the optical fiber and microwave links which are primarily classified by a characteristic database, acquiring weather conditions at the time according to the detection time, and correcting the type of the optical fiber links which are primarily classified as microwave links if the round trip delay of the optical fiber links has obvious fluctuation compared with the round trip delay of sunny days in overcast and rainy weather; and if the round trip time delay of the primarily classified microwave links does not obviously fluctuate compared with the round trip time delay of the sunny days in overcast and rainy weather, correcting the primarily classified microwave links to be optical fiber links.
CN202211122899.5A 2022-09-15 2022-09-15 Link identification method based on gradient lifting decision tree feature combination Pending CN116132300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211122899.5A CN116132300A (en) 2022-09-15 2022-09-15 Link identification method based on gradient lifting decision tree feature combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211122899.5A CN116132300A (en) 2022-09-15 2022-09-15 Link identification method based on gradient lifting decision tree feature combination

Publications (1)

Publication Number Publication Date
CN116132300A true CN116132300A (en) 2023-05-16

Family

ID=86299626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211122899.5A Pending CN116132300A (en) 2022-09-15 2022-09-15 Link identification method based on gradient lifting decision tree feature combination

Country Status (1)

Country Link
CN (1) CN116132300A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7490073B1 (en) * 2004-12-21 2009-02-10 Zenprise, Inc. Systems and methods for encoding knowledge for automated management of software application deployments
CN103168443A (en) * 2010-08-13 2013-06-19 高通股份有限公司 Feedback bundling for power-limited devices in wireless communications
WO2015034759A1 (en) * 2013-09-04 2015-03-12 Neural Id Llc Pattern recognition system
CN108270608A (en) * 2017-01-04 2018-07-10 中国科学院声学研究所 A kind of foundation of link prediction model and link prediction method
CN111985270A (en) * 2019-05-22 2020-11-24 中国科学院沈阳自动化研究所 sEMG signal optimal channel selection method based on gradient lifting tree
CN113591787A (en) * 2021-08-13 2021-11-02 广东电网有限责任公司 Method, device, equipment and storage medium for identifying optical fiber link component
CN114499632A (en) * 2021-12-30 2022-05-13 中国电信股份有限公司卫星通信分公司 Data transmission method based on fusion of heaven-earth satellite and broadband satellite

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7490073B1 (en) * 2004-12-21 2009-02-10 Zenprise, Inc. Systems and methods for encoding knowledge for automated management of software application deployments
CN103168443A (en) * 2010-08-13 2013-06-19 高通股份有限公司 Feedback bundling for power-limited devices in wireless communications
WO2015034759A1 (en) * 2013-09-04 2015-03-12 Neural Id Llc Pattern recognition system
CN108270608A (en) * 2017-01-04 2018-07-10 中国科学院声学研究所 A kind of foundation of link prediction model and link prediction method
CN111985270A (en) * 2019-05-22 2020-11-24 中国科学院沈阳自动化研究所 sEMG signal optimal channel selection method based on gradient lifting tree
CN113591787A (en) * 2021-08-13 2021-11-02 广东电网有限责任公司 Method, device, equipment and storage medium for identifying optical fiber link component
CN114499632A (en) * 2021-12-30 2022-05-13 中国电信股份有限公司卫星通信分公司 Data transmission method based on fusion of heaven-earth satellite and broadband satellite

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
罗文: "数据链信号特征分析与识别", 《中国优秀硕士学位论文全文数据库》, 3 December 2015 (2015-12-03) *
蒋小勇: "基于端到端测量的网络链路特征参数估计", 《中国优秀硕士学位论文全文数据库》, 1 September 2016 (2016-09-01) *
赵金龙,高仲合,贾圣文: "基于端到端单播测量的网络拓扑识别方法", 《计算机工程》, 31 January 2012 (2012-01-31), pages 100 - 102 *

Similar Documents

Publication Publication Date Title
EP2266254A2 (en) Available bandwidth estimation in a packet-switched communication network
CN106021361A (en) Sequence alignment-based self-adaptive application layer network protocol message clustering method
CN114221790A (en) BGP (Border gateway protocol) anomaly detection method and system based on graph attention network
CN113821793B (en) Multi-stage attack scene construction method and system based on graph convolution neural network
CN113537788B (en) Urban traffic jam recognition method based on virus propagation theory
CN109088903A (en) A kind of exception flow of network detection method based on streaming
CN114385397A (en) Micro-service fault root cause positioning method based on fault propagation diagram
CN113779247A (en) Network fault diagnosis method and system based on intention driving
CN113489619A (en) Network topology inference method and device based on time series analysis
CN111367908A (en) Incremental intrusion detection method and system based on security assessment mechanism
CN116170224A (en) Penetration test method, device, equipment and medium
CN112134873A (en) IoT network abnormal flow real-time detection method and system
CN113988558B (en) Power grid dynamic security assessment method based on blind area identification and electric coordinate system expansion
CN116132300A (en) Link identification method based on gradient lifting decision tree feature combination
CN113824707A (en) Website performance dial testing measurement method and device based on knowledge graph
CN112153636A (en) Method for predicting number portability and roll-out of telecommunication industry user based on machine learning
CN114124734B (en) Network traffic prediction method based on GCN-Transformer integration model
CN116170208A (en) Network intrusion real-time detection method based on semi-supervised ISODATA algorithm
CN114615052A (en) Intrusion detection method and system based on knowledge compilation
CN105022689A (en) Method for discovering key test function of large object-oriented software system
CN112235254B (en) Rapid identification method for Tor network bridge in high-speed backbone network
CN115665787A (en) Low-overhead AMF network intelligent fault diagnosis method based on machine learning
CN114896977A (en) Dynamic evaluation method for entity service trust value of Internet of things
CN114422379A (en) Analysis method for multi-platform equipment wireless networking
CN114118083A (en) Industrial resource information matching optimization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination