CN111985569B

CN111985569B - Anonymous node positioning method based on multi-source point clustering idea

Info

Publication number: CN111985569B
Application number: CN202010851544.4A
Authority: CN
Inventors: 夏勇; 栾吉海; 李宁; 张兆心; 赵东
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-10-14
Anticipated expiration: 2040-08-21
Also published as: CN111985569A

Abstract

The invention relates to an anonymous node positioning method based on a multi-source point clustering idea, which aims to reduce the interference of anonymous nodes in an IP path obtained by Traceroute on real network routing nodes and comprises the following steps: acquiring domestic ip addresses, geographic positions and longitude and latitude; using ping command to detect and filter the liveness of ip, and extracting the live ip address; storing the IP of the detected geographical position into a database; deploying a server near a clustering center obtained by a k-means algorithm, and carrying out traceroute detection on target nodes in the same category; acquiring a time delay curve, extracting the characteristics of the time delay curve, performing hierarchical clustering, merging IP paths obtained by traceroute according to the structure of a chromatographic tree, merging points which may be the same anonymous node, and recording IP addresses of a previous hop and a next hop; and calculating a set center consisting of the IP of the last hop and the IP of the next hop of the anonymous node pair, and calculating the longitude and latitude by using the Euclidean distance to be used as the physical position of the anonymous node.

Description

Anonymous node positioning method based on multi-source point clustering idea

Technical Field

The invention relates to the technical field of computers, in particular to an anonymous node positioning method based on a multi-source point clustering idea.

Background

In consideration of security and service characteristics, a large number of application services in the existing Internet network are developed in an anonymous manner, and a normal communication opposite terminal cannot know the position of an information sender according to the anonymous identification. However, anonymous communication is often adopted for illegal information and junk information on the network, which may destroy network security, and for national information regulatory departments or individuals, it is sometimes necessary to trace the source of the anonymous communication to locate the source of the illegal information and the junk information. Because alias and anonymous interference exist in the network, the result obtained by the topology measurement of the IP level has a small difference with the real network environment, and a method for positioning the anonymous node is urgently needed in order to reduce the interference of the anonymous node in an IP path obtained by Traceroute to the real network routing node.

Disclosure of Invention

The invention provides an anonymous node positioning method based on a multi-source point clustering idea of k-means clustering and hierarchical clustering, which aims to reduce interference of anonymous nodes in an IP path obtained by Traceroute on real network routing nodes.

The invention provides an anonymous node positioning method based on a multi-source point clustering idea, which comprises the following steps:

A. acquiring domestic ip addresses, geographic positions and longitude and latitude;

B. using ping command to detect and filter the liveness of ip, and extracting the live ip address;

C. storing the IP of the detected geographical position into a database, and taking the additional information as other bases of classification;

D. deploying a server near a clustering center obtained by a k-means algorithm, and carrying out traceroute detection on target nodes in the same category;

E. meanwhile, multiple Ping command detections are carried out on the target IP from the source point to obtain a time delay curve, and the characteristics of the time delay curve are extracted;

F. performing hierarchical clustering on the extracted features to obtain a chromatographic tree, merging IP paths obtained by traceroute according to the structure of the chromatographic tree, merging points which are the same anonymous node, and recording the IP addresses of the previous hop and the next hop;

G. and calculating a set center consisting of the IP of the last hop and the IP of the next hop of the anonymous node pair, and calculating the longitude and latitude by using the Euclidean distance to be used as the physical position of the anonymous node.

Preferably, the specific method of step a is: judging whether the IP address is alive by Ping and other commands, and acquiring latitude and longitude information by using IPtoregin.

Preferably, the additional information in step C includes an economic grade, a degree of development and a city grade of a city.

Preferably, the specific steps of step C are as follows:

a. collecting surviving IP addresses;

b. obtaining longitude and latitude information of the city by using an external API (application program interface), and recording the city;

c. recording city grades published on the network by using python;

d. using longitude, latitude and city grade as the characteristic of clustering;

e. and setting the clustering number of the K-means clustering method as 3, or automatically calculating the optimal K value through an algorithm.

Preferably, the K-means clustering method in step e specifically comprises: selecting K objects from the data as initial clustering centers:

(1) Calculating the distance from each clustering object to the clustering center for division;

(2) Calculating each cluster center again;

(3) And repeating the steps until the requirements are met.

Preferably, the specific steps of obtaining the delay curve and obtaining the curve characteristics in step E are as follows:

(A) A large number of IP belonging to the same class are subjected to Ping operation for multiple times in a short time, and the variation of time delay is recorded to draw a characteristic curve;

(B) Extracting the characteristic value of the curve by wavelet decomposition.

Preferably, step F performs hierarchical clustering on the characteristic curves, then fuses paths obtained by a plurality of traceroutes according to a hierarchical clustering result, and merges points that are the same anonymous node, and the specific steps are as follows:

1) Performing hierarchical clustering on the characteristic values obtained by wavelet decomposition, and recording printing information;

2) Selecting two Traceroute paths to be fused for anonymous fusion;

3) And recording the IP addresses of the last hop and the next hop of the fused anonymous node set as a basis for positioning the anonymous node.

Preferably, the criterion for step 2) fusion is:

a) Merging anonymous nodes with the same father node and child node into one node;

b) Merging anonymous nodes without father nodes but with same child nodes into one node;

c) Merge the last point of the same parent node but without child nodes into one node.

Preferably, the positioning of the anonymous node in the step G includes the specific steps of:

a) The previous hop and the next hop of the anonymous node set have more than or equal to 2 known IP addresses, a range is calculated and determined by Euclidean distance according to the information of the longitude and latitude coordinates of the known IP, and the center of the range is taken as the physical address of the anonymous node;

b) When the last hop and the next hop of the anonymous node set have only one known IP, the point and the IP address of the destination node are directly averaged to be used as the physical address of the point.

The beneficial effects of the invention are: the invention can restore a relatively real network environment, and because alias and anonymous interference exist in the network, the result obtained by the topological measurement of the routing level has a small difference with the real network environment.

Drawings

FIG. 1 is a schematic flow chart of the operation of the present invention;

FIG. 2 is a schematic thermodynamic diagram of the present invention testing surviving IP;

FIG. 3 is a schematic diagram of a data listing for hierarchical clustering in accordance with the present invention;

FIG. 4 is a schematic view of data visualization of hierarchical clustering in accordance with the present invention.

Detailed Description

The present invention is further described below with reference to the drawings and examples so that those skilled in the art can easily practice the present invention.

Example 1: the invention provides a method for positioning anonymous nodes, which is an operation flow diagram of the invention as shown in figure 1, and the invention specifically comprises the following steps:

A. the method comprises the following steps of obtaining domestic ip addresses, cities and longitude and latitude, wherein the format of the domestic ip addresses is as follows: IP address, geographic location, latitude and longitude.

B. And (4) carrying out survivability detection and filtration on the ip by using a ping command, and extracting the survivable ip address.

C. The IP with the detectable geographic position is stored in a database, and other information can be attached to the database to serve as other bases for classification.

D. And deploying a server near the clustering center obtained by a k-means algorithm to perform traceroute detection on the target nodes in the same class.

E. And meanwhile, carrying out Ping command detection on the target IP from the source point for multiple times, acquiring a time delay curve, and carrying out feature extraction on the time delay curve.

F. And then, carrying out hierarchical clustering on the extracted features to obtain the chromatographic tree. And merging the IP paths obtained by traceroute according to the structure of the chromatographic tree. Merging the points which are possibly the same anonymous node, and recording the ip addresses of the previous hop and the next hop.

G. And calculating a set center consisting of the IP of the last hop and the IP of the next hop of the anonymous node pair, wherein the latitude and the longitude are calculated by using the Euclidean distance as the physical position of the anonymous node.

The above is the basic flow of the present invention, and the specific flow of each step will be further described below:

in the step A, the IP address and the longitude and latitude information thereof are obtained, and because the IP address range of China is known, whether the IP address is alive or not can be judged by Ping and other commands, and the longitude and latitude information is obtained by utilizing IPtoregin, evian science and technology and other Api.

In step B, the IP is survivability detected, and since the IP is not always detected, the existing IP is survivability detected again. And the accuracy of the geographical position of the anonymous node obtained subsequently is ensured.

In step C, other information is added, which may include economic level, development degree, and several lines of cities. The choice of the probing source point is important, but there is not much information to refer to before the probing source point is chosen. The detection source point needs to select points with more surrounding IP nodes as source points as much as possible, so that the topological structure as complete as possible can be obtained, and the position of the anonymous node can be deduced more accurately. In economically developed locations, there will be more surrounding IP. The level of economic development can be added to the classification feature. The concrete steps of the step C are as follows:

a. live IP addresses are collected.

b. And obtaining longitude and latitude information of the city by using an external API (application program interface), and recording the city.

c. The city grades published on the internet are recorded by using python, for example, the first-line city of Beijing is recorded as 1, the Haerbin is recorded as 2, and the city grades are considered because the city with higher economic development in the same region is more beneficial to subsequent operation and accuracy as the center.

d. (longitude, latitude, city level) is taken as the feature of the cluster.

e. K can be set to be 3 in consideration of economic feasibility, and an optimal K value can be automatically calculated through an algorithm.

The K-means clustering method of step e is further explained:

(1) Selecting K objects from the data as initial clustering centers;

(2) Calculating the distance from each clustering object to a clustering center for division;

(3) Calculating each cluster center again;

(4) And repeating the steps until the requirements are met.

In step E, a time delay curve is obtained, and the steps of obtaining the curve characteristics are as follows:

(A) Performing Ping operation on a large number of IP belonging to the same class for multiple times in a short time, recording the variation of time delay, and drawing a characteristic curve;

(B) Wavelet decomposition is used to extract the characteristic values of the curve.

Further, for step (a), since the condition of the network fluctuates in each time period, it is required to perform intra-class detection of multiple source points in a short time, and thus the advantages of such detection are as follows:

because the nodes to be detected are too large, it cannot be guaranteed that a large number of time delay curves can be obtained at the same time. Therefore, the intra-class detection is adopted, so that the local integrity can be ensured, and the detection pressure can be shared.

Although the clustering is not classified according to the network environment because the source point is pre-selected, the geographic location of the IP in the same area has similar characteristics. This also facilitates later hierarchical clustering.

And step F is the core of anonymous fusion, hierarchical clustering is carried out on the characteristic curves, paths obtained by a plurality of traceroutes are fused according to the hierarchical clustering result, and points which may be the same anonymous node are merged. The method comprises the following specific steps:

2) Two Traceroute paths to be fused are selected for anonymous fusion, wherein the fusion criterion mainly comprises 3 points:

a) Merging anonymous nodes with the same parent node and child node into one node;

And recording the IP addresses of the previous hop and the next hop of the fused anonymous node set as a basis for positioning the anonymous node.

In the positioning of the anonymous node in the step G, the two situations are totally divided:

a) There are more than or equal to 2 known IP addresses for the last hop, the next hop of the set of anonymous nodes. The Euclidean distance is calculated according to the information of known IP longitude and latitude coordinates, and as a range can be determined by a plurality of up-and-down-hop IP addresses, the position of an anonymous node is inevitably in the formed range, so that the center of the anonymous node can be taken as the physical address of the anonymous pair.

b) When the previous hop or the next hop has only one known IP, the point and the IP address of the destination node are directly averaged to be used as the physical address, and the two points are determined according to the following conditions:

one) accuracy is guaranteed since the IP within the class is measured.

Second) since the physical positions of the previous hop and the next hop do not converge too far, and generally, the situation only occurs in the case that the last node is anonymous, the coordinates of the node are located by using the previous hop and the destination node as the basis.

Example 2: the present invention takes probing the IP within its class from only one point as an example:

and step 0, preprocessing and k-means clustering. The clustering centers and the IPs within their classes are recorded. As shown in fig. 2, the selected partial survival IP thermodynamic diagram shows that the IP near shanghai is abundant, so this embodiment only defines the probe source point in shanghai.

Step 1, a Ping command of a system is called by python, more than 10 Ping operations are simultaneously carried out on the IP of the intra-class object, and a delay curve is drawn.

And 2, performing wavelet basis decomposition on the time delay curve by using the pywt packet of python to obtain a characteristic vector of the time delay.

And 3, calling a system Traceroute command by using python to acquire a Traceroute path.

And 4, clustering the characteristic values by using the python hierarchical clustering packet. Finally, hierarchical clustering and a chromatographic tree are obtained, anonymous fusion is carried out by matching with a traceroute path, and the result is shown in fig. 3-4.

And step 5, selecting a corresponding traceroute path from the database for fusion according to the first two columns of the data. The resulting data format is: { (anonymous node pair), (IP over and under hops) }

And 6, positioning the physical address of the anonymous node by utilizing the up-down hop information.

The above description is only for the purpose of illustrating preferred embodiments of the present invention and is not to be construed as limiting the present invention, and it is apparent to those skilled in the art that various modifications and variations can be made in the present invention. All modifications, equivalents, improvements and the like which come within the scope of the invention as defined by the claims should be understood as falling within the scope of the invention.

Claims

1. An anonymous node positioning method based on a multi-source point clustering idea is characterized by comprising the following steps:

E. meanwhile, carrying out Ping command detection on a target IP from a source point for multiple times to obtain a time delay curve, and carrying out feature extraction on the time delay curve;

2. The anonymous node location method based on the multi-source point clustering idea of claim 1, wherein the specific method of step a is as follows: judging whether the IP address is alive by Ping and other commands, and acquiring latitude and longitude information by using IPtoregin.

3. The anonymous node location method based on the multi-source clustering idea of claim 1, wherein the additional information in step C comprises economic level, development degree and city level of a city.

4. The anonymous node location method based on the multi-source point clustering idea of claim 3, wherein the specific steps of step C are as follows:

a. collecting surviving IP addresses;

c. recording city grades published on the network by using python;

5. The anonymous node location method based on the multi-source clustering idea of claim 4, wherein the K-means clustering method of step e comprises: selecting K objects from the data as initial clustering centers:

(2) Calculating each cluster center again;

(3) And repeating the steps until the requirements are met.

6. The anonymous node location method based on the multi-source point clustering idea of claim 1, wherein the specific steps of obtaining the delay curve and obtaining the curve characteristics in step E are as follows:

(A) Repeatedly Ping a large number of IP belonging to the same class in a short time, recording the variation of time delay and drawing a characteristic curve;

(B) Wavelet decomposition is used to extract the eigenvalues of the curve.

7. The method for locating anonymous nodes based on the idea of multisource point clustering according to claim 6, wherein the step F carries out hierarchical clustering on the characteristic curve, then carries out fusion on paths obtained by a plurality of corresponding traceroutes according to the result of the hierarchical clustering, and merges the points which are the same anonymous node, and the method comprises the following specific steps:

2) Selecting two Traceroute paths to be fused for anonymous fusion;

3) And recording the IP addresses of the previous hop and the next hop of the fused anonymous node set as a basis for positioning the anonymous node.

8. The anonymous node location method based on the multi-source point clustering idea of claim 7, wherein the step 2) is based on the fused criterion:

9. The anonymous node positioning method based on the multi-source point clustering idea of claim 1, wherein the anonymous node positioning in the step G specifically comprises the following steps: