WO2017177591A1

WO2017177591A1 - Method for analyzing source and destination of internet traffic

Info

Publication number: WO2017177591A1
Application number: PCT/CN2016/095672
Authority: WO
Inventors: 张大顺
Original assignee: 上海牙木通讯技术有限公司
Priority date: 2016-04-14
Filing date: 2016-08-17
Publication date: 2017-10-19
Also published as: GB2564057A; CN105704260A; JP7075348B2; RU2702048C1; CN105704260B; JP2019514303A

Abstract

The present invention provides a method for analyzing the source and destination of Internet traffic. In the method, the source and destination of Internet traffic are obtained by processing DNS logs. The method comprises the following steps: a log filtering step of filtering DNS logs that cannot reflect a real access path of a user; a log segmentation step of sequentially segmenting, according to a source IP, a timestamp difference, and a central domain, the DNS logs obtained after the log filtering step to obtain segmented access paths; and a data summarization step of summarizing all the segmented access paths. By means of the analysis method of the present invention, the source and destination of Internet traffic can be mastered, so that analysis and optimization of website traffic can be better facilitated. Furthermore, by completely knowing about the flow direction state of the entire Internet traffic, the traffic state of other websites can be analyzed and understood from the whole perspective.

Description

An analysis method for the destination of Internet traffic sources

Technical field

The invention relates to the field of Internet DNS domain name resolution, and in particular relates to an analysis method for the destination of Internet traffic sources.

Background technique

The so-called source of Internet traffic refers to a series of website access paths such as what website the user first visited and then what website. There is only one way to mainstream the traffic on your site, by adding JavaScript monitoring code to the pages of your site. The most common are third-party detection tools such as google analytics and Baidu statistics.

The above statistical methods have great limitations. Each website can only know the website visited by the visitor. There is no way to know the multiple websites visited by this visitor, and there is no way to know where the visitor will go after leaving the website. DNS (Domain Name System) is a distributed database on the Internet that maps domain names and IP addresses to each other. It enables users to access the Internet more easily without having to remember the IP number that can be directly read by the machine. . "DNS domain name resolution technology" means that when a user needs to visit a website, he needs to enter the domain name of the website in the browser. After hitting the Enter key, the browser will first initiate a DNS request. Through the DNS technology, the browser can obtain the IP address of the server corresponding to the domain name, and then initiate an HTTP request for the IP address.

The DNS log records the response content of each DNS request, and can record the domain name information requested by all users. However, the log will contain too many exceptions and invalid information. For example, the server will also initiate DNS requests to generate a large amount of domain name information. Internet crawlers and even network attacks will generate a large number of DNS requests. These requests are unable to truly and effectively reflect the user's real access path.

At present, there is no method on the market that can well analyze the entire access path of Internet visitors, and the present invention makes up for this vacancy. It is a kind of re-processing of DNS logs to analyze which websites from which the traffic of the website comes from, and after leaving, How to go to which websites.

Summary of the invention

In view of the above drawbacks, the present invention proposes an analysis method for the destination of Internet traffic sources, by the present invention The method of cleaning up the non-human access behavior in the log as much as possible can effectively obtain the source and destination of Internet traffic.

The method for analyzing the destination of the Internet traffic source of the present invention obtains the source and destination of the Internet traffic by processing the DNS log, and includes the following steps:

The log filtering step is used to filter the DNS logs that do not reflect the real access path of the user. The log segmentation step is performed by dividing the DNS logs obtained after the log filtering step according to the source IP address, the difference according to the timestamp, and the segment according to the central domain. a subsequent access path; and a data summary step that summarizes all of the segmented access paths.

Preferably, the log filtering step retains the DNS log containing the domain name request of the focused domain by setting a black and white list and removing the DNS log containing the non-human domain name request generated by the server.

Preferably, removing the DNS log further includes removing the log of the enterprise IP access and removing the log without the resolved IP.

Preferably, splitting the DNS log according to the source IP is to obtain a continuous DNS log of the same source IP for a period of time.

Preferably, the splitting the log according to the difference of the time stamps is performed by dividing the log according to the source IP and then dividing the time stamp according to the DNS log, if the time stamps of the two DNS logs are between If the difference is greater than the specified length of time, the two DNS logs are cut.

Preferably, the predetermined length of time is 3 seconds.

Preferably, the step of dividing the DNS log according to the difference of the timestamps further includes a step of merging, converting the domain name in the access path obtained by the segmentation into a domain, and merging consecutive consecutive domains to obtain the source IP path.

Preferably, the segmentation according to the central domain is performed by dividing the path of the source IP based on the central domain, and the access path obtained after the segmentation is: source domain name n+...+source domain name 1+center domain name+destination The domain name 1+...+ goes to the domain name n, wherein the central domain is a domain to be analyzed according to user/system requirements.

Preferably, in the data summary step, all access paths of the source IP obtained according to the central domain segmentation step are summarized.

Through the analysis method of the invention, the source and destination of the Internet traffic can be grasped, so that the website can better help the website to analyze and optimize the website traffic; further, by completely understanding the flow of the entire Internet traffic, the global situation can be Analyze the angle and understand the traffic of other websites to know ourselves and ourselves.

DRAWINGS

1(a) and 1(b) are flowcharts of an analysis method for the destination of Internet traffic sources according to the present invention;

2(a) and 2(b) are diagrams showing the source of traffic obtained by the analysis method of the Internet traffic source destination of the present invention.

detailed description

Hereinafter, the invention will be described in detail with reference to the accompanying drawings and embodiments. The following examples are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the present invention without departing from the spirit and scope of the invention.

As mentioned before, DNS (Domain Name System) is a distributed database on the Internet that maps domain names and IP addresses to each other, enabling users to access the Internet more easily without having to remember to be able to be machined. The number of IP strings read directly. When a user visits a website, first enter the domain name of the website in the browser. After hitting the carriage return, the browser will first initiate a DNS request. Through the DNS technology, the browser can obtain the IP address of the server corresponding to the domain name, and then Initiate an HTTP request for this IP address. This is the DNS domain name resolution technology.

In the process of domain name resolution described above, a DNS log is generated. The DNS log records the response content of each DNS request, and can record the domain name information requested by all users. The format of the DNS log is as follows:

14.***.***.10| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0

Source IP|Domain Name|Timestamp|Resolve IP|Status Code

That is, the DNS log includes five parts: "source IP", "domain name", "timestamp", "resolved IP" and "status code".

Since the domain name information requested by all users is included in the DNS log, the inventors have thus thought of analyzing the source and destination of the traffic of the website by reprocessing the DNS logs. However, the DNS log also includes a lot of abnormal and invalid information. For example, the server will also initiate DNS requests to generate a large amount of domain name information. Internet crawlers and even network attacks will generate a large number of DNS requests. These requests cannot truly and effectively reflect the user's real access path. Based on the above situation, the inventors have thought of effectively obtaining the source and destination of Internet traffic by cleaning out the behavior of non-human access in the log as much as possible.

1 is a flow chart of an analysis method for the destination of Internet traffic sources according to the present invention. As shown in FIG. 1, the analysis method of the Internet traffic source destination of the present invention includes the following steps.

First, the DNS log that does not reflect the user's real access path is filtered (step S1). As mentioned above, since the DNS request includes many domain names that cannot truly and effectively reflect the user's real access path, it needs to be cleaned. For example, by setting a black and white list, the DNS log containing the domain name request that is of interest is retained, and the DNS log containing the non-human domain name request generated by the server is removed. By setting a blacklist, you can remove non-human domain name requests generated by the server. Some domain names that are of interest can be retained by setting a whitelist. The whitelist has a higher priority than the blacklist. In addition, removing the DNS log further includes removing logs of enterprise IP access and removing logs that do not resolve IP. Among them, removing the enterprise IP is Because the enterprise IP will generate multiple simultaneous access logs, affecting the judgment of the personal access trajectory; removing the log without parsing the IP, that is, removing the log of the access failure. Log filtering through different dimensions, so that DNS logs reflecting the user's real access path can be obtained.

Next, the DNS logs obtained after the log filtering step are sequentially segmented according to the source IP, the difference according to the time stamp, and the central domain, to obtain the segmented domain (step S2).

The detailed steps are as follows:

1) Splitting according to the source IP (step S21). Splitting DNS logs based on source IP is a continuous DNS log of the same source IP over a period of time.

For example, source IP 1.1.1.1 and source IP 2.2.2.2 are different source IPs so the logs are split. As follows:

Source IP|Domain Name|Timestamp|Resolve IP|Status Code

1.1.1.1| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0

1.1.1.1| www.qq.com |20141211035932|180.***.***.107;180.***.***.108|0

---------------------------------------Log cutting line -------- ---------------------------------

2.2.2.2| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0

2.2.2.2| www.qq.com |20141211035932|180.***.***.107;180.***.***.108|0

2) Next, the log divided according to the source IP is further divided according to the difference of the time stamps (step S22). The splitting according to the difference between the timestamps means that the log divided according to the source IP is further divided according to the difference between the timestamps of the DNS logs. If the difference between the time stamps of the two DNS logs is greater than the specified length of time, then the two DNS logs are cut (the reason for the split is that the log interval is considered too different for two different behaviors). The specified length of time can be adjusted as needed. In this embodiment, the predetermined length of time is 3 seconds, that is, the time stamps are separated by more than 3 seconds.

For example, the DNS log for source IP 2.2.2.2 is further split based on the difference in its timestamp, as shown below. (Timestamp 20141211035932 means 3:59:32 on December 11, 2014)

Source IP|Domain Name|Timestamp|Resolve IP|Status Code

2.2.2.2| www.baidu.com |20141211000001|180.***.***.107;180.***.***.108|0

2.2.2.2| a.qq.com |20141211000002|180.***.***.107;180.***.***.108|0

2.2.2.2| b.baidu.com |20141211000003|180.***.***.107;180.***.***.108|0

2.2.2.2| c.tanx.com |20141211000004|180.***.***.107;180.***.***.108|0

2.2.2.2| c.allyes.com |20141211000005|180.***.***.107;180.***.***.108|0

---------------------------------------Log cutting line -------- -----------------------------------

2.2.2.2| www.sina.com |20141211000009|180.***.***.107;180.***.***.108|0

2.2.2.2| www.qq.com |201412110000015|180.***.***.107;180.***.***.108|0

2.2.2.2| www.qq.com |201412110000019|180.***.***.107;180.***.***.108|0

..... 2.2.2.2 | www.a.com | 201412110000024 | 180 *** *** 107; 180 *** *** 108 | 0.

2.2.2.2| www.b.com |201412110000029|180.***.***.107;180.***.***.108|0

As shown above, since the difference between the 05 second of the time stamp 20141211000005 and the 09 second of the 20141211000009 is 4 seconds (greater than 3 seconds), the log is cut. 20141211000009 and 201412110000015 are separated by 6 seconds, so they are also cut.

As shown above, the log is split into six segments. The source IP of the first paragraph: 2.2.2.2 has access to 5 domain names, www.baidu.com , a.qq.com , b.baidu.com , c.tanx.com , c.allyes.com . According to the judgment method of the user's access behavior, it can be concluded that the user actually only visits www.baidu.com, and the remaining four domain names are only the domain name request that is generated after the user clicks on www.baidu.com , and is not the real visit of the user. behavior. So the first log can show the path of the user accessing the domain name www.baidu.com . The method of determining the user access behavior mentioned here is as follows: When a user clicks on a url, in addition to requesting the domain name of the current url, some other domain names are requested. The crawler technology can obtain all the other domain name requests after the url domain name request, and match the crawled domain name request with the domain name segment segmented in the DNS log to obtain the correspondence between the DNS log and the domain name actually accessed by the user. relationship. The correspondence obtained by this method can be known that this log reflects that the user actually visited www.baidu.com . The second log is only www.sina.com, so www.sina.com is the domain name path that users access.

After connecting the paths of the above logs, as follows:

Www.baidu.com>www.sina.com>www.qq.com>www.qq.com>www.a.com>www.b.com

Then, the paths obtained by dividing the difference according to the time stamps are merged by the same domain, where the two domains are merged, and the combined result is:

Baidu.com>sina.com>qq.com>a.com>b.com

The above path is one of all the access behaviors of the source IP. According to such a rule, all access paths of all source IPs can be calculated.

3) Next, the above result is further divided according to the center field (step S23). The central domain is the domain to be analyzed according to the user/system requirements, that is, where the user came from the central domain and then which domains were removed from the central domain. For example, a.com in the log is the center domain, as shown below:

Baidu.com>sina.com>qq.com>a.com>b.com

For example, the following is the four paths of the foregoing source IP, and only the source domain of the first three layers of the central domain in each path is exemplified, and the path processing logic after the central domain is consistent with the path processing logic before the processing central domain. The actual number of layers can be adjusted according to specific needs. Also shown in Figure 2 (a).

Source Domain 3 > Source Domain 2 > Source Domain 1 > Central Domain

Path 1: baidu.com>sina.com>qq.com>a.com (central domain)

Path 2: sina.com>baidu.com>qq.com>a.com (central domain)

Path 3: youku.com>sina.com>baidu.com>a.com (central domain)

Path 4: baidu.com>qq.com>youku.com>a.com (central domain)

Finally, the data summary step summarizes all four access paths of the aforementioned source IP. The summary is shown in Figure 2(b).

The summary of the central domain is 4 a.com.

The summary of source domain 1 is 2 qq.com, 1 baicu.com, and 1 youku.com.

The summary of source domain 2 is 2 sina.com, 1 baidu.com, and 1 qq.com.

The summary of source domain 3 is 2 baidu.com, 1 sina.com, and 1 youku.com.

Through the visual drawing as shown in Figure 2(b), it can be clearly seen that the user who accesses the central domain a.com is Which domains were accessed, which domains were accessed before, and so on.

When all source IPs are processed according to this logic, they can see the traffic source and whereabouts of the entire Internet.

Through the above method of the present invention, the source and destination of the Internet traffic can be grasped based on the central domain name to be analyzed, thereby being able to better assist the central domain name website in analyzing and optimizing the website traffic; further, by completely understanding the entire Internet The flow of traffic can be analyzed from a global perspective and understand the traffic of other websites, so that we can know ourselves and ourselves.

The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. That is, equivalent changes and modifications made by the content of the patent application scope of the present invention should fall within the technical scope of the present invention.

Claims

An analysis method for the destination of Internet traffic sources, characterized in that the source and destination of Internet traffic are obtained by processing DNS logs, including the following steps:

The log filtering step filters the DNS logs that do not reflect the user's real access path.

The log segmentation step is performed by dividing the DNS logs obtained after the log filtering step according to the source IP, the difference according to the timestamp, and the central domain according to the central domain to obtain the access path after the segmentation;

A data summary step that summarizes all of the segmented access paths.
The analysis method according to claim 1, wherein the log filtering step retains a DNS log containing a domain name request of a focused domain name by setting a black and white list and removing a DNS log including a non-human domain name request generated by the server.
The analysis method according to claim 2, wherein the removing the DNS log further comprises removing the log of the enterprise IP access and removing the log without the resolved IP.
The analysis method according to claim 3, wherein the segmentation of the DNS log according to the source IP is to obtain a continuous DNS log of the same source IP for a period of time.
The analysis method according to claim 4, wherein the segmenting the log according to the difference of the timestamps is performed by dividing the log after the source IP segmentation according to the difference between the timestamps of the DNS logs. If the difference between the time stamps of the two DNS logs is greater than the specified length of time, the two DNS logs are cut.
The analysis method according to claim 5, wherein the predetermined length of time is 3 seconds.
The analysis method according to claim 6, wherein the step of dividing the DNS log according to the difference of the timestamps further comprises the step of merging, converting the domain name in the access path obtained by the segmentation into a domain, and continuing the same The fields are merged to obtain the path of the source IP.
The analysis method according to claim 7, wherein the segmentation according to the central domain is performed by dividing the path of the source IP based on the central domain, and the access path obtained after the segmentation is:

Source domain name n+...+Source domain name 1+Center domain name+Go to domain name 1+...+Go to domain name n,

The central domain is a domain that is to be analyzed according to user/system requirements.
The analysis method according to claim 8, wherein in the data aggregation step, all access paths of the source IP obtained according to the central domain segmentation step are summarized.